Detection of problems in log files using analytics

Detection of problems in log files using analytics

Hello, Habre!

Log files systematically record the chronology of events occurring in the system (or program). This can be anything from a program startup record to detailed error and warning information. For us, log files are not just event diaries, but a very good tool for diagnosing and eliminating problems.

I often encountered situations where proper analysis of these files helped not only to identify and eliminate program failures, but also to prevent potential problems before they appeared.

Log files contain detailed records of events and errors that occur in a program or system. By analyzing these records, you can determine exactly when and why an error occurred, and you can trace the sequence of events that led to the error.

Log files

Hello, Habre!

Log file formats

Text formats: common, easy for a human to read, but can be inefficient with large amounts of data. Examples include plain text (.txt), CSV, JSON, XML.

json example:

{
  "timestamp": "2023-12-19T12:00:00Z",
  "severity": "ERROR",
  "source": "AuthService",
  "message": "Failed to authenticate user",
  "eventId": 1001,
  "user": "user_otus",
  "details": {
    "ipAddress": "192.168.1.10",
    "attempt": 3
  }
}

xml logs look approximately the same, only instead of curly braces – <log>

csv example:

timestamp,severity,source,message,eventId,user
"2023-12-19T12:00:01Z","INFO","DataService","Data retrieved successfully",1002,"user_otus"
"2023-12-19T12:00:02Z","WARN","DataService","Data retrieval slow",1003,"user_otus"

Binary formats: good for storage and processing, but require specialized tools to read. Example proprietary formats from different programs.

In pseudocode, the binary code can look like this:

Бинарный log-файл:
- Заголовок файла (определяет формат, версию)
- Последовательность записей:
  - Длина записи (фиксированный размер)
  - Метка времени (фиксированный размер)
  - Уровень серьезности (фиксированный размер)
  - Идентификатор события (фиксированный размер)
  - Сообщение (переменный размер, завершается специальным символом)
  - Дополнительные данные (если присутствуют)

Standard fields

date and time: in general, this is the most important field indicating the exact time of the event

Severity level: Indicates the importance of the event Usually includes levels such as INFO, DEBUG, WARN, ERROR, FATAL.

Event source: Usually indicates the system component where the event originates.

Message: describes the event or error, providing context or details.

Event ID: a unique event code that helps with classification and filtering.

User or session: information about the user or session associated with the event.

Additional Information: may include call stack, variable values, and other diagnostic information.

Formatting, standards, and extensibility

In text formats such as JSON or XML, data is often structured in key-value form, which makes it easier to parse. In binary formats, data can be structured in specific formats that optimize read and write speeds.

Some programs may adhere to specific standards, such as Syslog on Unix/Linux systems.

syslog used to send event and error messages to a specific file, remote server, or simply display them on the screen. Message syslog have a strict structure that includes a timestamp, hostname, program name, process ID, and the message itself.

let’s write Python code that sends a message to syslog on a Unix/Linux system:

import syslog

# открываем соединение с syslog
syslog.openlog(logoption=syslog.LOG_PID, facility=syslog.LOG_USER)

# отправка различных уровней сообщений
syslog.syslog(syslog.LOG_INFO, "Информационное сообщение")
syslog.syslog(syslog.LOG_WARNING, "Предупреждение")
syslog.syslog(syslog.LOG_ERR, "Сообщение об ошибке")

# закрываем соединение
syslog.closelog()

syslog.openlog() used to open a connection with syslog. Parameter logoption=syslog.LOG_PID indicates that each message will include the process ID. facility=syslog.LOG_USER specifies the type of application that is sending the message. syslog.syslog() used to send messages of varying severity (LOG_INFO, LOG_WARNING, LOG_ERR). syslog.closelog() closes the connection with syslog.

The format of the message to be logged sysloglooks like this for example:

Dec 19 12:00:00 hostname myprogram[1234]: Информационное сообщение

Dec 19 12:00:00 is the timestamp, hostname – host name, myprogram – the name of the program, [1234] – ID of the process, and Информационное сообщение – This is the text of the message itself.

The log file format should allow adding user fields without breaking the existing structure.

Tools for analysis of log files

ELK Stack

Elements

  • Elasticsearch: Distributed search engine, used to store all collected logs. Provides good scalability

  • Logstash: A data stream processor that collects, transforms, and forwards data to Elasticsearch. Supports a wide range of input and output data, including files, databases and cloud storage.

  • Kibana: A web interface for data visualization and analysis from Elasticsearch. Offers tools for creating dashboards, graphs and maps, enabling deep data analysis.

Elastic stack

Splunk

Splunk automatically collects and indexes data from a variety of sources, including logs, configuration files, and metrics. Offers a query language for searching, analyzing and visualizing log data.

Splunk is my favorite.

Graylog

Collects logs from various sources in a single place for analysis. Allows you to filter, sort and analyze data.

Offers tools to visualize data and configure event alerts.

Graylog

Let’s compare these tools in the table:

Criterion

ELK Stack

Splunk

Graylog

Main function

Collection, aggregation and analysis of logs

Collection, analysis and visualization of big data

Centralized collection and analysis of logs

Scalability

High

High

average

Ease of use

Medium (requires customization)

High (simple interface)

High (easy to set up)

Data search and analysis

Powerful search capabilities in Elasticsearch

Advanced searches and alerts

Effective search and alerts

Visualization

Flexible dashboards in Kibana

Flexible visualization and reporting

Standard dashboards and graphs

Extensibility

High (many plugins and integrations)

A modular system of plugins

Support for plugins and integrations

Security

Access settings and encryption

Advanced management and security functions

Access and filtering settings

Licensing

Open Source (with paid add-ons)

Proprietary (with free limited use)

Open Source (with commercial options)

Principles of log file analysis

Data collection and aggregation

Data collection includes:

Identification of all sources of log data, such as servers, applications, network devices. Bringing data to a single format to simplify further analysis. Use of data collection agents or services (such as Filebeat).

Data aggregation on. in yourself:

Forwarding collected data to a centralized repository (eg Elasticsearch). Ensures data uniformity, for example through schemas or templates. Data preparation for quick search and analysis.

Filebeat is a lightweight log collector that can forward data to Elasticsearch or Logstash. An example Filebeat configuration file for Apache log collection might look like this:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/apache2/*.log

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "apache-logs-%{+yyyy.MM.dd}"

Logstash can receive data from Filebeat, transform it and send it to Elasticsearch. Example of Logstash configuration for processing Apache logs:

input {
  beats {
    port => 5044
  }
}

filter {
  if [fileset][module] == "apache" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "apache-logs-%{+YYYY.MM.dd}"
  }
}

In our beloved python, for example, this process might look like this:

let’s write the code that reads lines from the log file and extracts certain data:

import re

# Открываем log-файл
with open('/path/to/logfile.log', 'r') as file:
    for line in file:
        # Используем регулярное выражение для извлечения данных
        match = re.search(r'ERROR: (.*) - at (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', line)
        if match:
            error_message = match.group(1)
            timestamp = match.group(2)
            print(f'Timestamp: {timestamp}, Error: {error_message}')

let’s write a script that collects data from several log files and aggregates them in one list:

import glob

# Список для хранения агрегированных данных
aggregated_logs = []

# Собираем все log-файлы в директории
log_files = glob.glob('/path/to/logs/*.log')

# Читаем каждый файл и добавляем данные в список
for log_file in log_files:
    with open(log_file, 'r') as file:
        for line in file:
            aggregated_logs.append(line.strip())

We convert data from log files before analysis, for example, we convert the time format:

from datetime import datetime import re def preprocess_log_line(line): # Extract the time stamp and convert it timestamp_regex = r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})'
    match = re.search(timestamp_regex, line)
    if match:
        timestamp_str = match.group(1)
        timestamp = datetime.strptime(timestamp_str, '%d/%b/%Y:%H:%M:%S')
        return timestamp, line
    return None, line

# Чтение и обработка log-файла
with open('/path/to/apache.log', 'r') as file:
    for line in file:
        timestamp, processed_line = preprocess_log_line(line)
        print(f'Timestamp: {timestamp}, Log Line: {processed_line}')

Фильтрация и сортировка данных

Фильтрация в контексте log-файлов означает извлечение записей, соответствующих определенным критериям//

Можно использовать регулярные выражения для извлечения определенных шаблонов из текста log-файлов.

Условную логику можно использловать для отбора записей по определенным критериям, например, дата, уровень ошибки и т.д

Пример фильтрации по уровню ошибки:

import re

log_file_path="/path/to/logfile.log"
error_level="ERROR"

with open(log_file_path, 'r') as file:
    for line in file:
        if re.search(f'{error_level}:', line):
            print(line.strip())

Сортировка данных

Сортировка данных помогает организовать записи в порядке, который упрощает анализ, например, по времени или уровню серьезности.

Сортировка по временной метке:

from datetime import datetime
import re

def extract_timestamp(line):
    match = re.search(r'\[(.*?)\]', line) if match: return datetime.strptime(match.group(1), '%d/%b/%Y:%H:%M:%S') return None log_entries = []
with open('/path/to/apache.log', 'r') as file: for line in file: timestamp = extract_timestamp(line) if timestamp: log_entries.append((timestamp, line.strip())) # Sort by timestamp log_entries.sort(key=lambda x: x[0]) for entry in log_entries: print(entry[1])

First, we extract temporary labels from each line of the log file, then we sort the records by these labels.

ml for automatic anomaly detection

Unsupervised learning methods are often used to detect anomalies, as real anomalies are rarely known in advance. Algorithms such as Isolation Forest, one-class SVM, and autoencoders are used.

For example, let’s try isol.forest:

import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Создание примера датафрейма
# Предположим, у нас есть логи с тремя параметрами: CPU Usage, Memory Usage и Number of Requests
np.random.seed(42)
data = {
    'cpu_usage': np.random.normal(loc=50, scale=5, size=100),   # Нормальное распределение вокруг 50 с отклонением 5
    'memory_usage': np.random.normal(loc=70, scale=10, size=100), # Нормальное распределение вокруг 70 с отклонением 10
    'num_requests': np.random.normal(loc=100, scale=20, size=100) # Нормальное распределение вокруг 100 с отклонением 20
}

# Добавление аномалий в данные
data['cpu_usage'][::10] = np.random.normal(loc=80, scale=5, size=10)   # Аномально высокое использование CPU
data['memory_usage'][::10] = np.random.normal(loc=30, scale=5, size=10) # Аномально низкое использование памяти
data['num_requests'][::10] = np.random.normal(loc=150, scale=30, size=10) # Аномально высокое количество запросов

# Создание DataFrame
df = pd.DataFrame(data)

# Инициализация и обучение модели Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.1)
predictions = iso_forest.fit_predict(df)

# Добавление результатов в датафрейм
df['anomaly'] = predictions

# Визуализация данных с аномалиями
sns.pairplot(df, hue="anomaly", palette={-1:'red', 1:'blue'})
plt.title('Isolation forest в аномалиях')
plt.show()

Analysis of log files is a major technical need of every project.

You can learn all the necessary analytics tools at OTUS online courses, which include open classes every day – the class calendar is available at the link.

Related posts