How to apply the PCA method to reduce the dimensionality of data

How to apply the PCA method to reduce the dimensionality of data

Hello, dear readers!

One of the key tasks when working with data is to reduce the dimensionality of data in order to improve its interpretability, speed up machine learning algorithms and, ultimately, improve the quality of decisions. Today we will talk about a method that is considered one of the most powerful tools in the arsenal of data developers – the method of principal components, or PCA (Principal Component Analysis).

PCA is a statistical method that allows you to reduce the dimensionality of the data while retaining the greatest amount of information. It is based on linear algebra and mathematical statistics, and is a powerful tool for analyzing multivariate data. The main idea behind PCA is to find new features, called principal components, that are most correlated with the original data.

In practice, PCA can be used for a variety of purposes, including dimensionality reduction for data visualization, removing noise from data, improving the performance of machine learning models, and more.

A few reasons why PCA is worth considering:

  1. Improved data visualization: Dimensionality reduction allows data to be displayed in 2D or 3D space, facilitating visual exploration and data analysis.

  2. Reduction of computational complexity: Dimensionality reduction can significantly reduce the number of features, resulting in faster training of machine learning models and reduced resource consumption.

  3. Improving the quality of models: Many machine learning algorithms can suffer from the curse of dimensionality PCA can help reduce the dimensionality of data while preserving important features, resulting in better model performance.

  4. Search for hidden patterns: PCA can help reveal hidden dependencies between features and their impact on the data.

Principle of operation of the principal components method (PCA)

PCA is based on the idea of ​​finding new features, called principal components, which are maximally correlated with the original data and at the same time orthogonal to each other. These principal components form a new basis in the feature space, eliminating redundant information and reducing dimensionality.

Suppose we have a data matrix X, where each row is an observation and each column is a feature. Our goal is to find such new features (principal components) that best describe the variability of the data. Principal components are calculated as eigenvectors of the data matrix.

The covariance matrix allows us to measure how traits are related to each other. The covariance between two features shows how much they vary together. If the covariance is positive, it means that the traits increase together, while a negative covariance indicates the opposite change. The covariance matrix X is usually calculated as follows:

where:

  • C – Insidious matrix.

  • X – data matrix.

  • u – Vector of mean values ​​of features.

  • n – the number of observations.

Basic steps of the PCA algorithm

  1. Data standardization: Before proceeding with the computation of the principal components, it is important to standardize the data to zero mean and unit variance. This is important because features with different scales can distort PCA results.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
  2. Calculation of the covariance matrix: After standardizing the data, we calculate the covariance matrix C.

    cov_matrix = np.cov(X_scaled, rowvar=False)
  3. Calculation of eigenvectors and eigenvalues: The next step is to calculate the eigenvectors and eigenvalues ​​of the covariance matrix. This can be done using various libraries such as NumPy.

    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
  4. Sorting of main components: Principal components are sorted in descending order of their values. This allows you to select the most informative components.

  5. Projection of data on the main components: Finally, we project the raw data onto a new basis formed by the principal components. This allows us to reduce the dimensionality of the data.

    projected_data = X_scaled.dot(eigenvectors[:, :k])

The principal components obtained in the last step are new features that can be used to analyze or train machine learning models.

Implementation of PCA

Example 1: Improving classification with PCA

In this example, we use the scikit-learn library to apply PCA to the Iris dataset and improve the classification using the Support Vector Method (SVM).

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Загрузка данных
data = load_iris()
X, y = data.data, data.target

# Разделение данных на обучающий и тестовый наборы
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Применение PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Обучение SVM на данных после PCA
svm = SVC()
svm.fit(X_train_pca, y_train)

# Оценка производительности модели
accuracy = svm.score(X_test_pca, y_test)
print(f'Accuracy after PCA: {accuracy:.2f}')

Result:

Accuracy after PCA: 1.00

Example 2: Accelerating Big Data Learning

PCA can be useful for speeding up the training of models on large datasets. In this example, we use the TensorFlow library and PCA to reduce the dimensionality of the data before training the neural network.

import tensorflow as tf
from sklearn.decomposition import PCA

# Загрузка большого набора данных
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

# Преобразование изображений в векторы
X_train = X_train.reshape(-1, 28 * 28)
X_test = X_test.reshape(-1, 28 * 28)

# Применение PCA
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Здесь мы могли бы обучить нейронную сеть на данных X_train_pca

Example 3: Improving clustering

PCA can also be used to improve data clustering. In the following example, we use the K-means library to cluster data and compare the results before and after applying PCA.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Создание синтетических данных
X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Кластеризация без PCA
kmeans = KMeans(n_clusters=4)
y_pred = kmeans.fit_predict(X)

# Кластеризация после применения PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
kmeans_pca = KMeans(n_clusters=4)
y_pred_pca = kmeans_pca.fit_predict(X_pca)

# Визуализация результатов
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("Кластеризация без PCA")
plt.subplot(122)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred_pca, cmap='viridis')
plt.title("Кластеризация после PCA")
plt.show()

Data visualization

Example 1: Visualization of Iris data

We use the Iris dataset and apply PCA to reduce the dimensionality to 2 components and visualize the data in 2D space.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Загрузка данных
data = load_iris()
X, y = data.data, data.target

# Применение PCA для сокращения размерности
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Визуализация данных
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
plt.title("Визуализация данных Iris с PCA")
plt.xlabel("Главная компонента 1")
plt.ylabel("Главная компонента 2")
plt.show()

Evaluation and interpretation of PCA results

When applying PCA, one of the important issues is the selection of the optimal number of principal components. Choosing the wrong number of components can lead to loss of information or overcomplexity of the model. There are several methods for estimating the optimal number of components, including the elbow method and the explained variance method.

Elbow method: This method consists in analyzing the proportion of explained variance depending on the number of components. We construct a graph where the number of components is plotted on the X axis, and the proportion of explained variance on the Y axis. The graph will be elbow-shaped, and the point where the decline in the proportion of explained variance slows down will indicate the optimal number of components.

Example code for the elbow method:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Создаем экземпляр PCA
pca = PCA()

# Обучаем PCA на данный X
pca.fit(X)

# Строим график объясненной дисперсии
explained_variance_ratio = pca.explained_variance_ratio_
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker="o")
plt.xlabel("Число компонент")
plt.ylabel("Доля объясненной дисперсии")
plt.title("Метод локтя для выбора числа компонент")
plt.show()

Method of explained variance: This method consists in choosing the number of components so that the proportion of explained variance reaches a given threshold (eg 95% or 99%). This allows you to save most of the information while reducing the dimensionality.

Example code for the explained variance method:

from sklearn.decomposition import PCA

# Создаем экземпляр PCA с заданным порогом
pca = PCA(0.95)  # сохраняем 95% доли объясненной дисперсии

# Обучаем PCA на данный X
X_reduced = pca.fit_transform(X)

Analysis of explained variance

After choosing the optimal number of components and transforming the data, it is important to analyze the explained variance. This allows us to understand how much information we have retained after dimensionality reduction.

Example code for the analysis of explained variance:

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = explained_variance_ratio.cumsum()

# Визуализация объясненной дисперсии
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker="o")
plt.xlabel("Число компонент")
plt.ylabel("Накопленная доля объясненной дисперсии")
plt.title("Анализ объясненной дисперсии")
plt.show()

Interpretation of the main components

After reducing the dimensionality and choosing the optimal number of components, it becomes important to understand what these components are. Interpreting principal components can help in understanding what features they encode and what dependencies between features they highlight.

To interpret the main components, their weights (eigenvectors) and associated features can be analyzed. For example, in the case of image analysis, it can be found that the first principal component can be related to the illumination of the images, and the second – to the orientation of the objects.

Example code for principal component analysis:

# Получение собственных векторов (весов) главных компонент
eigen_vectors = pca.components_

# Визуализация весов для первых нескольких компонент
plt.figure(figsize=(10, 5))
for i in range(5):
    plt.subplot(1, 5, i + 1)
    plt.imshow(eigen_vectors[i].reshape(имя_изображения), cmap='viridis')
    plt.title(f"Главная компонента {i + 1}")
    plt.axis('off')
plt.show()

The interpretation of principal components depends on the specific task and data you are working with. This may require additional analysis and domain knowledge to fully understand the meaning of the underlying components.

More examples of using PCA

Example 1: Dimensional reduction of medical images

In the medical field, especially in MRI or CT images, the dimensionality of the data can be huge, making analysis difficult:

# Генерация медицинского датасета
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, n_features=3000, random_state=42)

# Применение PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X)

Example 2: Improving the analysis of textual data

In text analysis, especially when working with large corpora, PCA can be used to reduce dimensionality and highlight the most important features:

# Генерация текстового датасета
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Это первый документ.", "Это второй документ.", "А это третий документ."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Применение PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X.toarray())

Example 3: Visualization of data in large time series

When working with time series, PCA can help visualize changes in the data, which can be useful when analyzing financial markets or monitoring production processes:

# Генерация временного ряда
import numpy as np
time = np.linspace(0, 10, 1000)
signal = np.sin(time) + np.random.normal(0, 0.1, 1000)

# Применение PCA для визуализации
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(signal.reshape(-1, 1))

Example 4: Improving the analysis of spectral data

In the analysis of spectral data such as spectrograms, PCA can help highlight important frequencies and reduce the dimensionality of the data:

# Генерация спектрального датасета
import numpy as np
freqs = np.array([10, 20, 30, 40, 50])
data = np.array([np.sin(2 * np.pi * f * np.linspace(0, 1, 1000)) for f in freqs])

# Применение PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(data.T)

Example 5: Improving geodata processing

When working with geodata such as GPS coordinates, PCA can be used to reduce dimensionality and extract the most important factors:

# Генерация геодатасета
import numpy as np
latitude = np.linspace(37.7749, 37.8049, 1000)
longitude = np.linspace(-122.4194, -122.3894, 1000)
coordinates = np.column_stack((latitude, longitude))

# Применение PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(coordinates)

Conclusion

PCA allows you to improve data analysis, reduce computational costs and highlight the most informative features. It is important to remember that the correct selection of the number of components and competent interpretation of the results play a key role in the successful application of PCA in projects.

The article was prepared as part of recruitment to the online course “System Analyst. Advanced”. To find out if your knowledge is sufficient to pass the course program, take the entrance test.

Related posts