Reverse normalization (denormalization)

Reverse normalization (denormalization)

Hello, Habre!

Normalization – It is the process of organizing data in the core to minimize redundancy and dependencies. It helps us avoid anomalies when inserting, updating or deleting data and, of course, simplifies the database schema.

But like everything in this world, normalization is not perfect. Sometimes it can lead to excessive complexity and reduced query performance. And here it is useful inverse normalization. If normalization is ordering, then reverse normalization is some “relaxation” of the rules for achieving certain goals. In a database, this means merging tables, adding redundant data, and so on, to speed up data reading and simplify queries, even if it compromises some of the “cleanliness” of the schema.

Inverse normalization is not about abandoning all principles of order, but rather the ability to find a balance between ideal normalization and the real needs of performance and ease of working with data. It helps speed up data reading by reducing the number of required joins and operations.

Reverse normalization makes queries more straightforward and understandable.

Planning the reverse normalization process

Before you begin, you need to understand what the purpose of inverse normalization is.

For example, the goals can be:

Reduction of response time for certain types of requests,

Ease of understanding and work with data,

Acceleration of reporting or analytical queries (generally specific functions),

Before deciding to denormalize, it is important to fully understand the current database schema. This can be done using DB analysis.

Based on the analysis, determine which tables and relationships are candidates for denormalization. Consider:

Tables with frequent and slow JOIN operations,

Data that are often used together: if certain data is always requested together, denormalizing it can speed up access.

Documentation and planning of the work process naturally follow.

How are things now with the DB scheme?

The most important step in the reverse normalization process: analyzing the current database schema. It is necessary to clearly understand what and how we are going to improve. In practice, it often happens that, due to its advantages, inverse normalization is pushed into places where it is not needed at all.

What is characteristic of bottlenecks?

Queries are taking longer than usual, especially under heavy load, indicating bottlenecks that are slowing down data processing.

Bottlenecks often load the processor, memory or disk space, which can be seen in relation to increased loading of these components. It is possible to analyze with the help of all known programs.

Queries involving multiples JOIN operations on large tables are often bottlenecks.

Excessive normalization

Too many small, specialized tablesthat must be joined to perform common queries may indicate excessive normalization.

The need to write complex queries with many JOIN operations to obtain on simple information may be a sign of over-normalization.

There are many more signs of over-normalization, but if the two above-mentioned signs are present, then this already indicates an excess.

Examples of bottlenecks in the SQL database

Multiple connection (JOIN) of tables

FROM Orders
JOIN Customers ON Orders.CustomerID = Customers.CustomerID
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID
JOIN Products ON OrderDetails.ProductID = Products.ProductID
WHERE Customers.Country = 'Disneyland';

If the tables are highly normalized, multiple JOIN operations can significantly slow down the execution of the query if the tables are large and optimized with the help of indexes.

Excessive information:

Table Orders:

OrderID | CustomerID | OrderDate

Table OrderDetails:

OrderDetailID | OrderID | ProductID | Quantity

Table Products:

ProductID | ProductName | UnitPrice

To get complete information about the order, you need to connect all three tables. This is a very time-consuming process.

Unoptimized queries:

FROM Orders
WHERE CustomerID IN (SELECT CustomerID FROM Customers WHERE Country = 'Germany');

Nested queries can be expensive if they return a large amount of data.

Absence or incorrect use of indices:

FROM Orders
WHERE OrderDate BETWEEN '2023-01-01' AND '2023-12-31';

If on a column OrderDate no index, each query will perform a full table scan, which can be very slow for large tables.

Implementation of changes

Always start with full backup of your database All changes should first be tested in an isolated environment that will not affect the production database.

The denormalization plan should be clearly documented, including the objectives, the tables and relationships selected, and the expected results.

Implementation in sql database in examples

A simple combination of tables

You have two tables Orders and OrderDetails, which are often used together. We can reduce the number JOIN operations by combining the data of one table.

-- новую таблаяу, объединяющую Orders и OrderDetails
CREATE TABLE CombinedOrders AS
SELECT Orders.OrderID, Orders.CustomerID, Orders.OrderDate, OrderDetails.ProductID, OrderDetails.Quantity
FROM Orders
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID;

Adding calculated columns

For example, we often need to calculate the order total. You can speed up this process using a calculation column:

-- Добавляем столбец TotalAmount в таблицу Orders
ALTER TABLE Orders ADD TotalAmount DECIMAL(10,2);

-- Обновляем TotalAmount на основе данных из OrderDetails
SET TotalAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE Orders.OrderID = OrderDetails.OrderID);

Denormalization of many-to-many connections

There is a complex many-to-many relationship between Products and Categories. We can simplify queries by creating a denormalized table:

From the beginning:

    CourseID INT,
    CourseName VARCHAR(100)

    StudentID INT,
    StudentName VARCHAR(100)

-- таблица для связи многие-ко-многим
CREATE TABLE CourseStudents (
    CourseID INT,
    StudentID INT


-- добавление столбца с информацией о курсах непосредственно в таблицу студентов
ALTER TABLE Students ADD Courses VARCHAR(255);

-- обновление данных с информацией о курсах для каждого студента
UPDATE Students
SET Courses = (SELECT STRING_AGG(CourseName, ', ') FROM Courses JOIN CourseStudents ON Courses.CourseID = CourseStudents.CourseID WHERE CourseStudents.StudentID = Students.StudentID);

One-to-many denormalization

-- добавляем столбец 'TotalOrderAmount' в таблицу 'Orders'
ADD TotalOrderAmount DECIMAL(10,2);

-- обновляем 'TotalOrderAmount' суммой всех позиций заказа
SET TotalOrderAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE OrderDetails.OrderID = Orders.OrderID);

Use of materialized representations

It is necessary to perform a complex analytical query frequently. You can speed up reading by creating a materialized play:

-- материализованное представление для сложного запроса
SELECT ProductID, SUM(Quantity) AS TotalSold, AVG(UnitPrice) AS AveragePrice
FROM OrderDetails

Caching complex requests

There are complex requests that often do not change, but require a lot of time to complete.

You can speed up access to these requests by caching them in a separate table:

-- таблица для кэширования результатов сложного запроса
CREATE TABLE CachedComplexQuery AS
SELECT ComplexQueryColumn1, ComplexQueryColumn2, ...
FROM ...
WHERE ...;

-- Обновляем кэш при необходимости
TRUNCATE TABLE CachedComplexQuery;
INSERT INTO CachedComplexQuery (ComplexQueryColumn1, ComplexQueryColumn2, ...)
SELECT ComplexQueryColumn1, ComplexQueryColumn2, ...
FROM ...
WHERE ...;

Denormalization using triggers

It is necessary to automatically update the denormalized data when the source tables change. We can ensure the relevance of denormalized data using triggers:

-- функцию триггера для обновления TotalAmount
  UPDATE Orders
  SET TotalAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE OrderDetails.OrderID = NEW.OrderID)
  WHERE Orders.OrderID = NEW.OrderID;
$$ LANGUAGE plpgsql;

--  триггер, который вызывает функцию при изменении OrderDetails
FOR EACH ROW EXECUTE FUNCTION update_total_amount();

Status indicator

-- добавление столбца статуса в таблицу заказов

-- установка статуса на основе определенных условий
SET Status = CASE
    WHEN OrderDate < '2023-01-01' THEN 'Archived'
    ELSE 'Active'

Denormalization of hierarchical data

First of all:

-- таблица сотрудников с ссылкой на менеджера
CREATE TABLE Employees (
    EmployeeID INT,
    EmployeeName VARCHAR(100),
    ManagerID INT


-- добавление имени менеджера непосредственно в таблицу сотрудников
ALTER TABLE Employees ADD ManagerName VARCHAR(100);

-- обновление данных с именами менеджеров
UPDATE Employees e
SET ManagerName = (SELECT EmployeeName FROM Employees WHERE EmployeeID = e.ManagerID);

When denormalization becomes your enemy

When data is duplicated, the probability of errors increases exponentially. One wrong update and your data takes on a life of its own. Maintaining synchronization between denormalized data is like trying to arrange a meeting with five friends who all have different schedules.

Denormalization speeds up reading, but what about writing? Each update can become very slow. More data = more storage and processing costs.

With each new denormalized element, your system becomes more and more confusing. Short-term benefits can cause long-term problems.

Denormalization can make scaling difficult. Your database begins to resemble a large truck, which is difficult to return to a new path.

Denormalization can improve performance by reducing the number of connections and simplifying queries. However, it is important to remember that denormalization entails trade-offs, including increasing data redundancy and potentially complicating the data update process. Therefore, the decision on denormalization must be justified.

Related posts