Reverse normalization (denormalization)
Normalization – It is the process of organizing data in the core to minimize redundancy and dependencies. It helps us avoid anomalies when inserting, updating or deleting data and, of course, simplifies the database schema.
But like everything in this world, normalization is not perfect. Sometimes it can lead to excessive complexity and reduced query performance. And here it is useful inverse normalization. If normalization is ordering, then reverse normalization is some “relaxation” of the rules for achieving certain goals. In a database, this means merging tables, adding redundant data, and so on, to speed up data reading and simplify queries, even if it compromises some of the “cleanliness” of the schema.
Inverse normalization is not about abandoning all principles of order, but rather the ability to find a balance between ideal normalization and the real needs of performance and ease of working with data. It helps speed up data reading by reducing the number of required joins and operations.
Reverse normalization makes queries more straightforward and understandable.
- 1 How are things now with the DB scheme?
- 2 Implementation of changes
- 2.1 Implementation in sql database in examples
- 2.1.1 A simple combination of tables
- 2.1.2 Adding calculated columns
- 2.1.3 Denormalization of many-to-many connections
- 2.1.4 One-to-many denormalization
- 2.1.5 Use of materialized representations
- 2.1.6 Caching complex requests
- 2.1.7 Denormalization using triggers
- 2.1.8 Status indicator
- 2.1.9 Denormalization of hierarchical data
- 2.1.10 When denormalization becomes your enemy
- 2.1 Implementation in sql database in examples
Planning the reverse normalization process
Before you begin, you need to understand what the purpose of inverse normalization is.
For example, the goals can be:
Reduction of response time for certain types of requests,
Ease of understanding and work with data,
Acceleration of reporting or analytical queries (generally specific functions),
Before deciding to denormalize, it is important to fully understand the current database schema. This can be done using DB analysis.
Based on the analysis, determine which tables and relationships are candidates for denormalization. Consider:
Tables with frequent and slow JOIN operations,
Data that are often used together: if certain data is always requested together, denormalizing it can speed up access.
Documentation and planning of the work process naturally follow.
How are things now with the DB scheme?
The most important step in the reverse normalization process: analyzing the current database schema. It is necessary to clearly understand what and how we are going to improve. In practice, it often happens that, due to its advantages, inverse normalization is pushed into places where it is not needed at all.
What is characteristic of bottlenecks?
Queries are taking longer than usual, especially under heavy load, indicating bottlenecks that are slowing down data processing.
Bottlenecks often load the processor, memory or disk space, which can be seen in relation to increased loading of these components. It is possible to analyze with the help of all known programs.
Queries involving multiples
JOIN operations on large tables are often bottlenecks.
Too many small, specialized tablesthat must be joined to perform common queries may indicate excessive normalization.
The need to write complex queries with many
JOIN operations to obtain on simple information may be a sign of over-normalization.
There are many more signs of over-normalization, but if the two above-mentioned signs are present, then this already indicates an excess.
Examples of bottlenecks in the SQL database
Multiple connection (JOIN) of tables
JOIN Customers ON Orders.CustomerID = Customers.CustomerID
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID
JOIN Products ON OrderDetails.ProductID = Products.ProductID
WHERE Customers.Country = 'Disneyland';
If the tables are highly normalized, multiple
JOIN operations can significantly slow down the execution of the query if the tables are large and optimized with the help of indexes.
OrderID | CustomerID | OrderDate
OrderDetailID | OrderID | ProductID | Quantity
ProductID | ProductName | UnitPrice
To get complete information about the order, you need to connect all three tables. This is a very time-consuming process.
WHERE CustomerID IN (SELECT CustomerID FROM Customers WHERE Country = 'Germany');
Nested queries can be expensive if they return a large amount of data.
Absence or incorrect use of indices:
WHERE OrderDate BETWEEN '2023-01-01' AND '2023-12-31';
If on a column
OrderDate no index, each query will perform a full table scan, which can be very slow for large tables.
Implementation of changes
Always start with full backup of your database All changes should first be tested in an isolated environment that will not affect the production database.
The denormalization plan should be clearly documented, including the objectives, the tables and relationships selected, and the expected results.
Implementation in sql database in examples
A simple combination of tables
You have two tables
OrderDetails, which are often used together. We can reduce the number
JOIN operations by combining the data of one table.
-- новую таблаяу, объединяющую Orders и OrderDetails
CREATE TABLE CombinedOrders AS
SELECT Orders.OrderID, Orders.CustomerID, Orders.OrderDate, OrderDetails.ProductID, OrderDetails.Quantity
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID;
Adding calculated columns
For example, we often need to calculate the order total. You can speed up this process using a calculation column:
-- Добавляем столбец TotalAmount в таблицу Orders
ALTER TABLE Orders ADD TotalAmount DECIMAL(10,2);
-- Обновляем TotalAmount на основе данных из OrderDetails
SET TotalAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE Orders.OrderID = OrderDetails.OrderID);
Denormalization of many-to-many connections
There is a complex many-to-many relationship between
Categories. We can simplify queries by creating a denormalized table:
From the beginning:
CREATE TABLE Courses (
CREATE TABLE Students (
-- таблица для связи многие-ко-многим
CREATE TABLE CourseStudents (
-- добавление столбца с информацией о курсах непосредственно в таблицу студентов
ALTER TABLE Students ADD Courses VARCHAR(255);
-- обновление данных с информацией о курсах для каждого студента
SET Courses = (SELECT STRING_AGG(CourseName, ', ') FROM Courses JOIN CourseStudents ON Courses.CourseID = CourseStudents.CourseID WHERE CourseStudents.StudentID = Students.StudentID);
-- добавляем столбец 'TotalOrderAmount' в таблицу 'Orders'
ALTER TABLE Orders
ADD TotalOrderAmount DECIMAL(10,2);
-- обновляем 'TotalOrderAmount' суммой всех позиций заказа
SET TotalOrderAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE OrderDetails.OrderID = Orders.OrderID);
Use of materialized representations
It is necessary to perform a complex analytical query frequently. You can speed up reading by creating a materialized play:
-- материализованное представление для сложного запроса
CREATE MATERIALIZED VIEW SalesSummary AS
SELECT ProductID, SUM(Quantity) AS TotalSold, AVG(UnitPrice) AS AveragePrice
GROUP BY ProductID;
Caching complex requests
There are complex requests that often do not change, but require a lot of time to complete.
You can speed up access to these requests by caching them in a separate table:
-- таблица для кэширования результатов сложного запроса
CREATE TABLE CachedComplexQuery AS
SELECT ComplexQueryColumn1, ComplexQueryColumn2, ...
-- Обновляем кэш при необходимости
TRUNCATE TABLE CachedComplexQuery;
INSERT INTO CachedComplexQuery (ComplexQueryColumn1, ComplexQueryColumn2, ...)
SELECT ComplexQueryColumn1, ComplexQueryColumn2, ...
Denormalization using triggers
It is necessary to automatically update the denormalized data when the source tables change. We can ensure the relevance of denormalized data using triggers:
-- функцию триггера для обновления TotalAmount
CREATE OR REPLACE FUNCTION update_total_amount() RETURNS TRIGGER AS $$
SET TotalAmount = (SELECT SUM(UnitPrice * Quantity) FROM OrderDetails WHERE OrderDetails.OrderID = NEW.OrderID)
WHERE Orders.OrderID = NEW.OrderID;
$$ LANGUAGE plpgsql;
-- триггер, который вызывает функцию при изменении OrderDetails
CREATE TRIGGER UpdateTotalAmount AFTER INSERT OR UPDATE OR DELETE ON OrderDetails
FOR EACH ROW EXECUTE FUNCTION update_total_amount();
-- добавление столбца статуса в таблицу заказов
ALTER TABLE Orders ADD Status VARCHAR(50);
-- установка статуса на основе определенных условий
SET Status = CASE
WHEN OrderDate < '2023-01-01' THEN 'Archived'
Denormalization of hierarchical data
First of all:
-- таблица сотрудников с ссылкой на менеджера
CREATE TABLE Employees (
-- добавление имени менеджера непосредственно в таблицу сотрудников
ALTER TABLE Employees ADD ManagerName VARCHAR(100);
-- обновление данных с именами менеджеров
UPDATE Employees e
SET ManagerName = (SELECT EmployeeName FROM Employees WHERE EmployeeID = e.ManagerID);
When denormalization becomes your enemy
When data is duplicated, the probability of errors increases exponentially. One wrong update and your data takes on a life of its own. Maintaining synchronization between denormalized data is like trying to arrange a meeting with five friends who all have different schedules.
Denormalization speeds up reading, but what about writing? Each update can become very slow. More data = more storage and processing costs.
With each new denormalized element, your system becomes more and more confusing. Short-term benefits can cause long-term problems.
Denormalization can make scaling difficult. Your database begins to resemble a large truck, which is difficult to return to a new path.
Denormalization can improve performance by reducing the number of connections and simplifying queries. However, it is important to remember that denormalization entails trade-offs, including increasing data redundancy and potentially complicating the data update process. Therefore, the decision on denormalization must be justified.