How select_related and prefetch_related work in Django

How select_related and prefetch_related work in Django

Often the definition of the difference between select_related and prefetch_related sounds like “the first for ForeignKey fields, the second for ManyToMany”, but this description does not reveal the essence of how these methods work. Below I have tried to show the difference between these methods and what effect they have on the generated SQL to retrieve the data with examples.

TLDR: The article will primarily be useful to those who are starting their acquaintance with Django, as well as to those who use select_related/prefetch_related in their daily work, but have not delved into the depths of Django.

Before we go any further, a little description of the models that will be used for the code examples
from django.db import models


class Department(models.Model):
    name = models.CharField(max_length=64)
    description = models.TextField()


class Country(models.Model):
    name = models.CharField(max_length=64)


class City(models.Model):
    name = models.CharField(max_length=64)
    country = models.ForeignKey(Country, on_delete=models.CASCADE,
                                related_name="cities")

class Employee(models.Model):
    first_name = models.CharField(max_length=64)
    last_name = models.CharField(max_length=64)

    department = models.ForeignKey(Department, on_delete=models.CASCADE,
                                   related_name="employees")
    email = models.EmailField()
    city = models.ForeignKey(City, on_delete=models.CASCADE,
                             related_name="employees")

What problem do select_related/prefetch_related solve

By default, Django does not load related objects with the main request, but instead uses lazy loading and defers the request to the database until the related objects are accessed.
This approach simplifies the work with linked objects, but can lead to an N + 1 problem, when each linked entity generates an additional database query.

For example, we want to display a list of employees and the department in which they work:

for employee in Employee.objects.all():
    print(employee.id, employee.first_name, employee.last_name, employee.department.name)

In this case, a request to receive records from the employee table will be executed, and a request to receive his department will be additionally generated for each employee.

Fixing a problem with select_related

To solve this problem we can use select_related. The select_related method loads related objects using a JOIN.

What the request will look like:

for employee in Employee.objects.all().select_related("department"):
    print(employee.id, employee.first_name, employee.last_name, employee.department.name)

The SQL query in this example will look something like this:

SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_employee"."department_id",
       "employee_employee"."email",
       "employee_employee"."city_id",
       "employee_department"."id",
       "employee_department"."name",
       "employee_department"."description"
FROM "employee_employee"
INNER JOIN "employee_department" ON ("employee_employee"."department_id" = "employee_department"."id")

Now, when referring to the department attribute, an additional request to the database will not be created.

As parameters select_related accepts the names of ForeignKey/OneToOne fields or related_name of the OneToOne field in the related table. You can also pass field names in tables linked by a foreign key relationship, for example:

Employee.objects.all().select_related("city", "city__country")
# или вот так
Employee.objects.all().select_related("city").select_related("city__country")
# или вот так
Employee.objects.all().select_related("city__country")

In this case, the data for the city field will also be loaded.

None can be passed as a parameter to clear the list of related entities to be loaded with the main request. For example, in this case, together with the main request, data will be loaded only from the City model:

Employee.objects.all()
    .select_related("city", "city__country")
    .select_related(None)
    .select_related("city")

Minus select_related

Although select_related is used for optimization, using select_related can also slow down the query. select_related uses JOIN to load data, if there are many records in the main table and they refer to the same data in the related table, the data will be repeated in the resulting table.

For example, the result of such a query:

Employee.objects.all().select_related("city", "city__country")

Maybe this is the table:

The values ​​in the department__name column will be constantly repeated. The more data we load in this way, the more likely it is that the database will need additional memory to process the query.

An alternative approach is prefetch_related

Unlike select_related, prefetch_related fetches related objects with a separate request for each field passed as a parameter and performs object binding inside python.

This approach allows you to load objects for ManyToMany fields and records that refer to our table through the ForeignKey field using related_name.

However, prefetch_related can also be used where we use select_related to fetch related records using an additional query instead of a JOIN.

Using prefetch_related, you can speed up the execution of the request from the previous example:

Employee.objects.all().prefetch_related("city", "city__country")

The result will be the following SQL:

SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_employee"."department_id",
       "employee_employee"."email",
       "employee_employee"."city_id"
FROM "employee_employee"

SELECT "employee_city"."id",
       "employee_city"."name",
       "employee_city"."country_id"
FROM "employee_city"
WHERE "employee_city"."id" IN (22, 23, 25, 26, 27, 28)

SELECT "employee_country"."id",
       "employee_country"."name"
FROM "employee_country"
WHERE "employee_country"."id" IN (4, 5, 6)

This resulted in 3 queries: 1 to retrieve data from the base model, one to retrieve data from the City model, and one to retrieve data from the Country table.

Even though 3 queries were executed, the total time was less than the execution time using select_related.

An example with loading objects that refer to the main model: for each department we display a list of employees, we use prefetch_related for optimization:

for department in Department.objects.all().prefetch_related("employees"):
    print(department.name, "":")
    for employee in department.employees.all():
        print("    ", employee.first_name, employee.last_name)

Generated SQL:

SELECT "employee_department"."id",
       "employee_department"."name",
       "employee_department"."description"
FROM "employee_department"

SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_employee"."department_id",
       "employee_employee"."email",
       "employee_employee"."city_id"
FROM "employee_employee"
WHERE "employee_employee"."department_id" IN (13, 14, 15, 16)

An interesting point, in the case of such a combined QuerySet, as a result, 2 requests to the database will be executed, since Django will determine that the objects for the city relation were loaded via select_related:

Employee.objects.all().select_related("city").prefetch_related("city__country")
SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_employee"."department_id",
       "employee_employee"."email",
       "employee_employee"."city_id",
       "employee_city"."id",
       "employee_city"."name",
       "employee_city"."country_id"
FROM "employee_employee"
INNER JOIN "employee_city" ON ("employee_employee"."city_id" = "employee_city"."id")

SELECT "employee_country"."id",
       "employee_country"."name"
FROM "employee_country"
WHERE "employee_country"."id" IN (4, 5, 6)

Similarly to select_related, the list of relations to load can be cleared:

Employee.objects.all().prefetch_related("city", "city__country").prefetch_related(None)

Additional optimizations, Prefetch object

Sometimes we want to do additional customization of the objects we want to load using prefetch_related. For such cases, Django allows us to pass a special Prefetch object as a prefetch_related parameter.

The Prefetch object receives 3 parameters as input:

  • lookup for finding a relationship, similar to the string we pass if we do not use the Prefetch object;

  • queryset , an optional parameter to set the QuerySet that will be used to load related objects;

  • to_attr, an optional parameter that can be used to change the field into which the related objects will be loaded, objects loaded in this way will be collected into a list rather than a QuerySet object.

The main restriction on the transferred queryset is the ban on using the values ​​​​and values_list methods, because in this case the resulting object will cease to be an instance of the model, but you can apply filters and annotations as to a regular queryset.

As an example, let’s load a list of employees for each department who live in a certain country and use the annotation to add information about the city, and also exclude the email field, which we do not use in our script:

employees = Employee.objects.defer("email")
    .filter(city__country_id=4)
    .annotate(city_name=F("city__name"))

queryset = Department.objects.all().prefetch_related(Prefetch("employees", queryset=employees))

for department in queryset:
    print(department.name, ":")
    for employee in department.employees.all():
        print("    ", employee.first_name, employee.last_name, employee.city_name)
SELECT "employee_department"."id",
       "employee_department"."name",
       "employee_department"."description"
FROM "employee_department"

SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_employee"."department_id",
       "employee_employee"."city_id",
       "employee_city"."name" AS "city_name"
FROM "employee_employee"
INNER JOIN "employee_city" ON ("employee_employee"."city_id" = "employee_city"."id")
WHERE ("employee_city"."country_id" = 4
       AND "employee_employee"."department_id" IN (13, 14, 15, 16))

How to accidentally break optimization

When using prefetch_related, we need to remember that we are working with QuerySet objects to avoid unexpected behavior.

Filtering loaded objects will invalidate the cache at the QuerySet object level, and therefore result in an additional query.

In the example below, an additional query will be performed for each department object:

for department in Department.objects.all().prefetch_related("employees"):
    print(department.name, ":")
    for employee in department.employees.filter(city__country_id=4):
        print("    ", employee.first_name, employee.last_name)

Since in the case of prefetch_related object binding occurs at the python level, excluding a field containing a reference to the underlying model using the only/defer methods will result in additional requests to retrieve that field for each fetched object at binding time yawning

For example, in the case of such a script:

employees = Employee.objects.only("first_name", "last_name")
    .filter(city__country_id=4)
    .annotate(city_name=F("city__name"))

queryset = Department.objects.all().prefetch_related(Prefetch("employees", queryset=employees))

for department in queryset:
    print(department.name, ":")
    for employee in department.employees.all():
        print("    ", employee.first_name, employee.last_name, employee.city_name)

2 requests will be made to load the Department and Employee objects

SELECT "employee_department"."id",
       "employee_department"."name",
       "employee_department"."description"
FROM "employee_department"

SELECT "employee_employee"."id",
       "employee_employee"."first_name",
       "employee_employee"."last_name",
       "employee_city"."name" AS "city_name"
FROM "employee_employee"
INNER JOIN "employee_city" ON ("employee_employee"."city_id" = "employee_city"."id")
WHERE ("employee_city"."country_id" = 4
       AND "employee_employee"."department_id" IN (13, 14, 15, 16))

and many identical queries to get the department_id:

SELECT "employee_employee"."id",
       "employee_employee"."department_id"
FROM "employee_employee"
WHERE "employee_employee"."id" = 63265
LIMIT 21

Conclusion

In conclusion, I would like to say that the examples described above cover the main use cases of select_related/prefetch_related.

A more detailed description of the Prefetch methods and object can be found in the documentation:

https://docs.djangoproject.com/en/4.2/ref/models/querysets/#select-related

https://docs.djangoproject.com/en/4.2/ref/models/querysets/#prefetch-related

https://docs.djangoproject.com/en/4.2/ref/models/querysets/#prefetch-objects

Related posts