Pyrrhic victory of Domain-Driven Design

Pyrrhic victory of Domain-Driven Design

TL; DR: DDD inevitably leads to an excessive (by an order of magnitude more than the minimally necessary) number of sagas in the project, which, in turn, inevitably lead to a violation of data integrity in the database.

DDD quite successfully solves the task: to give developers tools that will allow them to cope (correctly implement and maintain) with a complex subject area. But this victory turned out to be pyrrhic: tools that ensure the correctness of data in memorywere unable to guarantee the correctness of the data in the database. And what is the use of initially correct data in memory, if over time (after saving them in the database and further reading) they cease to be correct? Essentially, DDD has a fatal flaw: DDD inevitably leads to violation of data integrity (business logic invariant) in the database.

Here “inevitable” is used in the sense that in the whole familiar thesis “Big Ball of Mud inevitably leads to the impossibility of developing the project”. Many Big Ball of Mud projects may either not survive long enough to face an unacceptable increase in the complexity of making changes, or manage to fully implement all the necessary functionality by that point and not need further development. And then the “inevitable” doesn’t happen for them – they just don’t have time. But it does not cease to be inevitable…

The same with DDD: many projects may not encounter data integrity violations in the DB. The essence of the problem is that the implementation any new feature or any changes in business requirements to anyone DDD project may lead to the fact that after some time the DB in production will contain incorrect (from the point of view of business logic) data. This time mine is embedded in the very essence of DDD, so whether or not your DDD project explodes on it is purely a matter of luck, your project will either be lucky or unlucky (and even if it is lucky today, everything can change tomorrow).

By the way, DDD leaders (Eric Evans, Vaughn Vernon, Udi Dahan, Greg Young…) are aware of this problem. In Vaughn Vernon’s 2011 paper “Effective Aggregate Design Part II: Making Aggregates Work Together” (which they all reviewed), it is described as: pending intervention.” (if you can’t do the next step), but if you were able to at least understand it, you need to ask a person to fix the problem in the sales database by hand. And here’s why this problem will inevitably arise in any DDD project, I’ll tell you now.

Almost everything is great in DDD, except for one tactical pattern: determination of the limit of transactions per aggregate. (Unfortunately, this is a key element of the entire “tactical” part of DDD. So the only way to avoid this pattern is to limit your project to a “DDD strategy” and completely abandon “DDD tactics”.)

DDD recommends that when changing an aggregate, it should be stored in the database in its entirety, in one transaction. But developers tend to be very concerned about data integrity, so this recommendation naturally leads to the fact that most of the project’s data is “glued together” into one giant aggregate. Unfortunately, it does not work like that (the infrastructure does not cope) – brakes and mass cancellations of transactions begin. Therefore, in order for this approach to work (do not create performance problems and frequent conflicts between transactions), a rather strict recommendation is added: make aggregates as little as possible. Evans’ recommendation: ask the business if it needs immediate consistency of this invariant or if you can use eventual consistency, and if the business is happy with the second (hint: more often than not!) then take part of the large aggregate into separate aggregate(s) and reconcile them with each other with some delay , through domain events. And so that following this recommendation does not lead to the other extreme (in which almost all aggregates consist of a single entity), Evans added another one: to ask yourself if ensuring consistency between these parts of the aggregate is the responsibility of the user who called the current operation (use case) – if so, then leave these parts in one aggregate, and if not (this is the responsibility of someone else or the system itself), then separate and use eventual consistency. Unfortunately, although the last recommendation slightly eased the general situation, it did not fundamentally change anything: it is the business that determines how to share data between aggregates, where it is necessary to use eventual consistency and where sagas – ie. whether we have sagas and how many are out of the developers’ (almost) control.

This approach inevitably leads to the fact that the number of applications of eventual consistency in the project increases by orders of magnitude, and many of them will need to be implemented as saga (long-running transaction). (Yes, there will not always be a need for the saga itself – in some cases, the next steps cannot, in principle, fail, and we can do with the usual eventual consistency without sagas… The bad thing is that we can do without sagas – it is determined by business requirements, and not by developers, so sooner or later sagas will be needed, and this may happen in a non-obvious way: when a new feature requires adding a new step to the already existing chain of eventual consistency, which, unlike the previous ones, may fail – in such a situation, it is easy to miss the need to transform the entire chain steps in the saga with the implementation of the logic of compensation of all previous steps.) And the more sagas there are in the project, the greater the number of situations when the execution of the next steps of the saga may fail. Failure will require a rollback/compensation of the previous steps…and here lies the key issue: the logic implementing the compensation of the saga steps is practically impossible to write and maintain correct. Moreover, it is extremely difficult to even understand that it is written incorrectly (or suddenly became incorrect due to a change in business logic somewhere far from this place of the code). And the reason for this is not in technology, but in people: there is a combinatorial explosion, and our brain simply cannot cope with this task. Actually, this is the same reason why Big Ball of Mud projects cannot be accompanied.

Of course, there are specialized tools (e.g. Temporal and Cadence) to facilitate the implementation of sagas, but they only solve the technical complexity of sagas (automation of repeating steps in case of temporal errors, saving the current progress of the saga, recording the logic of all steps in one place of the code, debugging / introspection of the saga ) ). But the main difficulty of the sagas is not in this at all, but in the need to correctly describe the logic of compensation.

Sagas, unlike traditional ACID transactions, do not provide “I” (isolation) and are ACD. Due to the lack of isolation, it is extremely difficult to take into account all possible data changes in the database that could have occurred between the execution of one of the steps of the saga and the later moment when this step had to be “compensated”. Even if everything was taken into account correctly at the time the compensation logic was implemented, future changes in other parts of the project may cause this logic to no longer be correct. All possible consequences and relationships are guaranteed to be traced almost beyond human capacity, so the logic of compensating saga steps will inevitably contain errors. And the more sagas and steps in these sagas there are in the project, the more such mistakes there will be. And every such error means that sooner or later the integrity of the data in the database will be violated. What’s worse – in most cases, developers won’t even know that the compensation logic is brokenbecause it is practically impossible to reliably test (for this, the tests need to take into account all the above-mentioned consequences and relationships, which in practice people are not able to take into account).

I consider this problem to be a fatal flaw for the simple reason that DDD puts a lot of effort into ensuring that the code correctly follows the invariants of the business logic (you could even say that this is the main task of DDD) … and it succeeds quite well as long as the model is in memory ‘yati, but it does not cope with this task for the model in the database. At the same time, it is obvious that for business, the correctness of data in the database is much more important than in memory.

You can, of course, say that a real business almost never has a guarantee of the correctness of all its data, so it is used to this situation and knows how to cope with it. Therefore, rare violations of the correctness of data in the sales database (unequivocally unacceptable for a perfectionist programmer) can be quite acceptable for business. And for businesses, the benefits of DDD can outweigh these challenges. This is all true, but this problem does not cease to be a fatal flaw.

It may seem to you that I am exaggerating, and in fact this problem is not so serious at all. Because you did not encounter it in your DDD projects. But in reality, the reasons for this are usually different:

  • A fairly simple microservice (Bounded Context) in which you can easily do without tactical DDD. If there are almost no interconnected aggregates in it, then there will be no extra eventual consistency and, even more so, extra sagas.

  • You use large aggregates (against DDD recommendations), but:

    • Or it does not cause problems because the load is quite low or there is almost no competitive access to the aggregates (eg, a specific aggregate is changed by only one user-owner manually through the UI).

    • Or it creates problems, but they are ignored (or even not known about them at all due to insufficient monitoring).

  • Support for compensations is not implemented at all (to the point that where there should be a saga, just eventual consistency is used) or implemented too simplified (insufficiently correct). So far, everything is working according to happy flow – there will be no problems because of this. And when something breaks, it is not a fact that they will quickly find out about it (again the issue of monitoring). And it will look like this (many months after the problem occurred): “oh, look, there is some bullshit here in the database… this shouldn’t happen in principle… there must be a bug somewhere… but we won’t find it anyway who messed up the data months ago and why, so screw him!

How is this problem usually dealt with in non-DDD projects? If we take a typical project on microservices (in which the connections between microservices are designed according to strategic DDD patterns, but the implementation of microservices themselves does not follow tactical DDD patterns), then usually such projects try to design in such a way as to reduce to an absolute minimum (ideally – to zero ) necessity in sagas. For this in such projects the boundary of the transaction passes through the microservice (Bounded Context), not the aggregate.

Related posts