Long ago (1987) some industry thought leaders like Hector Garcia-Molina realized that transactions are not a suitable way of maintaining consistency of long-running processes. Despite this fact, the reliance on transactions for consistency of writes done across multiple processes over long timeframes continued with technologies such as Microsoft Distributed Transaction Coordinator or XA implementing the 2-phase commit protocol.
The most sophisticated (ab)use of distributed transaction technology was probably the WS-AtomicTransaction spec from 2004, which stood out as too enterprisey even compared to the other WS-* specifications.
This article proposes an HTTP-based protocol that can be used to ensure remote invocation is conducted exactly once. It is an adaptation of the token-based-deduplication idea described previously.
One of the most common system integration scenarios is executing a function remotely and fetching the result. It is easy if the target function is pure (does not have any side effects). But what if it does?
The basic building blocks of the HTTP protocol and REST approach are not enough to guarantee exactly-once execution in such a scenario.
Side effects In the previous post we introduced the token-based deduplication approach. It inverts the traditional principle of deduplication. Instead of dropping a message if another copy of that same message is known to have been processed, the token-based approach drops a message if there is no token for processing it. In other words it uses negative (token does not exist), rather than positive (processing information exists), proof of duplication.
One problem left unsolved by our previous attempts at designing a deduplication solution is the non-deterministic nature of data eviction. We know we can’t keep the deduplication data forever but when can we safely delete it? Unfortunately there is no good answer. The longer we keep the data, the less likely we are to miss a duplicate message. Fortunately there is a way to solve the problem.
When it’s gone, it’s gone So far our algorithms depended on the existence of information to be able to discard a duplicate message.
In one of the previous posts we introduced the Outbox pattern. The Outbox implements the consistent messaging idea by storing the ID of the incoming message and the collection of outgoing messages in the outbox records inside the application database. The correctness of the Outbox behavior depends on the ability to tap into the application state change transaction. The big advantage of this pattern is its relative simplicity, compared to alternative solutions.
Designing distributed algorithms is a challenging task. It’s not that hard to reason about “happy paths” and show that at least sometimes the algorithm does what we intended. Checking that it always behaves as expected is a completely different story.
By definition, any non-trivial distributed system is concurrent and fails partially. Both elements adding up to the combinatorial explosion of possible executions - too many to fit in a single person’s head or analyze by hand.
Distributed algorithms are difficult. If you find yourself struggling to understand one of them, we assure you – you are not alone. We have spent last couple of years researching ways to ensure exactly-once message processing in systems that exchange messages in an asynchronous and durable way (a.k.a. message queues) and you know what? We still struggle and make silly mistakes. The reason is that even a very simple distributed algorithm generates vast numbers of possible execution paths.
It is widely known that exactly-once message delivery is impossible in a distributed system. But what is exactly-once delivery? To answer this question we need to first ask what do we understand as message delivery. This is not an easy task. In real life the receiving system is not a single blob of code. It consists of multiple layers. Is the message considered delivered when all its bytes are read from the network cable?
In the previous post we have shown how consistent messaging can be implemented by storing point-in-time state snapshots and using these snapshots for publishing outgoing messages. We discussed some pros and cons of this approach. This time we will focus on the alternative approach which is based on storing the outgoing messages before they are dispatched.
Consistent messaging requires the ability to ensure exactly same side effects (in form of the outgoing messages) each time a copy of a given incoming message is processed.
In the previous posts, we described consistent messaging and justified its usefulness in building robust distributed systems. Enough with the theory, it’s high time to show some code! First comes the state-based approach.
Context State-based consistent messaging comes with two requirements:
“Point-in-time” state availability - it’s possible to restore any past version of the state. Deterministic message handling logic - for a set state and input message every handler execution gives the same result.
In the previous post we explained what a messaging infrastructure is. We showed that it is necessarily a distributed thing with parts running in different processes. We’ve seen the trade-offs involved in building the messaging infrastructure and what conditions must be met to guarantee consistent message processing on top of such infrastructure. This time we will explain why it is reasonable to expect that most line-of-business systems require a messaging infrastructure.
The messaging infrastructure is all the components required to exchange messages between parts of the business logic code.
There are two parts to the messaging infrastructure. One is the message broker that manages the message queues (and possibly topics). The other equally important part consists of all the libraries running in the same process as the application logic, that expose the API for processing messages.
Whenever an application wants to send a message, it calls the in-process part of the messaging infrastructure which, in turn, communicates with the out-of-process part.
If there’s a distributed protocol every software engineer knows it’s Two-Phase Commit also know as 2PC. Although, in use for several decades1, it’s been in steady decline mainly due to lack of support in cloud environments.
For quite some time it was a de-facto standard for building enterprise distributed systems. That said, with the cloud becoming the default deployment model, designers need to learn how to build reliable systems without it.
In the previous post we talked about exactly-once processing looking at the endpoint from the outside. Here we will re-focus on an individual endpoint and see what exactly-once means for an endpoint’s state.
It’s not about the execution history Exactly-once spawned some heated debates in the past1 so let’s make sure we make it clear what it means in our context - or more importantly what it doesn’t. Here we talk about exactly-once processing not delivery, the two being quite different things.
If our experience in the IT industry has taught us anything it would be that drawing boundaries is the most important part of the design process. Boundaries are essential for understanding and communicating the design. The sync-async is an example of a boundary that is useful when designing distributed systems.
Purely sync The most popular technology for building synchronous systems is HTTP. Such systems often consist of multiple layers of HTTP endpoints.
Modern messaging infrastructures offer delivery guarantees that make it non-trivial to build distributed systems. Robust solutions require a good understanding of what can and can’t happen and how that affects business level behavior.
This post looks at the main challenges from the system consistency perspective and sketches possible solutions.
A system We will assume that systems in focus consist of endpoints, each owning a distinct piece of state. Every endpoint processes input messages, modifying internal state and producing output messages.