This post is a little bit out of the ordinary for our blog, but I’d like to share a few notes about a situation that is sometimes reported by our customers. This situation leads to a more interesting discussion for dealing with a common problem in distributed systems.
In some situations, your application may receive a webhook for the same event/message multiple times from Postmark. Under normal circumstances, this should be relatively rare, but it is something your webhook processing should be prepared to handle.
Understanding why this can happen is not only important when working with webhooks in Postmark, but more generally useful in dealing with distributed systems. After all, all apps that use APIs are distributed systems, whether they want to be, or not!
At least once delivery #
Postmark makes a best effort for “at least once delivery” of webhooks. This is in contrast to “exactly once delivery,” which cannot be truly guaranteed by any system, and to “fake it” introduces substantially more complexity to both the sender and receiver's code.
Why is this? Well, posting a webhook is essentially a distributed transaction. Our servers open an
http connection to your server, your server accepts the message, and then responds with a 200-level
http status code. After that, we mark the webhook as processed in our internal databases and we move on to the next job. Under normal circumstances, this all works, and the webhook will only be sent to your application one time.
However, there are many possible places where this transaction can fail, both on the sending and receiving sides, as well as the network in between. Here are a few examples:
- The receiving app takes too long to return an
OKresponse to our system (we'll wait for 15-30s depending on the type of webhook).
- The receiving app sends an
OKresponse immediately, but network packets are dropped or there is high latency between the two servers, and our system doesn't receive the confirmation that your system accepted the webhook.
- We can't mark the webhook as completed after receiving an
OKresponse from your system because an internal system is down/under high load (this is really uncommon).
- The sending system crashes after sending the
http POST, but before processing the
OK httpresponse (this is also uncommon).
Since many of these issues are indistinguishable from a “broken” webhook, we want to give webhook endpoints multiple chances to accept them. Therefore we will retry delivery of webhooks multiple times, over several hours, before they are marked as failures in our system and we stop retrying.
Systems that purportedly provide “exactly once delivery” must introduce an additional network roundtrip into the process in order for the two systems to, essentially, confirm the confirmation. This both amplifies all of the above issues, and is substantially more complex to implement for both the sender and receiver, as compared to “at least once delivery.”
This is why Postmark will make a best effort to deliver webhooks multiple times, even if your system appears to be responding with a
200 OK status.
How to deal with duplicate webhooks #
There is a solution to dealing with the above cases gracefully, which is a core quality of many distributed systems: Idempotency. An example of this in mathematics is multiplying a number by 1; It doesn't matter how many times you multiply it, the result will always be the original number:
5 x 1 = 5 5 x 1 x 1 = 5
Alternatively, adding 1 to a number is not idempotent — the outcome changes each time you apply it:
5 + 1 = 6 5 + 1 + 1 = 7
In this context, what this means is that if your system receives the same input multiple times, it will always be in the same state after the input is applied, regardless of how many times you apply that same input.
The benefits of building a system that has idempotent qualities cannot be overstated, and can provide you with tremendous benefits in terms of scaling out work, ability to replay/recover from different types of issues, and tolerate communication failures between systems.
To achieve idempotency, your inputs must be uniquely identifiable. In the case of a Postmark webhook, you may look at the
message id and
recipient to determine if it is unique, for example. (Side note: we will likely enhance the webhooks in the future to provide a unique id for all such events, but for now, the onus is on you to determine this.)
Once you have selected an appropriate identity, you can use this to determine whether the webhook should be processed or can be discarded because it has already been processed. What “processed” means varies substantially from one application to another, but we can discuss a potentially common example: Incrementing stats.
In this example, you'll have a relational database that contains a stats table. When you receive a webhook, you increment a row in that stats table, and you respond with
200 OK. However, in the case of “at least once delivery”, this presents a problem, because you may end up double-counting the event. That's no good.
Fortunately, you can resolve this by adding a secondary table in the same database that simply holds identities. When you update your stats table, also insert your identity to this second table during the same transaction. If the transaction fails due to a conflict in this table, you know it has been applied previously, and therefore does not need to be applied again. This is a simplified example, and would be heavily influenced by your app's scale, but hopefully illustrates one technique for handling this type of issue in your code.
It is important to note that in either of these cases, a webhook endpoint should respond with a
200 OK status to the external system since the payload has been applied successfully. Responding with something like a
409 Conflict on a second attempt simply propagates irrelevant information. The desired state from processing the webhook has already been achieved, and further attempts will not change that, so the system should indicate success, instead.
These techniques are not only applicable to handling webhooks from Postmark, but you have more than likely applied some of these concepts in your own code already. If not, think about how making your code idempotent could improve its resilience in the face of failure, you’ll be amazed at how powerful this simple concept can be.