An update on email delays from December 16th
I recognize that many reading may not have heard from me before. I’m Erik Enge, Head of Engineering at Postmark. The goal of everything we do at Postmark is to provide a reliable email delivery service to you, our customer. This weekend challenged that and we’ve failed to meet the expectations you have of us. I was hoping that the first time you’d hear from me would be under better circumstances.
Over the weekend, Postmark experienced a massive and sophisticated DDoS attack that resulted in a severe reduction of our sending capability and email delays for our customers. Despite the scale of the attack, we didn’t lose any email or data, and we continued to accept and queue messages while experiencing significant delays when sending email and updating activity pages.
This incident represented a major disruption, and you deserve a complete explanation. We want to provide as much information as we can while maintaining the safety and security of our—and our customers’—infrastructure.
Before I get to the details, I want to say that we are truly sorry about this incident and its impact to you, your teams, and your customers. So let’s begin.
A brief summary #
Around 12:00 (noon) EST on December 16th, 2022, a third-party data center we use to host certain Postmark services was hit with a large-scale DDoS attack. The attack ultimately lasted ~44.5 hours, ending around 9:30am EST on December 18th, 2022. During the entirety of the incident we continued to accept and deliver mail; although at greatly reduced speeds.
The impact #
The greatest impact to Postmark and our customers occurred in the first 12 hours. Because of the duration of the attack and length of time that messages were queued, some customers experienced soft bounces as deliveries to receivers timed out. A small handful of customers saw residual delays on December 19th.
Another unfortunate side effect of the attack was that we experienced very high volumes of traffic to our status page, which was hosted with a different provider. This impacted customers in three ways:
- Those who visited status.postmarkapp.com experienced a 500 error
- We were unable to notify customers who subscribed to status updates via email
- Customers who used our status API were no longer able to access this endpoint
What happened #
When the attack began, we observed downstream effects on our networking infrastructure. DNS queries in particular were failing and there was general network unreliability. Our vendor became aware of the attack and shared with us that a large-scale UDP Reflection attack (for an explanation of what that is, this article and this video both provide friendly explanations) was under way. They enabled a mitigation effort that successfully blocked the attack and stabilized our network. This sounds like good news. But, in fact, it was just the beginning of our next issue.
During a DDoS it can be difficult to distinguish legitimate traffic from attack traffic, and that was the case during this incident. Mitigation efforts successfully blocked the attack traffic, but they also caught legitimate email traffic. An unintended consequence of the mitigation effort was that Postmark’s sending capacity was dramatically reduced—by nearly 60x.
Postmark continued to accept and send messages over the duration of the attack, however our mail queues were also gathering millions of messages due to our reduced sending capacity. During this time we redirected some of the queues through alternate pathways, allowing some mail to flow. The impact of this redirection was that some customers saw delays in events on the activity page while we tried to speed up the queue and sending processes.
Shortly after midnight EST on December 17th we were able to establish a more robust alternate network path, and we were once again able to re-route queued emails to unaffected parts of our infrastructure. Over the next two hours we sent the majority of the pending messages at our regular speeds.
We still had several million emails queued in a different part of our infrastructure, and our team carefully tested and implemented a change that delivered the remaining email over the next 2 hours. We had to be careful as we ramped back up, because sending too much mail, too fast, could cause issues with the receiving MTAs. Our team made adjustments to our sending volumes and opened direct lines of communication with receivers to avoid reputation and deliverability issues.
By 9:00am EST on December 17th, we had delivered all queued mail and email delivery was back to normal.
How big was this attack? #
We are still working with our vendor to understand the scope of the attack. Our current understanding is that this UDP Reflection attack generated hundreds of gigabits per second of traffic during these 44.5 hours. To provide some perspective: When we say that this was a large-scale attack, we mean it. According to Cloudflare’s most recent DDoS threat report, only 0.1% of attacks are over 100Gbps. Also, 94% of attacks end in less than 20 minutes. Ours lasted 44.5 hours—a duration that's truly exceptional.
Keeping you informed #
While our top priority was mitigating the attack, a very close second was keeping our customers informed about what was happening. Our engineers were all-hands-on-deck working with our data center to mitigate the attack while marketing and customer success were responding to tweets, emails, and keeping the status page up-to-date as new information was available.
That status page—which was built and maintained by our team and has served us well for many years—went down as many of you were seeking updates. For a while, our customers couldn't access our status page and didn't receive the real-time updates you might have subscribed to.
We worked as fast as we could to migrate the status page and related notifications to a new provider, but we know that there was a brief period of time where you couldn't access status information. If you had previously subscribed to email alerts for status updates, you don't have to subscribe again on the new page—we moved that email list over to Sorry, the new provider.
The status API remains unavailable while we work on an alternative solution.
Thank you for your support #
It probably goes without saying that it was a rough weekend for the Postmark team as they worked 24/7 to respond to the attack. Your words of encouragement and offers of assistance means the world to our team. Thank you for all the #hugops messages of support we received.
We’re very sorry about the delays and the impact on your business. I completely understand the frustration that causes—we know how critical transactional email is to your business, which is why "Time To Inbox" is the primary metric that we track and focus to uphold internally.
What we’re doing next #
Due to the nature of the attack there will be some things that I can share—and some that I cannot share—about what we’ve done and what we’re doing. I hope you’ll forgive some lack of details here and trust that we take this situation very seriously. Large-scale DDoS attacks are never easy to mitigate but we expect to serve our customers much better than we did on Friday.
We have completed the following actions:
- We’ve created alternate network paths when sending email so that a similar attack cannot have the same impact
- We’ve made changes to how our internal networking is structured to minimize the impact of another attack
- Through third-party vendors we now have more options in place to help mitigate future attacks in case any one mitigation causes issues with our ability to send email
This attack clarified how Postmark needs to evolve to be able to weather similar and larger attacks in the future without disappointing our customers. Although we are unlikely to speak about the details of that publicly, we are dedicated to making sure Postmark is resilient and reliable.