After yesterday’s issues, I’d like to explain some of the more recent service outages we’ve had over the past few weeks. I’d like to apologize and provide a detailed explanation of what’s been happening, and what we are doing about it.
In the last few weeks we’ve seen the following:
- May 26th: A major DDOS attack against Postmark caused connectivity issues.
- June 1st: Major DDOS attack against our hosting provider causing 30 minutes of connectivity issues.
- June 16th: Telia, a prominent transit provider, experienced several minutes of packet loss causing brief connectivity issues to our services.
- June 20th (Yesterday): Telia issues again affecting EU routes to US.
- June 20th (Yesterday): Elasticsearch issues causing message delays and missing activity.
Events from yesterday
Before I jump in I’d like to explain yesterday in more detail along with the current status. At around 2:56pm EST we had an interruption in our Elasticsearch cluster after a hardware failure. While outbound sending is completely detached from this dataset, it did affect Inbound processing causing some delays for triggering web hooks.
Elasticsearch is completely redundant and normally carries on, but during the recovery process of the failed nodes we made a mistake and caused a complete cluster outage. This ended up having a ripple effect on our RabbitMQ cluster as events were queued and caused delays across both Outbound sending and Inbound processing. The delays had more impact on Inbound processing and most outbound messages processed quickly.
At around 6pm we managed to get control of both the Elasticsearch and RabbitMQ clusters and everything came back to normal. We didn’t escape without some wounds though. The impact of this includes:
- Missing activity: We are currently restoring previous activity history. Some records may be missing from between 3pm - 6pm EST (June 20th).
- Missing stats: While recovering RabbitMQ we had to purge some stats events from between 3 and 6pm EST (June 20th). These records include counters from sent, opens, and bounces. Unfortunately we cannot recover these records.
- Delays: Inbound processing experienced delays of up to 5 minutes and Outbound up to 4 minutes between 3pm - 6pm EST (June 20th).
- Inbound duplicates: It is possible some messages were posted multiple times to the inbound web hook urls.
What are we doing to fix all of this?
During yesterday’s issues we took note of several things that could have made this easier. The Elasticsearch issues were mostly a mistake, but the ripple effect that it caused should be fixed. This includes separating Inbound processing from the path of Elasticsearch and making RabbitMQ more resilient to the increase in load in this scenario. We are also working on some code changes we discovered that could reduce the impact. In some cases, these issues only appear when something drastic like this happens, so we need to be sure we fix them immediately to avoid it again in the future.
Our recent DDOS and Transit provider issues have been rough. They are also very hard to control or fix. The Telia issues affected many providers and our data center is working to be as proactive as possible when these things come up to reroute traffic. For the DDOS attacks we are already working toward isolating the paths to services so the impact is reduced across our endpoints. This will also allow us to scale each piece separately and add a layer of DDOS protection on top of it. This is already under way on our operations team.
We hate for you to have to deal with these kinds of problems, and we’re truly sorry. We built Postmark so you don’t have to worry about sending email and managing infrastructure. When these issues happen we fail on that promise. Reliability is our most important feature and the core of our product. I’m confident that we will correct this and an even more reliable service will come out of it.