Over the past few days, Postmark has experienced volatility in API response times and email delivery times. We owe everyone an explanation.
The majority of the issues were due to increased load on our MySQL database servers, mostly caused by disk IO saturation. On Tuesday, September 22nd this came to a head when we experienced severe email delays and culminated in us having to perform an emergency maintenance, bringing the entire application down to failover to a second MySQL server.
Most customers experienced slow API response times for various API endpoints, as well as severe delays for a portion of emails. When our APIs detect problems with database connectivity, we save requests to disk to be picked up by a background process. As these background processes started working to recover saved emails, performance across the entire system began to deteriorate again. To counteract this, we drastically limited the rate at which files recovered. This allowed live requests to process, but saved requests took much longer until the email was ultimately delivered. This resulted in some emails sitting in the queue for over an hour.
We had been evaluating our MySQL usage and configuration since we began to notice slight issues a number of weeks ago. We did extensive research, consulted with MySQL experts (Percona), and came to the conclusion that we would get the most benefit from new hardware and configuration changes that required downtime, both of which we wanted time to evaluate and test. We still looked for and implemented changes that could be done online, but what we tried did not make any significant impact.
When we did finally initiate an emergency failover around 6pm on Tuesday night we noticed immediate improvements. After monitoring for the last 36 hours, we can say that we are back to normal performance and response times.
Lessons learned #
There are two major areas where we will be focusing our efforts: capacity planning and failure handling.
With better capacity planning and a renewed focus on understanding the specific needs of our infrastructure, we will solve these kinds of problems before you ever begin to notice them. While we have many areas where we are able to predict an increase in traffic, a data store like MySQL can become a single bottleneck within a short timeframe. We are already acting on changes to scale vertically on a hardware level for increased IO using NVMe drives, and at the same time, implementing newer features of MySQL that can help with larger datasets.
With better failure handling, we will recover from downtime events faster and not impact performance further while recovery is taking place. This includes a faster database failover process and more control over throttling specific aspects of the email flow in the application.
Infrastructure is our business #
The last several days have been stressful and exhausting for the team, but what is even worse is that we let you down. We have a great reputation for reliability in both uptime and email delivery and we believe that is the core of our business. Without fast API response times and quick delivery to the inbox, we can’t exist. There is not a single fancy feature we could launch that can compensate for bad performance or an unreliable service.
I’m confident that we will get through this and I want you to know it’s a primary focus for the entire team. We let you down, and now we need to earn back your trust.
If you have any concerns or questions, please email me directly at email@example.com.
This post was originally published Sep 24, 2015