This morning at about 8:15am EDT we had an outage on our primary MySQL server. It affected sending, the API and our web interface. No emails were lost, but you would have noticed delays. I want to give a brief overview on what happened, how we recovered and where we are now.
The outage occurred when our primary server ran out of disk space. We have systems that monitor for things like this on each server to alert us. In this case, the space grew much quicker than normal, so by the time the alerts were noticed MySQL was already struggling.
In most cases when we have an issue with MySQL the impact is minimal. We run a high availability environment with MySQl-MMM, which usually detects any problems and automatically fails over to a healthy server. Unfortunately in this case MySQL was still running - it was just having issues writing certain data to disk, so it never got the trigger to fail over.
We struggled a bit to free up some logs and space and after about 30 minutes decided to just fail over to the second server. In hindsight we should have done that right away, but we wanted to make sure replication was healthy first.
Once we failed over manually, things were back to normal and we just had to recover any failed jobs. We have a nice fail safe in Postmark where if the DB is unavailable we queue emails locally on each server so we can successfully send them when we recover. This came in handy today!
Moving forward, we’re working on some upgrades to the cluster that will focus on availability and performance. We’re also going to make the alerts a bit more sensitive to make sure we catch potential issues sooner.
I’ve very sorry for the early morning chaos. It was definitely not the way I wanted to start my day and surely not how we wanted all of you to start your day either. As always, we’ll work hard to minimize issues like these and improve reliability. We know you depend on us, and thank you for your patience and trusting us with your email delivery needs.