Today we were going to have a really important, but low impact maintenance. This morning at 9am the team brought the application offline to migrate to a new, more powerful database. Email sending was not going to be affected during most of the process, we expected about 2-4 minutes of paused sending. Things did not happen as we expected and we had some long outages and delays in sending. I want to explain what happened in detail.
What happened #
The process to move to the new database meant we had to transfer data. We were moving the database to our datacenter in Chicago (from Virginia). When we came back online we noticed significantly worse performance. While we investigated what the cause could be, emails were sending but at a much slower rate.
To combat the slowness, we added many more workers and about 20 minutes later sending was back to full speed. At this time we realized something strange was going on. Our test accounts that we created after the migration were reflecting inconsistent data. As we continued to investigate, we realized that the transfer of the old to new DB was incomplete. This was totally unexpected and took a while to realize. During this time, all accounts were not affected and sending chugged along very quickly.
We decided that the best way to fix the issue was to reimport the old DB dump. We had to put the app in emergency offline mode and run the import. After about 15 minutes it completed, but was still missing some records. We decided the only solution at that point was to revert back. About 10 minutes later the app was back online and sending was instant.
The damage #
By reverting back we lost some data from today between 10am EST and 2:30pm EST. This includes a some new account sign ups as well as any credit deductions and sent email count. However, the sent records can still be seen in the activity page.
The good news is we didn’t lose any emails, and all emails are in Activity for searching.
What we’re doing now #
Russ and the team are investigating what happened. In all our testing we weren’t able to predict that the sync would be incomplete. Until we know for sure, we will hold off on the migration. Our hope is to try again next weekend. Before we do, we will send an email notice about the maintance so you’re fully aware of any upcoming downtime.
We are very aware of the affect this kind of day has on our customers and their businesses. We use Postmark ourselves and rely on it’s uptime. We’ve gotten to a really good place testing and most days you never hear of any of our deployments or updates. On behalf of myself, Chris and our team, we’re very sorry. We will continue to work towards making sure these kinds of situations don’t happen again.
This post was originally published Mar 24, 2012