As you may have noticed, last week we ran into some issues on Postmark that caused email sending delays and even some API timeouts for people. It was a rough week. This weekend we did some maintenance on MongoDB, our primary data store, which we hoped would resolve some of the issues. This morning we experienced delays again and got it under control within about two hours. I want to give a status update and some background on what is going on.
First, some background
When we designed Postmark’s infrastructure the idea was that if all else fails, make sure we can at least recover and send the email. Of course we have redundancy on our servers, but we also wanted recovery mechanisms inside the application. For instance, if the DB fails and our API can’t save messages we have a redirect to store them temporarily on each local API instance. When the DB comes back, they are instantly recovered. Our thought is, as long as the API instance are accepting messages we can ensure nothing is lost. So far this has worked really well for us.
We use MongoDB for storing and archiving our email activity, such as send, in queue, bounces and spam messages that you view and search in Postmark’s web app. When we launched this feature, we approached it with the same logic: If all else fails, make sure the email is sent. When we launched the activity page it was more of an experiment for us. We wanted to see what we could track and how many messages could be stored as we grow. Since then, it’s become one of our most important features when it comes to tracking delivery issues and viewing the status of your emails. So when we have issues related to MongoDB, it’s noticeable.
We’ve been experiencing issues with MongoDB for months. We’ve also come to rely on it quite a bit for things like our bounce processing and internal statistics. When it works, it’s awesome, but when it doesn’t the sky is falling. Earlier this year we made the decision to try CouchDB, mainly for the replication features it offered compared to MongoDB. It looked very promising for our needs so we started to rewrite the MongoDB code to use CouchDB. It took a lot more time than we thought, but that’s not the worst part. When we have some of the core code in place for Couch, we needed to do some performance benchmarks. Compared to MongoDB it was pretty bad. There are a lot of technical reasons for this and tradeoffs to consider, but after a lot of testing we reverted back to MongoDB. Since then we’ve been full force on optimizing and upgrading MongoDB.
Jumping back to now
Let’s get back to last week and today. Essentially what happens is that when MongoDB is under higher load than normal, a large portion of our dataset starts hitting disk instead of indexes in RAM (currently 64GB). This causes huge iowait on the servers, which then slows down every query on Mongo. Since we rely on MongoDB to update things like in queue and sent status, it starts to get delayed. In addition, slower queries mean our processes that grab and send emails start to slow down as well. This caused some pretty severe delays in sending email from around 6am to 9am EST. While all of that was going on, and since we built things to accept failure, a lot of the activity pages might not have loaded or might have showed messages as in queue. So while the emails were sent, it skipped updating the record from “in queue” to “sent.” Again, we’d rather send the emails than wait for MongoDB to recover.
What are we doing about it?
Running a service that is an extension of your infrastructure is serious, and we treat it that way. We’ve been working our asses off to not only get things back to normal but to improve our datastore so we can grow into it. We don’t have time to mess around, so we’re working with 10gen, the developers of MongoDB, as well as upgrading some our infrastructure at Rackspace. By using tools like Server Density we’ve been able to spot some trends and understand where we can optimize. It will take a little more time to figure out the real bottlenecks and plan a better layout for our MongoDB servers. In the mean time we are going to migrate our replica sets to some better hardware, away from the slower I/O of EC2 and on to physical servers. At the moment everything has stabilized and we will continue monitoring it closely through the day.
Finally, I just want to say I am sorry. It’s been pretty stressful here and I know that the stress gets pushed down to our customers as well. Everyone has been great through this tough process and we really appreciate it. Please make sure to follow us on Twitter for the latest updates.
This post was originally published Jun 06, 2011