Lessons learned from our recent outages

I owe an explanation for the recent issues on Postmark. This week we had some pretty terrible performance on our MongoDB cluster, resulting in delayed emails, disabled activity and even disabled bounce API. No emails were lost during the outages, but I do want to explain what happened, what we did, and what we are doing about it.

Cause and effect #

In short, one of our MongoDB clusters failed, forcing a failover to a secondary machine as a primary. Unfortunately this new primary did not have enough resources (Disk IO) to keep up with the write load that Postmark throws at it. Since the other replica failed, we basically had to wait until it rebuilt to bring it back, which took over a day with the amount of data we had on that cluster. Once we failed over again, everything instantly came back to normal.

The issue that occurred is not something new. We’ve had challenges with MongoDB in the past. I would say that it is not entirely about how MongoDB works, but how it fits our needs and how we use it. Postmark captures a lot of data each day, which we store both in MongoDB for archival and Elastic Search for searching.

We go through so much data that in order to maintain the performance that our customers expect, we need to purge data constantly. With a solution like MongoDB, this causes two problems.

First, we have to constantly delete data, which adds to the disk utilization on each primary in the replicaset. Second, when data is purged, it does not release the space on disk, meaning that we have to run constant compactions or resync to get it back.

As a result of these problems, our Bounce API was down for almost two days. We chose to disable that part of the API because it uses MongoDB queries to find emails, which puts additional load on the servers. We decided that it was better to have email sending services in good shape if it meant the Bounce API was offline.

While embarrassing, I still feel this was the right decision. The other side affect of the performance issues are inaccurate “In queue” messages in activity. While all emails were sent successfully (although possibly delayed in some cases), the activity statuses were still being pulled from MongoDB data which reflected some inconsistency while it was suffering. We are slowly updating the activity history to reflect the properly sent messages and you can expect them to be back to normal within the next day or so. All new messages are being shown accurately.

A silver lining #

One really good thing that came out of the performance issues is an updated Bounce API. The bounce API has always gone directly to MongoDB when searching for email records. Since MongoDB is not really meant for searching, it was very slow, and if you use it you probably had timeouts in the past. One of our developers, JP Toto, was able to rebuild it to make calls directly to Elastic Search, which results in near instant responses. This change will also allow us to enhance how you search for bounce (and sent) records in the future.

Rethinking our architecture #

We’ve been on MongoDB for a long time. At this point we are rethinking our architecture and have a plan for some very thorough benchmarking of other nosql (and sql) databases to see how they might fit our needs. The most important part of the results of this benchmarking is making sure that our email archive cannot affect the real-time sending that our customers rely on. This either means an eventually consistent solution or heavy background processing, or a combination of both. We still have not ruled out MongoDB, but we have some ideas on how we can use it differently that might fit us better.

Since most of our customers are also developers, we’ll make sure to post our benchmarking results once we are done.

More than anything, I want to sincerely apologize for the problems. For the most part, we know we are doing our job when you forget we exist because everything “just works”. We heard from many of you because that wasn‘t the case in the past few days.

We‘ll make sure that the next time we hear from you, it’s on a positive note. Thanks for your patience and support.