The past seven days have been a whirlwind. I know a lot of our customers have noticed some issues with activity and sending the last few days after our migration. I want to give an update on what happened and what we have to look forward to.
First, why we moved to Cloudant
We’ve been using MongoDB ever since we released the activity feed, which was only a few months after we launched Postmark. It’s been a long and rough relationship to say the least. Initially the problem was basic instability back before MongoDB had replicasets. After that, it was search issues and disk access. Our dataset is pretty big and constantly growing. We solved that by moving all search and queries to Elastic Search (ES). It worked incredibly well, even though ES itself is young, we had to build our own oplog to ES follower, and ES has some of its own quirkiness. Over time, we got MongoDB to perform well by throwing a lot of SSDs at it and everything became pretty stable. At that point we still had some issues and bugs that would appear in MongoDB and we were not big fans of the master/slave architecture, so we decided to move on.
We evaluated a lot of options, such as Riak, CouchDB, Cassandra, and even MySQL. We were looking for a solution that would allow us to scale easily (by just adding a node) and something that fit in with our existing infrastructure (Elastic Search). About a year and a half ago we had talked to Cloudant (a Postmark customer). Even though at that point we were not ready to migrate (this was before ES), we really liked the offering. This time around, we decided that Cloudant was a perfect fit. Not only did it perform well in our tests, it was a drop-in replacement when it comes to Elastic Search and our data model.
On top of that, nosql and the shift to new data stores is overwhelming and pretty unpredictable. They are all fairly new and far from the stability people are used to from MySQL and other traditional relational databases. This is exactly why we chose to go with Cloudant. Their level of support is incredible and their expertise cannot be matched. When we were in the thick of it, we had a team of people from their engineering group in chat helping us. And not just with cluster health, but how to interact with the cluster in our code as well.
With Cloudant we were able to achieve a high level of customization. This allowed us to splurge on hardware like 8TB of SSDs and Intel Ivy-Bridge-EP processors. The servers are hosted at a Dupont Fabros facility outside of Chicago, one of the most advanced in the country, through ServerCentral.
A huge rewrite
At first the migration was pretty simple. To put it simply, we essentially just replaced how we interact with MongoDB and put Cloudant in its place. As we got into it though, the performance was not what we required and we noticed a lot of areas for improvement. For instance, in our previous architecture if MongoDB had high load, the entire sending process would slow down and queues would go through the roof. Postmark is kind of like a firehose in that respect. Even if the app is down our servers will still capture emails offline. The problem here is that MongoDB was not part of the core sending process, it was just for viewing email history in the Web App. While a lot of it was in the background, we still had some work to do.
The new system, launched on Sunday, allows us to put all non-critical aspects of the system in the background. This way, even if our Cloudant cluster were to go offline completely, Postmark would still send out your valuable emails and keep churning. While the sending history is important, getting your emails to the inbox is our first priority.
The migration on Sunday was not just a data migration from MongoDB to Cloudant. It was a huge refactor of code and processes. While we tend to work in smaller releases, we had to do it this way. After the initial migration everything looked great. Once records stopped going to MongoDB we had some scripts ready to start back filling the old data into Cloudant. Sometime in the afteroon on Sunday, all hell broke loose. Our RabbitMQ queues went through the roof and records stopped getting processed into our Cloudant cluster. At first we thought it was a mix of regular production activity combined with the back fill imports. We managed to get through it, stopped the imports, and things went back to normal. We decided to regroup.
The next day, Monday, it happened again around the same time. We spent a lot of time with Cloudant to analyze the cluster performance, but everything looked normal. The problem is that we had to delay our back fill imports, so while you saw your recent activity, anything before Sunday was missing. We were still a little nervous to run the imports again. It wasn’t until Tuesday, when it happened again at the same time, that we realized the formatting of certain customer emails (tens of thousands of them) was breaking our RabbitMQ workers. Thankfully, once Milan patched it, everything went back to normal.
Where we are now
We finished with the back fill imports yesterday. As of now, all previous bounce, sent and inbound history should be in the activity. The last step is to correct some incorrect “In queue” records that are there from the RabbitMQ issues. We should have them settled today and tomorrow.
Overall, a gigantic success
While it was exhausting and stressful, this was a gigantic success for us. The Cloudant cluster is humming along nicely. We just had a customer slam us with emails for a really large brand giveaway - our charts and benchmarking tools hardly even noticed it. There are many, many positive things to take away from this migration, but let me list a few:
- A consistent and much more stable Elastic Search indexer (The CouchDB river is much better than our custom written tool for Mongo)
- Successfully moved most of our processes in the background, decoupling critical sending from email activity and search.
- A highly proficient team at Cloudant to help manage our cluster.
- Ability to store many more days of activity history.
Overall, the real success is moving forward. With our Cloudant migration out of the way, we can now continue on some much needed features like Open Tracking, Link Tracking and some other really awesome ideas we have in the works.
I really want to thank you for putting up with some of the mess over the last few days. We expect better from ourselves and you deserve better. We’re looking forward to the next steps of Postmark. It’s going to be fun.