Over the past week and a half, Postmark has suffered numerous outages and delays on our front-end activity feed. While these issues have never negatively affected our API, sending, or the delivery of messages, it has caused a lot of confusion for our customers and for that we are profoundly sorry.
About 10 days ago we performed a fairly routine move of our physical Cloudant (Big Couch) servers at ServerCentral to a new cabinet. We have four total servers which are managed by Cloudant so it’s perfectly feasible to move them, one at at time, and then bring them back into a cluster. Meanwhile, the front-end activity feed is powered by ElasticSearch which uses the couchdb-river plugin to send user message data directly to ElasticSearch for indexing. 99% of the time this works extremely well.
After breaking apart the Cloudant cluster and putting it back together, however, the river plugin seemed to stop working and we could no longer update our front-end activity effectively. After several attempts to repair the river and existing ElasticSearch index, we opted to rebuild the index from scratch using a more reliable ElasticSearch setup.
Another issue that manifested at the same time was duplicate statuses showing up in activity. About 10% of the time, messages would show in activity with both “queued” and “sent” statuses. While this didn’t mean users’ messages were being delayed, it did mean confusion for anyone trying to decide if their messages were actually being sent.
Since that time, we’ve been working hard to restore all services back to their normal working condition. We found out that the duplicate statuses are being caused by our application trying to update the status of messages from “queued” to “sent” before the “queued” message was finished being written to disk on the Cloudant server. It’s important to have a quorum of writes across the cluster so that we can guarantee redundancy for message data. Unfortunately the writes haven’t been happening quickly enough and the result has been messages marked as both “queued” and “sent”.
To rectify the duplicate status issue, we began bonding the network interfaces in our servers to provide more bandwidth. As well, we’ve been updating the application code to guarantee documents are written to Cloudant before we try to update them. During the bonding process, however, we had some hiccups in the cluster which again caused our ElasticSearch index to fall out of sync.
We’re currently rebuilding the message index and expect to have that finished and working with 100% accuracy inside of 24 hours.
Regardless of the technical issues we’ve had with our front-end activity recently, we consider our recent delays and data inaccuracy to be completely unacceptable and wildly outside of the performance standard we strive for each day. The problems have gone on much longer than we anticipated and frankly, we find it a bit embarrassing.
We expect to have the problems resolved permanently over the next 24 hours and would like to offer our sincerest apologies to our users for the haphazard issues over the last 10 days.
- The Postmark Team
This post was originally published Mar 19, 2013