Earlier today, around 2:05 PM EDT, our API servers started to drop connections and time out. The entire team dropped everything and started to investigate. After ensuring the API servers and backend databases were healthy, we started to investigate other causes. During this time we decided to redirect traffic to our off-site disaster recovery data center, to avoid lost messages.
After about 20 minutes, we narrowed down the cause to our RabbitMQ cluster. Two of the three nodes were not properly handing requests and the cluster did not fail over properly. The only solution was to forcibly kill the two nodes, getting us down to a single healthy node. Once we verified this was working again, we redirected traffic back to our Chicago data center and enabled sending again. This happened close to 3:00 PM EDT.
Sometime during the queuing of messages, we know that some messages were picked up and sent more than once. We’re combing through logs to see what could have caused duplicates. If you find them in your account, please get in touch and we’ll reimburse the credits.
Recently we’ve had a few outages from various issues. This was was the worst by far. We are going to do a full analysis of our RabbitMQ cluster to figure out what happened as well as figure out ways to avoid it in the future.
Next week we go on our company retreat, so stability in Postmark will surely be a big topic. I’m really sorry for the trouble we have caused. I’ve always said that in Postmark our #1 priority is to never drop a connection (lose a message) and today we’ve failed you big time.
This post was originally published Aug 28, 2014