Review of partial outage on 2014-04-28

What happened #

We had a partial outage of the Postmark website and Bounces API endpoint on Apr 24 2014, from approximately 1:00PM EST to 5:30PM EST. At about 1:30PM we narrowed the problem down to our CouchDB views. Discussing the problem with our partner Cloudant, at 2:30PM we realized that during some ongoing maintenance our view shards got deleted and would need to be rebuilt with a rebuild time of >24 hours. This not being an acceptable solution, JP and Nick implemented a series of hotfixes to use different data sources to work around the loss of our CouchDB views. These issues were resolved around 5:30PM.

What was fixed #

  • Viewing messages on the website
  • Single bounce API call, GET /bounces/{bounceID}
  • Bounce dump, GET /bounces/{bounceID}/dump
  • Activating bounces, PUT /bounces/{bounceID}/activate

These portions of the application will show a slight degradation in performance until our views are rebuilt.

What we are still fixing #

Deleting a server from the website or the API is not functional right now. We are working to fix this functionality.

Update Apr 29 12:28PM EST deleting servers should be fully functional now.

What we’re doing in the future #

We’re going to leave in place and enhance our hotfix use of alternate data sources so if CouchDB views degrade again recovery will be much quicker. We’re going to continue working closely with Cloudant to make sure our CouchDB cluster is as resilient as possible.