In the last two months we’ve had two big maintenance windows. During this time we had to queue messages and bring most services down. This caused significant delays (up to 30 minutes) for the important, time sensitive emails that you need to get to customers. Even though it was a planned maintenance, our goal is to perform zero downtime maintenance windows. I’d like to give some background on what happened and what we need to do next.
The majority of Postmark’s infrastructure resides in a data center where we purchase and manage the hardware. This has allowed us to have full control over the growth and performance of the product over time. We’ve had the opportunity to be very specific in our needs, whether it is NVMe drives for mail servers or MySQL performance, or 10GbE networking across our cabinets.
This year, we realized we’d need to upgrade the physical firewalls in our cabinets to keep up with the demand as Postmark grows. We invested in new cabinets, a brand new Juniper networking stack, much more powerful firewalls, and more hardware to sustain the growth. Overall it’s been fantastic. We’ve nearly doubled our growth and maintained our internal expectations for time to inbox and performance.
The last effort is to migrate over to our new firewalls. While in many cases this might be simple, we have many years of history on the current firewalls and needed to be careful. We knew this would be complicated and we knew it would require at least a short outage. Working together with our provider, Server Central, we planned the cut over for November 13th. It didn’t go well. After the cut over we realized some of the routes were not working, so we reverted. Fortunately we used our edge data centers to capture messages and queue them to avoid any lost data.
After a lot of testing, and breaking out the migrations into smaller pieces, we gave the final cut over another shot on December 27th. It went really well, except for one important subnet, which caused us to revert again. During this time we saved and queued messages again for about 30 minutes, but as I said earlier, these delays are not up to our standards.
One more time #
This has been a frustrating project internally. Not only from the issues it causes you, our customers, but the time it has taken from our team. This firewall cut over will offer huge room for growth and sets us up for a larger global data center expansion beyond our current edge networks. It will also allow future work that will avoid this kind of maintenance altogether. We are at the point where we need to finalize it, and we think we have the last pieces to make it work.
We’d like to give it one more shot on January 4th (10pm ET). Our hope is to have the outage down to 15 minutes or so. We will queue all messages to be sent out once the maintenance is complete, but the web UI will be down.
I’m writing this to give you some background and let you know this is not an acceptable scenario for us either. It’s important for the growth of the product and for making sure we give you the most stable service imaginable. In other words, I’m asking for you to put up with this one more time so we can put it in the past. We already have projects in the works for expanding Postmark’s global footprint and conducting zero downtime maintenance windows. It’s what you should expect from an infrastructure service.
Thanks again for your patience through all of this.