Increasing Postmark's capacity: A parable of pipes
It’s been fascinating to see Postmark grow over the past years. Today, we’re processing more emails and support more customers than ever before. Just last month, we saw email volume increase by 24%. Sure, these numbers are exciting, but you know what we’re even more proud of? While we continue to grow, we’re still delivering email faster and more reliably than our competition. And our customers love Postmark just as much as they did a few years ago.
We never want that to change. But we also have to be realistic about the capabilities of our current setup: If Postmark continues to grow at the rate we’re seeing today, we’ll hit our capacity limits at some point—and if that ever happened, the user experience would suffer. So it’s time to get ahead of that growth. Let’s give Postmark some bigger pipes!
The Mail Flow #
When our users want to send email, it goes through our system like so:
Or, at least, that's how we model it logically. That's the flow of data through Postmark. That's a useful way to understand our systems, but there's another layer underneath it. The physical.
That layer is a little more complicated, so we're going to build it up piece by piece. First, let's look at where these services actually are, and add a couple more components:
We accept mail and process it into SendService inside of AWS. Then, for the actual mail sending, we send it across Direct Connect (DC) to a set of load balancers (for high availability/traffic management) to MTA, and then out to the internet.
I won't go too deep into the mechanics of DC on this post. But it's cool, and if you're interested, you can check it out here.
This is how mail flows through our system, both before and after this change.
So, why did we bother to do anything at all?
The problem #
Let's update our diagram again, and expand on what DC actually is:
At the physical layer, DC is really two connections. Each of them is capable of passing 1Gbps* of traffic, and we can load balance across the two. These are physical cables plugged in (more or less) from AWS direct to us, hence: direct connect. As an aside, this introduces an interesting failure mode: our cables to the cloud can be plugged in backwards:
When we bought and initially set up DC, we thought 2x1G links would last us forever.
Well, we’re processing more and more emails so forever is getting closer.
During peak traffic, we're hitting 50% of our capacity on our existing load balancers across direct connect:
Each time our bandwidth consumption across direct connect hits that 1Gbps cap, we see our 'ack rate', or processing rate, for email drop to zero.
When we're at capacity for a connection, we'd call the network congested. When a computer attempts to send more traffic across a congested network, there are all sorts of bad things that can happen. Connections can time out, packets can be dropped, latency (the time it takes to send traffic) can increase.
The nice thing about the network is that it really wants to get it right. It won't just lose data. It'll keep retrying, with a combination of queues, backoffs, and retries. All of that leads to reduced performance, above and beyond just the network capacity limit.
It's the difference between everyone going through a doorway in an orderly line, vs everyone having to stop and organize and take turns. It's extra overhead.
And it's limiting our capacity to send email.
The fix #
We need more capacity. We need bigger pipes. And without downtime, of course.
So, the first step is to get that new capacity online. Rather than 2x1G links, we want 2x10G links.
We ordered them, got them all setup, and started running both of our 1G and 10G links simultaneously. It looked something like this:
Because we are balancing equally across all 4 pipes, in this setup we're limited to the top speed of the slowest link: 1G. With 4 links, that’s 4Gbps total. An improvement, but we can do better. All we need to do is kill those 1G pipes, and we'll be left with this:
I went ahead and tested performance across the 10G links to confirm:
That's.. not 20Gbps.
The fix for the fix #
We should have been able to push 20Gbps across these links, but we're only getting 2Gbps. This forces us to confront a basic fact about networks: we can only send as much data as the smallest link will support. If we imagine a scenario like this:
It doesn't matter that Server B has a 100G connection: Server A can never send and receive more than 1G from it. This is easy to grok as a concept, but finding those bottlenecks can be complicated.
Let's dive deeper into the physical side. We'll only look at the DC pathway after AWS, and the connections from our physical load balancers and MTA to the switches:
I know, I know.
MTA and the load balancers are listed twice, connected to both the 1G and 10G switches, because they have multiple network interfaces.
Most consumer computers will have a wireless card, and maybe an ethernet port. All of our servers have at least two cards: a 1G card and a 10G card. Each of those cards then has at least 2 redundant ports. Those get uplinked to redundant switches. Which are uplinked to redundant routers. Which are.. you get the picture.
But that's ok. For our purposes, at this level, we can ignore that. We just need to know the aggregate capacity (noted on the diagram above), and then we need to pop back up a level, to the logical layer, to view how traffic is flowing.
In our case, traffic flows in from DC to the load balancers over the 1G switches, then from the load balancers to MTA over the 10G switches. This avoids saturating the 1G connections as a consequence of traffic amplification. The load balancers have to receive messages, and then copy them, from both the client and the server. So any traffic going through it will immediately double. The data flow looks something like this:
Each of these arrows is a different 'segment':
Green is the traffic from SendService to the load balancers
Purple is the traffic from the load balancers to MTA
Brown is the traffic from MTA out to the internet
Now, in this scenario, we have 10x load balancers, each with 2G uplinks to the 1G switch; the green path. 10x2G = 20Gbps max capacity. Why are we stuck at 2Gbps?
Do you see it?
Peep the red circle:
The max capacity between our 10G and 1G switches is only 2G. It doesn't matter that we've got all these load balancers with multiple connections: we're never pushing past 2G.
One way we can fix this is to bump that link between the switches. And we did, but there's a limit to what we can do there. Instead, let's move some things around:
We can take the load balancers off the 1G network altogether. Instead, we'll have 2x, each with 20Gbps pipes. That's 40Gbps total. Remember that we have to derate that because of traffic amplification, making it effectively 20Gbps. Nice.
Let's run our tests again, and make sure we're getting 20Gbps:
About 8Gbps. Progress, but not what we're looking for.
The pipes are inside the pipes.
When we look at the flow diagram again, we're at least 20G everywhere back to AWS.
The problem is that our flow diagram is wrong.
Yeah, I know I made it just now, but you get to suffer with me. The green flow really looks like this:
All that traffic is being routed through the firewall. Which, while maybe not optimal, shouldn't matter. It's 20G, right?
The connections between the 10G switch and the firewall are absolutely 20G. But inside the firewall are more pipes. Pipes that are designed to require you to give vendors more money. When we go look at the specs for our firewall, we find this tidbit:
Now, there are things we could do here to get around this, but, really, the path forward here is new firewalls. That's a whole different topic.
So where are we at? #
We now have ~9Gbps to play with—a significant improvement to the 1Gbps we started with. Thanks to our new, bigger pipes, we've got better performance and faster mail sending for our top of hour traffic. And with 2x10GB links, we'll have enough bandwidth to AWS to last us forever.