Craig Box's journeys, stories and notes...

Migrating your servers to Amazon EC2: Load balancing


When you run a large web site, you probably have a number of machines, across a number of different availability zones, but you need to present a single URL to the user. You distribute the load between your machines with (a redundant pair of) load balancers, and point your DNS to the floating IP of the balancers.

A number of options for doing similar exist for Amazon EC2 users: as a good balance between convenience and performance, we chose to use Amazon's Elastic Load Balancing (ELB) service offering, with a caveat listed below. While a good default position, this may not be for you; check the bottom of this article for some resources to help you choose.

ELB has some great features. As well as the regular load balancer feature of tracking of which backend instances are up, it proactively adds extra capacity (which I term 'nodes', so as not to get confused with backend instances) in the event of increasing load. You can also set ELB up to spin up more backend instances in the case of there not being enough to serve your requests. All this for a small per-hour and per-GB cost.

Side note: You may be thinking "Why not use round robin DNS, and put the IPs of more than one server?" This is a trap for young players; you actually make things worse, because any one of N machines failing means there's a 1/N chance a request goes to a broken instance.  There's a good writeup on Server Fault if you want more information.

Then and now

In the old world, our site sat behind a hardware load balancer appliance.  Being that we were using a shared device at a co-location provider; I never saw it, and thus can't give you the exact details: but the important part of this story is that when traffic got to our instance, its source IP was still set to the IP of the sender, or at least the last proxy server it went through on it's travels. This matters to us, because, just like Phil Zimmerman's brain, some of Symbian's code is export controlled, due to containing cryptographic awesomesauce. We need to know the source IP of all requests, in case they are requesting our restricted areas.

When you're in EC2, you're operating under their network rules which "will not permit an instance to send traffic with a source IP or MAC address other than its own". This also applies to the instances that run the ELB service. If you set up an ELB, your backend servers will see all their traffic coming from the IP addresses of your ELB nodes, telling them nothing about where it came from before that.

The story that is falling into place largely revolves around the X-Forwarded-For header, which is added to HTTP transactions by proxy servers. Our back-end servers are told the packet arrived from the load balancer, but if you tell ELB that it's using the HTTP protocol on this port, it adds the X-F-F header automatically: the backends can then look at the most recently added entry to the X-F-F and learn the source IP as the ELB knew it.1

Because the load balancer sits between the client and the server, who are either end of an encrypted transaction, it can't rip open a HTTPS packet and add an arbitrary header. So, we had a Heisenproblem: it was not possible to know where something came from, and have that same something happen securely. And, stuff you are only giving to certain allowed people is exactly the sort of stuff you probably want to distribute over SSL.

There were two possible solutions to this:

  1. Direct secure traffic directly to a backend instance
  2. Wait for Amazon to implement SSL termination on ELB

In order to go live, we did #1. It came with a bunch of downsides, such as having to instruct our cache to redirect requests for certain paths to a different URL, such that if you requested, you were taken to "But what happens when that server goes down?", you say! When I planned this article, it was going to include a nice little description of how we got Heartbeat sharing an elastic IP address, so that we always had our "secure" IP pointing to whichever one of (a pair of) our servers which was up. It's a useful trick, so I'll come back to it later.

However, I'm pleased to announce that since then, Amazon have introduced #2: support for SSL termination, so you can upload your certificate to your load balancer, and then it can add the X-F-F header to your secure packets, and you don't need to worry about it any more.2

I was similarly going to have to worry about how to handle database failover in EC2, but they introduced that between me looking and go-live. I surmise that if you wait long enough, Amazon will do everything for you, and now delay introducing anything! :)

Now we know all that, let's dig a little deeper into how ELB works.

A Little Deeper

Amazon is all about the short-TTL DNS. If they want to scale something, they do so, and change what their DNS server returns when you query it.

When you register an ELB, you get given a DNS name such as You're explicitly warned to set your chosen site name as a CNAME to this; and indeed if you use the IP as it stands now, one day your site will break (for reasons you will learn below.)

First oddity with this setup: you can't CNAME the root of a domain, so you have to make a redirect to, preferably one hosted somewhere outside the cloud, as needs to be an A record to an IP address. Some DNS providers have a facility for doing redirects using their own servers, which is an option here.

If you were to query that DNS record you would find that it has a 60 second TTL; thus if you query it twice, 2 mins apart, and you have more than one ELB node3 you may, at the discretion of Amazon's algorithms, get different results.  Try this:

$ dig 60 IN A
$ dig @ 60 IN A

Dude, where's my balancing?

When you register an ELB, you tell it the availability zones it should operate it. Each AZ has at least one ELB node, and that node will route you to instances in its own AZ, unless there are none available. That, along with the fact you are pseudo-randomly given a IP (with a minimum 60 second TTL), leads to a non-obvious conclusion. This actually happened to us - our policy is that odd numbered servers are in -1a, and even numbered servers are in -1b.

external:~$ ab -n 10
web1:~$ wc -l /var/log/apache2/access.log
10 /var/log/apache2/access.log
web2:~$ wc -l /var/log/apache2/access.log
0 /var/log/apache2/access.log
Lop-sided load

That is to say: if your servers are in multiple availability zones4, a single user doing requests in quick succession isn't load-balanced across your backend instances, so ELB doesn't appear to be working at all. Thankfully, it is, you just can't see it, because you're not looking from enough places at once. ELB is designed to work for a widely distributed client base, and in that case, you should expect about half the traffic on one instance, and half on the other. If you ran this test from a different location, you might see all 10 requests go to web2.

If you ask Amazon5, they can change the DNS for an ELB so that it presents all the IP addresses associated, not just one of them. This means your client has the choice to pick the IP each time it connects, and depending on how your application works, may be better for test servers.


The prime reason to use an ELB is that Amazon can transparently add more computing power to support your load if needed.  The converse of that is that when it is no longer needed, it will be removed. It bears mention that if they take an IP address out of the DNS, it will last at least 60 minutes before being taken out of service. Not everyone obeys a TTL on a DNS zone!

To reiterate: don't ever take what the name currently resolves to, and use that IP.  It's not yours and one day it will break.

Further reading

For this article, I have touched on some of the interesting parts of ELB. I didn't feel I needed to write a general introduction, as there are already several good resources out there:

Check back later for talk about databases, storage, security, mail and more!

  1. If you're worried about people spoofing the X-F-F, you can trust that the most recently added entry was yours, and throw away all the rest. 
  2. It's like my boss knew I'd been sitting on writing this post, and just had to pip me to the post! 
  3. Due to having more traffic than one node can service, or being hosted in more than one AZ 
  4. A good practice if you're trying to mitigate site failure - see "No single point of failure". 
  5. You may have to have commercial support for them to do this. 

Leave a Reply