When it comes to traditional load balancers, you can either splurge on expensive hardware or go the software route. Hardware load balancers typically have poor/outdated API designs and are, at least in my experience, slow. You can find a few software load balancing products with decent APIs, but trying to use free alternatives like HAproxy leaves you with bolt on software that generates the configuration file for you. Even then, if you need high throughput you have to rely on vertical scaling of your load balancer or round-robin DNS to distribute traffic horizontally.

We were trying to figure out how to avoid buying a half million dollars worth of load balancers every time we needed a new data center. What if you didn’t want to use a regular layer 4/7 load balancer and, instead, relied exclusively on layer 3? This seems entirely possible, especially after reading about how CloudFlare uses Anycast to solve this problem. There are a few ways to accomplish this. You can go full blown BGP and run that all the way down to your top of rack switches, but that’s a commitment and likely requires a handful of full time network engineers on your team. Running a BGP daemon on your servers is the easiest way to mix “Anycast for load balancing” into your network. You have multiple options to do this:

After my own research, I decided that ExaBGP is the easiest way to manipulate routes. The entire application is written in Python, making it perfect for hacking. ExaBGP has a decent API, and even supports JSON for parts of it. The API works by reading STDOUT from your process and sending your process information through STDIN. In the end, I’m looking for automated control over my network, rather than more configuration management.

At this point, I can create a basic “healthcheck” process that might look like:

#!/usr/bin/env bash
STATE="down"

while true; do
  curl localhost:4000/healthcheck.html 2>/dev/null | grep OK

  if [[ $? == 0 ]]; then
    if [[ "$STATE" != "up" ]]; then
      echo "announce 10.1.1.2/32 next-hop self"
      STATE="up"
    fi
  else
    if [[ "$STATE" != "down" ]]; then
      echo "withdraw 10.1.1.2/32 next-hop self"
      STATE="down"
    fi
  fi

  sleep 2
done

Then in your ExaBGP configuration file, you would add something like this:

group anycast-test {
  router-id 10.1.10.11;
  local-as 65001;
  peer-as 65002;

  process watch-application {
    run /usr/local/bin/healthcheck.sh
  }

  neighbor 10.1.10.1 {
    local-address 10.1.10.11;
  }
}

Now, anytime your curl | grep check is passing, your BGP neighbor (10.1.10.1) will have a route to your service IP (10.1.1.2). When it begins to fail, the route will be withdrawn from the neighbor. If you now deploy this on a handful of servers, your upstream BGP neighbor will have multiple routes. At this point, you have to configure your router to properly spread traffic between the multiple paths with equal cost. In JUNOS, this would look like:

set policy-options policy-statement load-balancing-policy then load-balance per-packet
set routing-options forwarding-table export load-balancing-policy
commit

Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution. You can read more about this configuration on Juniper’s website..

There are some scale limitations though. It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind. A tiered approach may be appropriate if you need a high number of backend machines. This would include a layer of route servers running BIRD/Quagga and then your backend services peer to this using ExaBGP. You could even use this approach to scale HAproxy horizontally.

In conclusion, replacing a traditional load balancer with layer 3 routing is entirely possible. In fact, it can even give you more control of where traffic is flowing in your datacenter if done right. I look forward to rolling this out with more backend services over the coming months and learning what problems may arise. The possibilities are endless, and I’d love to hear more about what others are doing.