High availability with ExaBGP

Vincent Bernat

When it comes to provide redundant services, several options are available:

  • The service can be hosted behind a set of load-balancers. They will detect any faulty node. However, you need to ensure that this new layer is also fault-tolerant.
  • The nodes providing the service can rely on IP failover to share a set of IP using protocols like VRRP1 or CARP. The IP address of a faulty node will be assigned to another node. This requires all nodes to be part of the same IP subnet.
  • The clients of the service can ask a third-party for available nodes. Usually, this is achieved through round-robin DNS where only working nodes are advertised in a DNS record. The failover can be quite long because of caches.

A common setup is a combination of these solutions: web servers are behind a couple of load-balancers achieving both redundancy and load-balancing. The load-balancers use VRRP to ensure redundancy. The whole setup is replicated to another datacenter and round-robin DNS is used to ensure both redundancy and load-balacing of the datacenters.

There is a fourth option which is similar to VRRP but relies on dynamic routing and therefore is not limited to nodes in the same subnet:

  • The nodes advertise their availability with BGP to announce the set of service IP addresses they are able to serve. Each address is weighted such that IP addresses are balanced among the available nodes.

We will explore how to implement this fourth option using ExaBGP, the BGP swiss army knife of networking, in a small lab based on KVM. You can grab the complete lab from GitHub. ExaBGP 3.2.5 is needed to run this lab.

Update (2019-05)

Also have a look at “Multi-tier load-balancing with Linux.” It describes the build of a highly available load-balancing layer of which ExaBGP is one of the components.

Environment#

We will be working in the following (IPv6-only) environment:

Lab with three web nodes
Basic routed network with three web nodes

BGP configuration#

BGP is enabled on ER2 and ER3 to exchange routes with peers and transits (only R1 for this lab). BIRD is used as a BGP daemon. Its configuration is pretty basic. Here is a fragment:

router id 1.1.1.2;

protocol static NETS { # ❶
  import all;
  export none;
  route 2001:db8::/40 reject;
}
protocol bgp R1 { # ❷
  import all;
  export where proto = "NETS";
  local as 64496;
  neighbor 2001:db8:1000::1 as 64511;
}
protocol bgp ER3 { # ❸
  import all;
  export all;
  next hop self;
  local as 64496;
  neighbor 2001:db8:1::3 as 64496;
}

First, in ❶, we declare the routes that we want to export to our peers. We don’t try to summarize them from the IGP. We just unconditonnaly export the networks that we own. Then, in ❷, R1 is defined as a neighbor and we export the static route we declared previously. We import any route that R1 is willing to send us. In ❸, we share everything we know with our pal, ER3, using internal BGP.

OSPF configuration#

OSPF will distribute routes inside the AS. It is enabled on ER2, ER3, DR6, DR7 and DR8. For example, here is the relevant part of the configuration of DR6:

router id 1.1.1.6;
protocol kernel {
   persist;
   import none;
   export all;
}
protocol ospf INTERNAL {
  import all;
  export none;
  area 0.0.0.0 {
    networks {
      2001:db8:1::/64;
      2001:db8:6::/64;
    };
    interface "eth0";
    interface "eth1" { stub yes; };
  };
}

ER2 and ER3 inject a default route into OSPF:

protocol static DEFAULT {
  import all;
  export none;
  route ::/0 via 2001:db8:1000::1;
}
filter default_route {
  if proto = "DEFAULT" then accept;
  reject;
}
protocol ospf INTERNAL {
  import all;
  export filter default_route;
  area 0.0.0.0 {
    networks {
      2001:db8:1::/64;
    };
    interface "eth1";
  };
}

Web nodes#

The web nodes are pretty basic. They have a default static route to the nearest router and that’s all. The interesting thing here is that they are each on a separate IP subnet: we cannot share an IP using VRRP.2

Why are these web servers on different subnets? Maybe they are not in the same datacenter or maybe your network architecture is using a routed access layer.

Let’s see how to use BGP to enable redundancy of these web nodes.

Redundancy with ExaBGP#

ExaBGP is a convenient tool to plug scripts into BGP. They can then receive and advertise routes. ExaBGP does the hard work of speaking BGP with your routers. The scripts just have to read routes from standard input or advertise them on standard output.

The big picture#

Here is the big picture:

Using ExaBGP to advertise web services
High availability using ExaBGP and two route servers

Let’s explain it step by step:

  1. Three IP addresses will be allocated for our web service: 2001:db8:30::1, 2001:db8:30::2 and 2001:db8:30::3. They are distinct from the real IP addresses of W1, W2 and W3.

  2. Each web node will advertise all of them to the route servers we added in the network. I will talk more about these route servers later.

  3. Each route comes with a metric to help the route server choose where it should be routed. We choose the metrics such that each IP address will be routed to a distinct web node (unless there is a problem).

  4. The route servers (which are not routers) will then advertise the best routes they learned to all the routers in our network. This is still done using BGP.

  5. Now, for a given IP address, each router knows to which web node the traffic should be directed.

Here are the respective metrics announced routes for W1, W2 and W3 when everything works as expected:

Route W1 W2 W3 Best Backup
2001:db8:30::1 102 101 100 W3 W2
2001:db8:30::2 101 100 102 W2 W1
2001:db8:30::3 100 102 101 W1 W3

ExaBGP configuration#

The configuration of ExaBGP is quite simple:

group rs {
  neighbor 2001:db8:1::4 {
    router-id 1.1.1.11;
    local-address 2001:db8:6::11;
    local-as 65001;
    peer-as 65002;
  }
  neighbor 2001:db8:8::5 {
    router-id 1.1.1.11;
    local-address 2001:db8:6::11;
    local-as 65001;
    peer-as 65002;
  }

  process watch-nginx {
      run /usr/bin/python /lab/healthcheck.py -s --config /lab/healthcheck-nginx.conf --start-ip 0;
  }
}

A script checks if the service (an nginx web server) is up and running and advertise the appropriate IP addresses to the two declared route servers. If we run the script manually, we can see the advertised routes:

$ python /lab/healthcheck.py --config /lab/healthcheck-nginx.conf --start-ip 0
INFO[healthcheck] send announces for UP state to ExaBGP
announce route 2001:db8:30::3/128 next-hop self med 100
announce route 2001:db8:30::2/128 next-hop self med 101
announce route 2001:db8:30::1/128 next-hop self med 102
[…]
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
INFO[healthcheck] send announces for DOWN state to ExaBGP
announce route 2001:db8:30::3/128 next-hop self med 1000
announce route 2001:db8:30::2/128 next-hop self med 1001
announce route 2001:db8:30::1/128 next-hop self med 1002

When the service becomes unresponsive, the healthcheck script detect the situation and retry several times before acknowledging that the service is dead. Then, the IP addresses are advertised with higher metrics and the service will be routed to another node (the one advertising 2001:db8:30::3/128 with metric 101).

This healthcheck script is now part of ExaBGP and can be invoked with:

$ python -m exabgp healthcheck --config /lab/healthcheck-nginx.conf --start-ip 0

Route servers#

We could have connected our ExaBGP servers directly to each router. However, if you have 20 routers and 10 web servers, you now have to manage a mesh of 200 sessions. The route servers are here for three purposes:

  1. Reduce the number of BGP sessions (from 200 to 60) between equipments (less configuration, less errors).

  2. Avoid modifying the configuration on routers each time a new service is added.

  3. Separate the routing decision (the route servers) from the routing process (the routers).

You may also ask yourself: “why not use OSPF?” Good question!

  • OSPF could be enabled on each web node and the IP addresses advertised using this protocol. However, OSPF has several drawbacks: it does not scale, there are restrictions on the allowed topologies, it is difficult to filter routes inside OSPF and a misconfiguration will likely impact the whole network. Therefore, it is considered a good practice to limit OSPF to network equipments.

  • The routes learned by the route servers could be injected into OSPF. On paper, OSPF has a “next-hop” field to provide an explicit next-hop. This would be handy as we wouldn’t have to configure adjacencies with each router. However, I have absolutely no idea how to inject BGP next-hop into OSPF next-hop. What happens is that BGP next-hop is resolved locally using OSPF routes. For example, if we inject BGP routes into OSPF from RS4, RS4 will know the appropriate next-hop but other routers will route the traffic to RS4.

Let’s look at how to configure our route servers. RS4 will use BIRD while RS5 will use Quagga. Using two different implementations will help with resiliency by avoiding a bug to hit the two route servers at the same time.

BIRD configuration#

There are two sides for BGP configuration: the BGP sessions with the ExaBGP nodes and the ones with the routers. Here is the configuration for the latter:

template bgp INFRABGP {
  export all;
  import none;
  local as 65002;
  rs client;
}
protocol bgp ER2 from INFRABGP {
  neighbor 2001:db8:1::2 as 65003;
}
protocol bgp ER3 from INFRABGP {
  neighbor 2001:db8:1::3 as 65003;
}
protocol bgp DR6 from INFRABGP {
  neighbor 2001:db8:1::6 as 65003;
}
protocol bgp DR7 from INFRABGP {
  neighbor 2001:db8:1::7 as 65003;
}
protocol bgp DR8 from INFRABGP {
  neighbor 2001:db8:1::8 as 65003;
}

The AS number used by our route server is 65002 while the AS number used for routers is 65003 (the AS numbers for web nodes will be 65001). They are reserved for private use by RFC 6996. All routes known by the route server are exported to the routers but no routes are accepted from them.

Let’s have a look at the other side:

# Only import loopback IPs
filter only_loopbacks { # ❶
  if net ~ [ 2001:db8:30::/64{128,128} ] then accept;
  reject;
}

# General template for an EXABGP node
template bgp EXABGP {
  local as 65002;
  import filter only_loopbacks; # ❷
  export none;
  route limit 10; # ❸
  rs client;
  hold time 6; # ❹
  multihop 10;
  igp table internal;
}

protocol bgp W1 from EXABGP {
  neighbor 2001:db8:6::11 as 65001;
}
protocol bgp W2 from EXABGP {
  neighbor 2001:db8:7::12 as 65001;
}
protocol bgp W3 from EXABGP {
  neighbor 2001:db8:8::13 as 65001;
}

To ensure separation of concerns, we are being a bit more picky. With ❶ and ❷, we only accept loopback addresses and only if they are contained in the subnet that we reserved for this use. No server should be able to inject arbitrary addresses into our network. With ❸, we also limit the number of routes that a server can advertise.

With ❹, we reduce the hold time from 240 to 6. It means that after 6 seconds, the peer is considered dead. This is quite important to be able to recover quickly from a dead machine. The minimal value is 3. We could have use a similar setting with the session with the routers.

Quagga configuration#

Quagga’s configuration is a bit more verbose but should be strictly equivalent:

router bgp 65002 view EXABGP
 bgp router-id 1.1.1.5
 bgp log-neighbor-changes
 no bgp default ipv4-unicast

 neighbor R peer-group
 neighbor R remote-as 65003
 neighbor R ebgp-multihop 10

 neighbor EXABGP peer-group
 neighbor EXABGP remote-as 65001
 neighbor EXABGP ebgp-multihop 10
 neighbor EXABGP timers 2 6
!
 address-family ipv6

 neighbor R activate
 neighbor R soft-reconfiguration inbound
 neighbor R route-server-client
 neighbor R route-map R-IMPORT import
 neighbor R route-map R-EXPORT export
 neighbor 2001:db8:1::2 peer-group R
 neighbor 2001:db8:1::3 peer-group R
 neighbor 2001:db8:1::6 peer-group R
 neighbor 2001:db8:1::7 peer-group R
 neighbor 2001:db8:1::8 peer-group R

 neighbor EXABGP activate
 neighbor EXABGP soft-reconfiguration inbound
 neighbor EXABGP maximum-prefix 10
 neighbor EXABGP route-server-client
 neighbor EXABGP route-map RSCLIENT-IMPORT import
 neighbor EXABGP route-map RSCLIENT-EXPORT export
 neighbor 2001:db8:6::11 peer-group EXABGP
 neighbor 2001:db8:7::12 peer-group EXABGP
 neighbor 2001:db8:8::13 peer-group EXABGP

 exit-address-family
!
ipv6 prefix-list LOOPBACKS seq 5 permit 2001:db8:30::/64 ge 128 le 128
ipv6 prefix-list LOOPBACKS seq 10 deny any
!
route-map RSCLIENT-IMPORT deny 10
!
route-map RSCLIENT-EXPORT permit 10
  match ipv6 address prefix-list LOOPBACKS
!
route-map R-IMPORT permit 10
!
route-map R-EXPORT deny 10
!

The view is here to not install routes in the kernel.3

Routers#

Configuring BIRD to receive routes from route servers is straightforward:

# BGP with route servers
protocol bgp RS4 {
  import all;
  export none;
  local as 65003;
  neighbor 2001:db8:1::4 as 65002;
  gateway recursive;
}
protocol bgp RS5 {
  import all;
  export none;
  local as 65003;
  neighbor 2001:db8:8::5 as 65002;
  multihop 4;
  gateway recursive;
}

It is important to set gateway recursive because most of the time, the next-hop is not reachable directly. In this case, by default, BIRD will use the IP address of the advertising router (the route servers).

Testing#

Let’s check that everything works as expected. Here is the view from RS5:

# show ipv6  bgp
BGP table version is 0, local router ID is 1.1.1.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*  2001:db8:30::1/128
                    2001:db8:6::11         102             0 65001 i
*>                  2001:db8:8::13         100             0 65001 i
*                   2001:db8:7::12         101             0 65001 i
*  2001:db8:30::2/128
                    2001:db8:6::11         101             0 65001 i
*                   2001:db8:8::13         102             0 65001 i
*>                  2001:db8:7::12         100             0 65001 i
*> 2001:db8:30::3/128
                    2001:db8:6::11         100             0 65001 i
*                   2001:db8:8::13         101             0 65001 i
*                   2001:db8:7::12         102             0 65001 i

Total number of prefixes 3

For example, traffic to 2001:db8:30::2 should be routed through 2001:db8:7::12 (which is W2). Other IP are affected to W1 and W3.

RS4 should see the same thing:4

$ birdc6 show route
BIRD 1.3.11 ready.
2001:db8:30::1/128 [W3 22:07 from 2001:db8:8::13] * (100/20) [AS65001i]
                   [W1 23:34 from 2001:db8:6::11] (100/20) [AS65001i]
                   [W2 22:07 from 2001:db8:7::12] (100/20) [AS65001i]
2001:db8:30::2/128 [W2 22:07 from 2001:db8:7::12] * (100/20) [AS65001i]
                   [W1 23:34 from 2001:db8:6::11] (100/20) [AS65001i]
                   [W3 22:07 from 2001:db8:8::13] (100/20) [AS65001i]
2001:db8:30::3/128 [W1 23:34 from 2001:db8:6::11] * (100/20) [AS65001i]
                   [W3 22:07 from 2001:db8:8::13] (100/20) [AS65001i]
                   [W2 22:07 from 2001:db8:7::12] (100/20) [AS65001i]

Let’s have a look at DR6:

$ birdc6 show route
2001:db8:30::1/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]
2001:db8:30::2/128 via fe80::5054:60ff:fe02:3681 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:60ff:fe02:3681 on eth0 (100/20) [AS65001i]
2001:db8:30::3/128 via 2001:db8:6::11 on eth1 * (100/10) [AS65001i]
                   via 2001:db8:6::11 on eth1 (100/10) [AS65001i]

2001:db8:30::3 is owned by W1 which is behind DR6 while the two others are in another part of the network and will be reached through the appropriate link-local addresses learned by OSPF.

Let’s kill W1 by stopping nginx. A few seconds later, DR6 learns the new routes:

$ birdc6 show route
2001:db8:30::1/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]
2001:db8:30::2/128 via fe80::5054:60ff:fe02:3681 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:60ff:fe02:3681 on eth0 (100/20) [AS65001i]
2001:db8:30::3/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]

Demo#

For a demo, have a look at the following video:


  1. The primary target of VRRP is to make a default gateway for an IP network highly available by exposing a virtual router which is an abstract representation of multiple routers. The virtual IP address is held by the master router and any backup router can be promoted to master in case of a failure. In practice, VRRP can be used to achieve high availability of services too. ↩︎

  2. However, you could deploy an L2 overlay network using, for example, VXLAN and use VRRP on it. ↩︎

  3. I have been told on the Quagga mailing-list that such a setup is quite uncommon and that it would be better to not use views but use --no_kernel flag for bgpd instead. You may want to look at the whole thread for more details. ↩︎

  4. The output of birdc6 shows both the next-hop as advertised in BGP and the resolved next-hop if the route have to be exported into another protocol. I have simplified the output to avoid confusion. ↩︎