TL;DR: With its implementation of IPv4 routing tables using LPC-tries, Linux offers good lookup performance (50 ns for a full view) and low memory usage (64 MiB for a full view).
During the lifetime of an IPv4 datagram inside the Linux kernel, one
important step is the route lookup for the destination address
fib_lookup() function. From essential
information about the datagram (source and destination IP addresses,
interfaces, firewall mark, …), this function should quickly provide
a decision. Some possible options are:
- local delivery (
- forwarding to a supplied next hop (
- silent discard (
Route lookup in a trie🔗
Looking up a route in a routing table is to find the most specific prefix matching the requested destination. Let’s assume the following routing table:
$ ip route show scope global table 100 default via 203.0.113.5 dev out2 192.0.2.0/25 nexthop via 203.0.113.7 dev out3 weight 1 nexthop via 203.0.113.9 dev out4 weight 1 192.0.2.47 via 203.0.113.3 dev out1 192.0.2.48 via 203.0.113.3 dev out1 192.0.2.49 via 203.0.113.3 dev out1 192.0.2.50 via 203.0.113.3 dev out1
Here are some examples of lookups and the associated results:
|Destination IP||Next hop|
A common structure for route lookup is the trie, a tree structure where each node has its parent as prefix.
Lookup with a simple trie🔗
The following trie encodes the previous routing table:
For each node, the prefix is known by its path from the root node and the prefix length is the current depth.
A lookup in such a trie is quite simple: at each step, fetch the
nth bit of the IP address, where n is the current
depth. If it is 0, continue with the first child. Otherwise, continue
with the second. If a child is missing, backtrack until a routing
entry is found. For example, when looking for
192.0.2.50, we will
find the result in the corresponding leaf (at depth 32). However for
192.0.2.51, we will reach
192.0.2.50/31 but there is no second
child. Therefore, we backtrack until the
192.0.2.0/25 routing entry.
Adding and removing routes is quite easy. From a performance point of view, the lookup is done in constant time relative to the number of routes (due to maximum depth being capped to 32).
Quagga is an example of routing software still using this simple approach.
Lookup with a path-compressed trie🔗
In the previous example, most nodes only have one child. This leads to a lot of unneeded bitwise comparisons and memory is also wasted on many nodes. To overcome this problem, we can use path compression: each node with only one child is removed (except if it also contains a routing entry). Each remaining node gets a new property telling how many input bits should be skipped. Such a trie is also known as a Patricia trie or a radix tree. Here is the path-compressed version of the previous trie:
Since some bits have been ignored, on a match, a final check is executed to ensure all bits from the found entry are matching the input IP address. If not, we must act as if the entry wasn’t found (and backtrack to find a matching prefix). The following figure shows two IP addresses matching the same leaf:
The reduction on the average depth of the tree compensates the necessity to handle these false positives. The insertion and deletion of a routing entry is still easy enough.
Many routing systems are using Patricia trees:
Lookup with a level-compressed trie🔗
In addition to path compression, level compression2 detects parts of the trie that are densily populated and replace them with a single node and an associated vector of 2k children. This node will handle k input bits instead of just one. For example, here is a level-compressed version our previous trie:
Such a trie is called LC-trie or LPC-trie and offers higher lookup performances compared to a radix tree.
An heuristic is used to decide how many bits a node should handle. On Linux, if the ratio of non-empty children to all children would be above 50% when the node handles an additional bit, the node gets this additional bit. On the other hand, if the current ratio is below 25%, the node loses the responsibility of one bit. These values are not tunable.
Insertion and deletion becomes more complex but lookup times are also improved.
Implementation in Linux🔗
Here is the representation of our example routing table in memory:3
There are several structures involved:
struct fib_tablerepresents a routing table,
struct trierepresents a complete trie,
struct key_vectorrepresents either an internal node (when
bitsis not zero) or a leaf,
struct fib_infocontains the characteristics shared by several routes (like a next-hop gateway and an output interface),
struct fib_aliasis the glue between the leaves and the
The trie can be retrieved through
$ cat /proc/net/fib_trie Id 100: +-- 0.0.0.0/0 2 0 2 |-- 0.0.0.0 /0 universe UNICAST +-- 192.0.2.0/26 2 0 1 |-- 192.0.2.0 /25 universe UNICAST |-- 192.0.2.47 /32 universe UNICAST +-- 192.0.2.48/30 2 0 1 |-- 192.0.2.48 /32 universe UNICAST |-- 192.0.2.49 /32 universe UNICAST |-- 192.0.2.50 /32 universe UNICAST [...]
For internal nodes, the numbers after the prefix are:
- the number of bits handled by the node,
- the number of full children (they only handle one bit),
- the number of empty children.
Moreover, if the kernel was compiled with
some interesting statistics are available in
$ cat /proc/net/fib_triestat Basic info: size of leaf: 48 bytes, size of tnode: 40 bytes. Id 100: Aver depth: 2.33 Max depth: 3 Leaves: 6 Prefixes: 6 Internal nodes: 3 2: 3 Pointers: 12 Null ptrs: 4 Total size: 1 kB [...]
When a routing table is very dense, a node can handle many bits. For example, a densily populated routing table with 1 million entries packed in a /12 can have one internal node handling 20 bits. In this case, route lookup is essentially reduced to a lookup in a vector.
The following graph shows the number of internal nodes used relative to the number of routes for different scenarios (routes extracted from an Internet full view, /32 routes spreaded over 4 different subnets with various densities). When routes are densily packed, the number of internal nodes are quite limited.
So how performant is a route lookup? The maximum depth stays low
(about 6 for a full view), so a lookup should be quite fast. With the
help of a small kernel module, we can accurately benchmark5
The lookup time is loosely tied to the maximum depth. When the routing table is densily populated, the maximum depth is low and the lookup times are fast.
When forwarding at 10 Gbps, the time budget for a packet would be about 50 ns. Since this is also the time needed for the route lookup alone in some cases, we wouldn’t be able to forward at line rate with only one core. Nonetheless, the results are pretty good and they are expected to scale linearly with the number of cores.
The measurements are done with a Linux kernel 4.11 from Debian unstable. I have gathered performance metrics accross kernel versions in “Performance progression of IPv4 route lookup on Linux.”
Another interesting figure is the time it takes to insert all these routes into the kernel. Linux is also quite efficient in this area since you can insert 2 million routes in less than 10 seconds:
The memory usage is available directly in
/proc/net/fib_triestat. The statistic provided doesn’t account for
fib_info structures, but you should only have a handful of them
(one for each possible next-hop). As you can see on the graph below,
the memory use is linear with the number of routes inserted, whatever
the shape of the routes is.
The results are quite good. With only 256 MiB, about 2 million routes can be stored!
Unless configured without
supports several routing tables and has a system of configurable
rules to select the table to use. These rules can be configured with
ip rule. By default, there are three of them:
$ ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default
Linux will first lookup for a match in the
local table. If it
doesn’t find one, it will lookup in the
main table and at last
local table contains routes for local delivery:
$ ip route show table local broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1 broadcast 192.168.117.0 dev eno1 proto kernel scope link src 192.168.117.55 local 192.168.117.55 dev eno1 proto kernel scope host src 192.168.117.55 broadcast 192.168.117.63 dev eno1 proto kernel scope link src 192.168.117.55
This table is populated automatically by the kernel when addresses are
configured. Let’s look at the three last lines. When the IP address
192.168.117.55 was configured on the
eno1 interface, the kernel
automatically added the appropriate routes:
- a route for
192.168.117.55for local unicast delivery to the IP address,
- a route for
192.168.117.255for broadcast delivery to the broadcast address,
- a route for
192.168.117.0for broadcast delivery to the network address.
127.0.0.1 was configured on the loopback interface, the same
kind of routes were added to the
local table. However, a loopback
address receives a special treatment and the kernel also adds the
whole subnet to the
local table. As a result, you can ping any IP in
$ ping -c1 127.42.42.42 PING 127.42.42.42 (127.42.42.42) 56(84) bytes of data. 64 bytes from 127.42.42.42: icmp_seq=1 ttl=64 time=0.039 ms --- 127.42.42.42 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
main table usually contains all the other routes:
$ ip route show table main default via 192.168.117.1 dev eno1 proto static metric 100 192.168.117.0/26 dev eno1 proto kernel scope link src 192.168.117.55 metric 100
default route has been configured by some DHCP daemon. The
connected route (
scope link) has been automatically added by the
proto kernel) when configuring an IP address on the
default table is empty and has little use. It has been kept when
the current incarnation of advanced routing has been introduced in
Linux 2.1.68 after a first tentative using “classes” in Linux 184.108.40.206
Since Linux 4.1 (commit 0ddcf43d5d4a), when the set of rules is
left unmodified, the
local tables are merged and the
lookup is done with this single table (and the
default table if not
empty). Moreover, since Linux 3.0 (commit f4530fa574df), without
specific rules, there is no performance hit when enabling the support
for multiple routing tables. However, as soon as you add new rules,
some CPU cycles will be spent for each datagram to evaluate them. Here
is a couple of graphs demonstrating the impact of routing rules on
For some reason, the relation is linear when the number of rules is between 1 and 100 but the slope increases noticeably past this threshold. The second graph highlights the negative impact of the first rule (about 30 ns).
A common use of rules is to create virtual routers: interfaces are segregated into domains and when a datagram enters through an interface from domain A, it should use routing table A:
# ip rule add iif vlan457 table 10 # ip rule add iif vlan457 blackhole # ip rule add iif vlan458 table 20 # ip rule add iif vlan458 blackhole
The blackhole rules may be removed if you are sure there is a default route in each routing table. For example, we add a blackhole default with a high metric to not override a regular default route:
# ip route add blackhole default metric 9999 table 10 # ip route add blackhole default metric 9999 table 20 # ip rule add iif vlan457 table 10 # ip rule add iif vlan458 table 20
To reduce the impact on performance when many interface-specific rules are used, interfaces can be attached to VRF instances and a single rule can be used to select the appropriate table:
# ip link add vrf-A type vrf table 10 # ip link set dev vrf-A up # ip link add vrf-B type vrf table 20 # ip link set dev vrf-B up # ip link set dev vlan457 master vrf-A # ip link set dev vlan458 master vrf-B # ip rule show 0: from all lookup local 1000: from all lookup [l3mdev-table] 32766: from all lookup main 32767: from all lookup default
l3mdev-table rule was automatically added when
configuring the first VRF interface. This rule will select the routing
table associated to the VRF owning the input (or output) interface.
VRF was introduced in Linux 4.3 (commit 193125dbd8eb), the performance was greatly enhanced in Linux 4.8 (commit 7889681f4a6c) and the special routing rule was also introduced in Linux 4.8 (commit 96c63fa7393d, commit 1aa6c4f6b8cd). You can find more details about it in the kernel documentation.
The takeaways from this article are:
- route lookup times hardly increase with the number of routes,
- densily packed /32 routes lead to amazingly fast route lookups,
- memory use is low (128 MiB par million routes),
- no optimization is done on routing rules.
For IPv6, have a look at “IPv6 route lookup on Linux.”
The routing cache was subject to reasonably easy to launch denial of service attacks. It was also believed to not be efficient for high volume sites like Google but I have first-hand experience it was not the case for moderately high volume sites. ↩︎
For internal nodes, the
key_vectorstructure is embedded into a
tnodestructure. This structure contains information rarely used during lookup, notably the reference to the parent that is usually not needed for backtracking as Linux keeps the nearest candidate in a variable. ↩︎
One leaf can contain several routes (
struct fib_aliasis a list). The number of “prefixes” can therefore be greater than the number of leaves. The system also keeps statistics about the distribution of the internal nodes relative to the number of bits they handle. In our example, all the three internal nodes are handling 2 bits. ↩︎
The measurements are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor was set to performance). The benchmark is single-threaded. It runs a warm-up phase, then executes about 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC. ↩︎