Here is the short version: to get better performance on your SSL terminator, use stud on 64bit system with patch from Émeric Brun for SSL session reuse with some AES cipher suite (128 or 256, does not really matter), without DHE, on as many cores as needed, a key size of 1024 bits unless more is needed.

UPDATED: I received some constructive comments about those tests. After reading this article, I invite you to look at additional benchmarks in the second round.

Introduction

A quick note: when I say SSL, you should read TLS v1 instead. Look at the article for TLS on Wikipedia for some background on this.

One year ago, Adam Langley, from Google, stated SSL was not computationally expensive any more:

In January this year (2010), Gmail switched to using HTTPS for everything by default. Previously it had been introduced as an option, but now all of our users use HTTPS to secure their email between their browsers and Google, all the time. In order to do this we had to deploy no additional machines and no special hardware. On our production frontend machines, SSL/TLS accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead. Many people believe that SSL takes a lot of CPU time and we hope the above numbers (public for the first time) will help to dispel that.

If you stop reading now you only need to remember one thing: SSL/TLS is not computationally expensive any more.

This is a very informative post containing tips on how to enhance SSL performance by reducing latency. However, unlike Gmail, you may still be worried by SSL raw performances. Maybe each of your frontends is able to serve 2000 requests per second and CPU overhead for SSL is therefore significative. Maybe you want to terminate SSL in your load balancer (even if you know this is not the best way to scale).

Tuning SSL

There are a lot of knobs you can use to get better performances from SSL: choosing the best implementation, using more CPU cores, switching to 64bit system, choosing the right cipher suite and the appropriate key size and enabling a session cache.

We will consider three SSL terminators. They all use OpenSSL behind the hood. stunnel is the oldest one and uses a threaded model. stud is a recent attempt to write a simple SSL terminator which is efficient and scalable. It uses the one-process-per-core model. nginx is a web server and it can be used as reverse proxy and therefore act as SSL terminator. It is known to be one of the most efficient web server, hence the choice here. It also features built-in basic load balancing. Since stud and stunnel1 do not have this feature, we use them with HAProxy, an high performance load-balancer that usually defers the SSL part to stunnel (but stud can act as a drop-in replacement here).

Those implementations were already tested by Matt Stancliff. He first concluded nginx sucked at SSL then that nginx did not suck at SSL (in the first case, nginx was the only one to select a DHE cipher suite). The details were very scarce. I hope to bring here more details.

This brings the importance of the selected cipher suite. The client and the server must agree on the cipher suite to use for symmetric encryption. The server will select the strongest one it supports from the list proposed by the client. If you enable some expensive cipher suite on your server, it is likely to be selected.

The last important knob is SSL session reuse. It matters for both performance and latency. During the first handshake, the server will send a session ID. The client can use this session ID to request an abbreviated handshake. This handshake is shorter (latency improvement by one round-trip) and allows us to reuse the previously negociated master secret (performance improvement by skipping the costliest part of the handshake). See the article from Adam Langley for additional information (with helpful drawings).

The following table shows what cipher suite is selected by some major web sites. I have also indicated wheter they support session reuse. See this post from Matt Palmer to know how to check this.

Site Cipher Session reuse?
www.google.com RC4-SHA
www.facebook.com RC4-MD5
twitter.com AES256-SHA
windowsupdate.microsoft.com RC4-MD5
www.paypal.com AES256-SHA
www.cmcicpaiement.fr DHE-RSA-AES256-SHA

RFC 5077 defines another mechanism for session resumption that does not require any server-side state. You can find additional information about this optional mechanism in my article about enabling session reuse.

Benchmarks

All those benchmarks have been run on a HP DL 380 G7, with two Xeon L5630 (running at 2.13GHz for a total of 8 cores), without hyperthreading, using a 2.6.39 kernel (HZ is set to 250) and two Intel 82576 NIC. Linux conntrack subsystem has been disabled and file limits have been raised over 100,000.

A Spirent Avalanche 2900 appliance is used to run the benchmarks.

We are interested in raw SSL performances in term of handshakes per second. Because we want to observe the effects of session reuse, the scenario runs by most tests lets clients doing four successive requests and reusing the SSL session for three of them. There is no HTTP keepalive. There is no compression. The size of the HTTP answer from the server is 1024 bytes. Those choices are done because our primary target is the number of SSL handshakes per second.

It should also be noted that HAProxy alone is able to handle 22,000 TPS on one core. During the tests, it was never the bottleneck.

Implementation

We run a first bench to compare nginx, stud and stunnel on one core. This bench runs on a 32bit system, with AES128-SHA1 cipher suite and a 1024bit key size. Here is the result for stud (version 0.1, with Émeric Brun’s patch for SSL session reuse):

stud, 1 CPU

The most important plot is the top one. The blue line is the attempted number of transactions per second (TPS) while the green one is the successful number of TPS. When the two lines start to diverge, this means we reach some kind of maximum: the number of unsuccessful TPS also starts to raise (the red line).

There are several noticeable points: the maximum TPS (788 TPS), the maximum TPS with an average response time less than 100 ms (766 TPS), less than 500 ms (776 TPS) and the maximum TPS with less than 0.01% of packet loss (783 TPS).

Let’s have a look at nginx (version 1.0.5):

nginx, 1 CPU

It is able to get the same performance as stud (763 TPS). However, over 512 TPS, you get more than 100 ms of response time. Over 556 TPS, you even get more than 500 ms! I get this behaviour in every bench with nginx (with or without proxy_buffering, with or without tcp_nopush, with or without tcp_nodelay) and I am unable to explain it. Maybe there is something wrong in the configuration. After hitting this limit, nginx starts to process connections by bursts. Therefore, the number of successful TPS is plotted using a dotted line while the moving average over 20 seconds is plotted with a solid line.

On the next plot, you can see the compared performance of the three implementations. On this kind of plot, the number of TPS that we keep is the maximum number of TPS where loss is less than 0.1% and average response time is less than 100 ms. stud achieves a performance of 766 TPS while nginx and stunnel are just above 500 TPS.

stunnel vs stud vs nginx, 1 CPU

Number of cores

With multiple cores, we can affect several of them to do SSL hard work. To get better performance, we pin each processes to a dedicated CPU. Here is the repartition used during the tests:

1 core 2 cores 4 cores 6 cores
CPU 1, Core 1 network network network network + haproxy
CPU 1, Core 2 haproxy haproxy haproxy SSL
CPU 1, Core 3 SSL SSL SSL SSL
CPU 1, Core 4 - SSL SSL SSL
CPU 2, Core 5 - - network SSL
CPU 2, Core 6 - - SSL SSL
CPU 2, Core 7 - - SSL SSL
CPU 2, Core 8 system system system system + haproxy

Remember, we have two CPU, 4 cores on each CPU. Cores on the same CPU share the same L2 cache (on this model). Therefore, the arrangement inside a CPU is not really important. When possible, we try to keep things together on the same CPU. The SSL processes always get exclusive use of a core since they will always be the most busy. The repartition is done with cpuset(7) for userland processes and by setting smp_affinity for network card interrupts.

We keep one core for the system. It is quite important to be able to connect and monitor the system correctly even when it is loaded. With so many cores, we can afford to reserve one core for this usage.

Beware! There is a trap when pining processes or IRQ to a core. You need to check /proc/cpuinfo to discover the mapping between kernel processors and physical cores. Sometimes, second kernel processor can be the first core of the second physical processor (instead of the second core of the first physical processor).

stunnel vs stud vs nginx, 6 CPU

As you can see in the plot above, only stud is able to scale properly. stunnel is not able to take advantage of the cores and its performances are worse than with a single core. I think this may be due to its threaded model and the fact that the userland for this 32bit system is a bit old. nginx is able to achieve the same TPS than stud but is disqualified because of the increased latency. We won’t use stunnel in the remaining tests.

32bit vs 64bit

The easiest way to achieve a performance boost is to switch to a 64bit system. TPS are doubled. On our 64bit system, thanks to the use of a version of OpenSSL with the appropriate support, AES-NI, a CPU extension to improve the speed of AES encryption/decryption, is enabled. Remaining tests are done with a 64bit system.

stud vs nginx, 64bit

Ciphers and key sizes

In our tests, the influence of ciphers is minimal. There is almost no difference between AES256 and AES128. Using RC4 adds some latency. The use of AES-NI may have helped to avoid this latency for AES.

On the other hand, using a 2048bit key size has a huge performance hit. TPS are divided by 5.

stud: cipher suites and key sizes

Session cache

The last thing to check is the influence of SSL session reuse. Since we are on a local network, we see only one of the positive effects of session reuse: better TPS because of reduced CPU usages. If there was a latency penalty, we should also have seen better TPS thanks to the removed round-trip.

stud: session reuse

This plot also explains why stud performances fall after the maximum: because of the failed transactions, the session cache is not as efficient. This phenomenon does not exist when session reuse is not enabled.

Conclusion

Here is a summary of the maximum TPS reached during the benchmark (with an average response time below 100ms). Our control use case is the following: 64bit, 6 cores, AES128, SHA1, 1024bit key, 4 requests into the same SSL session.

Context nginx 1.0.5 stunnel 4.41 stud 0.1 (patched)
1 core, 32bit 512 TPS 503 TPS 766 TPS
2 cores, 32bit 599 TPS - -
32bit 804 TPS 501 TPS 4251 TPS
- 1799 TPS - 9000 TPS
AES256 - - 8880 TPS
RC4-MD5 - - 7370 TPS
2048bit - - 1643 TPS
no session reuse - - 5844 TPS
80 requests per SSL session - - 10797 TPS

Therefore, stud in our control scenario is able to sustain 1500 TPS per core. It seems to be the best current option for a SSL termination. It is not available in Debian yet, but I intend to package it.

Here is how stud is invoked:

# ulimit -n 100000
# stud  -n 2 -f 172.31.200.15,443 -b 127.0.0.1,80 -c ALL \
>   -B 1000 -C 20000 --write-proxy \
>   =(cat server1024.crt server1024.key dhe1024)

You may also want to look at HAProxy configuration, nginx configuration and stunnel configuration.

UPDATED: More benchmarks are available in the second part. Keep reading!


  1. In fact, stunnel also has simple round-robin load balancing by specifying connect multiple times in the configuration.