Fast Huffman Decoder
Summary
The LiteSpeed HPACK Huffman decoder is twice as fast as that used by nginx. This is one of the reasons LiteSpeed Web Server outperforms nginx in a variety of HTTP/2 benchmarks. This article describes how our Huffman decoder works and then tests its performance against that of nginx Huffman decoder. The benchmarked code is available in a public GitHub repository, making the results easy to verify.
Introduction
The latest major release of LiteSpeed Web Server includes several new features and optimizations. Our web server outperforms nginx by a factor of 2 to 12 in several readily reproducible benchmarks.
How do we do it? The truth is, when we are not adding new features, we are optimizing existing code. We 1) examine each module and submodule to find one we can speed up; 2) speed it up; and 3) go back to step 1. Sometimes we wake up because a way to elide a branch in the middle of a tight loop has appeared to us in a dream. Then we turn on the terminal and put down the nocturnal epiphany in code.
Why do we do it? HTTP/2 is on its way to become the dominant web protocol. HTTP/2 performance is crucial for any modern web server. And, of course, we want to be faster than everyone else.
This article describes one of the optimizations that made ours the fastest HTTP/2 implementation in the world.
History
Well-led software projects are not afraid to borrow good ideas, for the hubris of the not-invented-here syndrome often leads to suboptimal implementations. Such is the story of the HPACK Huffman decoder in today’s major web servers. Apache, LiteSpeed, and nginx have all, until recently, used the same Huffman decoder implementation.
HPACK is a specialized compression mechanism for HTTP headers, designed specifically for HTTP/2. Header compression is one of HTTP/2’s “killer features:” it reduces bandwidth necessary to transfer headers and thus improves throughput. Huffman coding is used to compress headers that are not in one of the HPACK tables.
In January of 2014, over a year before the HPACK RFC was published, Tatsuhiro Tsujikawa rewrote Huffman decoder in nghttp2’s implementation of HPACK. (nghttp2 is used by Apache to provide HTTP/2 functionality.) The new code used the method described eleven years earlier by Renato Pajarola in his paper Fast Prefix Code Processing. It is significantly faster than prior methods. The idea is to process compressed data several bits — instead of one bit — at a time. Tatsuhiro chose 4 bits as input length. 4 is a good number. Conveniently, it is exactly half the length of a byte. A shorter sequence would not have sped up the code as much, while a longer sequence would grow the transition table.
Because HPACK Huffman codes do not change, the transition table is generated once — at compile time. The static table is then used by a simple, robust, and fast decoder function. This was a good idea and a good piece of code, so LiteSpeed, nginx, and others copied (with slight alterations) the code, and moved on.
Fast or Slow?
Our own version of the Huffman decoder looks like this (see litespeed.c):
int lshpack_dec_huff_decode_full (const unsigned char *src, int src_len, unsigned char *dst, int dst_len) { const unsigned char *p_src = src; const unsigned char *const src_end = src + src_len; unsigned char *p_dst = dst; unsigned char *dst_end = dst + dst_len; struct decode_status status = { 0, 1 }; while (p_src != src_end) { if (p_dst == dst_end) return -2; if ((p_dst = hdec_huff_dec4bits(*p_src >> 4, p_dst, &status)) == NULL) return -1; if (p_dst == dst_end) return -2; if ((p_dst = hdec_huff_dec4bits(*p_src & 0xf, p_dst, &status)) == NULL) return -1; ++p_src; } if (!status.eos) return -1; return p_dst - dst; }
Even though 4 bits at a time is faster than one bit at a time, the code is still pretty expensive. One way to look at this is to count the number of branches for each byte of input. Inside the while
loop above, there are four if
statements. This is in addition to conditionals in hdec_huff_dec4bits
function (there are two) and in addition to the while
loop conditional that checks for end of input. That is seven branches for each byte of input. Can we do better?
Common Case
HPACK defines 257 Huffman codes: one for each of the 256 possible bytes and one for the special end-of-stream (EOS) symbol. The codes range from 5 to 30 bits in length: the more likely the value, the shorter its code. What is important to realize is that two thirds of the Huffman codes are virtually never used in HTTP headers. Most of the time, only shorter codes are used. For example, no codes longer than 13 bits are used in the QIF corpus. In this restricted domain, there are more possibilities for optimization.
Big Table!
The idea is to generate a large table that can be indexed by the longest of the common Huffman codes. For example, a table with 64K entries can be indexed by a 16-bit value, representing a sequence of 16 input bits. In the common case, a 16-bit sequence is wide enough to contain Huffman codes for one or more output symbols (the average number is 2.14 output symbols per 16 bits).
Encountering a code longer than 16 bits would require falling back to the original, full Huffman decoder that processes the input 4 bits at a time.
Fast Huffman Decoder Details
(To keep the blog post manageable, only the most important snippets of code are included here. For your convenience, the table and the Fast Huffman Decoder function are available in their entirety here: litespeed.c, litespeed-table.h.)
The Table
As mentioned above, the table contains 64K entries, for the full range of values that fits into a 16-bit integer, from 0 to 65,535. Each entry contains three things:
- Sum of lengths of Huffman codes consumed from the input stream, in bits. This value has a valid range from 5, which is the length of the shortest HPACK Huffman code, to 16, which is the number of bits available for input. Remember, the input bits here are in the index value.
- Number of output bytes. The smallest value is 1, while the largest is 3. 3 is how many 5-bit codes can fit into 16 bits.
- Output bytes themselves.
static const struct hdec { uint8_t lens; uint8_t out[3]; } hdecs[] = { /* 0 */ {(15<<2)|3,{48,48,48}}, /* 1 */ {(15<<2)|3,{48,48,48}}, /* 2 */ {(15<<2)|3,{48,48,49}}, /* 3 */ {(15<<2)|3,{48,48,49}}, /* --- 8< --- many rows skipped --- 8< --- */ /* 65532 */ {(15<<2)|1,{123,0,0}}, /* 65533 */ {(15<<2)|1,{123,0,0}}, /* 65534 */ {0,{0,0,0,}}, /* 65535 */ {0,{0,0,0,}}, };
To keep each element to 4 bytes, Huffman codes lengths and number of output bytes are combined into a single struct member lens. The 3-byte array out
holds the output bytes themselves. For example, element 2 above consumes 15 bits of input and outputs 3 bytes: 0
, 0
, 1
. Element 65533 consumes 15 bits of input and outputs 1 byte: {
.
The last two elements of the table have lens
equal zero. This indicates that there is no matching Huffman code that fits into 16 bits. The HPACK Huffman code table shows that, indeed, all codes longer than 16 bits begin with either 11111111 11111110
(65534) or 11111111 11111111
(65535) bit sequence. When this code is encountered, the Fast Huffman Decoder has to use the fallback mechanism.
Main Loop
The main loop can be described in this high-level pseudocode:
while (have input) { fill buf; drain buf; }
Several input bytes are copied into an intermediate representation. This batch of input bits is then processed: an element in hdecs
table is looked up and output is written, until the number of available bits (avail_bits
) falls under 16. At this point, buf
is refilled.
Input Batching
The input bytes are copied into an integer buffer buf
, which is the size of a pointer (8 bytes on 64-bit platforms). avail_bits
holds the number of input bits available in buf
. avail_bits
must be at least 16 to be able to use the large table.
if (src + sizeof(buf) <= src_end) /* --- 8< --- omitted for brevity, refer to litespeed.c --- 8< --- */ ; else if (src < src_end) do { buf <<= 8; buf |= (uintptr_t) *src++; avail_bits += 8; } while (src < src_end && avail_bits <= sizeof(buf) * 8 - 8); else break; /* Normal case terminating condition: out of input */
Filling buf
is optimized. If the number of remaining input bytes is at least the size of buf
, we can use a switch
statement. Otherwise, bytes are added to buf
using a while
loop.
Batch Processing
The batch of input bits in buf
is used to look up elements in hdecs
. If the output buffer has enough room to fit the maximum number of bytes that could possibly be encoded in buf
(64 / 5 = 12 bytes on 64-bit platform), the processing loop can skip destination check.
if (dst_end - dst >= (ptrdiff_t) (8 * sizeof(buf) / SHORTEST_CODE) && avail_bits >= 16) { /* Fast path: don't check destination bounds */ do { idx = buf >> (avail_bits - 16); hdec = hdecs[idx]; dst[0] = hdec.out[0]; dst[1] = hdec.out[1]; dst[2] = hdec.out[2]; dst += hdec.lens & 3; avail_bits -= hdec.lens >> 2; } while (avail_bits >= 16 && hdec.lens); if (avail_bits < 16) continue; goto slow_path; }
The do
/while
loop above is the main reason the Fast Huffman Decoder is fast. The only conditional is the terminating condition, which checks both the number of available bits and hdec
validity at the same time. The average number of consumed bits in hdecs
table is 12.76. This translates to 0.63 branches per input byte. Far cry from seven! Of course, there are branch penalties elsewhere — filling the batch, error checking — but this loop’s speed makes up for them in spades.
The next 16-bit input sequence to use is in the most significant bits of buf
. Since idx
is uint16_t
, there is no need to mask out the low 16 bits of buf
: assignment to uint16_t
takes care of that.
All three bytes of the hdec.out
array are written to the destination buffer. This saves a conditional at a cost of an unnecessary write. 15% of hdecs entries output 3 bytes, while 84% output 2 bytes. On average, then, 5 times out of 6 an extra byte is written to dst
. This is cheaper than a branch.
As described above, invalid entries have lens
equal to zero. In this case, the values of dst
and avail_bits
are not changed inside the loop. The if
statement after the loop is used to differentiate whether we ran out of bits (the common case) or found an invalid code. The latter forces us to fall back to the full Huffman decoder.
A slower while loop is utilized when there is not much room left in the destination buffer.
Fallback
When a long (or invalid) Huffman code is encountered, we have to use the full (original) Huffman decoder to process the rest of the string.
/* Find previous byte boundary and finish decoding thence. */ while ((avail_bits & 7) && dst > orig_dst) avail_bits += encode_table[ *--dst ].bits; src -= avail_bits >> 3; r = lshpack_dec_huff_decode_full(src, src_end - src, dst, dst_end - dst); if (r >= 0) return dst - orig_dst + r; else return r;
We cannot start processing from the current input position if the Huffman code is not on the byte boundary. We need to rewind the input pointer toward the beginning, looking for the byte boundary. To do that, we use the bytes we already output and the Huffman encoder table. We back up until avail_bits
is a multiple of 8. In the worst case, we will stop at the beginning of the input, but that usually does not happen for an input of any appreciable length.
Benchmarks
We will compare the Fast Huffman Decoder with Huffman decoder from nginx. To do that, I extracted Huffman decoder functions from ls-hpack and nginx and wrote a small driver program that runs the decoder. See our Huffman decoder GitHub repo. The specific versions of the decoder are here (LiteSpeed) and here (nginx). The latter has been adapted from nginx 1.16.1. You should be able easily to clone this repository and replicate our results.
The driver is a simple program that reads Huffman-encoder string from a file and decodes it a number of times. The less time a decoder takes to decode the input, the faster the decoder.
For example:
sh$ time -p ./comp-dec idle.huff 1000000 litespeed real 8.35 user 8.33 sys 0.00 sh$ time -p ./comp-dec idle.huff 1000000 nginx real 16.20 user 16.16 sys 0.00
It took Fast Huffman Decoder 8.35 seconds to decode Huffman-encoded string in idle.huff
one million times, while the nginx Huffman decoder took 16.20 seconds.
We will test using four different inputs:
litespeed.huff
. This is a Huffman-encoded string “LiteSpeed”. It is used to test small inputs.x-fb-debug.huff
. This is a medium-size string taken from a real x-fb-debug header from the Facebook response QIF file. This is used to test medium-size input.idle.huff
. This file contains the first few paragraphs from The Idle Thoughts of an Idle Fellow by Jerome K. Jerome. It is used to test large input.x-fb-backslash.huff
. This is the same as (2), except with a backslash inserted in the middle of the string. The backslash is encoded using a 19-bit code, and so the Fast Huffman Decoder will have to backtrack and fall back to the full decoder. This is used to test how well (or poorly) the Fast Huffman Decoder performs in this case.
We’ll run each input file five times for each decoder, LiteSpeed and nginx. Then we will compare median average run times. (See the complete numbers at the bottom of this article.)
Results
Input | nginx | LiteSpeed | Speed-up |
litespeed.huff | 5.05s | 2.57s | 1.96x |
x-fb-debug.huff | 5.33s | 2.75s | 1.94x |
idle.huff | 16.09s | 8.22s | 1.96x |
x-fb-backslash.huff | 4.86s | 4.21s | 1.15x |
Presented with normal input, the LiteSpeed Fast Huffman Decoder almost doubles the speed of the nginx Huffman decoder. Even when the Fast Huffman Decoder has to backtrack due to uncommon input, it still performs significantly better than the alternative.
Conclusion
The LiteSpeed Fast HPACK Huffman Decoder leverages its knowledge of the expected input to optimize for the common case. The fast code path uses two to three times fewer conditionals than the original decoder, making it twice as fast as the original Huffman decoder for all but the most unusual inputs. This is one of the many optimizations we performed — and continue to perform — to make ours the fastest web server, period.
We also use the Fast Huffman Decoder in our QPACK library, speeding up our HTTP/3 stack. We intend to maintain performance leadership with next-generation HTTP version as well.
Post Scriptum
While writing this article, I found two more optimizations that would make the Fast Huffman Decoder even faster. Stay tuned!
Raw Numbers
litespeed.huff
Needed to set number of iterations to 100,000,000 to get into the seconds.
nginx | 5.05 | 4.94 | 4.97 | 5.07 | 5.07 |
LiteSpeed | 2.93 | 2.74 | 2.78 | 2.74 | 2.75 |
x-fb-debug.huff
Number of iterations is set to 10,000,000
nginx | 5.27 | 5.37 | 5.31 | 5.34 | 5.33 |
LiteSpeed | 2.75 | 2.72 | 2.77 | 2.74 | 2.85 |
idle.huff
Number of iterations set to 1,000,000
nginx | 16.09 | 16.06 | 16.13 | 16.09 | 16.05 |
LiteSpeed | 8.17 | 8.25 | 8.21 | 8.22 | 8.29 |
x-fb-backslash.huff
Number of iterations set to 10,000,000
nginx | 4.82 | 4.86 | 4.80 | 4.89 | 4.86 |
LiteSpeed | 4.21 | 4.23 | 4.21 | 4.22 | 4.20 |
It is curious that this input differs only by one character from x-fb-debug.huff
, yet nginx decoder processes it markedly faster. I haven’t been able — using a limited amount of time — to figure out why.
Comments