Fast Huffman Encoder
LiteSpeed’s HPACK Huffman Encoder has been optimized: it now runs twice as fast as before. In this article, we present the Fast Huffman Encoder, share a few optimization tricks and pitfalls, and benchmark several Huffman encoder implementations.
Introduction
Huffman encoder is used both in HTTP/2 and HTTP/3 stacks (HPACK and QPACK compression, respectively). Speeding up this component thus leads to a double win.
In our previous blog post, we discussed how we sped up our Huffman decoder. The main idea was that we can process the input two, rather than one, bytes at a time. The same approach can be used to speed up the Huffman encoder.
Unlike the decoder, the encoder implementations by LiteSpeed, nginx, h2o, and nghttp2 do not share a common ancestor. For that reason, we will benchmark all of them.
Starting Line
Before optimization, the main encoder loop in our implementation looks like this:
while (src != src_end) { cur_enc_code = encode_table[*src++]; if (bits_used + cur_enc_code.bits < sizeof(bits) * 8) { bits <<= cur_enc_code.bits; bits |= cur_enc_code.code; bits_used += cur_enc_code.bits; continue; } else if (p_dst + sizeof(bits) <= dst_end) { bits <<= sizeof(bits) * 8 - bits_used; bits_used = cur_enc_code.bits - (sizeof(bits) * 8 - bits_used); bits |= cur_enc_code.code >> bits_used; #if UINTPTR_MAX == 18446744073709551615ull *p_dst++ = bits >> 56; *p_dst++ = bits >> 48; *p_dst++ = bits >> 40; *p_dst++ = bits >> 32; #endif *p_dst++ = bits >> 24; *p_dst++ = bits >> 16; *p_dst++ = bits >> 8; *p_dst++ = bits; bits = cur_enc_code.code; /* OK not to clear high bits */ } else return -1; }
Where an element of encode_table
is as follows:
struct encode_el { uint32_t code; int bits; };
Each byte of input is used as an index into encode_table
. The associated encoded bit string is added to the accumulator buffer bits
, which is an 8-byte or 4-byte integer, depending on the platform. When bits
is full, its bytes are written out and the accumulator is reset.
We want to reduce the number of branches. A useful metric is “number of conditionals per byte of input.” In this loop, we check whether
- we’ve run out of input (for every input byte);
- the accumulator buffer is full (for every input byte); and
- we’ve run out of output (every 8 output bytes).
Estimating average compression ratio of 0.75, this translates to 2.1 branches per byte of input. Can we do better?
Big Table
Since there is no obvious way to reduce the number of conditionals in the processing loop, the next thing to do is to reduce the number of input units: that is, process two bytes at a time instead of one.
struct henc { unsigned lens; uint32_t code; } hencs[] = { [0x0000] = { 26, 0x3FF1FF8 }, [0x0001] = { 64, 0 }, /* --- 8< -- snip --- 8< --- */ [0x6161] = { 10, 0x63 }, /* --- 8< -- snip --- 8< --- */ [0xFFFF] = { 64, 0 }, };
This table has 64K entries. Each element contains the bit string code
that represents the Huffman encoding of the two input bytes and its length, lens
. Because code
is 32 bits wide, some two-byte encodings cannot be represented. They are marked by lens
having value 64. In this case, the encoder has to fall back to processing input byte by byte.
New Loop
The new loop is broken up into three parts below to place explanation next to the code. You can view the loop in its entirety on GitHub.
Read
while (src + sizeof(bits) * 8 / 5 + sizeof(idx) < src_end && p_dst + sizeof(bits) <= dst_end) { memcpy(&idx, src, 2); henc = &hencs[idx]; src += 2; while (bits_used + henc->lens < sizeof(bits) * 8) { bits <<= henc->lens; bits |= henc->code; bits_used += henc->lens; memcpy(&idx, src, 2); henc = &hencs[idx]; src += 2; }
To avoid source and destination checks inside the loop, they are performed once. The body of the loop is executed if there are enough source bytes to fill the accumulator with shortest Huffman codes and read two more bytes (that translates to 64 / 5 + 2 = 14 bytes on 64-bit platforms) and enough space in the destination buffer to fit all bytes in the accumulator.
With that out of the way two bytes are read into idx
and this value is used to point to the corresponding entry in the hencs
table. memcpy
is used to guarantee aligned access.
The loop condition of the inner while
loop checks two things:
- whether the accumulator will be filled by the current output bit string; and
- output bit string validity.
The reason for making 64
the invalid value in the hencs
table is to be able to combine the two checks above into a single expression. If 0
were the invalid value indicator, the condition would have to be more complicated:
henc->lens && bits_used + henc->lens < sizeof(bits) * 8
This optimization makes a noticeable difference by reducing the number of instructions in the inner loop.
Write
if (henc->lens < 64) { bits <<= sizeof(bits) * 8 - bits_used; bits_used = henc->lens - (sizeof(bits) * 8 - bits_used); bits |= henc->code >> bits_used; #if UINTPTR_MAX == 18446744073709551615ull *p_dst++ = bits >> 56; *p_dst++ = bits >> 48; *p_dst++ = bits >> 40; *p_dst++ = bits >> 32; #endif *p_dst++ = bits >> 24; *p_dst++ = bits >> 16; *p_dst++ = bits >> 8; *p_dst++ = bits; bits = henc->code; /* OK not to clear high bits */ }
If the current henc
entry is valid, the accumulated bytes are written to destination buffer. First, the accumulator bits
is filled to the max by adding first part of the current output bit string henc->code
.
Because the destination check has already been performed and we know the number of bytes to write, it is done in a series of shift, assignment, and increment statements.
henc->code
is simply copied to the bits
accumulator as is not necessary to clear any bits. The unused bits from henc->code
will be processed based on the value of bits_used
.
Note that it costs only one conditional to output a bits
-ful of bytes.
Fallback
else { src -= 2; break; } }
The fallback occurs when the current two-byte input sequence translates to an output bit string that is too long. The input pointer is backed up two bytes and the new loop is exited, with control proceeding naturally to the original byte-by-byte loop.
Count Branches
The loop checks:
- source and destination bounds once per every 8 output bytes;
- accumulator fill every two input bytes; and
- code validity every 8 output bytes.
Again, assuming compression ratio to be 0.75, this translates to 1 / (8 / 0.75) + 1 / 2 + 1 / (8 / 0.75) = 0.69 branches per input byte. This is three times fewer branches than in the original loop.
Benchmarks
We compare the optimized Huffman encoder (litespeed) to the original (litespeed-orig), as well as to Huffman decoders from nginx, h2o, and nghttp2 projects. I extracted the relevant pieces of code from each project and placed them into separate C files in our Huffman encoder benchmark GitHub repo.
We will test using four different inputs:
- litespeed.txt: This file contains the string “LiteSpeed”. It is used to test small inputs.
- x-fb-debug.txt: This is a medium-size string taken from a real x-fb-debug header from the Facebook response QIF file. This is used to test medium-size input.
- idle.huff: This file contains the first few paragraphs from The Idle Thoughts of an Idle Fellow by Jerome K. Jerome. It is used to test large input.
- x-fb-backslash.huff: This is the same as (2), except with three backslashes inserted in the middle of the string. The backslash is encoded using a 19-bit code, and so the Fast Huffman Encoder will have to fall back to using the original byte-by-bytes loop.
The table below was generated by the helper program bench.pl. In addition to running the driver and calculating median values, it first calibrates each input file so that the fastest of the encoders runs for at least five seconds. Each encoder runs an input file five times and the median run time is taken. These are the numbers for each encoder in the table below. They are in seconds.
Results
Input | Number of iterations | litespeed | litespeed-orig | nginx | h2o | nghttp2 |
litespeed.txt | 327680000 | 8.67 | 8.80 | 9.61 | 10.19 | 20.59 |
x-fb-debug.txt | 40960000 | 5.25 | 11.23 | 11.13 | 18.83 | 30.31 |
x-fb-backslash.txt | 40960000 | 7.22 | 12.65 | 11.20 | 19.80 | 32.21 |
idle.txt | 2560000 | 9.32 | 15.89 | 17.60 | 29.50 | 52.25 |
As expected, there is no speedup for litespeed.txt, as that string is shorter than 14 bytes. For x-fb-debug.txt, x-fb-backslash.txt, and idle.txt, litespeed is 2.14, 1.75, and 1.70 times, respectively, faster than litespeed-orig. litespeed Huffman encoder beats nginx, h2o, and nghttp2 in all tests. This is how much faster:
input | litespeed | nginx | h2o | nghttp2 |
litespeed.txt | 1.00 | 1.11 | 1.18 | 2.37 |
x-fb-debug.txt | 1.00 | 2.12 | 3.59 | 5.77 |
x-fb-backslash.txt | 1.00 | 1.55 | 2.74 | 4.46 |
idle.txt | 1.00 | 1.89 | 3.17 | 5.61 |
litespeed is about twice as fast (2.12 and 1.89) as nginx on normal input when the string is not tiny (and when it is tiny, its still faster).
h2o is an also-ran, while the nghttp2 encoder is the slowest of the bunch.
Conclusion
The Fast Huffman Encoder uses the same insight as the Fast Huffman Decoder: the subset of commonly used characters in HTTP headers is much smaller than the set of possible characters. For inputs of 12 characters or longer, the encoder virtually always uses the fast path, and the fallback is cheap. At the cost of 512KB of a read-only lookup table, we've doubled the Huffman encoding speed.
Portability
This is a bonus section for the dedicated. Welcome!
Table Compilation: Endianness
The initial description of the big table was a bit of a white lie. The table actually looks like this:
#if __BYTE_ORDER == __LITTLE_ENDIAN #define I(i,j) ((j<<8)|i) #else #define I(i,j) ((i<<8)|j) #endif #if UINTPTR_MAX == 18446744073709551615ull #define X32 32 #else #define X32 64 #endif struct henc { unsigned lens; uint32_t code; } hencs[] = { [I(0,0)] = {26,0x3FF1FF8}, [I(1,0)] = {64,0}, /* --- 8< --- snip --- 8< --- */ [I(97,97)] = {10,0x63}, /* --- 8< --- snip --- 8< --- */ [I(203,97)] = {X32,0xFFFFFBC3}, /* --- 8< --- snip --- 8< --- */ [I(255,255)] = {64,0}, }; #undef X32 #undef I
The I()
macro is used to generate different indexes on little-endian and big-endian platforms.
Avoid Undefined Behavior
The X32
macro effectively disables 32-bit output sequences on 32-bit platforms. This is because of the left shift used when outputting encoded buffer in the big loop. The C standard says the following about bitwise shift operators:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This means that the result of the following statement may be undefined if we shift by 32:
bits <<= sizeof(bits) * 8 - bits_used;
To avoid bits_used
from being 0 here, the longest code must be smaller than 32 on a 32-bit platform. We guarantee this with the inner while
loop.
Aligned Access
We read encoder input as a sequence of two-byte integers by copying memory to a local variable:
memcpy(&idx, src, 2); henc = &hencs[idx];
On Intel x86 and x64 architecture, data does not have to be aligned, so one could (or could one?) get away with writing code like this:
henc = &hencs[ * (uint16_t *) src];
But then this would not work on platforms with stricter alignment requirements, such as SPARC.
Fortunately, the compiler is smart enough to compile both of these code snippets to exactly the same -- fast -- code.
Comments