A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I took that implementation, ported it to C++, and used it to make some efficiency gains and fix some spec compliance issues in the Ladybird browser.
Throughout this, I never actually looked at the implementations used in any of the major browser engines (no reason for this, just me being dumb). However, now that I have looked at Blink/WebKit/Gecko (Chrome/Safari/Firefox, respectively), I've realized that my implementation seems to be either on-par or better across the metrics that the browser engines care about:
So, I thought I'd take you through what I came up with and how it compares to the implementations in the major browser engines. Mostly, though, I just think the data structure I used is neat and want to tell you about it (fair warning: it's not novel).
A named character reference is an HTML entity specified using an ampersand (&
) followed by an ASCII alphanumeric name. An ordained set of names will get transformed during HTML parsing into particular code point(s). For example, ◯
is a valid named character reference that gets transformed into the symbol â—¯, while &
will get transformed into &.
Here's a few properties of named character references that are relevant for what we'll ultimately be aiming to implement:
&
;
fj
maps to U+0066 U+006A
which are just the ASCII letters fj
)Most crucially, though, the mappings of named character references are fixed. The HTML standard contains this note about named character references:
Note: This list is static and will not be expanded or changed in the future.
This means that it's now safe to represent the data in the minimum amount of bits possible without any fear of needing to accommodate more named character reference mappings in the future.
I'm specifically going to be talking about the Named character reference state of HTML tokenization. You can read the spec for it if you want (it's fairly short) but there's really only one sentence we're interested in:
Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table.
This sentence is really the crux of the implementation, but the spec is quite hand-wavy here and conceals a lot of complexity. The example given a bit later in the spec hints at why the above sentence is not so straightforward (edited to exclude irrelevant details; for context, ¬
is a valid named character reference, as is ∉
):
Example: If the markup contains the string
I'm ¬it; I tell you
, the character reference is parsed as "not", as in,I'm ¬it; I tell you
. But if the markup wasI'm ∉ I tell you
, the character reference would be parsed as "notin;", resulting inI'm ∉ I tell you
.
That is, with the string ¬it;
, the characters up to and including ¬i
can still lead to a valid named character reference (∉
, among others), so we only know that ¬it;
is invalid once we've reached ¬it
which can no longer lead to a valid named character reference. What this means is that there needs to be some amount of backtracking involved, as the goal is to consume only the characters that are part of the longest valid named character reference.
That's not the only complication, though...
document.write
🔗Due to <script>
tags, the input to HTML tokenization is mutable while it is being tokenized, meaning looking ahead is not always reliable/possible since we might not yet know what comes next. Consider this example:
<script>
document.write("¬");
</script>in;
The expected result after parsing is ∉
, meaning the resolved character reference is ∉
. Let's go through why that is the case.
After the closing script tag is tokenized, the parser adds an "insertion point" after it, and then the script within the tag itself is executed. So, right before the script is executed, the tokenizer input can be visualized as (where is the insertion point):
in;
And then after the <script>
is executed and the document.write
call has run, it can be visualized as:
∉
What happens from the tokenizer's perspective is that after the closing </script>
tag, ¬
comes next in the input stream (which was inserted by document.write
), and then the characters after ¬
are in;
, so ultimately a tokenizer going character-by-character should see an unbroken ∉
, recognize that as a valid character reference, and translate it to ∉
.
To further show why this can be tricky to handle, consider also the possibility of document.write
writing one character at a time, like so:
<script>
for (let char of "¬") {
document.write(char);
}
</script>in;
This, too, is expected to result in ∉
(i.e. ∉
). Keep in mind that the tokenizer will advance after each document.write
call, so if the tokenizer tries to lookahead past the insertion point at any point before the full script is run, it will resolve the wrong string as a character reference (∈
, &nin;
, or &noin;
). Here's a visualization that shows the insertion point and the various states of the input stream after each document.write
call:
in;
Therefore, while HTML tokenizers can theoretically look ahead, they can never look past the end of an insertion point.
All of this is to say that a "consume the longest valid named character reference" implementation probably needs to use one of two strategies:
The second strategy seems like the better approach to me, so that's what my implementation will be focused on. We'll see both strategies later on, though.
So, we want an implementation that can iterate character-by-character and (at any point) efficiently determine if it's possible for the next character to lead to a longer valid named character reference.
A data structure that seems pretty good for this sort of thing is a trie. A trie is a specialized tree where each node contains a character that can come after the character of its parent node in a set of words. Below is a representation of a trie containing this small subset of named character references:
¬
∉
⋷
⋶
∌
⋾
⋽
With such a trie, you search for the next character within the list of the current node's children (starting from the root). If the character is found within the children, you then set that child node as the current node and continue on for the character after that, etc.
For invalid words, this means that you naturally stop searching as soon as possible (after the first character that cannot lead to a longer match). For valid words, you trace a path through the trie and end up on an end-of-word node (and you may also pass end-of-word nodes on the way there). Here's what the path through the trie would look like for the named character reference ⋶
:
You'll notice that the mapped code point (â‹¶) is present on the diagram above as well. This is because it is trivial to use a trie as a map to look up an associated value, since each word in the set must end at a distinct node in the trie (e.g. no two words can share an end-of-word node). Conveniently, using the trie as a map is exactly what we want to be able to do for named character references, since ultimately we need to convert the longest matched named character reference into the relevant code point(s).
One way to represent a trie node is to use an array of optional pointers for its children (where each index into the array represents a child node with that byte value as its character), like so:
const Node = struct {
// This example supports all `u8` byte values.
children: [256]?*Node,
end_of_word: bool,
};
Earlier, I said that trie visualizations typically put the letters on the connections between nodes rather than the nodes themselves, and, with this way of representing the trie, I think that makes a lot of sense, since the connections are the information being stored on each node.
So, this representation can be visualized like so:
With this, checking if a character can come after the current node is a straightforward O(1)
array access:
if (node.children[c] != null) {
// found child
}
but it comes at the cost of a lot of potentially wasted space, since most nodes will have many null
children.
One way to mitigate the wasted space would be to switch from an array of children to a linked list of children, where the parent stores an optional pointer to its first child, and each child stores an optional pointer to its next sibling:
const Node = struct {
char: u8,
first_child: ?*Node,
next_sibling: ?*Node,
end_of_word: bool,
};
Now that char
is stored on each node directly, I (in turn) think it makes sense to visualize the trie with the characters shown on the nodes themselves, like so:
While this 'linked list' representation saves on memory, it transforms the search for a particular child into a O(n)
linear scan across the children:
var node = starting_node.first_child orelse return null;
while (true) {
if (node.char == c) {
// found child
}
node = node.next_sibling orelse return null;
}
This linear search can be slow, especially if the nodes are individually heap-allocated and therefore could be very spread out in memory, leading to a lot of random memory accesses and cache misses. Additionally, pointers themselves take up quite a bit of space (8 bytes on 64-bit machines). If we ultimately want to decrease the size of the node, getting rid of the pointer fields would be helpful as well.
We can solve multiple of these problems at once by:
With this approach, Node
could look like this:
const Node = packed struct {
char: u8,
// It's safe to represent this with the minimum number of bits,
// e.g. there's 6 nodes in our example so it can be represented in 3 bits.
first_child_index: u3,
// `last_sibling` replaces the need for the `next_sibling` field, since
// accessing the next sibling is just an index increment.
last_sibling: bool,
end_of_word: bool,
};
And the array of nodes for this particular example trie would look like this:
const nodes = [6]Node{
.{ .first_child_index = 1, .char = 0, .last_sibling = true, .end_of_word = false },
.{ .first_child_index = 3, .char = 'G', .last_sibling = false, .end_of_word = false },
.{ .first_child_index = 5, .char = 'H', .last_sibling = true, .end_of_word = false },
.{ .first_child_index = 0, .char = 'G', .last_sibling = false, .end_of_word = true },
.{ .first_child_index = 0, .char = 'L', .last_sibling = true, .end_of_word = true },
.{ .first_child_index = 0, .char = 'F', .last_sibling = true, .end_of_word = true },
};
This representation can be visualized like so:
This still means that searching a child list uses a O(n)
linear scan, but this representation makes those searches much more friendly to the CPU, and greatly reduces the size of each node.
To get an idea of how the different representations compare, here's a breakdown for a trie containing the full set of named character references (2,231 words).
[256]?*Node
as the children
field)That is, the 'flattened' version is 1/514 the size of the 'connections' version, and 1/6 the size of the 'linked list' version.
As mentioned, the 'linked list' and 'flattened' versions sacrifice the O(1)
lookup of the 'connections' version in favor of reducing the data size, so while the 'flattened' version claws some performance back from the 'linked list' version, the 'connections' version is the fastest:
501.596ms (50ns per contains
call)
965.138ms (96ns per contains
call)
609.215ms (60ns per contains
call)
One interesting thing to note is that the above results for representations 1 & 2 rely on a friendly allocation pattern for the nodes (i.e. the memory addresses of the nodes happening to end up fairly close to eachother) This is admittedly pretty likely when constructing a trie all at once, but, if we intentionally force a horrendous allocation pattern, where each allocated node gets put on an entirely separate page, we can see the effects very clearly:
1.025s (102ns per contains
call)
(each contains
call takes ~2x longer than it did)4.372s (437ns per contains
call)
(each contains
call takes ~4x longer than it did)If we run the relevant benchmarks through poop
, we can confirm the cause of the slowdown (these results are from the 'connections' version):
Benchmark 1 (11 runs): ./trie-friendly-allocations
measurement mean ± σ min … max outliers delta
wall_time 496ms ± 24.1ms 440ms … 518ms 1 ( 9%) 0%
cpu_cycles 2.01G ± 100M 1.78G … 2.11G 1 ( 9%) 0%
instructions 1.94G ± 26.3K 1.94G … 1.94G 0 ( 0%) 0%
cache_references 107M ± 3.40M 98.8M … 109M 2 (18%) 0%
cache_misses 33.1M ± 1.15M 30.3M … 34.0M 2 (18%) 0%
branch_misses 21.7M ± 20.7K 21.7M … 21.7M 0 ( 0%) 0%
Benchmark 2 (5 runs): ./trie-horrendous-allocations
measurement mean ± σ min … max outliers delta
wall_time 1.07s ± 21.9ms 1.05s … 1.11s 0 ( 0%) 💩+116.4% ± 5.5%
cpu_cycles 4.17G ± 92.4M 4.09G … 4.33G 0 ( 0%) 💩+106.9% ± 5.6%
instructions 1.92G ± 38.4 1.92G … 1.92G 0 ( 0%) - 1.0% ± 0.0%
cache_references 145M ± 1.63M 144M … 147M 0 ( 0%) 💩+ 35.3% ± 3.3%
cache_misses 62.2M ± 839K 61.8M … 63.7M 0 ( 0%) 💩+ 88.3% ± 3.7%
branch_misses 21.5M ± 51.6K 21.4M … 21.5M 1 (20%) - 1.1% ± 0.2%
Note that the instruction counts are roughly the same between the 'friendly' and 'horrendous' versions, so the increased cpu_cycles
and wall_time
can presumably be attributed to the increase in cache misses and pointer chasing (the trie code itself is identical between the two versions).
When using a DAFSA for this particular task, we're trading off a ~20% difference in lookup speed for 2-3 orders of magnitude difference in data size. This seems pretty okay, especially for what we're ultimately interested in implementing: a fully static and unchanging data structure, so there's no need to worry about how easy it is to modify after construction.
A while back at the same Zig meetup where I gave a talk about my Windows resource compiler, Niles Salter aka Validark gave a talk titled Better data structures and where to find them. It was nominally about his novel autocomplete data structure, but the stated purpose of the talk was to get people interested in learning about data structures and potentially inventing their own.
During the talk, I thought back to when I contributed to an HTML parser implementation and had to leave proper named character reference tokenization as a TODO
because I wasn't sure how to approach it. I can't remember if a deterministic acyclic finite state automaton (DAFSA) was directly mentioned in the talk, or if I heard about it from talking with Niles afterwards, or if I learned of it while looking into trie variations later on (since the talk was about a novel trie variation), but, in any case, after learning about the DAFSA, it sounded like a pretty good tool for the job of named character references, so I revisited named character reference tokenization with that tool in hand.
A DAFSA is essentially the 'flattened' representation of a trie, but, more importantly, certain types of redundant nodes are eliminated during its construction (the particulars of this aren't too relevant here so I'll skip them; see here if you're interested).
Going back to the same subset of named character references as the example in the "Trie implementation" section above, a DAFSA would represent that set of words like so:
As you can see, the v
, a
, b
, c
and ;
nodes are now shared between all the words that use them. This takes the number of nodes down to 13 in this example (compared to 22 for the trie).
The downside of this node consolidation is that we lose the ability to associate a given end-of-word node with a particular value. In this DAFSA example, all words except not
end on the exact same node, so how can we know where to look for the associated value(s) for those words?
Here's an illustration of the problem when matching the word ⋶
:
To get around this downside, it'd be possible to use something like a separate hash map or devise some minimal perfect hashing scheme to lookup the associated code point(s) for a matching word after the fact, but, luckily, we don't have to worry too much about that because it turns out it is possible to do...
First detailed in Applications of finite automata representing large vocabularies (Cláudio L. Lucchesi, Tomasz Kowaltowski, 1993) (pdf), the technique for minimal perfect hashing using a DAFSA is actually rather simple/elegant:
Within each node, store a count of all possible valid words from that node. For the example we've been using, those counts would look like this:
Then, to get a unique index for a given word, traverse the DAFSA as normal, but:
number
to the unique indexFor example, if we had a DAFSA with a
, b
, c
, and d
as possible first characters (in that order), and the word we're looking for starts with c
, then we'll iterate over a
and b
when looking for c
in the list of children, so we add the numbers of the a
and b
nodes (whatever they happen to be) to the unique index. Here's an illustration:
For a given word in the set, applying this algorithm will produce a number between 1 and the total number of words in the DAFSA (inclusive), and it's guaranteed that each word will end up with a unique number (i.e. this is a minimal perfect hash). Here's what that looks like for the example we've been using:
After you have the unique index of a word, you can then use a lookup array for the associated values and index into it using unique_index - 1
.
Ok, so now that we have two different data structures that seem pretty well suited for named character reference matching—a trie and a DAFSA—how do they compare? It's now (finally) time to start using the full set of named character references and all of their mapped code point(s) to see how they stack up.
Some numbers to keep in mind upfront:
As mentioned earlier, each named character reference is mapped to either one or two code points. Unicode code points have a range of 0x0
-0x10FFFF
, but if we actually look at the set of code points used by named character references, there are a few properties worth noting:
U+1D56B
, which takes 17 bits to encode, so all first code point values can fit into a 17 bit wide unsigned integer.With both of these properties taken together, it's possible to encode the mapped code point values for any given named character reference in 21 bits. With padding between the elements of an array of 21-bit integers, though, that will round up to 4 bytes per element (11 bits of padding), so it ends up being the same as if 32 bit integers were used.
Here's a diagram of a possible memory layout of an array using this representation, where     is the bits of the first code point,     is the bits of the second code point, and     is the padding bits between elements:
However, while using 21 bits to represent the mapped code point(s) does not automatically lead to any saved bytes over a 32 bit integer, it opens up the possibility to tightly pack an array of 21-bit elements in order to actually save some bytes. Yet, doing so means that storing/loading elements from the tightly packed array becomes trickier (both computationally and implementation-wise). Here's the same diagram as before, but with the elements tightly packed (no padding bits between elements):
You'll notice that no elements past the first start or end on byte boundaries, meaning in order to load an element, a fair bit of bitwise operations are required (bit shifting, etc). This makes array accesses more expensive, but that isn't necessarily a big deal for our use case, since we only ever access the array of values once per named character reference, and only after we're certain we have a match. So, tightly bitpacking the value array is a viable way to save some extra bytes for our purposes.
For the DAFSA, the size calculation is pretty straightforward:
Ultimately, the goal is to keep the node size less than or equal to 32 bits while storing the following data on each node:
168
, which can fit into 8 bits.In total, that's 8 + 8 + 1 + 1 + 12 = 30 bits, so we have 2 bits to spare while remaining within the 32 bit target size with this representation.
So, the DAFSA and the lookup array for the values (together) will use either 24,412 bytes (23.84 KiB) or 21,345 bytes (20.84 KiB) total.
For the trie, there's slightly more to discuss around data representation before we can get to the data size calculations. It was glossed over in the Trie implementation section, but when using the 'flattened' representation of a trie there are effectively two ways to handle value lookups for each word:
Beyond that, the details aren't super relevant. Suffice it to say that each node will either take up 5 bytes or 3 bytes depending on which of the two 'value array' strategies you choose (note: I actually mean 5 bytes and 3 bytes, as they can be represented as one array of 2- or 4-byte values and one array of 1-byte values so padding between elements doesn't factor in).
The summary is that, depending on the particular representation, the trie will use between 57,993 bytes and 68,777 bytes (56.63 KiB to 67.16 KiB) total, or, if the values array is tightly bitpacked, between 54,926 bytes and 55,227 bytes (53.64 KiB to 53.93 KiB) total.
Ultimately, the data size of the trie is going to be at least 2x larger than the equivalent DAFSA.
Luckily, there's an existing HTML parser implementation written in Zig called rem
that uses a trie for its named character reference tokenization, so getting some relevant benchmark results from actual HTML parsing for a trie vs DAFSA comparison was pretty easy.
From my benchmarking, it turns out that the DAFSA implementation uses more instructions than the trie implementation because it needs to do extra work to build up the unique index during iteration, but the DAFSA saves on cache misses (presumably due to the smaller overall size of the DAFSA and its node re-use) and everything just about evens out in terms of wall clock time:
Benchmark 1 (449 runs): ./trie
measurement mean ± σ min … max outliers delta
wall_time 44.4ms ± 1.57ms 43.0ms … 61.8ms 3 ( 1%) 0%
peak_rss 61.8MB ± 62.5KB 61.6MB … 61.9MB 0 ( 0%) 0%
cpu_cycles 49.5M ± 463K 48.5M … 51.9M 19 ( 4%) 0%
instructions 76.2M ± 2.95 76.2M … 76.2M 5 ( 1%) 0%
cache_references 2.54M ± 21.6K 2.48M … 2.63M 12 ( 3%) 0%
cache_misses 119K ± 1.64K 115K … 128K 18 ( 4%) 0%
branch_misses 322K ± 1.02K 319K … 328K 18 ( 4%) 0%
Benchmark 2 (451 runs): ./dafsa
measurement mean ± σ min … max outliers delta
wall_time 44.3ms ± 561us 43.8ms … 48.4ms 19 ( 4%) - 0.4% ± 0.3%
peak_rss 61.6MB ± 66.6KB 61.5MB … 61.7MB 0 ( 0%) - 0.2% ± 0.0%
cpu_cycles 53.0M ± 566K 52.3M … 54.9M 11 ( 2%) 💩+ 7.0% ± 0.1%
instructions 78.7M ± 2.59 78.7M … 78.7M 6 ( 1%) 💩+ 3.2% ± 0.0%
cache_references 2.49M ± 30.0K 2.43M … 2.60M 29 ( 6%) ⚡- 2.0% ± 0.1%
cache_misses 90.9K ± 1.32K 86.4K … 95.4K 12 ( 3%) ⚡- 23.5% ± 0.2%
branch_misses 331K ± 730 330K … 337K 21 ( 5%) 💩+ 2.9% ± 0.0%
For the use-case of named character references, using a DAFSA instead of a trie cuts the size of the data at least in half while performing about the same.
First, let's take a look at what the Ladybird implementation looked like before my changes: state implementation, matching implementation. Here's a rough summary of the approach, in pseudo-code:
var match = null;
// Lookahead at the rest of the input
var remaining_input = input.substring(current_offset, input.length - current_offset);
// Check against each named character reference one-by-one
for (var entity : entities) {
if (remaining_input.starts_with(entity)) {
if (match == null or entity.length > match.length) {
match = entity;
}
}
}
// If there is a match
if (match != null) {
// Consume up to the end of the match that was found
consume_and_advance(match.length - 1);
// ...
}
This has two major problems:
document.write
correctly, as previously discussed in The spectre of document.write
. It's doing lookahead, but it does not account for insertion points, as it makes the mistake of looking past insertion points. So, if document.write
is used to write one-character-at-a-time, it will attempt to resolve the named character reference before all the characters are available (e.g. in the case of in;
, it will erroneously try matching against ∈
and then exit the named character reference state)My pull request focused on fixing both of those problems. The data structure I used is exactly the DAFSA implementation as described so far, with a value array that is not bitpacked, because:
enum
for the second code point would either mean using an extra bit (since enums are signed integers in C++) or using some workaround to keep it using the minimal number of bitsThe last piece of the puzzle that I haven't mentioned yet is the NamedCharacterReferenceMatcher
, which handles DAFSA traversal while providing an API well-tailored to the named character reference state, specifically. The details aren't too important, so here are the relevant bits of the exposed API:
// If `c` is the code point of a child of the current `node_index`, the `node_index`
// is updated to that child and the function returns `true`.
// Otherwise, the `node_index` is unchanged and the function returns false.
bool try_consume_code_point(u32 c);
// Returns the number of code points consumed beyond the last full match.
u8 overconsumed_code_points();
// Returns the code points associated with the last match, if any.
Optional<NamedCharacterReferenceCodepoints> code_points();
So, with all that context, here's what the Ladybird implementation looks like after my changes (slightly simplified for clarity; here's the full implementation):
BEGIN_STATE(NamedCharacterReference)
{
if (matcher.try_consume_code_point(current_code_point)) {
temporary_buffer.append(current_code_point);
continue; // stay in the NamedCharacterReference state and go to the next code point
} else {
DONT_CONSUME_CHARACTER;
}
auto overconsumed_code_points = matcher.overconsumed_code_points();
if (overconsumed_code_points > 0) {
backtrack_to(current_offset - overconsumed_code_points);
temporary_buffer.shrink_by(overconsumed_code_points);
}
auto mapped_code_points = matcher.code_points();
// If there is a match
if (mapped_code_points) {
// ...
} else {
FLUSH_CODEPOINTS_CONSUMED_AS_A_CHARACTER_REFERENCE;
SWITCH_TO(AmbiguousAmpersand);
}
}
Note also that Ladybird follows 'spec-driven development', meaning that the goal is for its code to be implemented to match the text of the relevant specification as closely as possible. Here's what the named character reference state specification looks like for reference:
13.2.5.73 Named character reference state
Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed.
If there is a match
- ...
Otherwise
- Flush code points consumed as a character reference. Switch to the ambiguous ampersand state.
Overall, these changes made the Ladybird tokenizer:
document.write
emitting one-character-at-a-time correctlyBut that's all pretty low-hanging fruit, as the previous Ladybird implementation had some obvious problems. In fact, we can actually improve on this some more (and will later on), but I think it's worth looking at the Firefox/Chrome/Safari implementations now to see how this DAFSA version stacks up against them.
Before we get to the actual comparisons, there's (unfortunately) a lot that has to be discussed.
First, you'll notice from the Ladybird benchmarks above that an 8x improvement in a very named-character-reference-specific benchmark only led to a 1.23x improvement in the average case. This points to the fact that named character reference matching is not something that HTML tokenizers typically do very often, and that named character reference matching being fast enough is likely just fine, all things considered.
Second, instead of going the route of putting my DAFSA implementation into the other browsers' engines to compare, I went with taking the other browsers' implementations and putting them into Ladybird. Not only that, though, I also made the Firefox/Chrome/Safari implementations conform to the API of NamedCharacterReferenceMatcher
(for reasons that will be discussed soon). So, in order for my benchmarking to be accurate you'll have to trust that:
For the first point, the only real assurance I can give you is that the same number of web platform tests within the html/syntax/parsing
category were passing with each browser's implementation integrated. The second point will be discussed more later on. For the third point, we have to go on yet another detour...
My initial benchmarking setup was straightforward:
However, this approach ultimately left me with some inexplicable results. Here's an example of such results, where the benchmark that exclusively tests the relevant parts of the code shows the Blink (Chrome) implementation being slightly faster than mine:
Benchmark 1 (85 runs): ./TestHTMLTokenizerBlink --bench named_character_references
measurement mean ± σ min … max outliers delta
wall_time 118ms ± 1.68ms 115ms … 121ms 0 ( 0%) 0%
Benchmark 2 (84 runs): ./TestHTMLTokenizerDafsa --bench named_character_references
measurement mean ± σ min … max outliers delta
wall_time 120ms ± 1.50ms 117ms … 123ms 0 ( 0%) 💩+ 2.1% ± 0.4%
Yet, when I ran a benchmark that only occasionally exercises the code that I changed (a benchmark using a sample of real HTML files), I got unexpected results. What we should expect is either a very slight difference in the same direction, or (more realistically) no discernible difference, as this effect size should not be noticeable in the average case. Instead, I got the opposite result and a larger effect:
Benchmark 1 (12 runs): ./TestHTMLTokenizerBlink --bench benchfiles
measurement mean ± σ min … max outliers delta
wall_time 1.85s ± 28.4ms 1.79s … 1.88s 2 (17%) 0%
Benchmark 2 (12 runs): ./TestHTMLTokenizerDafsa --bench benchfiles
measurement mean ± σ min … max outliers delta
wall_time 1.79s ± 20.7ms 1.76s … 1.84s 1 ( 8%) ⚡- 3.2% ± 1.1%
Taken together, the only explanation for these sorts of results would be that the parts of the code that I didn't change got faster in one version, and not the other.
Well, this explanation—that the code I didn't change got faster—is actually likely to be the correct one, and I've known about this possibility ever since I watched the excellent talk "Performance Matters" by Emery Berger. I recommend watching the talk, but the short explanation is that changes to one part of the codebase may inadvertently cause the compiler to reorganize unrelated parts of the compiled binary in ways that affect performance ('layout').
However, while I was aware of this possible confounder, up until now I have basically ignored it whenever doing benchmarking (and I'm assuming that's true for most programmers). This is the first time I've come face-to-face with the problem and had to reckon with its effects, and I think a large part of that is because I'm dealing with a much larger codebase than I typically would when benchmarking, so there's a lot of room for inadvertent layout changes to cascade and cause noticeable performance differences.
Unfortunately, the solution presented in the talk (Stabilizer, which allows you to constantly randomize a binary's layout during runtime to control for layout-based performance differences) has bitrotted and only works with an ancient version of LLVM. So, instead, I thought I'd try a different benchmarking setup to counteract the problem:
This introduces some dynamic dispatch overhead into the mix which may muddy the results slightly, but, in theory, this should eliminate the effects of layout differences, as whatever the binary layout happens to be, all implementations will share it. In practice, this did indeed work, but introduced another unexplained anomaly that we'll deal with afterwards. After moving all the implementations into the same binary, I got these results for the 'average case' benchmark:
Benchmark 1 (12 runs): ./BenchHTMLTokenizerFiles dafsa
measurement mean ± σ min … max outliers delta
wall_time 1.78s ± 29.8ms 1.75s … 1.87s 1 ( 8%) 0%
peak_rss 88.7MB ± 59.2KB 88.7MB … 88.8MB 0 ( 0%) 0%
cpu_cycles 6.71G ± 127M 6.62G … 7.10G 1 ( 8%) 0%
instructions 16.1G ± 99.2K 16.1G … 16.1G 0 ( 0%) 0%
cache_references 324M ± 2.87M 319M … 331M 2 (17%) 0%
cache_misses 10.2M ± 66.4K 10.1M … 10.3M 3 (25%) 0%
branch_misses 8.69M ± 2.99M 7.82M … 18.2M 1 ( 8%) 0%
Benchmark 3 (12 runs): ./BenchHTMLTokenizerFiles blink
measurement mean ± σ min … max outliers delta
wall_time 1.79s ± 24.8ms 1.73s … 1.82s 1 ( 8%) + 1.0% ± 1.3%
peak_rss 89.1MB ± 234KB 89.0MB … 89.8MB 1 ( 8%) + 0.5% ± 0.2%
cpu_cycles 6.70G ± 57.7M 6.62G … 6.79G 0 ( 0%) - 0.2% ± 1.2%
instructions 16.1G ± 128K 16.1G … 16.1G 1 ( 8%) - 0.0% ± 0.0%
cache_references 325M ± 1.90M 321M … 328M 0 ( 0%) + 0.2% ± 0.6%
cache_misses 10.3M ± 54.3K 10.2M … 10.4M 0 ( 0%) + 0.8% ± 0.5%
branch_misses 7.86M ± 24.6K 7.79M … 7.89M 3 (25%) - 9.6% ± 20.6%
Benchmark 4 (12 runs): ./BenchHTMLTokenizerFiles gecko
measurement mean ± σ min … max outliers delta
wall_time 1.72s ± 9.79ms 1.70s … 1.74s 0 ( 0%) ⚡- 3.2% ± 1.1%
peak_rss 88.8MB ± 83.8KB 88.7MB … 89.0MB 4 (33%) + 0.1% ± 0.1%
cpu_cycles 6.69G ± 41.1M 6.63G … 6.76G 0 ( 0%) - 0.4% ± 1.2%
instructions 16.1G ± 72.3K 16.1G … 16.1G 0 ( 0%) - 0.1% ± 0.0%
cache_references 323M ± 1.53M 320M … 325M 1 ( 8%) - 0.2% ± 0.6%
cache_misses 10.2M ± 35.8K 10.1M … 10.2M 0 ( 0%) + 0.0% ± 0.4%
branch_misses 7.79M ± 49.1K 7.76M … 7.95M 2 (17%) - 10.4% ± 20.6%
The remaining Gecko (Firefox) wall_time
difference is consistently reproducible, but not readily explainable using the other metrics measured, as there is no significant difference in CPU cycles, instructions, cache usage, etc. After attempting some profiling and trying to use strace
to understand the difference, my guess is that this comes down to coincidental allocation patterns being friendlier when choosing the Gecko version.
If we use strace -e %memory -c
, the dafsa
and blink
versions consistently use more brk
/mmap
/munmap
syscalls (especially brk
):
Dafsa
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
47.18 0.012463 113 110 brk
31.90 0.008427 443 19 munmap
18.25 0.004822 9 508 mmap
2.66 0.000703 5 134 mprotect
------ ----------- ----------- --------- --------- ----------------
100.00 0.026415 34 771 total
Blink
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
55.95 0.018094 138 131 brk
28.81 0.009318 490 19 munmap
12.93 0.004181 8 508 mmap
2.32 0.000749 5 134 mprotect
------ ----------- ----------- --------- --------- ----------------
100.00 0.032342 40 792 total
The Gecko version, even though it uses roughly the same amount of memory overall, consistently has fewer of these syscalls:
Gecko
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
37.07 0.006560 385 17 munmap
31.73 0.005615 75 74 brk
26.50 0.004689 9 506 mmap
4.70 0.000831 6 134 mprotect
------ ----------- ----------- --------- --------- ----------------
100.00 0.017695 24 731 total
I don't know enough about the glibc
/libstdc++
allocator implementation(s) to know why this would be the case, and the magnitude of the difference reported by strace
doesn't seem large enough to explain the results, but I'm at least a little bit confident that this is the cause, since, after inserting padding to each NamedCharacterReferenceMatcher
subclass to ensure they are all the same size, the wall_time
difference went away:
Benchmark 1 (12 runs): ./BenchHTMLTokenizerFiles dafsa
measurement mean ± σ min … max outliers delta
wall_time 1.81s ± 29.8ms 1.73s … 1.85s 1 ( 8%) 0%
peak_rss 89.7MB ± 203KB 89.5MB … 90.3MB 1 ( 8%) 0%
cpu_cycles 6.74G ± 58.0M 6.68G … 6.87G 0 ( 0%) 0%
instructions 16.1G ± 142K 16.1G … 16.1G 1 ( 8%) 0%
cache_references 322M ± 2.45M 316M … 325M 1 ( 8%) 0%
cache_misses 10.2M ± 25.1K 10.1M … 10.2M 0 ( 0%) 0%
branch_misses 7.84M ± 26.1K 7.77M … 7.87M 1 ( 8%) 0%
Benchmark 2 (12 runs): ./BenchHTMLTokenizerFiles blink
measurement mean ± σ min … max outliers delta
wall_time 1.80s ± 23.0ms 1.74s … 1.83s 1 ( 8%) - 0.1% ± 1.2%
peak_rss 89.6MB ± 205KB 89.2MB … 90.1MB 2 (17%) - 0.1% ± 0.2%
cpu_cycles 6.74G ± 37.0M 6.68G … 6.82G 0 ( 0%) - 0.0% ± 0.6%
instructions 16.1G ± 194K 16.1G … 16.1G 0 ( 0%) - 0.0% ± 0.0%
cache_references 321M ± 1.93M 317M … 324M 0 ( 0%) - 0.3% ± 0.6%
cache_misses 10.3M ± 47.5K 10.2M … 10.3M 0 ( 0%) + 0.7% ± 0.3%
branch_misses 7.87M ± 23.3K 7.82M … 7.91M 1 ( 8%) + 0.4% ± 0.3%
Benchmark 3 (12 runs): ./BenchHTMLTokenizerFiles gecko
measurement mean ± σ min … max outliers delta
wall_time 1.80s ± 29.0ms 1.71s … 1.82s 1 ( 8%) - 0.4% ± 1.4%
peak_rss 89.7MB ± 265KB 89.5MB … 90.5MB 1 ( 8%) + 0.1% ± 0.2%
cpu_cycles 6.73G ± 43.3M 6.65G … 6.78G 0 ( 0%) - 0.2% ± 0.6%
instructions 16.1G ± 156K 16.1G … 16.1G 2 (17%) - 0.1% ± 0.0%
cache_references 321M ± 3.29M 316M … 329M 0 ( 0%) - 0.2% ± 0.8%
cache_misses 10.2M ± 52.9K 10.1M … 10.2M 2 (17%) - 0.2% ± 0.3%
branch_misses 7.83M ± 22.2K 7.76M … 7.86M 2 (17%) - 0.2% ± 0.3%
No meaningful differences across the board, which is what we expect. What this (tentatively) means is that heap allocation is another potential confounder, and that something as inconsequential as a single allocation being a different size (in our case the NamedCharacterReferenceMatcher
instance) may have knock-on effects that last for the rest of the program (or I'm wrong about the cause and this is a red herring).
Consequently, this means that I won't bother with the 'average case' benchmarking moving forward. In other words, spoiler alert: nothing in this article will move the needle on this 'average case' benchmark.
Something worth mentioning here is that I've made the choice to convert the Firefox/Chrome/Safari implementations to conform to the NamedCharacterReferenceMatcher
API used by Ladybird (instead of porting the full named character reference tokenizer state implementation from the other browsers' into Ladybird). This was done for two reasons:
I'm mentioning this now because it means that I've introduced another possible source of error into my benchmarks; the Firefox/Chrome/Safari implementations that I'm testing are not 1:1 ports, as they had to be transformed to conform to the NamedCharacterReferenceMatcher
API (Firefox much more than Chrome/Safari).
I think the big takeaway here is that there is a lot that can go wrong when benchmarking.
Aside from the more esoteric stuff mentioned above, there are also countless simple/dumb mistakes that can be made that can completely ruin a benchmark's integrity. As an example, for a good while when writing this article, I accidentally left a loop in the Chrome version that I only put there for debugging purposes. That loop was just eating CPU cycles for no reason and skewed my benchmark results pretty significantly. Luckily, I found and fixed that particular mistake, but that sort of thing could have easily gone unnoticed and caused me to draw totally invalid conclusions. Beyond that, there's other stuff I haven't mentioned like CPU architecture, compiler flags, etc, etc, etc.
What I'm really trying to get across is something like:
With all that out of the way, let's get into it.
The current implementation of named character reference tokenization in the Gecko engine (Firefox's browser engine) was introduced in 2010, and refined during the rest of 2010. It has remained unchanged since then.
It does not use any form of a trie, but instead uses a number of arrays (48 in total) to progressively narrow down the possible candidates within the set of named character references until there's no more possible candidates remaining. Here's an overview:
a-z
or A-Z
range and the first character is saved [src]HILO_ACCEL
array in order to get the 'row' to use for the first character (there are 44 possible rows; the second character is also always within a-z
and A-Z
, but there happens to be 8 missing characters from the A-Z
range) [src]0
and 51
(inclusive) and that is used as an index into the 'row' that was retrieved from the second character [src]NAMES
array are struct
's that contain two pieces of information [src]:
ALL_NAMES
array (an array of bytes)hi < lo
or ;
is seen). [src]VALUES
array (the NAMES
and VALUES
arrays are the same length) [src]In the very likely scenario that the above description is hard to take in, here's my best attempt at visually illustrating how it works, matching against the valid named character reference ⋶
:
notinvc;
^
☑ the first character (n
) is within the a-z
or A-Z
range
notinvc;
^
Use the second character (o
) to get the array to use with the first character (n
)
HILO_ACCEL
|
||
---|---|---|
0 | '\x00' |
N/A |
... | ||
65 | 'A' |
HILO_ACCEL_65 |
... | ||
110 | 'n' |
HILO_ACCEL_110 |
111 | 'o' |
HILO_ACCEL_111
|
110 | 'p' |
HILO_ACCEL_112 |
... | ||
122 | 'z' |
HILO_ACCEL_122 |
HILO_ACCEL_111
|
||
---|---|---|
0 | 'A' |
0x00110010 |
... | ||
25 | 'Z' |
... |
26 | 'a' |
... |
... | ||
39 | 'n' |
0x060205F6
|
... | ||
51 | 'z' |
0x08B308B3 |
0x060205F6
hi
lo
0x0602
or 15380x05F6
or 1526Any possible matches must be between indexes 1526 and 1538 (inclusive)
The possible matches are:
NAMES
|
|
---|---|
1526 | nopf; |
1527 | not |
1528 | not; |
1529 | notin; |
1530 | notinE; |
1531 | notindot; |
1532 | notinva; |
1533 | notinvb; |
1534 | notinvc; |
1535 | notni; |
1536 | notniva; |
1537 | notnivb; |
1538 | notnivc; |
Now we start to narrow those possibilities down:
notinvc;
^
NAMES
|
|
---|---|
nopf; |
|
1527 | not |
1528 | not; |
1529 | notin; |
1530 | notinE; |
1531 | notindot; |
1532 | notinva; |
1533 | notinvb; |
1534 | notinvc; |
1535 | notni; |
1536 | notniva; |
1537 | notnivb; |
1538 | notnivc; |
notinvc;
^
NAMES
|
|
---|---|
nopf; |
|
not |
|
not; |
|
1529 | notin; |
1530 | notinE; |
1531 | notindot; |
1532 | notinva; |
1533 | notinvb; |
1534 | notinvc; |
notni; |
|
notniva; |
|
notnivb; |
|
notnivc; |
notinvc;
^
NAMES
|
|
---|---|
nopf; |
|
not |
|
not; |
|
1529 | notin; |
1530 | notinE; |
1531 | notindot; |
1532 | notinva; |
1533 | notinvb; |
1534 | notinvc; |
notni; |
|
notniva; |
|
notnivb; |
|
notnivc; |
notinvc;
^
NAMES
|
|
---|---|
nopf; |
|
not |
|
not; |
|
notin; |
|
notinE; |
|
notindot; |
|
1532 | notinva; |
1533 | notinvb; |
1534 | notinvc; |
notni; |
|
notniva; |
|
notnivb; |
|
notnivc; |
notinvc;
^
NAMES
|
|
---|---|
nopf; |
|
not |
|
not; |
|
notin; |
|
notinE; |
|
notindot; |
|
notinva; |
|
notinvb; |
|
1534 | notinvc; |
notni; |
|
notniva; |
|
notnivb; |
|
notnivc; |
I'm glossing over how exactly the possibilities are narrowed down because it's not super relevant (if you're interested, here's the responsible tokenizer code), but I will note that the 'lo' and 'hi' cursors always move linearly (i.e. each possible match is ruled out one-by-one; there's no binary search or anything like that going on).
This approach works well because the first two characters alone fairly reliably narrow down the possibilities to a pretty small range. Out of 2288 possible combinations of the first two characters, 1658 of them (72.5%) lead to zero possible matches. Out of the remaining combinations (those with ≥ 1 possible match), the mean number of matches is 3.54 with a standard deviation of 29.8, and the median number of possible matches is 2. Here's what the full distribution looks like (with the combinations that lead to zero matches included):
Now that we have an understanding of how the Firefox implementation works, let's see how it compares using the three metrics that were mentioned at the start.
Performance between the Firefox version and the Ladybird DAFSA version is basically a wash in the primary benchmark I'm using (tokenizing a file with tens of thousands of valid and invalid named character references):
Benchmark 1 (89 runs): ./BenchHTMLTokenizer gecko
measurement mean ± σ min … max outliers delta
wall_time 113ms ± 1.12ms 111ms … 115ms 0 ( 0%) 0%
peak_rss 83.4MB ± 93.7KB 83.0MB … 83.5MB 2 ( 2%) 0%
cpu_cycles 226M ± 877K 224M … 230M 3 ( 3%) 0%
instructions 438M ± 10.6K 438M … 438M 7 ( 8%) 0%
cache_references 9.54M ± 130K 9.40M … 10.5M 6 ( 7%) 0%
cache_misses 427K ± 11.1K 406K … 458K 2 ( 2%) 0%
branch_misses 578K ± 1.79K 575K … 585K 5 ( 6%) 0%
Benchmark 2 (88 runs): ./BenchHTMLTokenizer dafsa
measurement mean ± σ min … max outliers delta
wall_time 114ms ± 1.38ms 110ms … 116ms 5 ( 6%) + 0.7% ± 0.3%
peak_rss 83.3MB ± 94.6KB 83.0MB … 83.5MB 1 ( 1%) - 0.1% ± 0.0%
cpu_cycles 229M ± 856K 227M … 232M 3 ( 3%) 💩+ 1.4% ± 0.1%
instructions 450M ± 10.4K 450M … 450M 3 ( 3%) 💩+ 2.7% ± 0.0%
cache_references 9.42M ± 128K 9.25M … 10.5M 3 ( 3%) - 1.2% ± 0.4%
cache_misses 418K ± 9.03K 400K … 443K 2 ( 2%) ⚡- 2.0% ± 0.7%
branch_misses 575K ± 3.01K 570K … 600K 5 ( 6%) - 0.5% ± 0.1%
However, if we tailor some benchmarks to test the scenarios where each should theoretically perform the worst, we can see some clearer differences.
I believe the worst case for the Firefox implementation is successfully matching the named character reference ⫌
. As mentioned earlier, su
as the first two characters narrows down the possibilities the least, with 55 remaining possibilities, and ⫌
should take the longest to match out of the remaining possibilities.
Here are the results for tokenizing a file with nothing but 30,000 ⫌
sequences in a row:
Benchmark 1 (197 runs): ./BenchHTMLTokenizer gecko gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 50.6ms ± 1.11ms 48.3ms … 53.2ms 0 ( 0%) 0%
peak_rss 53.0MB ± 85.4KB 52.7MB … 53.2MB 1 ( 1%) 0%
cpu_cycles 137M ± 816K 135M … 140M 3 ( 2%) 0%
instructions 278M ± 9.06K 278M … 278M 7 ( 4%) 0%
cache_references 3.27M ± 58.4K 3.13M … 3.55M 17 ( 9%) 0%
cache_misses 361K ± 10.0K 342K … 396K 2 ( 1%) 0%
branch_misses 314K ± 5.29K 306K … 335K 5 ( 3%) 0%
Benchmark 2 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 45.9ms ± 805us 43.6ms … 47.5ms 22 (10%) ⚡- 9.3% ± 0.4%
peak_rss 53.0MB ± 83.9KB 52.7MB … 53.2MB 1 ( 0%) - 0.0% ± 0.0%
cpu_cycles 117M ± 635K 116M … 119M 3 ( 1%) ⚡- 14.5% ± 0.1%
instructions 259M ± 5.16K 259M … 259M 5 ( 2%) ⚡- 7.0% ± 0.0%
cache_references 3.27M ± 128K 3.15M … 4.10M 20 ( 9%) - 0.0% ± 0.6%
cache_misses 357K ± 7.06K 344K … 384K 5 ( 2%) - 1.1% ± 0.5%
branch_misses 183K ± 1.95K 179K … 193K 21 (10%) ⚡- 41.8% ± 0.2%
On the flipside, the scenario where the Firefox implementation likely outperforms the Ladybird implementation the most is an invalid named character reference that can be rejected from the first two characters alone (I've arbitrarily chosen &cz
).
Here are the results for tokenizing a file with nothing but 30,000 &cz
sequences in a row:
Benchmark 1 (163 runs): ./BenchHTMLTokenizer gecko ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 61.4ms ± 958us 59.2ms … 63.6ms 1 ( 1%) 0%
peak_rss 65.1MB ± 85.5KB 64.9MB … 65.3MB 24 (15%) 0%
cpu_cycles 104M ± 525K 102M … 106M 4 ( 2%) 0%
instructions 194M ± 4.18K 194M … 194M 2 ( 1%) 0%
cache_references 5.89M ± 125K 5.79M … 6.92M 8 ( 5%) 0%
cache_misses 374K ± 4.44K 367K … 385K 0 ( 0%) 0%
branch_misses 163K ± 1.24K 160K … 166K 3 ( 2%) 0%
Benchmark 2 (159 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 63.0ms ± 1.00ms 60.1ms … 65.0ms 0 ( 0%) 💩+ 2.6% ± 0.3%
peak_rss 65.1MB ± 74.3KB 64.8MB … 65.2MB 1 ( 1%) - 0.1% ± 0.0%
cpu_cycles 112M ± 673K 111M … 117M 8 ( 5%) 💩+ 7.6% ± 0.1%
instructions 214M ± 4.26K 214M … 214M 1 ( 1%) 💩+ 10.4% ± 0.0%
cache_references 5.87M ± 54.9K 5.77M … 6.19M 5 ( 3%) - 0.4% ± 0.4%
cache_misses 375K ± 4.52K 364K … 391K 2 ( 1%) + 0.2% ± 0.3%
branch_misses 164K ± 751 161K … 166K 2 ( 1%) + 0.4% ± 0.1%
However, neither of these scenarios are likely to be all that common in reality. My hunch (but I don't have any data to back this up) is that there are two scenarios that are common in real HTML:
&
directly in the markup instead of &
(and therefore the &
is probably surrounded by whitespace)In the second scenario where an &
character is surrounded by whitespace, the tokenizer will never actually enter the named character reference state, since that requires &
to be followed by an ASCII alphanumeric character, so all implementations will perform the same there.
We can test the first scenario, though. Here are the results for a file with nothing but 30,000 valid named character references (chosen at random) in a row:
Benchmark 1 (229 runs): ./BenchHTMLTokenizer gecko all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.6ms ± 713us 41.3ms … 45.2ms 13 ( 6%) 0%
peak_rss 54.4MB ± 84.8KB 54.1MB … 54.5MB 5 ( 2%) 0%
cpu_cycles 103M ± 545K 102M … 105M 2 ( 1%) 0%
instructions 188M ± 10.6K 188M … 188M 17 ( 7%) 0%
cache_references 3.63M ± 91.6K 3.54M … 4.35M 15 ( 7%) 0%
cache_misses 361K ± 8.92K 347K … 386K 0 ( 0%) 0%
branch_misses 383K ± 1.57K 379K … 387K 0 ( 0%) 0%
Benchmark 2 (233 runs): ./BenchHTMLTokenizer dafsa all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.0ms ± 770us 40.8ms … 44.3ms 30 (13%) ⚡- 1.3% ± 0.3%
peak_rss 54.4MB ± 80.6KB 54.0MB … 54.5MB 1 ( 0%) - 0.1% ± 0.0%
cpu_cycles 102M ± 452K 101M … 104M 4 ( 2%) ⚡- 1.3% ± 0.1%
instructions 191M ± 8.04K 191M … 191M 11 ( 5%) 💩+ 2.0% ± 0.0%
cache_references 3.54M ± 63.5K 3.45M … 4.09M 13 ( 6%) ⚡- 2.6% ± 0.4%
cache_misses 355K ± 5.23K 344K … 370K 1 ( 0%) ⚡- 1.7% ± 0.4%
branch_misses 344K ± 1.24K 341K … 348K 17 ( 7%) ⚡- 10.1% ± 0.1%
Something worth noting about why Firefox fares better here than in the ⫌
benchmark above is that the distribution of possible matches after two characters is very uneven (as mentioned previously). In 42.7% of cases (269 out of 630) where there is at least 1 possible match after the first two characters, there is only 1 possible match. This makes the 'narrowing down the matches' portion of the Firefox implementation essentially just a string comparison.
Something else we can look at is raw matching speed in isolation (i.e. not within the context of HTML tokenization). In my benchmarking, the Firefox implementation wins out in this category:
Benchmark 1 (95 runs): ./BenchMatcherGecko
measurement mean ± σ min … max outliers delta
wall_time 105ms ± 1.14ms 102ms … 108ms 5 ( 5%) 0%
peak_rss 4.56MB ± 72.7KB 4.33MB … 4.72MB 0 ( 0%) 0%
cpu_cycles 426M ± 1.40M 424M … 430M 9 ( 9%) 0%
instructions 745M ± 81.5 745M … 745M 0 ( 0%) 0%
cache_references 8.03M ± 81.6K 7.89M … 8.54M 3 ( 3%) 0%
cache_misses 28.0K ± 5.70K 21.2K … 46.5K 7 ( 7%) 0%
branch_misses 5.41M ± 2.49K 5.41M … 5.42M 0 ( 0%) 0%
Benchmark 2 (84 runs): ./BenchMatcherDafsa
measurement mean ± σ min … max outliers delta
wall_time 120ms ± 1.46ms 116ms … 123ms 0 ( 0%) 💩+ 13.8% ± 0.4%
peak_rss 4.50MB ± 61.6KB 4.46MB … 4.59MB 0 ( 0%) - 1.3% ± 0.4%
cpu_cycles 487M ± 4.39M 477M … 496M 2 ( 2%) 💩+ 14.3% ± 0.2%
instructions 1.02G ± 66.2 1.02G … 1.02G 14 (17%) 💩+ 36.7% ± 0.0%
cache_references 6.11M ± 90.5K 5.98M … 6.81M 2 ( 2%) ⚡- 23.9% ± 0.3%
cache_misses 26.0K ± 4.02K 20.3K … 39.8K 5 ( 6%) ⚡- 7.1% ± 5.2%
branch_misses 6.01M ± 20.5K 5.98M … 6.06M 0 ( 0%) 💩+ 11.1% ± 0.1%
Two things to note:
As noted earlier, Firefox uses a total of 48 arrays for its named character reference data:
HILO_ACCEL
is an array of 123 pointers, so that's 984 bytes on a 64-bit architectureHILO_ACCEL_n
arrays, each containing 52 32-bit integers, so that's 9,152 bytesALL_NAMES
is an array of bytes containing the complete set of characters in all named character references, excluding the first two characters of each. This adds up to 12,183 bytes in totalNAMES
is an array of 2,231 32-bit structs, so that's 8,924 bytesVALUES
is also an array of 2,231 32-bit structs, so that's 8,924 bytesSo, in total, the Firefox implementation uses 40,167 bytes (39.23 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 15,755 bytes (15.39 KiB), or, in other words, the Ladybird implementation uses 60.8% of the data size of the Firefox implementation.
Since, for the purposes of my benchmarking, the Firefox implementation was made to conform to the API of the NamedCharacterReferenceMatcher
that is used in the Ladybird implementation, there's no meaningful difference in terms of ease-of-use.
However, I'll take this opportunity to talk about what changes were made to make that happen and how the real Firefox implementation differs.
goto
statements to move between themNamedCharacterReferenceMatcher
version moves the tokenizer states into the Matcher
and replaces the complicated loops with 2 simple loopsVery crudely approximated, the implementation went from ~200 SLoC to ~110 SLoC. So, the NamedCharacterReferenceMatcher
abstraction may represent a marginal improvement in terms of ease-of-use.
Overall, the Firefox implementation fares quite well in this comparison.
Blink (the browser engine of Chrome/Chromium) started as a fork of WebKit (the browser engine of Safari), which itself started as a fork of KHTML. There are some differences that have emerged between the two since Blink was forked from WebKit, but for the purposes of this article I'm only going to benchmark against the Blink implementation and assume the results would be roughly the same for the WebKit implementation (the difference mostly comes down to data size, which I'll mention in the "Data size" section later).
Like Firefox, the Chrome/Safari named character reference tokenization does not use a trie. For the 'matching' portion, the Chrome/Safari implementation is actually quite similar in concept to the Firefox implementation:
std::ranges::equal_range
to narrow the possibilities within the current range (std::ranges::equal_range
uses binary searches to get both the std::ranges::lower_bound
and std::ranges::upper_bound
) [src]The main differences from the Firefox implementation are that the Chrome/Safari version (a) only uses the first character to narrow the initial possible matches, and (b) uses binary searches to narrow the possibilities after that instead of linearly moving the lo
and hi
indexes.
Similar to Firefox, the performance difference in the 'tens of thousands of valid and invalid named character references' benchmark is basically a wash:
Benchmark 1 (88 runs): ./BenchHTMLTokenizer blink
measurement mean ± σ min … max outliers delta
wall_time 115ms ± 943us 113ms … 117ms 0 ( 0%) 0%
peak_rss 83.3MB ± 99.6KB 83.0MB … 83.5MB 2 ( 2%) 0%
cpu_cycles 232M ± 754K 230M … 234M 0 ( 0%) 0%
instructions 461M ± 4.73K 461M … 461M 0 ( 0%) 0%
cache_references 9.95M ± 299K 9.71M … 12.3M 6 ( 7%) 0%
cache_misses 412K ± 5.68K 401K … 425K 0 ( 0%) 0%
branch_misses 747K ± 1.78K 744K … 757K 2 ( 2%) 0%
Benchmark 2 (88 runs): ./BenchHTMLTokenizer dafsa
measurement mean ± σ min … max outliers delta
wall_time 114ms ± 1.26ms 110ms … 117ms 1 ( 1%) - 0.6% ± 0.3%
peak_rss 83.3MB ± 77.5KB 83.1MB … 83.5MB 0 ( 0%) + 0.0% ± 0.0%
cpu_cycles 228M ± 882K 227M … 233M 5 ( 6%) ⚡- 1.4% ± 0.1%
instructions 450M ± 7.33K 450M … 450M 4 ( 5%) ⚡- 2.4% ± 0.0%
cache_references 9.48M ± 402K 9.29M … 13.1M 8 ( 9%) ⚡- 4.7% ± 1.1%
cache_misses 412K ± 7.21K 398K … 432K 0 ( 0%) - 0.0% ± 0.5%
branch_misses 575K ± 6.48K 571K … 633K 6 ( 7%) ⚡- 23.0% ± 0.2%
The DAFSA is faster than Chrome in the all-⫌
benchmark, but the difference is not as big as it was with Firefox, since Chrome uses binary searches to narrow down the possible matches whereas Firefox uses linear scans:
Benchmark 1 (211 runs): ./BenchHTMLTokenizer blink gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 47.3ms ± 782us 45.0ms … 49.6ms 19 ( 9%) 0%
peak_rss 53.0MB ± 94.4KB 52.7MB … 53.2MB 2 ( 1%) 0%
cpu_cycles 123M ± 807K 122M … 126M 4 ( 2%) 0%
instructions 290M ± 7.46K 290M … 290M 17 ( 8%) 0%
cache_references 3.31M ± 200K 3.19M … 5.79M 7 ( 3%) 0%
cache_misses 358K ± 8.15K 345K … 401K 1 ( 0%) 0%
branch_misses 182K ± 1.82K 178K … 194K 3 ( 1%) 0%
Benchmark 2 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 45.9ms ± 663us 43.6ms … 47.3ms 11 ( 5%) ⚡- 3.0% ± 0.3%
peak_rss 53.0MB ± 82.6KB 52.7MB … 53.2MB 1 ( 0%) + 0.1% ± 0.0%
cpu_cycles 117M ± 677K 116M … 120M 4 ( 2%) ⚡- 5.0% ± 0.1%
instructions 259M ± 7.78K 259M … 259M 9 ( 4%) ⚡- 10.8% ± 0.0%
cache_references 3.26M ± 99.4K 3.17M … 4.00M 18 ( 8%) - 1.4% ± 0.9%
cache_misses 358K ± 8.14K 342K … 386K 8 ( 4%) - 0.2% ± 0.4%
branch_misses 183K ± 3.07K 179K … 214K 17 ( 8%) + 0.9% ± 0.3%
The DAFSA is once again worse at detecting &cz
as invalid, and the results are similar to what they were with Firefox:
Benchmark 1 (160 runs): ./BenchHTMLTokenizer blink ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 62.6ms ± 1.08ms 60.6ms … 64.3ms 0 ( 0%) 0%
peak_rss 65.1MB ± 79.4KB 64.8MB … 65.2MB 1 ( 1%) 0%
cpu_cycles 108M ± 640K 107M … 111M 3 ( 2%) 0%
instructions 203M ± 11.3K 203M … 203M 7 ( 4%) 0%
cache_references 5.92M ± 51.5K 5.82M … 6.30M 10 ( 6%) 0%
cache_misses 384K ± 8.88K 366K … 409K 0 ( 0%) 0%
branch_misses 164K ± 1.44K 160K … 169K 2 ( 1%) 0%
Benchmark 2 (157 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 63.9ms ± 1.10ms 61.8ms … 65.5ms 0 ( 0%) 💩+ 2.0% ± 0.4%
peak_rss 65.1MB ± 92.8KB 64.8MB … 65.2MB 4 ( 3%) - 0.0% ± 0.0%
cpu_cycles 113M ± 707K 112M … 117M 1 ( 1%) 💩+ 4.4% ± 0.1%
instructions 214M ± 10.9K 214M … 214M 7 ( 4%) 💩+ 5.6% ± 0.0%
cache_references 5.89M ± 63.4K 5.76M … 6.31M 5 ( 3%) - 0.5% ± 0.2%
cache_misses 387K ± 9.20K 367K … 412K 2 ( 1%) + 0.7% ± 0.5%
branch_misses 165K ± 2.77K 162K … 197K 5 ( 3%) + 0.6% ± 0.3%
For the '30,000 valid named character references' benchmark, the DAFSA is slightly faster (similar results as Firefox):
Benchmark 1 (228 runs): ./BenchHTMLTokenizer blink all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.9ms ± 844us 41.7ms … 47.0ms 21 ( 9%) 0%
peak_rss 54.3MB ± 89.2KB 54.0MB … 54.5MB 2 ( 1%) 0%
cpu_cycles 105M ± 959K 104M … 112M 5 ( 2%) 0%
instructions 204M ± 6.18K 204M … 204M 7 ( 3%) 0%
cache_references 3.78M ± 137K 3.67M … 5.37M 13 ( 6%) 0%
cache_misses 359K ± 11.2K 345K … 457K 13 ( 6%) 0%
branch_misses 466K ± 1.43K 463K … 472K 4 ( 2%) 0%
Benchmark 2 (232 runs): ./BenchHTMLTokenizer dafsa all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.1ms ± 592us 41.1ms … 44.3ms 13 ( 6%) ⚡- 1.8% ± 0.3%
peak_rss 54.4MB ± 79.6KB 54.1MB … 54.5MB 0 ( 0%) + 0.0% ± 0.0%
cpu_cycles 102M ± 487K 101M … 104M 2 ( 1%) ⚡- 3.3% ± 0.1%
instructions 191M ± 5.33K 191M … 191M 4 ( 2%) ⚡- 6.0% ± 0.0%
cache_references 3.55M ± 74.9K 3.46M … 4.21M 10 ( 4%) ⚡- 6.3% ± 0.5%
cache_misses 355K ± 5.67K 344K … 373K 2 ( 1%) - 1.0% ± 0.5%
branch_misses 345K ± 2.12K 342K … 374K 2 ( 1%) ⚡- 26.0% ± 0.1%
In the raw matching speed benchmark, the DAFSA comes out ahead:
Benchmark 1 (70 runs): ./BenchHTMLTokenizerBlink
measurement mean ± σ min … max outliers delta
wall_time 144ms ± 1.33ms 141ms … 148ms 1 ( 1%) 0%
peak_rss 4.55MB ± 65.8KB 4.33MB … 4.72MB 0 ( 0%) 0%
cpu_cycles 590M ± 2.33M 585M … 604M 4 ( 6%) 0%
instructions 1.07G ± 83.4 1.07G … 1.07G 0 ( 0%) 0%
cache_references 12.3M ± 162K 12.1M … 13.0M 6 ( 9%) 0%
cache_misses 27.6K ± 5.79K 21.1K … 67.6K 2 ( 3%) 0%
branch_misses 8.71M ± 21.1K 8.68M … 8.79M 1 ( 1%) 0%
Benchmark 2 (85 runs): ./BenchMatcherDafsa
measurement mean ± σ min … max outliers delta
wall_time 119ms ± 1.47ms 116ms … 122ms 0 ( 0%) ⚡- 17.6% ± 0.3%
peak_rss 4.52MB ± 65.7KB 4.46MB … 4.59MB 0 ( 0%) - 0.8% ± 0.5%
cpu_cycles 484M ± 4.33M 477M … 496M 0 ( 0%) ⚡- 17.8% ± 0.2%
instructions 1.02G ± 80.6 1.02G … 1.02G 0 ( 0%) ⚡- 4.4% ± 0.0%
cache_references 6.10M ± 177K 5.99M … 7.66M 2 ( 2%) ⚡- 50.5% ± 0.4%
cache_misses 25.4K ± 3.42K 20.4K … 38.8K 4 ( 5%) ⚡- 7.9% ± 5.3%
branch_misses 6.02M ± 21.7K 5.98M … 6.06M 0 ( 0%) ⚡- 30.9% ± 0.1%
The Chrome implementation uses four arrays:
kStaticEntityStringStorage
, an array of all the bytes in every named character reference, with some de-duplication techniques (e.g. the sequence 'b', 'n', 'o', 't', ';'
in the array is used for ⌐
, ¬
, and ¬
). It uses 14,485 bytes total.kStaticEntityTable
, an array of 2,231 12-byte wide structs containing information about each named character reference (its location in the kStaticEntityStringStorage
array, the length of its name, the code point(s) it should be transformed into). It uses 26,722 bytes.kUppercaseOffset
and kLowercaseOffset
are each arrays of offsets into kStaticEntityTable
, and both are used as lookup tables for the first character. Getting kUppercaseOffset[char - 'A']
gives you the initial lower bound's offset and kUppercaseOffset[char - 'A' + 1]
gives you the initial upper bound's offset (and the same sort of thing for kLowercaseOffset
). Each uses 54 bytes, so that's 108 bytes total.All together, the Chrome implementation uses 41,167 bytes (40.39 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 16,953 bytes (16.56 KiB), or 59.0% of the data size of the Chrome implementation.
The Safari implementation uses the same four arrays, but has made a few more data size optimizations:
kStaticEntityStringStorage
does not include semicolons, and instead that information was moved to a boolean flag within the elements of the kStaticEntityTable
array. This brings down the total bytes used by this array to 11,127 (-3,358 compared to the Chrome version)HTMLEntityTableEntry
struct (used in the kStaticEntityTable
array) was converted to use a bitfield to reduce the size of the struct from 12 bytes to 8 bytes (57 bits). However, Clang seems to insert padding bits into the struct
which brings it back up to 12 bytes anyway (it wants to align the optionalSecondCharacter
and nameLengthExcludingSemicolon
fields). So, this data size optimization may or may not actually have an effect (I'm not very familiar with the rules around C++ bitfield padding, so I feel like I can't say anything definitive). If the size is reduced to 8 bytes, then kStaticEntityTable
uses 8,924 less bytes (17,798 instead of 26,722).So, the Safari implementation uses either 30,040 bytes (29.34 KiB) if HTMLEntityTableEntry
uses 12 bytes or 21,116 bytes (20.62 KiB) if HTMLEntityTableEntry
uses 8 bytes. This means that Safari's data size optimizations (or at least their intended effect) makes its data size smaller than Ladybird's (even if the Ladybird implementation tightly bitpacked its values array, it'd still use 229 bytes more than the 8-byte-HTMLEntityTableEntry
Safari version). This also shows that the larger data size of the Chrome implementation is not inherent to the approach that it uses.
For now I'll just say there's no meaningful difference, but there's a caveat that will be discussed later.
Overall, the Chrome implementation as-it-is-now fares about as well as the Firefox implementation in this comparison, but has some potential strengths/weaknesses of its own. That is, it covers one weakness of the Firefox implementation by using binary searches instead of linear scans, but it always has to narrow down the possibilities from a larger initial range (since it only uses the first character to get the range of possible matches whereas Firefox uses the first two characters).
The Safari version fares much better in terms of data size (potentially beating out my DAFSA version), and its size optimizations could be applied to the Chrome version as well since the core approach is the same between them.
At this point, though, you might be asking yourself, why don't we try...
In theory, the best ideas from the Firefox and Chrome/Safari implementations could be combined into one new implementation:
kStaticEntityStringStorage
/ALL_NAMES
array (like Firefox)kStaticEntityStringStorage
/ALL_NAMES
when possible (like Chrome/Safari, see the ⌐
/¬
/¬
example above)kStaticEntityStringStorage
/ALL_NAMES
array (like Safari)HTMLEntityTableEntry
/nsHtml5CharacterName
struct (like Safari intends to do)I haven't tested this combination to see how exactly it stacks up, but I would assume it'd be quite good overall.
Since I converted the Chrome implementation to use Ladybird's NamedCharacterReferenceMacher
API in an effort to improve the accuracy of my benchmarking, one major aspect of the Chrome implementation was lost in translation: the Chrome implementation doesn't actually use the character-by-character tokenization strategy we've discussed so far.
Instead, it uses the "lookahead (but never beyond an insertion point) until we're certain we have enough characters ..." strategy mentioned back in the Named character reference tokenization overview. The very high-level summary of the approach (as it is actually implemented in Chrome) is very similar to the description of it in that section:
The downside of the Chrome implementation in particular is actually a choice that was made that's not inherent to the overall approach: they don't preserve any matching state between tokenizer iterations and always backtrack to the &
before trying to match again. For example, when matching against something like ∉
that's being written one-character-at-a-time (via document.write
), the matching algorithm described in the "Comparison with Blink/WebKit (Chrome/Safari)" section will be executed for each of:
&n
, &no
, ¬
, ¬i
, and ¬in
, each resulting in the 'not enough characters' flag being set∉
will be matched fully (the semicolon acts as a definitive delimiter)In theory, the redundant work that's performed in these sorts of scenarios should have a noticeable affect on performance, but, in practice, I wasn't able to prove that out with benchmarking.
document.write
)Benchmark 1 (14 runs): ./headless-browser --dump-text pathological-write-valid.html
measurement mean ± σ min … max outliers delta
wall_time 1.46s ± 8.11ms 1.45s … 1.48s 0 ( 0%) 0%
peak_rss 62.1MB ± 123KB 61.9MB … 62.3MB 0 ( 0%) 0%
cpu_cycles 5.90G ± 33.5M 5.88G … 5.99G 1 ( 7%) 0%
instructions 29.1G ± 562K 29.1G … 29.1G 1 ( 7%) 0%
cache_references 210M ± 988K 208M … 211M 0 ( 0%) 0%
cache_misses 8.53M ± 322K 7.92M … 9.09M 0 ( 0%) 0%
branch_misses 5.03M ± 19.2K 5.01M … 5.07M 0 ( 0%) 0%
Benchmark 2 (14 runs): ./headless-browser --blink-preserve-state --dump-text pathological-write-valid.html
measurement mean ± σ min … max outliers delta
wall_time 1.46s ± 4.52ms 1.45s … 1.47s 0 ( 0%) - 0.3% ± 0.3%
peak_rss 62.0MB ± 164KB 61.7MB … 62.4MB 0 ( 0%) - 0.1% ± 0.2%
cpu_cycles 5.88G ± 18.1M 5.86G … 5.93G 0 ( 0%) - 0.3% ± 0.4%
instructions 29.1G ± 422K 29.1G … 29.1G 0 ( 0%) - 0.1% ± 0.0%
cache_references 209M ± 1.26M 207M … 212M 1 ( 7%) - 0.5% ± 0.4%
cache_misses 8.55M ± 425K 8.19M … 9.78M 1 ( 7%) + 0.3% ± 3.4%
branch_misses 5.00M ± 27.0K 4.97M … 5.07M 0 ( 0%) - 0.6% ± 0.4%
So, despite HTMLEntitySearch::Advance
being called 4.5x more in the 'no state preserved' benchmark, no difference shows up in the results. I believe this is because the actual matching is a small part of the work being done in this benchmark, or, in other words, there is a difference but it's being drowned out by all the work being done elsewhere (JavaScript being run, tokenizer input being updated, etc). I have a hunch that Ladybird in particular might be greatly exacerbating this effect and making all this tangential work slower than it theoretically needs to be, especially in the case of updating the tokenizer input. For example, Chrome uses a rope-like SegmentedString to mitigate the cost of inserting into the middle of the input, while Ladybird currently reallocates the entire modified input on each insertion.
To sum up what I'm trying to say:
document.write
one-character-at-a-time' scenario because it doesn't preserve matching state between retries due to 'not enough characters'This is important to note because it is the reason I included the following point in my 'list of things you'll need to trust in order for my benchmarking to be accurate' in the intro to the "Comparison to the major browser engines" section:
- The performance characteristics exhibited would hold when going the other direction (putting my implementation into their tokenizer)
That is, I have some reason to believe 'going the other direction' may actually be slightly more favorable to my DAFSA implementation, as all the code outside of the named character reference tokenization state itself likely will do less that will muddy benchmarking results.
An upside of the 'lookahead' strategy is that, when you know there's no active insertion point, you can always match against the full remaining input in one go. This is potentially an improvement over the 'tokenize-one-character-at-a-time' strategy if there is any significant amount of work that's done between tokenizer iterations. Here's some simple diagrams to illustrate the point:
There are two theoretical benefits I can think of with this:
N
in one go may be more efficient than moving it ahead by one, N
timesLuckily, we don't actually need to change much about our 'character-by-character tokenization' approach to get these benefits, as we only need to make it so we use lookahead whenever there's no active insertion point. In Ladybird, that might look something like this (full implementation):
BEGIN_STATE(NamedCharacterReference)
{
if (stop_at_insertion_point == StopAtInsertionPoint::No) {
// Use the 'lookahead' approach without needing to worry about insertion points
} else {
// Use the character-by-character tokenization approach
}
// ...
}
In practice, this seems to give up to a 1.13x speedup in our Ladybird benchmarks essentially for free:
Benchmark 1 (44 runs): ./BenchHTMLTokenizer dafsa
measurement mean ± σ min … max outliers delta
wall_time 114ms ± 934us 112ms … 115ms 3 ( 7%) 0%
peak_rss 83.5MB ± 74.9KB 83.3MB … 83.6MB 0 ( 0%) 0%
cpu_cycles 226M ± 964K 223M … 229M 2 ( 5%) 0%
instructions 452M ± 6.81K 452M … 452M 0 ( 0%) 0%
cache_references 9.64M ± 86.1K 9.51M … 9.86M 4 ( 9%) 0%
cache_misses 418K ± 10.5K 399K … 448K 3 ( 7%) 0%
branch_misses 573K ± 2.51K 570K … 584K 3 ( 7%) 0%
Benchmark 2 (47 runs): ./BenchHTMLTokenizer dafsa-lookahead
measurement mean ± σ min … max outliers delta
wall_time 108ms ± 1.05ms 106ms … 111ms 0 ( 0%) ⚡- 5.1% ± 0.4%
peak_rss 83.5MB ± 81.5KB 83.3MB … 83.6MB 0 ( 0%) - 0.0% ± 0.0%
cpu_cycles 203M ± 869K 201M … 205M 0 ( 0%) ⚡- 10.5% ± 0.2%
instructions 377M ± 10.8K 377M … 377M 1 ( 2%) ⚡- 16.6% ± 0.0%
cache_references 9.57M ± 71.3K 9.42M … 9.77M 1 ( 2%) - 0.7% ± 0.3%
cache_misses 415K ± 7.59K 401K … 429K 0 ( 0%) - 0.7% ± 0.9%
branch_misses 553K ± 2.04K 547K … 556K 0 ( 0%) ⚡- 3.5% ± 0.2%
Benchmark 1 (109 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 45.8ms ± 866us 43.7ms … 47.2ms 11 (10%) 0%
peak_rss 53.2MB ± 67.6KB 53.0MB … 53.4MB 29 (27%) 0%
cpu_cycles 117M ± 601K 116M … 118M 3 ( 3%) 0%
instructions 261M ± 6.10K 261M … 261M 2 ( 2%) 0%
cache_references 3.27M ± 62.6K 3.19M … 3.69M 4 ( 4%) 0%
cache_misses 354K ± 4.17K 345K … 368K 2 ( 2%) 0%
branch_misses 182K ± 4.28K 178K … 213K 10 ( 9%) 0%
Benchmark 2 (124 runs): ./BenchHTMLTokenizer dafsa-lookahead gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 40.4ms ± 756us 38.3ms … 42.0ms 13 (10%) ⚡- 11.8% ± 0.5%
peak_rss 53.2MB ± 77.2KB 52.9MB … 53.4MB 40 (32%) + 0.0% ± 0.0%
cpu_cycles 93.4M ± 546K 92.5M … 95.8M 3 ( 2%) ⚡- 19.9% ± 0.1%
instructions 190M ± 5.41K 190M … 190M 3 ( 2%) ⚡- 27.1% ± 0.0%
cache_references 3.31M ± 42.2K 3.23M … 3.48M 3 ( 2%) + 1.4% ± 0.4%
cache_misses 354K ± 5.07K 345K … 372K 5 ( 4%) - 0.1% ± 0.3%
branch_misses 153K ± 10.5K 146K … 215K 17 (14%) ⚡- 16.1% ± 1.2%
Benchmark 1 (79 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 63.2ms ± 964us 61.6ms … 65.7ms 0 ( 0%) 0%
peak_rss 65.3MB ± 79.1KB 65.0MB … 65.4MB 0 ( 0%) 0%
cpu_cycles 112M ± 606K 111M … 115M 2 ( 3%) 0%
instructions 215M ± 8.83K 215M … 215M 0 ( 0%) 0%
cache_references 5.91M ± 107K 5.78M … 6.50M 1 ( 1%) 0%
cache_misses 372K ± 4.43K 363K … 390K 2 ( 3%) 0%
branch_misses 162K ± 1.17K 160K … 165K 0 ( 0%) 0%
Benchmark 2 (80 runs): ./BenchHTMLTokenizer dafsa-lookahead ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 62.8ms ± 1.08ms 61.3ms … 64.3ms 0 ( 0%) - 0.6% ± 0.5%
peak_rss 65.2MB ± 88.6KB 64.9MB … 65.4MB 1 ( 1%) - 0.0% ± 0.0%
cpu_cycles 111M ± 454K 110M … 112M 0 ( 0%) ⚡- 1.4% ± 0.1%
instructions 209M ± 7.60K 209M … 209M 2 ( 3%) ⚡- 2.7% ± 0.0%
cache_references 5.91M ± 68.1K 5.78M … 6.09M 5 ( 6%) - 0.1% ± 0.5%
cache_misses 374K ± 5.24K 365K … 397K 1 ( 1%) + 0.6% ± 0.4%
branch_misses 164K ± 964 162K … 168K 1 ( 1%) + 1.0% ± 0.2%
Benchmark 1 (115 runs): ./BenchHTMLTokenizer dafsa all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.3ms ± 843us 40.9ms … 45.6ms 6 ( 5%) 0%
peak_rss 54.5MB ± 83.6KB 54.3MB … 54.6MB 1 ( 1%) 0%
cpu_cycles 100M ± 744K 98.4M … 103M 2 ( 2%) 0%
instructions 193M ± 11.0K 193M … 193M 12 (10%) 0%
cache_references 3.58M ± 40.3K 3.47M … 3.70M 3 ( 3%) 0%
cache_misses 363K ± 11.3K 344K … 398K 1 ( 1%) 0%
branch_misses 344K ± 1.99K 341K … 349K 0 ( 0%) 0%
Benchmark 2 (127 runs): ./BenchHTMLTokenizer dafsa-lookahead all-valid
measurement mean ± σ min … max outliers delta
wall_time 39.4ms ± 849us 37.2ms … 40.8ms 14 (11%) ⚡- 9.0% ± 0.5%
peak_rss 54.5MB ± 84.3KB 54.3MB … 54.6MB 1 ( 1%) + 0.0% ± 0.0%
cpu_cycles 84.8M ± 521K 84.0M … 87.4M 5 ( 4%) ⚡- 15.5% ± 0.2%
instructions 148M ± 6.20K 148M … 148M 7 ( 6%) ⚡- 23.2% ± 0.0%
cache_references 3.56M ± 51.3K 3.47M … 3.82M 4 ( 3%) - 0.4% ± 0.3%
cache_misses 356K ± 7.44K 342K … 384K 7 ( 6%) ⚡- 2.0% ± 0.7%
branch_misses 316K ± 3.08K 313K … 347K 3 ( 2%) ⚡- 8.1% ± 0.2%
This also explains a few things that I glossed over or deferred until later throughout the article:
NamedCharacterReferenceMatcher
API and thereby converted them all to use the 'character-by-character tokenization' strategy (to rule out stuff like this from affecting the results without me realizing it).NamedCharacterReferenceMatcher
directly without involving the tokenizer) don't fully translate to equivalent differences in the tokenizer benchmarks.Something else that I've failed to mention until now regards exactly how backtracking is performed.
In Ladybird, backtracking is very straightforward: modify the input cursor's position by subtracting N
from the current position (where N
is the number of overconsumed code points). This is safe to do from a code point boundary perspective because the input is always UTF-8 encoded, and named character reference matching will only ever consume ASCII characters, so going back N
bytes is guaranteed to be equivalent to going back N
code points.
In all the other major browsers, though, backtracking is done by re-inserting the overconsumed code points back into the input stream, at the position just after the current code point. That is, it modifies the input stream such that the next code points to-be-tokenized will be those that were just re-inserted.
As of now, I'm unsure if the way that Firefox/Chrome/Safari do it is necessary (and therefore whether or not Ladybird will need to adopt the same strategy to be fully compliant with the spec). If it is necessary, I'm either unaware of the relevant web platform test, or there is a missing web platform test that checks whatever it's necessary for. If it's not necessary, then there may be an opportunity in the other browser engines to simplify backtracking when dealing with named character references.
As of the writing of this article, the DAFSA implementation that's been described so far is exactly what's in Ladybird. During the process of writing this, though, I came up with some ideas (largely inspired by the Firefox/Chrome/Safari implementations) to improve my implementation by utilizing every last possible bit I could get my hands on. This came in the form of two independent optimizations, with one of them just so happening to barely make the other possible.
A property of named character references that I missed, but that the major browser engines all take advantage of, is that the first character of a named character reference is always within the range of a-z
or A-Z
(inclusive). In terms of the DAFSA, we can take advantage of this property to accelerate the search for the first character: instead of linearly scanning across the child nodes, we can:
This would turn the O(n)
search within the 'first layer' of the DAFSA into an O(1)
lookup. As for what needs to be stored in the lookup table, remember that we build up a 'unique index' when traversing a list of children, with any child that we iterate over adding its number
field to the total:
number
field of the a
and b
nodes contribute to the unique index when searching for c
So, we can pre-compute the accumulated unique index that would result when matching a character in the 'first layer,' and then store that in the lookup table. For example, the relevant data for the first four nodes in the first layer of our DAFSA looks like this:
.{ .char = 'A', .number = 27 },
.{ .char = 'B', .number = 12 },
.{ .char = 'C', .number = 36 },
.{ .char = 'D', .number = 54 },
// ...
so the corresponding lookup table entries could look like this:
0,
27,
39, // 27 + 12
75, // 27 + 12 + 36
// ...
We can then use char - 'A'
(for uppercase characters) to index into the lookup table and instantly get the final unique index that would have resulted from successfully searching for that character normally (for lowercase characters you'd get the index using char - 'a' + 26
to allow for using a lookup array with exactly 52 values).
This change alone provides quite a big improvement to raw matching speed, since the 'first layer' represents the largest list of children in the DAFSA (by a significant margin):
Benchmark 1 (44 runs): ./bench-master
measurement mean ± σ min … max outliers delta
wall_time 114ms ± 1.28ms 112ms … 116ms 0 ( 0%) 0%
cpu_cycles 469M ± 2.40M 466M … 482M 5 (11%) 0%
instructions 740M ± 1.25 740M … 740M 0 ( 0%) 0%
cache_references 6.27M ± 60.6K 6.14M … 6.48M 1 ( 2%) 0%
cache_misses 2.05K ± 5.11K 979 … 34.8K 4 ( 9%) 0%
branch_misses 5.69M ± 13.0K 5.67M … 5.75M 8 (18%) 0%
Benchmark 2 (74 runs): ./bench-first-layer-accel
measurement mean ± σ min … max outliers delta
wall_time 68.1ms ± 1.26ms 66.6ms … 70.3ms 0 ( 0%) ⚡- 40.2% ± 0.4%
cpu_cycles 278M ± 416K 277M … 279M 3 ( 4%) ⚡- 40.7% ± 0.1%
instructions 385M ± 1.42 385M … 385M 0 ( 0%) ⚡- 47.9% ± 0.0%
cache_references 6.24M ± 51.2K 6.14M … 6.38M 1 ( 1%) - 0.6% ± 0.3%
cache_misses 1.38K ± 346 1.02K … 2.40K 0 ( 0%) - 32.8% ± 57.4%
branch_misses 4.79M ± 4.33K 4.78M … 4.80M 4 ( 5%) ⚡- 15.9% ± 0.1%
The second improvement is a theoretical one for now: what if we could take the time complexity of searching a list of children from O(n)
to O(log n)
by using a binary search? Let's think about what changes we'd need to make for that to be possible:
last_sibling
flag set to know where a list of children ends)For the first point, this is actually already the case—all lists of children are sorted (note: I'm unsure if this is a coincidental artifact of how I'm constructing my DAFSA, or if it's a common property of a DAFSA generally). For the second point, we could remove the last_sibling
flag and replace it with a children_len
field. For the third, we already have a model available for this with the lookup table approach we used for the first layer: we could just store an accumulated number
in each Node
struct instead of a per-node number.
The problem with all this is that there just aren't enough bits available to store all the information required. For context, here's the DAFSA Node
representation being used currently:
const Node = packed struct(u32) {
char: u8,
number: u8,
end_of_word: bool,
last_sibling: bool,
_extra: u2 = 0,
first_child_index: u12,
};
Things to note:
_extra
)number
and first_child_index
are currently using the minimum bits required to encode their largest valueAnd we want to:
last_sibling
flag and replace it with a children_len
field
number
field to be 12 bits wide, since the largest cumulative number is 2218
Overall, we need to store 8 more bits of information but only have 2 bits to spare, so somehow we'll need to eek out 6 extra bits.
Let's see what we can do...
As mentioned earlier, named character references only contain characters within the ASCII range. This means that we can use a u7
instead of a u8
for the char
field. This gives us an extra spare bit essentially for free (well, it will make the CPU have to do some extra work to retrieve the value of the field, but we'll ignore that for now).
5 bits to go.
Since we're using a lookup table for the cumulative numbers of the first layer of nodes, there's no reason to store real values in the number
fields of those nodes in the DAFSA. Removing those values is also extremely helpful because that's where the largest values appear; the largest cumulative number outside of the first layer is 163 which can be stored in 8 bits (4 bits fewer than we thought we needed for this field).
1 bit to go.
The last bit is the trickiest, as they say. Instead of jumping straight to an approach that works well, though, let's try out some interesting approaches that also work but are sub-optimal.
char
🔗From the expandable "Nitty-gritty DAFSA node size details" section earlier:
[The
char
field] can technically be represented in 6 bits, since the actual alphabet of characters used in the list of named character references only includes 61 unique characters ('1'...'8', ';', 'a'...'z', 'A'...'Z'). However, to do so you'd need to convert between the 6 bit representation and the actual ASCII value of each character to do comparisons.
Having to do this extra work when performing any comparison against the char
field is not ideal, to the point that it makes the binary search version perform quite a bit worse than the linear scan version (at least with the conversion function I came up with).
One quite-cool-but-also-not-worth-it approach is to avoid storing full cumulative number
fields, and instead store incremental numbers that will result in the correct cumulative number if they are combined together in a specific way while performing a binary search. In practice, what that means is:
number
value to the totalHere's a visualization showing how/why this can work, using a particularly relevant list of children (to be discussed afterwards):
char
|
number
|
---|---|
'a' | 0 |
'b' | 1 |
'c' | 2 |
'd' | 14 |
'e' | 19 |
'f' | 11 |
'h' | 13 |
'i' | 20 |
'l' | 14 |
'm' | 54 |
'o' | 8 |
'p' | 13 |
'q' | 16 |
'r' | 16 |
's' | 33 |
't' | 4 |
'u' | 9 |
'w' | 64 |
'z' | 5 |
tally |
---|
+0 |
+1 |
+2 |
+14 |
+19 |
+11 30 |
+13 |
+20 |
+14 |
+54 |
+8 |
+13 |
+16 |
+16 |
+33 |
+4 |
+9 |
+64 |
+5 |
expected total |
---|
0 |
1 |
2 |
16 |
19 |
30 |
32 |
39 |
53 |
54 |
62 |
67 |
70 |
86 |
87 |
91 |
96 |
151 |
156 |
This specific list of children was chosen because 64 (the number
value of the node with the character 'w'
here) is actually the largest number
value in the entire DAFSA after transforming the number
fields in this way. This means that instead of 8 bits to store the number field, we only need 7, thus saving the 1 bit that we're looking for.
Yet again, though, this extra work being done while performing the binary search cancels out the benefit of the binary search, and it ends up being marginally slower than the linear scan version.
The approach I found that works the best is actually quite straightforward to summarize: entirely remove the first layer of nodes from the DAFSA. We can get away with this because we already have a lookup table available for the first layer, so we could theoretically just stuff all the necessary information in there instead of keeping it in the DAFSA.
In practice, it's not quite as simple as that, though. If you're interested in the details, expand below:
After adding the lookup table for the first layer, NamedCharacterReferenceMatcher
uses this approach:
node_index
set to 0node_index
is currently 0, verify that the current character is within a-z
or A-Z
, and, if so, look up the cumulative unique index in the lookup table. Set the node_index
to the <lookup table index> + 1
(since the first layer of nodes in the DAFSA always follow the root node)node_index
is not 0, do DAFSA traversal as normal (search the node's children starting at nodes[node_index].first_child_index
)This means that we're currently relying on the first layer of nodes being in the DAFSA in order to allow for node_index
to be used as a universal 'cursor' that tracks where we are in the DAFSA. If we simply remove the first layer of nodes, then we'd need two separate cursors: 1 that we would use for the lookup table, and 1 that we would use for the DAFSA nodes array. That might not be a bad approach, but what I went for instead is this:
node_index
in NamedCharacterReferenceMatcher
, store an optional slice of children to check (?[]const Node
) that starts as null
children_to_check
is null
, use the first layer lookup table to get three pieces of information: (1) the cumulative unique index number, (2) the index of the node's first child in the DAFSA array, and (3) the length of the node's list of children. Set children_to_check
to a slice of the DAFSA array with the retrieved length, and starting at the retrieved first child indexchildren_to_check
is not null
, search within the children_to_check
slice as normalIt might not be completely obvious how this saves us a bit, but removing the first layer also happens to remove the largest values of our added children_len
field. Somewhat miraculously, the new largest value is 13 (down from 24), so instead of needing 5 bits, we now only need 4 bits for that field.
After all that, we end up with two struct
representations for the data we're looking to store:
One for the first layer lookup table:
const FirstLayerNode = packed struct {
number: u12,
child_index: u10,
children_len: u5,
};
and one for the rest of the DAFSA nodes:
const Node = packed struct(u32) {
char: u7,
number: u8,
end_of_word: bool,
children_len: u4,
child_index: u12,
};
With this, we can accelerate the search time for the first character drastically, and make it possible to use a binary search for all the rest of the searches we perform. In terms of raw matching speed, this provides another decent improvement over the first layer lookup table alone:
Benchmark 1 (44 runs): ./bench-master
measurement mean ± σ min … max outliers delta
wall_time 115ms ± 1.73ms 112ms … 118ms 0 ( 0%) 0%
peak_rss 754KB ± 617 750KB … 754KB 1 ( 2%) 0%
cpu_cycles 471M ± 2.35M 467M … 478M 0 ( 0%) 0%
instructions 740M ± 1.65 740M … 740M 0 ( 0%) 0%
cache_references 6.30M ± 53.8K 6.14M … 6.38M 1 ( 2%) 0%
cache_misses 6.81K ± 5.18K 1.19K … 19.4K 6 (14%) 0%
branch_misses 5.70M ± 14.3K 5.68M … 5.74M 1 ( 2%) 0%
Benchmark 2 (73 runs): ./bench-first-layer-accel
measurement mean ± σ min … max outliers delta
wall_time 69.0ms ± 1.70ms 66.7ms … 71.3ms 0 ( 0%) ⚡- 40.0% ± 0.6%
cpu_cycles 279M ± 1.18M 277M … 282M 0 ( 0%) ⚡- 40.7% ± 0.1%
instructions 385M ± 2.77 385M … 385M 1 ( 1%) ⚡- 47.9% ± 0.0%
cache_references 6.29M ± 73.4K 6.14M … 6.51M 1 ( 1%) - 0.1% ± 0.4%
cache_misses 8.70K ± 6.54K 1.13K … 21.8K 0 ( 0%) + 27.8% ± 33.7%
branch_misses 4.79M ± 9.12K 4.78M … 4.82M 4 ( 5%) ⚡- 15.9% ± 0.1%
Benchmark 3 (83 runs): ./bench-first-layer-accel-binary-search-u7-char-u8-number
measurement mean ± σ min … max outliers delta
wall_time 60.6ms ± 3.40ms 58.3ms … 81.0ms 2 ( 2%) ⚡- 47.3% ± 0.9%
cpu_cycles 244M ± 1.35M 242M … 251M 1 ( 1%) ⚡- 48.1% ± 0.1%
instructions 431M ± 3.45 431M … 431M 2 ( 2%) ⚡- 41.8% ± 0.0%
cache_references 6.30M ± 87.0K 6.12M … 6.47M 0 ( 0%) + 0.1% ± 0.4%
cache_misses 6.66K ± 3.61K 1.19K … 16.4K 1 ( 1%) - 2.2% ± 22.6%
branch_misses 4.92M ± 18.7K 4.89M … 5.00M 15 (18%) ⚡- 13.6% ± 0.1%
Taken together (the Benchmark 3 results), these two optimizations make raw matching speed about 1.9x faster than it was originally.
As mentioned, though, these raw matching speed improvements don't translate to nearly the same improvement when benchmarking this new implementation within the Ladybird tokenizer (note also that these benchmarks don't use the 'lookahead when outside of an insertion point' optimization).
Benchmark 1 (89 runs): ./BenchHTMLTokenizer dafsa
measurement mean ± σ min … max outliers delta
wall_time 113ms ± 1.10ms 112ms … 115ms 0 ( 0%) 0%
peak_rss 83.3MB ± 78.1KB 83.1MB … 83.5MB 30 (34%) 0%
cpu_cycles 228M ± 827K 227M … 231M 3 ( 3%) 0%
instructions 450M ± 4.99K 450M … 450M 0 ( 0%) 0%
cache_references 9.43M ± 266K 9.28M … 11.9M 1 ( 1%) 0%
cache_misses 404K ± 4.56K 393K … 417K 0 ( 0%) 0%
branch_misses 574K ± 7.43K 570K … 641K 2 ( 2%) 0%
Benchmark 2 (92 runs): ./BenchHTMLTokenizer dafsa-binary-search
measurement mean ± σ min … max outliers delta
wall_time 109ms ± 1.10ms 107ms … 111ms 0 ( 0%) ⚡- 3.8% ± 0.3%
peak_rss 83.3MB ± 74.0KB 83.1MB … 83.5MB 31 (34%) + 0.0% ± 0.0%
cpu_cycles 210M ± 778K 209M … 212M 0 ( 0%) ⚡- 7.9% ± 0.1%
instructions 417M ± 4.75K 417M … 417M 0 ( 0%) ⚡- 7.3% ± 0.0%
cache_references 9.39M ± 124K 9.21M … 10.4M 4 ( 4%) - 0.4% ± 0.6%
cache_misses 407K ± 5.47K 395K … 423K 1 ( 1%) + 0.8% ± 0.4%
branch_misses 533K ± 1.50K 530K … 542K 7 ( 8%) ⚡- 7.0% ± 0.3%
Benchmark 1 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 45.8ms ± 658us 43.6ms … 47.0ms 12 ( 6%) 0%
peak_rss 53.0MB ± 84.2KB 52.7MB … 53.2MB 2 ( 1%) 0%
cpu_cycles 117M ± 581K 116M … 119M 7 ( 3%) 0%
instructions 259M ± 5.32K 259M … 259M 3 ( 1%) 0%
cache_references 3.25M ± 82.1K 3.16M … 4.05M 16 ( 7%) 0%
cache_misses 355K ± 5.76K 345K … 376K 1 ( 0%) 0%
branch_misses 182K ± 2.45K 178K … 200K 16 ( 7%) 0%
Benchmark 2 (236 runs): ./BenchHTMLTokenizer dafsa-binary-search gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 42.4ms ± 921us 40.2ms … 43.8ms 7 ( 3%) ⚡- 7.5% ± 0.3%
peak_rss 53.0MB ± 77.8KB 52.7MB … 53.2MB 1 ( 0%) + 0.0% ± 0.0%
cpu_cycles 103M ± 771K 101M … 106M 6 ( 3%) ⚡- 12.3% ± 0.1%
instructions 220M ± 8.52K 220M … 220M 13 ( 6%) ⚡- 15.1% ± 0.0%
cache_references 3.27M ± 109K 3.18M … 4.16M 16 ( 7%) + 0.4% ± 0.5%
cache_misses 360K ± 8.94K 345K … 394K 1 ( 0%) + 1.3% ± 0.4%
branch_misses 189K ± 7.40K 179K … 215K 2 ( 1%) 💩+ 3.8% ± 0.6%
Benchmark 1 (158 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 63.2ms ± 859us 61.3ms … 64.6ms 1 ( 1%) 0%
peak_rss 65.1MB ± 85.1KB 64.8MB … 65.2MB 2 ( 1%) 0%
cpu_cycles 112M ± 550K 111M … 115M 1 ( 1%) 0%
instructions 214M ± 7.62K 214M … 214M 5 ( 3%) 0%
cache_references 5.87M ± 82.4K 5.77M … 6.52M 5 ( 3%) 0%
cache_misses 374K ± 4.78K 365K … 405K 4 ( 3%) 0%
branch_misses 164K ± 4.62K 160K … 219K 1 ( 1%) 0%
Benchmark 2 (163 runs): ./BenchHTMLTokenizer dafsa-binary-search ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 61.4ms ± 833us 59.3ms … 62.8ms 0 ( 0%) ⚡- 2.9% ± 0.3%
peak_rss 65.1MB ± 76.6KB 64.8MB … 65.2MB 1 ( 1%) + 0.0% ± 0.0%
cpu_cycles 105M ± 573K 104M … 107M 1 ( 1%) ⚡- 6.6% ± 0.1%
instructions 195M ± 7.30K 195M … 195M 1 ( 1%) ⚡- 8.8% ± 0.0%
cache_references 5.87M ± 94.8K 5.79M … 6.87M 6 ( 4%) + 0.1% ± 0.3%
cache_misses 375K ± 4.47K 365K … 388K 2 ( 1%) + 0.1% ± 0.3%
branch_misses 164K ± 3.11K 162K … 202K 3 ( 2%) + 0.3% ± 0.5%
Benchmark 1 (230 runs): ./BenchHTMLTokenizer dafsa all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.3ms ± 635us 41.1ms … 44.9ms 13 ( 6%) 0%
peak_rss 54.4MB ± 90.7KB 54.0MB … 54.5MB 1 ( 0%) 0%
cpu_cycles 102M ± 651K 101M … 106M 3 ( 1%) 0%
instructions 191M ± 8.94K 191M … 191M 14 ( 6%) 0%
cache_references 3.55M ± 48.8K 3.48M … 3.88M 10 ( 4%) 0%
cache_misses 358K ± 8.22K 342K … 387K 9 ( 4%) 0%
branch_misses 344K ± 4.30K 340K … 405K 2 ( 1%) 0%
Benchmark 2 (247 runs): ./BenchHTMLTokenizer dafsa-binary-search all-valid
measurement mean ± σ min … max outliers delta
wall_time 40.4ms ± 754us 38.1ms … 42.3ms 13 ( 5%) ⚡- 6.7% ± 0.3%
peak_rss 54.4MB ± 73.5KB 54.1MB … 54.5MB 76 (31%) + 0.0% ± 0.0%
cpu_cycles 90.1M ± 522K 88.8M … 92.3M 2 ( 1%) ⚡- 11.8% ± 0.1%
instructions 170M ± 7.36K 170M … 170M 13 ( 5%) ⚡- 11.0% ± 0.0%
cache_references 3.54M ± 32.2K 3.46M … 3.68M 7 ( 3%) - 0.3% ± 0.2%
cache_misses 356K ± 6.32K 344K … 379K 1 ( 0%) - 0.6% ± 0.4%
branch_misses 314K ± 1.08K 311K … 319K 8 ( 3%) ⚡- 8.8% ± 0.2%
However, it's enough to put this new DAFSA implementation ahead of the Firefox and Chrome/Sarafi implementations in all of the benchmarks I'm using.
Benchmark 1 (91 runs): ./BenchHTMLTokenizer dafsa-binary-search
measurement mean ± σ min … max outliers delta
wall_time 110ms ± 956us 108ms … 111ms 0 ( 0%) 0%
peak_rss 83.5MB ± 79.4KB 83.2MB … 83.6MB 1 ( 1%) 0%
cpu_cycles 212M ± 1.31M 209M … 217M 3 ( 3%) 0%
instructions 420M ± 9.03K 420M … 420M 5 ( 5%) 0%
cache_references 9.57M ± 186K 9.36M … 10.9M 2 ( 2%) 0%
cache_misses 405K ± 5.61K 394K … 421K 0 ( 0%) 0%
branch_misses 535K ± 1.70K 532K … 540K 0 ( 0%) 0%
Benchmark 2 (89 runs): ./BenchHTMLTokenizer gecko
measurement mean ± σ min … max outliers delta
wall_time 113ms ± 1.02ms 111ms … 115ms 0 ( 0%) 💩+ 2.6% ± 0.3%
peak_rss 83.6MB ± 70.9KB 83.3MB … 83.6MB 0 ( 0%) + 0.1% ± 0.0%
cpu_cycles 225M ± 1.28M 223M … 234M 2 ( 2%) 💩+ 5.8% ± 0.2%
instructions 441M ± 7.17K 441M … 441M 7 ( 8%) 💩+ 5.0% ± 0.0%
cache_references 9.85M ± 227K 9.64M … 11.3M 4 ( 4%) 💩+ 2.9% ± 0.6%
cache_misses 411K ± 5.54K 402K … 431K 2 ( 2%) 💩+ 1.5% ± 0.4%
branch_misses 581K ± 19.0K 575K … 758K 7 ( 8%) 💩+ 8.5% ± 0.7%
Benchmark 3 (88 runs): ./BenchHTMLTokenizer blink
measurement mean ± σ min … max outliers delta
wall_time 115ms ± 856us 113ms … 117ms 0 ( 0%) 💩+ 4.5% ± 0.2%
peak_rss 83.5MB ± 72.4KB 83.2MB … 83.6MB 26 (30%) - 0.0% ± 0.0%
cpu_cycles 232M ± 940K 230M … 235M 0 ( 0%) 💩+ 9.4% ± 0.2%
instructions 463M ± 8.80K 463M … 463M 6 ( 7%) 💩+ 10.4% ± 0.0%
cache_references 10.2M ± 141K 9.94M … 10.9M 2 ( 2%) 💩+ 6.1% ± 0.5%
cache_misses 410K ± 5.40K 398K … 424K 0 ( 0%) + 1.1% ± 0.4%
branch_misses 751K ± 1.90K 747K … 755K 0 ( 0%) 💩+ 40.3% ± 0.1%
Benchmark 1 (234 runs): ./BenchHTMLTokenizer dafsa-binary-search gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 42.6ms ± 717us 40.3ms … 44.2ms 17 ( 7%) 0%
peak_rss 53.2MB ± 90.0KB 52.8MB … 53.4MB 88 (38%) 0%
cpu_cycles 103M ± 604K 101M … 105M 2 ( 1%) 0%
instructions 222M ± 6.94K 222M … 222M 9 ( 4%) 0%
cache_references 3.27M ± 96.0K 3.19M … 3.98M 15 ( 6%) 0%
cache_misses 356K ± 5.77K 344K … 377K 1 ( 0%) 0%
branch_misses 182K ± 3.18K 178K … 215K 11 ( 5%) 0%
Benchmark 2 (198 runs): ./BenchHTMLTokenizer gecko gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 50.5ms ± 831us 48.5ms … 52.4ms 20 (10%) 💩+ 18.7% ± 0.3%
peak_rss 53.2MB ± 84.3KB 53.0MB … 53.4MB 1 ( 1%) + 0.1% ± 0.0%
cpu_cycles 138M ± 610K 136M … 139M 3 ( 2%) 💩+ 33.8% ± 0.1%
instructions 280M ± 7.27K 280M … 280M 6 ( 3%) 💩+ 26.3% ± 0.0%
cache_references 3.34M ± 118K 3.22M … 4.58M 9 ( 5%) 💩+ 2.1% ± 0.6%
cache_misses 356K ± 5.10K 344K … 372K 3 ( 2%) - 0.1% ± 0.3%
branch_misses 315K ± 5.57K 305K … 346K 4 ( 2%) 💩+ 72.8% ± 0.5%
Benchmark 3 (213 runs): ./BenchHTMLTokenizer blink gecko-worst-case
measurement mean ± σ min … max outliers delta
wall_time 47.1ms ± 620us 44.9ms … 48.3ms 18 ( 8%) 💩+ 10.6% ± 0.3%
peak_rss 53.2MB ± 84.0KB 52.8MB … 53.4MB 1 ( 0%) - 0.0% ± 0.0%
cpu_cycles 122M ± 745K 121M … 127M 2 ( 1%) 💩+ 18.5% ± 0.1%
instructions 292M ± 5.64K 292M … 292M 4 ( 2%) 💩+ 31.7% ± 0.0%
cache_references 3.30M ± 139K 3.19M … 4.78M 24 (11%) + 0.9% ± 0.7%
cache_misses 355K ± 5.58K 343K … 371K 7 ( 3%) - 0.4% ± 0.3%
branch_misses 183K ± 710 180K … 185K 3 ( 1%) + 0.2% ± 0.2%
Benchmark 1 (160 runs): ./BenchHTMLTokenizer dafsa-binary-search ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 62.4ms ± 1.16ms 60.4ms … 64.8ms 0 ( 0%) 0%
peak_rss 65.3MB ± 81.8KB 64.9MB … 65.4MB 45 (28%) 0%
cpu_cycles 107M ± 850K 105M … 114M 3 ( 2%) 0%
instructions 196M ± 12.0K 196M … 196M 14 ( 9%) 0%
cache_references 5.92M ± 73.9K 5.81M … 6.21M 3 ( 2%) 0%
cache_misses 386K ± 7.90K 369K … 409K 1 ( 1%) 0%
branch_misses 164K ± 1.74K 161K … 179K 9 ( 6%) 0%
Benchmark 2 (161 runs): ./BenchHTMLTokenizer gecko ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 62.3ms ± 988us 59.7ms … 64.1ms 0 ( 0%) - 0.1% ± 0.4%
peak_rss 65.3MB ± 79.2KB 65.0MB … 65.4MB 2 ( 1%) + 0.1% ± 0.0%
cpu_cycles 106M ± 618K 104M … 108M 3 ( 2%) - 0.8% ± 0.2%
instructions 195M ± 12.6K 195M … 195M 13 ( 8%) - 0.8% ± 0.0%
cache_references 6.03M ± 169K 5.84M … 6.99M 2 ( 1%) 💩+ 1.9% ± 0.5%
cache_misses 386K ± 7.61K 370K … 413K 2 ( 1%) - 0.1% ± 0.4%
branch_misses 165K ± 1.02K 163K … 169K 2 ( 1%) + 0.4% ± 0.2%
Benchmark 3 (158 runs): ./BenchHTMLTokenizer blink ladybird-worst-case
measurement mean ± σ min … max outliers delta
wall_time 63.4ms ± 1.22ms 61.0ms … 65.3ms 0 ( 0%) 💩+ 1.7% ± 0.4%
peak_rss 65.2MB ± 78.2KB 65.0MB … 65.4MB 1 ( 1%) - 0.0% ± 0.0%
cpu_cycles 109M ± 789K 107M … 115M 2 ( 1%) 💩+ 2.0% ± 0.2%
instructions 204M ± 11.7K 203M … 204M 4 ( 3%) 💩+ 3.7% ± 0.0%
cache_references 6.00M ± 90.0K 5.85M … 6.25M 0 ( 0%) 💩+ 1.4% ± 0.3%
cache_misses 388K ± 8.11K 371K … 409K 0 ( 0%) + 0.6% ± 0.5%
branch_misses 165K ± 1.24K 162K … 173K 1 ( 1%) + 0.4% ± 0.2%
Benchmark 1 (244 runs): ./BenchHTMLTokenizer dafsa-binary-search all-valid
measurement mean ± σ min … max outliers delta
wall_time 41.0ms ± 843us 38.3ms … 42.2ms 21 ( 9%) 0%
peak_rss 54.5MB ± 87.9KB 54.3MB … 54.6MB 4 ( 2%) 0%
cpu_cycles 90.6M ± 591K 89.6M … 92.6M 2 ( 1%) 0%
instructions 172M ± 9.67K 172M … 172M 19 ( 8%) 0%
cache_references 3.56M ± 117K 3.47M … 5.05M 14 ( 6%) 0%
cache_misses 359K ± 9.17K 343K … 399K 1 ( 0%) 0%
branch_misses 316K ± 4.49K 312K … 382K 1 ( 0%) 0%
Benchmark 2 (229 runs): ./BenchHTMLTokenizer gecko all-valid
measurement mean ± σ min … max outliers delta
wall_time 43.7ms ± 735us 41.4ms … 44.9ms 18 ( 8%) 💩+ 6.6% ± 0.3%
peak_rss 54.6MB ± 85.1KB 54.3MB … 54.8MB 2 ( 1%) + 0.1% ± 0.0%
cpu_cycles 103M ± 460K 102M … 105M 5 ( 2%) 💩+ 13.6% ± 0.1%
instructions 189M ± 4.69K 189M … 189M 4 ( 2%) 💩+ 10.1% ± 0.0%
cache_references 3.65M ± 93.4K 3.54M … 4.77M 14 ( 6%) 💩+ 2.5% ± 0.5%
cache_misses 356K ± 5.90K 344K … 377K 5 ( 2%) - 0.9% ± 0.4%
branch_misses 385K ± 861 383K … 388K 2 ( 1%) 💩+ 21.8% ± 0.2%
Benchmark 3 (224 runs): ./BenchHTMLTokenizer blink all-valid
measurement mean ± σ min … max outliers delta
wall_time 44.6ms ± 885us 42.1ms … 46.2ms 23 (10%) 💩+ 8.9% ± 0.4%
peak_rss 54.5MB ± 87.3KB 54.1MB … 54.6MB 88 (39%) - 0.1% ± 0.0%
cpu_cycles 106M ± 654K 105M … 109M 2 ( 1%) 💩+ 17.2% ± 0.1%
instructions 205M ± 9.02K 205M … 205M 10 ( 4%) 💩+ 19.4% ± 0.0%
cache_references 3.80M ± 103K 3.69M … 5.07M 9 ( 4%) 💩+ 6.5% ± 0.6%
cache_misses 361K ± 9.82K 345K … 396K 3 ( 1%) + 0.5% ± 0.5%
branch_misses 464K ± 1.33K 460K … 469K 6 ( 3%) 💩+ 46.7% ± 0.2%
Benchmark 1 (67 runs): ./BenchMatcherDafsaBinarySearch
measurement mean ± σ min … max outliers delta
wall_time 74.6ms ± 1.32ms 72.1ms … 76.4ms 0 ( 0%) 0%
peak_rss 4.57MB ± 70.4KB 4.46MB … 4.72MB 20 (30%) 0%
cpu_cycles 299M ± 1.18M 296M … 301M 1 ( 1%) 0%
instructions 471M ± 79.5 471M … 471M 0 ( 0%) 0%
cache_references 6.03M ± 49.6K 5.90M … 6.18M 2 ( 3%) 0%
cache_misses 26.0K ± 4.18K 20.5K … 37.4K 1 ( 1%) 0%
branch_misses 5.28M ± 53.2K 5.14M … 5.37M 3 ( 4%) 0%
Benchmark 2 (48 runs): ./BenchMatcherGecko
measurement mean ± σ min … max outliers delta
wall_time 105ms ± 1.18ms 103ms … 107ms 0 ( 0%) 💩+ 40.8% ± 0.6%
peak_rss 4.57MB ± 60.4KB 4.46MB … 4.72MB 11 (23%) - 0.1% ± 0.5%
cpu_cycles 426M ± 1.48M 424M … 430M 2 ( 4%) 💩+ 42.4% ± 0.2%
instructions 745M ± 68.8 745M … 745M 8 (17%) 💩+ 58.2% ± 0.0%
cache_references 8.04M ± 77.0K 7.95M … 8.48M 1 ( 2%) 💩+ 33.3% ± 0.4%
cache_misses 27.1K ± 5.19K 21.3K … 44.7K 6 (13%) + 4.3% ± 6.7%
branch_misses 5.41M ± 2.77K 5.41M … 5.42M 2 ( 4%) 💩+ 2.5% ± 0.3%
Benchmark 3 (36 runs): ./BenchMatcherBlink
measurement mean ± σ min … max outliers delta
wall_time 140ms ± 1.66ms 138ms … 146ms 1 ( 3%) 💩+ 88.2% ± 0.8%
peak_rss 4.60MB ± 75.3KB 4.46MB … 4.72MB 12 (33%) + 0.7% ± 0.6%
cpu_cycles 573M ± 4.44M 568M … 594M 4 (11%) 💩+ 91.6% ± 0.4%
instructions 1.07G ± 70.5 1.07G … 1.07G 7 (19%) 💩+126.3% ± 0.0%
cache_references 12.3M ± 107K 12.2M … 12.7M 2 ( 6%) 💩+104.6% ± 0.5%
cache_misses 28.3K ± 5.50K 21.3K … 43.8K 3 ( 8%) 💩+ 9.1% ± 7.4%
branch_misses 8.78M ± 10.3K 8.75M … 8.80M 1 ( 3%) 💩+ 66.2% ± 0.3%
So, at long last, we're at the point where the title (hopefully) becomes justified: this improved DAFSA implementation seems to be slightly better across the board.
One funny aspect of this whole thing is that the problem is actually quite simple once you understand it, and there are probably a lot of different ways one could approach it. If you've read this far, it's very likely that you have some ideas of your own on how to make something better: either a whole different approach, or an improvement to some part of one of the approaches detailed so far.
I'll outline some avenues I think might warrant some further attention, but I also expect that I'll miss things that someone else may consider obvious.
In the last section, I took inspiration from the other browsers' implementations by adding a lookup table to accelerate the search for the first character, but it'd also be possible to take one more page from the Firefox implementation and do the same thing for the second character, too.
I actually have tried this out, and the implementation that I came up with:
number
field since the largest number
value remaining in the DAFSA is 51 (down from 163)u8
for the char
field instead of a u7
(this should reduce the number of instructions needed to access that field)Overall, these changes cut the raw lookup time by around -16% (as measured by the benchmark I'm using, at least):
Benchmark 1 (169 runs): ./bench-first-layer-accel-binary-search
measurement mean ± σ min … max outliers delta
wall_time 59.4ms ± 1.20ms 58.0ms … 61.4ms 0 ( 0%) 0%
cpu_cycles 242M ± 797K 241M … 250M 14 ( 8%) 0%
instructions 431M ± 1.28 431M … 431M 0 ( 0%) 0%
cache_references 6.21M ± 74.9K 6.10M … 6.99M 2 ( 1%) 0%
cache_misses 1.53K ± 629 1.01K … 5.03K 16 ( 9%) 0%
branch_misses 4.88M ± 4.28K 4.87M … 4.90M 6 ( 4%) 0%
Benchmark 2 (201 runs): ./bench-two-layer-accel-linear-search
measurement mean ± σ min … max outliers delta
wall_time 49.7ms ± 2.64ms 47.4ms … 76.2ms 10 ( 5%) ⚡- 16.2% ± 0.7%
cpu_cycles 198M ± 1.20M 197M … 207M 11 ( 5%) ⚡- 18.2% ± 0.1%
instructions 332M ± 4.83 332M … 332M 26 (13%) ⚡- 22.8% ± 0.0%
cache_references 7.29M ± 48.6K 7.19M … 7.44M 1 ( 0%) 💩+ 17.3% ± 0.2%
cache_misses 1.56K ± 442 1.09K … 3.45K 7 ( 3%) + 1.8% ± 7.2%
branch_misses 3.96M ± 7.41K 3.95M … 4.00M 7 ( 3%) ⚡- 18.7% ± 0.0%
So, for 8 KiB more data you can get another decent performance improvement, but so far I've only implemented this version in Zig so I can't report more information on this yet (would need to port it to C++ to test it with Ladybird). This also represents my first attempt at this 'two layer acceleration' strategy, so it's possible there's more juice to squeeze here.
Single instruction, multiple data (SIMD) is something I have had very little experience with using up to this point, and my naive attempts at using SIMD to accelerate my DAFSA implementation were not fruitful. However, it seems like there's potential to take advantage of SIMD for this type of problem (if not with a DAFSA, then with some totally different approach that takes better advantage of what SIMD is good at).
As I understand it, the core idea of data-oriented design is: instead of using an 'array of structs', use a 'struct of arrays' where each array holds segments of data that are frequently accessed together. If applied well, this can both cut down on wasted padding bits between elements and make your code much more CPU-cache-friendly.
For example:
const Struct = struct {
foo: u8,
bar: u16,
}
const array_of_structs = [100]Struct{ ... };
With the above, each element will have 8 bits of padding (the fields of Struct
use 3 bytes, but @sizeOf(Struct)
is 4 bytes), and array_of_structs
will use 400 bytes. Additionally, if you have a loop where you're only accessing one field like this:
for (&array_of_structs) |element| {
if (element.foo == '!') return true;
}
then you're accidentally paying the cost of the larger Struct
size since fewer will fit into cache. If we move to a 'struct of arrays' approach instead like so:
const StructOfArrays = struct {
foos: [100]u8 = { ... },
bars: [100]u16 = { ... },
};
then we're only using 300 bytes for these two arrays, and if we write a loop that only looks at foos
like so:
for (&StructOfArrays.foos) |foo| {
if (foo == '!') return true;
}
it will be able to benefit from each element being contiguous in memory and from the fact that more elements of the array will fit into cache at a time.
This is something I experimented quite a bit with, but never got results from. I believe the problem is that the access patterns of the DAFSA don't really benefit from the 'struct of arrays' approach, even though it seems like they might. The char
field is accessed repeatedly while searching a list of children, but all the other fields of the Node
are almost always accessed after that search is finished, so any benefit we get from having the char
fields contiguous, we lose just as much from having the other fields farther away from their associated char
. As far as I can tell, it's overall equally-or-more efficient to just use a plain old array-of-structs for our DAFSA nodes.
I effectively pulled the DAFSA out of a hat, without surveying the possibilities much. Someone more well versed in the field of data structures will likely have some ideas about what's out there that could work better.
This article continually and relentlessly grew in scope, and has ended up quite a bit more in-depth than I originally imagined (this is something that's familiar to me, unfortunately). If you've read this far, thank you, and I hope you were able to get something out of it.
Throughout the process of writing, I've accrued a number of improvements that I can make on top of my original Ladybird pull request:
O(1)
lookup table)O(1)
lookup table)So, a new pull request (or a few) to Ladybird will be forthcoming with some combination of these changes. However, I expect that exactly what those future PR(s) will look like may be shaped by the feedback I receive from this post, as I remain confident that better approaches than mine are out there, and, if you've read this article, you have all the knowledge necessary (and then some) to come up with an implementation of your own.
Finally, I'll leave you with some links: