RSS

Slightly better named character reference tokenization than Chrome, Safari, and Firefox

- Programming

A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I took that implementation, ported it to C++, and used it to make some efficiency gains and fix some spec compliance issues in the Ladybird browser.

Throughout this, I never actually looked at the implementations used in any of the major browser engines (no reason for this, just me being dumb). However, now that I have looked at Blink/WebKit/Gecko (Chrome/Safari/Firefox, respectively), I've realized that my implementation seems to be either on-par or better across the metrics that the browser engines care about:

So, I thought I'd take you through what I came up with and how it compares to the implementations in the major browser engines. Mostly, though, I just think the data structure I used is neat and want to tell you about it (fair warning: it's not novel).

What is a named character reference?🔗

A named character reference is an HTML entity specified using an ampersand (&) followed by an ASCII alphanumeric name. An ordained set of names will get transformed during HTML parsing into particular code point(s). For example, ◯ is a valid named character reference that gets transformed into the symbol â—¯, while & will get transformed into &.

Here's a few properties of named character references that are relevant for what we'll ultimately be aiming to implement:

Most crucially, though, the mappings of named character references are fixed. The HTML standard contains this note about named character references:

Note: This list is static and will not be expanded or changed in the future.

This means that it's now safe to represent the data in the minimum amount of bits possible without any fear of needing to accommodate more named character reference mappings in the future.

Named character reference tokenization overview🔗

I'm specifically going to be talking about the Named character reference state of HTML tokenization. You can read the spec for it if you want (it's fairly short) but there's really only one sentence we're interested in:

Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table.

This sentence is really the crux of the implementation, but the spec is quite hand-wavy here and conceals a lot of complexity. The example given a bit later in the spec hints at why the above sentence is not so straightforward (edited to exclude irrelevant details; for context, &not is a valid named character reference, as is ∉):

Example: If the markup contains the string I'm &notit; I tell you, the character reference is parsed as "not", as in, I'm ¬it; I tell you. But if the markup was I'm ∉ I tell you, the character reference would be parsed as "notin;", resulting in I'm ∉ I tell you.

That is, with the string &notit;, the characters up to and including &noti can still lead to a valid named character reference (∉, among others), so we only know that &notit; is invalid once we've reached &notit which can no longer lead to a valid named character reference. What this means is that there needs to be some amount of backtracking involved, as the goal is to consume only the characters that are part of the longest valid named character reference.

That's not the only complication, though...

The spectre of document.write🔗

Due to <script> tags, the input to HTML tokenization is mutable while it is being tokenized, meaning looking ahead is not always reliable/possible since we might not yet know what comes next. Consider this example:

<script>
document.write("&not");
</script>in;

The expected result after parsing is ∉, meaning the resolved character reference is &notin;. Let's go through why that is the case.

After the closing script tag is tokenized, the parser adds an "insertion point" after it, and then the script within the tag itself is executed. So, right before the script is executed, the tokenizer input can be visualized as (where is the insertion point):

in;

And then after the <script> is executed and the document.write call has run, it can be visualized as:

&notin;

What happens from the tokenizer's perspective is that after the closing </script> tag, &not comes next in the input stream (which was inserted by document.write), and then the characters after &not are in;, so ultimately a tokenizer going character-by-character should see an unbroken &notin;, recognize that as a valid character reference, and translate it to ∉.

To further show why this can be tricky to handle, consider also the possibility of document.write writing one character at a time, like so:

<script>
for (let char of "&not") {
  document.write(char);
}
</script>in;

This, too, is expected to result in ∉ (i.e. &notin;). Keep in mind that the tokenizer will advance after each document.write call, so if the tokenizer tries to lookahead past the insertion point at any point before the full script is run, it will resolve the wrong string as a character reference (&in;, &nin;, or &noin;). Here's a visualization that shows the insertion point and the various states of the input stream after each document.write call:

in;

Therefore, while HTML tokenizers can theoretically look ahead, they can never look past the end of an insertion point.

What this all means, implementation-wise🔗

All of this is to say that a "consume the longest valid named character reference" implementation probably needs to use one of two strategies:

The second strategy seems like the better approach to me, so that's what my implementation will be focused on. We'll see both strategies later on, though.

Trie implementation🔗

So, we want an implementation that can iterate character-by-character and (at any point) efficiently determine if it's possible for the next character to lead to a longer valid named character reference.

A data structure that seems pretty good for this sort of thing is a trie. A trie is a specialized tree where each node contains a character that can come after the character of its parent node in a set of words. Below is a representation of a trie containing this small subset of named character references:

&not
¬
&notinva;
&notinvb;
&notinvc;
&notniva;
&notnivb;
&notnivc;

n

o

t

i

n

n

i

v

v

a

b

c

a

b

c

;

;

;

;

;

;

With such a trie, you search for the next character within the list of the current node's children (starting from the root). If the character is found within the children, you then set that child node as the current node and continue on for the character after that, etc.

For invalid words, this means that you naturally stop searching as soon as possible (after the first character that cannot lead to a longer match). For valid words, you trace a path through the trie and end up on an end-of-word node (and you may also pass end-of-word nodes on the way there). Here's what the path through the trie would look like for the named character reference &notinvc;:

n

o

t

i

n

n

i

v

v

a

b

c

a

b

c

;

;

;

;

;

;

You'll notice that the mapped code point (â‹¶) is present on the diagram above as well. This is because it is trivial to use a trie as a map to look up an associated value, since each word in the set must end at a distinct node in the trie (e.g. no two words can share an end-of-word node). Conveniently, using the trie as a map is exactly what we want to be able to do for named character references, since ultimately we need to convert the longest matched named character reference into the relevant code point(s).

A brief detour: Representing a trie in memory🔗

One way to represent a trie node is to use an array of optional pointers for its children (where each index into the array represents a child node with that byte value as its character), like so:

const Node = struct {
  // This example supports all `u8` byte values.
  children: [256]?*Node,
  end_of_word: bool,
};

Earlier, I said that trie visualizations typically put the letters on the connections between nodes rather than the nodes themselves, and, with this way of representing the trie, I think that makes a lot of sense, since the connections are the information being stored on each node.

So, this representation can be visualized like so:

G

H

G

L

F

 

 

 

 

 

With this, checking if a character can come after the current node is a straightforward O(1) array access:

if (node.children[c] != null) {
    // found child
}

but it comes at the cost of a lot of potentially wasted space, since most nodes will have many null children.

One way to mitigate the wasted space would be to switch from an array of children to a linked list of children, where the parent stores an optional pointer to its first child, and each child stores an optional pointer to its next sibling:

const Node = struct {
  char: u8,
  first_child: ?*Node,
  next_sibling: ?*Node,
  end_of_word: bool,
};

Now that char is stored on each node directly, I (in turn) think it makes sense to visualize the trie with the characters shown on the nodes themselves, like so:

G

H

G

L

F

While this 'linked list' representation saves on memory, it transforms the search for a particular child into a O(n) linear scan across the children:

var node = starting_node.first_child orelse return null;
while (true) {
    if (node.char == c) {
        // found child
    }
    node = node.next_sibling orelse return null;
}

This linear search can be slow, especially if the nodes are individually heap-allocated and therefore could be very spread out in memory, leading to a lot of random memory accesses and cache misses. Additionally, pointers themselves take up quite a bit of space (8 bytes on 64-bit machines). If we ultimately want to decrease the size of the node, getting rid of the pointer fields would be helpful as well.

We can solve multiple of these problems at once by:

With this approach, Node could look like this:

const Node = packed struct {
    char: u8,
    // It's safe to represent this with the minimum number of bits,
    // e.g. there's 6 nodes in our example so it can be represented in 3 bits.
    first_child_index: u3,
    // `last_sibling` replaces the need for the `next_sibling` field, since
    // accessing the next sibling is just an index increment.
    last_sibling: bool,
    end_of_word: bool,
};

And the array of nodes for this particular example trie would look like this:

const nodes = [6]Node{
    .{ .first_child_index = 1, .char = 0,   .last_sibling = true,  .end_of_word = false },
    .{ .first_child_index = 3, .char = 'G', .last_sibling = false, .end_of_word = false },
    .{ .first_child_index = 5, .char = 'H', .last_sibling = true,  .end_of_word = false },
    .{ .first_child_index = 0, .char = 'G', .last_sibling = false, .end_of_word = true  },
    .{ .first_child_index = 0, .char = 'L', .last_sibling = true,  .end_of_word = true  },
    .{ .first_child_index = 0, .char = 'F', .last_sibling = true,  .end_of_word = true  },
};

This representation can be visualized like so:

012345

G

H

G

L

F

This still means that searching a child list uses a O(n) linear scan, but this representation makes those searches much more friendly to the CPU, and greatly reduces the size of each node.

Some hard numbers🔗

To get an idea of how the different representations compare, here's a breakdown for a trie containing the full set of named character references (2,231 words).

Data size🔗

That is, the 'flattened' version is 1/514 the size of the 'connections' version, and 1/6 the size of the 'linked list' version.

Performance🔗

As mentioned, the 'linked list' and 'flattened' versions sacrifice the O(1) lookup of the 'connections' version in favor of reducing the data size, so while the 'flattened' version claws some performance back from the 'linked list' version, the 'connections' version is the fastest:

One interesting thing to note is that the above results for representations 1 & 2 rely on a friendly allocation pattern for the nodes (i.e. the memory addresses of the nodes happening to end up fairly close to eachother) This is admittedly pretty likely when constructing a trie all at once, but, if we intentionally force a horrendous allocation pattern, where each allocated node gets put on an entirely separate page, we can see the effects very clearly:

If we run the relevant benchmarks through poop, we can confirm the cause of the slowdown (these results are from the 'connections' version):

Benchmark 1 (11 runs): ./trie-friendly-allocations
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           496ms ± 24.1ms     440ms …  518ms          1 ( 9%)        0%
  cpu_cycles         2.01G  ±  100M     1.78G  … 2.11G           1 ( 9%)        0%
  instructions       1.94G  ± 26.3K     1.94G  … 1.94G           0 ( 0%)        0%
  cache_references    107M  ± 3.40M     98.8M  …  109M           2 (18%)        0%
  cache_misses       33.1M  ± 1.15M     30.3M  … 34.0M           2 (18%)        0%
  branch_misses      21.7M  ± 20.7K     21.7M  … 21.7M           0 ( 0%)        0%
Benchmark 2 (5 runs): ./trie-horrendous-allocations
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.07s  ± 21.9ms    1.05s  … 1.11s           0 ( 0%)        💩+116.4% ±  5.5%
  cpu_cycles         4.17G  ± 92.4M     4.09G  … 4.33G           0 ( 0%)        💩+106.9% ±  5.6%
  instructions       1.92G  ± 38.4      1.92G  … 1.92G           0 ( 0%)          -  1.0% ±  0.0%
  cache_references    145M  ± 1.63M      144M  …  147M           0 ( 0%)        💩+ 35.3% ±  3.3%
  cache_misses       62.2M  ±  839K     61.8M  … 63.7M           0 ( 0%)        💩+ 88.3% ±  3.7%
  branch_misses      21.5M  ± 51.6K     21.4M  … 21.5M           1 (20%)          -  1.1% ±  0.2%

Note that the instruction counts are roughly the same between the 'friendly' and 'horrendous' versions, so the increased cpu_cycles and wall_time can presumably be attributed to the increase in cache misses and pointer chasing (the trie code itself is identical between the two versions).

Takeaways🔗

When using a DAFSA for this particular task, we're trading off a ~20% difference in lookup speed for 2-3 orders of magnitude difference in data size. This seems pretty okay, especially for what we're ultimately interested in implementing: a fully static and unchanging data structure, so there's no need to worry about how easy it is to modify after construction.

An important note moving forward🔗

DAFSA implementation🔗

A while back at the same Zig meetup where I gave a talk about my Windows resource compiler, Niles Salter aka Validark gave a talk titled Better data structures and where to find them. It was nominally about his novel autocomplete data structure, but the stated purpose of the talk was to get people interested in learning about data structures and potentially inventing their own.

During the talk, I thought back to when I contributed to an HTML parser implementation and had to leave proper named character reference tokenization as a TODO because I wasn't sure how to approach it. I can't remember if a deterministic acyclic finite state automaton (DAFSA) was directly mentioned in the talk, or if I heard about it from talking with Niles afterwards, or if I learned of it while looking into trie variations later on (since the talk was about a novel trie variation), but, in any case, after learning about the DAFSA, it sounded like a pretty good tool for the job of named character references, so I revisited named character reference tokenization with that tool in hand.

What is a DAFSA?🔗

A DAFSA is essentially the 'flattened' representation of a trie, but, more importantly, certain types of redundant nodes are eliminated during its construction (the particulars of this aren't too relevant here so I'll skip them; see here if you're interested).

Going back to the same subset of named character references as the example in the "Trie implementation" section above, a DAFSA would represent that set of words like so:

n

o

t

i

n

n

i

v

a

b

c

;

As you can see, the v, a, b, c and ; nodes are now shared between all the words that use them. This takes the number of nodes down to 13 in this example (compared to 22 for the trie).

The downside of this node consolidation is that we lose the ability to associate a given end-of-word node with a particular value. In this DAFSA example, all words except not end on the exact same node, so how can we know where to look for the associated value(s) for those words?

Here's an illustration of the problem when matching the word &notinvc;:

n

o

t

i

n

n

i

v

a

b

c

;

 

To get around this downside, it'd be possible to use something like a separate hash map or devise some minimal perfect hashing scheme to lookup the associated code point(s) for a matching word after the fact, but, luckily, we don't have to worry too much about that because it turns out it is possible to do...

Minimal perfect hashing using a DAFSA🔗

First detailed in Applications of finite automata representing large vocabularies (Cláudio L. Lucchesi, Tomasz Kowaltowski, 1993) (pdf), the technique for minimal perfect hashing using a DAFSA is actually rather simple/elegant:

Within each node, store a count of all possible valid words from that node. For the example we've been using, those counts would look like this:

7

7

7

3

3

3

3

3

1

1

1

1

Then, to get a unique index for a given word, traverse the DAFSA as normal, but:

For example, if we had a DAFSA with a, b, c, and d as possible first characters (in that order), and the word we're looking for starts with c, then we'll iterate over a and b when looking for c in the list of children, so we add the numbers of the a and b nodes (whatever they happen to be) to the unique index. Here's an illustration:

a

b

c

d

For a given word in the set, applying this algorithm will produce a number between 1 and the total number of words in the DAFSA (inclusive), and it's guaranteed that each word will end up with a unique number (i.e. this is a minimal perfect hash). Here's what that looks like for the example we've been using:

Current Word: not
Autoplay: on
+1
+3
+3
+1
5

7

7

7

3

3

3

3

3

1

1

1

1

??

Unique Index: 0

After you have the unique index of a word, you can then use a lookup array for the associated values and index into it using unique_index - 1.

Trie vs DAFSA for named character references🔗

Ok, so now that we have two different data structures that seem pretty well suited for named character reference matching—a trie and a DAFSA—how do they compare? It's now (finally) time to start using the full set of named character references and all of their mapped code point(s) to see how they stack up.

Some numbers to keep in mind upfront:

Another brief detour: representing the mapped value(s)🔗

As mentioned earlier, each named character reference is mapped to either one or two code points. Unicode code points have a range of 0x0-0x10FFFF, but if we actually look at the set of code points used by named character references, there are a few properties worth noting:

With both of these properties taken together, it's possible to encode the mapped code point values for any given named character reference in 21 bits. With padding between the elements of an array of 21-bit integers, though, that will round up to 4 bytes per element (11 bits of padding), so it ends up being the same as if 32 bit integers were used.

Here's a diagram of a possible memory layout of an array using this representation, where      is the bits of the first code point,      is the bits of the second code point, and      is the padding bits between elements:

byte index
element index
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2

However, while using 21 bits to represent the mapped code point(s) does not automatically lead to any saved bytes over a 32 bit integer, it opens up the possibility to tightly pack an array of 21-bit elements in order to actually save some bytes. Yet, doing so means that storing/loading elements from the tightly packed array becomes trickier (both computationally and implementation-wise). Here's the same diagram as before, but with the elements tightly packed (no padding bits between elements):

byte index
element index
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4

You'll notice that no elements past the first start or end on byte boundaries, meaning in order to load an element, a fair bit of bitwise operations are required (bit shifting, etc). This makes array accesses more expensive, but that isn't necessarily a big deal for our use case, since we only ever access the array of values once per named character reference, and only after we're certain we have a match. So, tightly bitpacking the value array is a viable way to save some extra bytes for our purposes.

Data size🔗

For the DAFSA, the size calculation is pretty straightforward:

Nitty-gritty DAFSA node size details

Ultimately, the goal is to keep the node size less than or equal to 32 bits while storing the following data on each node:

  • An ASCII character
    • This can technically be represented in 6 bits, since the actual alphabet of characters used in the list of named character references only includes 61 unique characters ('1'...'8', ';', 'a'...'z', 'A'...'Z'). However, to do so you'd need to convert between the 6 bit representation and the actual ASCII value of each character to do comparisons. We aren't desperate to save bits, though, so we can get away with representing this value as 8 bits, which makes comparisons with any byte value trivial.
  • A "count of all possible valid words from that node"
    • Empirically, the highest value within our particular DAFSA for this field is 168, which can fit into 8 bits.
  • An "end of word" flag
    • 1 bit
  • A "last sibling" flag
    • 1 bit
  • An "index of first child" field
    • There are 3,872 nodes in our DAFSA, so all child indexes can fit into 12 bits.

In total, that's 8 + 8 + 1 + 1 + 12 = 30 bits, so we have 2 bits to spare while remaining within the 32 bit target size with this representation.

So, the DAFSA and the lookup array for the values (together) will use either 24,412 bytes (23.84 KiB) or 21,345 bytes (20.84 KiB) total.

For the trie, there's slightly more to discuss around data representation before we can get to the data size calculations. It was glossed over in the Trie implementation section, but when using the 'flattened' representation of a trie there are effectively two ways to handle value lookups for each word:

  1. Store an array of 2,231 values (one for each named character reference) and then also store an index into that array on each end-of-word node.
    • This increases the bit size needed for each node by 12 bits (since 2,231 can be encoded in 12 bits)
  2. Store an array of 9,854 values (one for each node in the trie) and then index into the value array by re-using the index of the end-of-word node.
    • This makes the size of the value array 4.42 times larger, but does not affect the node size

Beyond that, the details aren't super relevant. Suffice it to say that each node will either take up 5 bytes or 3 bytes depending on which of the two 'value array' strategies you choose (note: I actually mean 5 bytes and 3 bytes, as they can be represented as one array of 2- or 4-byte values and one array of 1-byte values so padding between elements doesn't factor in).

The summary is that, depending on the particular representation, the trie will use between 57,993 bytes and 68,777 bytes (56.63 KiB to 67.16 KiB) total, or, if the values array is tightly bitpacked, between 54,926 bytes and 55,227 bytes (53.64 KiB to 53.93 KiB) total.

Ultimately, the data size of the trie is going to be at least 2x larger than the equivalent DAFSA.

Performance🔗

Luckily, there's an existing HTML parser implementation written in Zig called rem that uses a trie for its named character reference tokenization, so getting some relevant benchmark results from actual HTML parsing for a trie vs DAFSA comparison was pretty easy.

From my benchmarking, it turns out that the DAFSA implementation uses more instructions than the trie implementation because it needs to do extra work to build up the unique index during iteration, but the DAFSA saves on cache misses (presumably due to the smaller overall size of the DAFSA and its node re-use) and everything just about evens out in terms of wall clock time:

Benchmark 1 (449 runs): ./trie
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          44.4ms ± 1.57ms    43.0ms … 61.8ms          3 ( 1%)        0%
  peak_rss           61.8MB ± 62.5KB    61.6MB … 61.9MB          0 ( 0%)        0%
  cpu_cycles         49.5M  ±  463K     48.5M  … 51.9M          19 ( 4%)        0%
  instructions       76.2M  ± 2.95      76.2M  … 76.2M           5 ( 1%)        0%
  cache_references   2.54M  ± 21.6K     2.48M  … 2.63M          12 ( 3%)        0%
  cache_misses        119K  ± 1.64K      115K  …  128K          18 ( 4%)        0%
  branch_misses       322K  ± 1.02K      319K  …  328K          18 ( 4%)        0%
Benchmark 2 (451 runs): ./dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          44.3ms ±  561us    43.8ms … 48.4ms         19 ( 4%)          -  0.4% ±  0.3%
  peak_rss           61.6MB ± 66.6KB    61.5MB … 61.7MB          0 ( 0%)          -  0.2% ±  0.0%
  cpu_cycles         53.0M  ±  566K     52.3M  … 54.9M          11 ( 2%)        💩+  7.0% ±  0.1%
  instructions       78.7M  ± 2.59      78.7M  … 78.7M           6 ( 1%)        💩+  3.2% ±  0.0%
  cache_references   2.49M  ± 30.0K     2.43M  … 2.60M          29 ( 6%)        ⚡-  2.0% ±  0.1%
  cache_misses       90.9K  ± 1.32K     86.4K  … 95.4K          12 ( 3%)        ⚡- 23.5% ±  0.2%
  branch_misses       331K  ±  730       330K  …  337K          21 ( 5%)        💩+  2.9% ±  0.0%

Takeaways🔗

For the use-case of named character references, using a DAFSA instead of a trie cuts the size of the data at least in half while performing about the same.

The Ladybird implementation🔗

First, let's take a look at what the Ladybird implementation looked like before my changes: state implementation, matching implementation. Here's a rough summary of the approach, in pseudo-code:

var match = null;

// Lookahead at the rest of the input
var remaining_input = input.substring(current_offset, input.length - current_offset);

// Check against each named character reference one-by-one
for (var entity : entities) {
    if (remaining_input.starts_with(entity)) {
        if (match == null or entity.length > match.length) {
            match = entity;
        }
    }
}

// If there is a match
if (match != null) {
    // Consume up to the end of the match that was found
    consume_and_advance(match.length - 1);

    // ...
}

This has two major problems:

  1. It is inefficient, since the input string is being compared against the entire list of named character references one-at-a-time
  2. It does not handle document.write correctly, as previously discussed in The spectre of document.write. It's doing lookahead, but it does not account for insertion points, as it makes the mistake of looking past insertion points. So, if document.write is used to write one-character-at-a-time, it will attempt to resolve the named character reference before all the characters are available (e.g. in the case of in;, it will erroneously try matching against &in; and then exit the named character reference state)

My pull request focused on fixing both of those problems. The data structure I used is exactly the DAFSA implementation as described so far, with a value array that is not bitpacked, because:

The last piece of the puzzle that I haven't mentioned yet is the NamedCharacterReferenceMatcher, which handles DAFSA traversal while providing an API well-tailored to the named character reference state, specifically. The details aren't too important, so here are the relevant bits of the exposed API:

// If `c` is the code point of a child of the current `node_index`, the `node_index`
// is updated to that child and the function returns `true`.
// Otherwise, the `node_index` is unchanged and the function returns false.
bool try_consume_code_point(u32 c);

// Returns the number of code points consumed beyond the last full match.
u8 overconsumed_code_points();

// Returns the code points associated with the last match, if any.
Optional<NamedCharacterReferenceCodepoints> code_points();

So, with all that context, here's what the Ladybird implementation looks like after my changes (slightly simplified for clarity; here's the full implementation):

BEGIN_STATE(NamedCharacterReference)
{
    if (matcher.try_consume_code_point(current_code_point)) {
        temporary_buffer.append(current_code_point);
        continue; // stay in the NamedCharacterReference state and go to the next code point
    } else {
        DONT_CONSUME_CHARACTER;
    }

    auto overconsumed_code_points = matcher.overconsumed_code_points();
    if (overconsumed_code_points > 0) {
        backtrack_to(current_offset - overconsumed_code_points);
        temporary_buffer.shrink_by(overconsumed_code_points);
    }

    auto mapped_code_points = matcher.code_points();
    // If there is a match
    if (mapped_code_points) {
        // ...
    } else {
        FLUSH_CODEPOINTS_CONSUMED_AS_A_CHARACTER_REFERENCE;
        SWITCH_TO(AmbiguousAmpersand);
    }
}

Note also that Ladybird follows 'spec-driven development', meaning that the goal is for its code to be implemented to match the text of the relevant specification as closely as possible. Here's what the named character reference state specification looks like for reference:

13.2.5.73 Named character reference state

Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed.

  • If there is a match

    • ...
  • Otherwise

    • Flush code points consumed as a character reference. Switch to the ambiguous ampersand state.

Overall, these changes made the Ladybird tokenizer:

But that's all pretty low-hanging fruit, as the previous Ladybird implementation had some obvious problems. In fact, we can actually improve on this some more (and will later on), but I think it's worth looking at the Firefox/Chrome/Safari implementations now to see how this DAFSA version stacks up against them.

Comparison to the major browser engines🔗

Before we get to the actual comparisons, there's (unfortunately) a lot that has to be discussed.

First, you'll notice from the Ladybird benchmarks above that an 8x improvement in a very named-character-reference-specific benchmark only led to a 1.23x improvement in the average case. This points to the fact that named character reference matching is not something that HTML tokenizers typically do very often, and that named character reference matching being fast enough is likely just fine, all things considered.

Second, instead of going the route of putting my DAFSA implementation into the other browsers' engines to compare, I went with taking the other browsers' implementations and putting them into Ladybird. Not only that, though, I also made the Firefox/Chrome/Safari implementations conform to the API of NamedCharacterReferenceMatcher (for reasons that will be discussed soon). So, in order for my benchmarking to be accurate you'll have to trust that:

For the first point, the only real assurance I can give you is that the same number of web platform tests within the html/syntax/parsing category were passing with each browser's implementation integrated. The second point will be discussed more later on. For the third point, we have to go on yet another detour...

On the difficulty of benchmarking🔗

My initial benchmarking setup was straightforward:

However, this approach ultimately left me with some inexplicable results. Here's an example of such results, where the benchmark that exclusively tests the relevant parts of the code shows the Blink (Chrome) implementation being slightly faster than mine:

Benchmark 1 (85 runs): ./TestHTMLTokenizerBlink --bench named_character_references
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           118ms ± 1.68ms     115ms …  121ms          0 ( 0%)        0%
Benchmark 2 (84 runs): ./TestHTMLTokenizerDafsa --bench named_character_references
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           120ms ± 1.50ms     117ms …  123ms          0 ( 0%)        💩+  2.1% ±  0.4%

Yet, when I ran a benchmark that only occasionally exercises the code that I changed (a benchmark using a sample of real HTML files), I got unexpected results. What we should expect is either a very slight difference in the same direction, or (more realistically) no discernible difference, as this effect size should not be noticeable in the average case. Instead, I got the opposite result and a larger effect:

Benchmark 1 (12 runs): ./TestHTMLTokenizerBlink --bench benchfiles
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.85s  ± 28.4ms    1.79s  … 1.88s           2 (17%)        0%
Benchmark 2 (12 runs): ./TestHTMLTokenizerDafsa --bench benchfiles
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.79s  ± 20.7ms    1.76s  … 1.84s           1 ( 8%)        ⚡-  3.2% ±  1.1%

Taken together, the only explanation for these sorts of results would be that the parts of the code that I didn't change got faster in one version, and not the other.

The elephant demands attention🔗

Well, this explanation—that the code I didn't change got faster—is actually likely to be the correct one, and I've known about this possibility ever since I watched the excellent talk "Performance Matters" by Emery Berger. I recommend watching the talk, but the short explanation is that changes to one part of the codebase may inadvertently cause the compiler to reorganize unrelated parts of the compiled binary in ways that affect performance ('layout').

However, while I was aware of this possible confounder, up until now I have basically ignored it whenever doing benchmarking (and I'm assuming that's true for most programmers). This is the first time I've come face-to-face with the problem and had to reckon with its effects, and I think a large part of that is because I'm dealing with a much larger codebase than I typically would when benchmarking, so there's a lot of room for inadvertent layout changes to cascade and cause noticeable performance differences.

Elephant mitigation🔗

Unfortunately, the solution presented in the talk (Stabilizer, which allows you to constantly randomize a binary's layout during runtime to control for layout-based performance differences) has bitrotted and only works with an ancient version of LLVM. So, instead, I thought I'd try a different benchmarking setup to counteract the problem:

This introduces some dynamic dispatch overhead into the mix which may muddy the results slightly, but, in theory, this should eliminate the effects of layout differences, as whatever the binary layout happens to be, all implementations will share it. In practice, this did indeed work, but introduced another unexplained anomaly that we'll deal with afterwards. After moving all the implementations into the same binary, I got these results for the 'average case' benchmark:

Benchmark 1 (12 runs): ./BenchHTMLTokenizerFiles dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.78s  ± 29.8ms    1.75s  … 1.87s           1 ( 8%)        0%
  peak_rss           88.7MB ± 59.2KB    88.7MB … 88.8MB          0 ( 0%)        0%
  cpu_cycles         6.71G  ±  127M     6.62G  … 7.10G           1 ( 8%)        0%
  instructions       16.1G  ± 99.2K     16.1G  … 16.1G           0 ( 0%)        0%
  cache_references    324M  ± 2.87M      319M  …  331M           2 (17%)        0%
  cache_misses       10.2M  ± 66.4K     10.1M  … 10.3M           3 (25%)        0%
  branch_misses      8.69M  ± 2.99M     7.82M  … 18.2M           1 ( 8%)        0%
Benchmark 3 (12 runs): ./BenchHTMLTokenizerFiles blink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.79s  ± 24.8ms    1.73s  … 1.82s           1 ( 8%)          +  1.0% ±  1.3%
  peak_rss           89.1MB ±  234KB    89.0MB … 89.8MB          1 ( 8%)          +  0.5% ±  0.2%
  cpu_cycles         6.70G  ± 57.7M     6.62G  … 6.79G           0 ( 0%)          -  0.2% ±  1.2%
  instructions       16.1G  ±  128K     16.1G  … 16.1G           1 ( 8%)          -  0.0% ±  0.0%
  cache_references    325M  ± 1.90M      321M  …  328M           0 ( 0%)          +  0.2% ±  0.6%
  cache_misses       10.3M  ± 54.3K     10.2M  … 10.4M           0 ( 0%)          +  0.8% ±  0.5%
  branch_misses      7.86M  ± 24.6K     7.79M  … 7.89M           3 (25%)          -  9.6% ± 20.6%
Benchmark 4 (12 runs): ./BenchHTMLTokenizerFiles gecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.72s  ± 9.79ms    1.70s  … 1.74s           0 ( 0%)        ⚡-  3.2% ±  1.1%
  peak_rss           88.8MB ± 83.8KB    88.7MB … 89.0MB          4 (33%)          +  0.1% ±  0.1%
  cpu_cycles         6.69G  ± 41.1M     6.63G  … 6.76G           0 ( 0%)          -  0.4% ±  1.2%
  instructions       16.1G  ± 72.3K     16.1G  … 16.1G           0 ( 0%)          -  0.1% ±  0.0%
  cache_references    323M  ± 1.53M      320M  …  325M           1 ( 8%)          -  0.2% ±  0.6%
  cache_misses       10.2M  ± 35.8K     10.1M  … 10.2M           0 ( 0%)          +  0.0% ±  0.4%
  branch_misses      7.79M  ± 49.1K     7.76M  … 7.95M           2 (17%)          - 10.4% ± 20.6%

The remaining Gecko (Firefox) wall_time difference is consistently reproducible, but not readily explainable using the other metrics measured, as there is no significant difference in CPU cycles, instructions, cache usage, etc. After attempting some profiling and trying to use strace to understand the difference, my guess is that this comes down to coincidental allocation patterns being friendlier when choosing the Gecko version.

If we use strace -e %memory -c, the dafsa and blink versions consistently use more brk/mmap/munmap syscalls (especially brk):

Dafsa
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.18    0.012463         113       110           brk
 31.90    0.008427         443        19           munmap
 18.25    0.004822           9       508           mmap
  2.66    0.000703           5       134           mprotect
------ ----------- ----------- --------- --------- ----------------
100.00    0.026415          34       771           total

Blink
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 55.95    0.018094         138       131           brk
 28.81    0.009318         490        19           munmap
 12.93    0.004181           8       508           mmap
  2.32    0.000749           5       134           mprotect
------ ----------- ----------- --------- --------- ----------------
100.00    0.032342          40       792           total

The Gecko version, even though it uses roughly the same amount of memory overall, consistently has fewer of these syscalls:

Gecko
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 37.07    0.006560         385        17           munmap
 31.73    0.005615          75        74           brk
 26.50    0.004689           9       506           mmap
  4.70    0.000831           6       134           mprotect
------ ----------- ----------- --------- --------- ----------------
100.00    0.017695          24       731           total

I don't know enough about the glibc/libstdc++ allocator implementation(s) to know why this would be the case, and the magnitude of the difference reported by strace doesn't seem large enough to explain the results, but I'm at least a little bit confident that this is the cause, since, after inserting padding to each NamedCharacterReferenceMatcher subclass to ensure they are all the same size, the wall_time difference went away:

Benchmark 1 (12 runs): ./BenchHTMLTokenizerFiles dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.81s  ± 29.8ms    1.73s  … 1.85s           1 ( 8%)        0%
  peak_rss           89.7MB ±  203KB    89.5MB … 90.3MB          1 ( 8%)        0%
  cpu_cycles         6.74G  ± 58.0M     6.68G  … 6.87G           0 ( 0%)        0%
  instructions       16.1G  ±  142K     16.1G  … 16.1G           1 ( 8%)        0%
  cache_references    322M  ± 2.45M      316M  …  325M           1 ( 8%)        0%
  cache_misses       10.2M  ± 25.1K     10.1M  … 10.2M           0 ( 0%)        0%
  branch_misses      7.84M  ± 26.1K     7.77M  … 7.87M           1 ( 8%)        0%
Benchmark 2 (12 runs): ./BenchHTMLTokenizerFiles blink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.80s  ± 23.0ms    1.74s  … 1.83s           1 ( 8%)          -  0.1% ±  1.2%
  peak_rss           89.6MB ±  205KB    89.2MB … 90.1MB          2 (17%)          -  0.1% ±  0.2%
  cpu_cycles         6.74G  ± 37.0M     6.68G  … 6.82G           0 ( 0%)          -  0.0% ±  0.6%
  instructions       16.1G  ±  194K     16.1G  … 16.1G           0 ( 0%)          -  0.0% ±  0.0%
  cache_references    321M  ± 1.93M      317M  …  324M           0 ( 0%)          -  0.3% ±  0.6%
  cache_misses       10.3M  ± 47.5K     10.2M  … 10.3M           0 ( 0%)          +  0.7% ±  0.3%
  branch_misses      7.87M  ± 23.3K     7.82M  … 7.91M           1 ( 8%)          +  0.4% ±  0.3%
Benchmark 3 (12 runs): ./BenchHTMLTokenizerFiles gecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.80s  ± 29.0ms    1.71s  … 1.82s           1 ( 8%)          -  0.4% ±  1.4%
  peak_rss           89.7MB ±  265KB    89.5MB … 90.5MB          1 ( 8%)          +  0.1% ±  0.2%
  cpu_cycles         6.73G  ± 43.3M     6.65G  … 6.78G           0 ( 0%)          -  0.2% ±  0.6%
  instructions       16.1G  ±  156K     16.1G  … 16.1G           2 (17%)          -  0.1% ±  0.0%
  cache_references    321M  ± 3.29M      316M  …  329M           0 ( 0%)          -  0.2% ±  0.8%
  cache_misses       10.2M  ± 52.9K     10.1M  … 10.2M           2 (17%)          -  0.2% ±  0.3%
  branch_misses      7.83M  ± 22.2K     7.76M  … 7.86M           2 (17%)          -  0.2% ±  0.3%

No meaningful differences across the board, which is what we expect. What this (tentatively) means is that heap allocation is another potential confounder, and that something as inconsequential as a single allocation being a different size (in our case the NamedCharacterReferenceMatcher instance) may have knock-on effects that last for the rest of the program (or I'm wrong about the cause and this is a red herring).

Consequently, this means that I won't bother with the 'average case' benchmarking moving forward. In other words, spoiler alert: nothing in this article will move the needle on this 'average case' benchmark.

Side note: conforming to one API🔗

Something worth mentioning here is that I've made the choice to convert the Firefox/Chrome/Safari implementations to conform to the NamedCharacterReferenceMatcher API used by Ladybird (instead of porting the full named character reference tokenizer state implementation from the other browsers' into Ladybird). This was done for two reasons:

I'm mentioning this now because it means that I've introduced another possible source of error into my benchmarks; the Firefox/Chrome/Safari implementations that I'm testing are not 1:1 ports, as they had to be transformed to conform to the NamedCharacterReferenceMatcher API (Firefox much more than Chrome/Safari).

Lessons learned🔗

I think the big takeaway here is that there is a lot that can go wrong when benchmarking.

Aside from the more esoteric stuff mentioned above, there are also countless simple/dumb mistakes that can be made that can completely ruin a benchmark's integrity. As an example, for a good while when writing this article, I accidentally left a loop in the Chrome version that I only put there for debugging purposes. That loop was just eating CPU cycles for no reason and skewed my benchmark results pretty significantly. Luckily, I found and fixed that particular mistake, but that sort of thing could have easily gone unnoticed and caused me to draw totally invalid conclusions. Beyond that, there's other stuff I haven't mentioned like CPU architecture, compiler flags, etc, etc, etc.

What I'm really trying to get across is something like:

With all that out of the way, let's get into it.

Comparison with Gecko (Firefox)🔗

The current implementation of named character reference tokenization in the Gecko engine (Firefox's browser engine) was introduced in 2010, and refined during the rest of 2010. It has remained unchanged since then.

It does not use any form of a trie, but instead uses a number of arrays (48 in total) to progressively narrow down the possible candidates within the set of named character references until there's no more possible candidates remaining. Here's an overview:

In the very likely scenario that the above description is hard to take in, here's my best attempt at visually illustrating how it works, matching against the valid named character reference &notinvc;:

notinvc; ^

the first character (n) is within the a-z or A-Z range

notinvc; ^

Use the second character (o) to get the array to use with the first character (n)

HILO_ACCEL
0 '\x00' N/A
...
65 'A' HILO_ACCEL_65
...
110 'n' HILO_ACCEL_110
111 'o'
HILO_ACCEL_111
110 'p' HILO_ACCEL_112
...
122 'z' HILO_ACCEL_122
HILO_ACCEL_111
0 'A' 0x00110010
...
25 'Z' ...
26 'a' ...
...
39
'n'
0x060205F6
...
51 'z' 0x08B308B3
0x060205F6
   ↙      ↘
hi
lo
0x0602 or 1538
0x05F6 or 1526

Any possible matches must be between indexes 1526 and 1538 (inclusive)

The possible matches are:

NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;

Now we start to narrow those possibilities down:

notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;
notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;
notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;
notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;
notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;
notinvc; ^
NAMES
1526 nopf;
1527 not
1528 not;
1529 notin;
1530 notinE;
1531 notindot;
1532 notinva;
1533 notinvb;
1534 notinvc;
1535 notni;
1536 notniva;
1537 notnivb;
1538 notnivc;

I'm glossing over how exactly the possibilities are narrowed down because it's not super relevant (if you're interested, here's the responsible tokenizer code), but I will note that the 'lo' and 'hi' cursors always move linearly (i.e. each possible match is ruled out one-by-one; there's no binary search or anything like that going on).

This approach works well because the first two characters alone fairly reliably narrow down the possibilities to a pretty small range. Out of 2288 possible combinations of the first two characters, 1658 of them (72.5%) lead to zero possible matches. Out of the remaining combinations (those with ≥ 1 possible match), the mean number of matches is 3.54 with a standard deviation of 29.8, and the median number of possible matches is 2. Here's what the full distribution looks like (with the combinations that lead to zero matches included):

051015202530354045505505001,0001,500Number of possible matches after the first two charactersFrequency

Now that we have an understanding of how the Firefox implementation works, let's see how it compares using the three metrics that were mentioned at the start.

Performance🔗

Performance between the Firefox version and the Ladybird DAFSA version is basically a wash in the primary benchmark I'm using (tokenizing a file with tens of thousands of valid and invalid named character references):

Benchmark 1 (89 runs): ./BenchHTMLTokenizer gecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           113ms ± 1.12ms     111ms …  115ms          0 ( 0%)        0%
  peak_rss           83.4MB ± 93.7KB    83.0MB … 83.5MB          2 ( 2%)        0%
  cpu_cycles          226M  ±  877K      224M  …  230M           3 ( 3%)        0%
  instructions        438M  ± 10.6K      438M  …  438M           7 ( 8%)        0%
  cache_references   9.54M  ±  130K     9.40M  … 10.5M           6 ( 7%)        0%
  cache_misses        427K  ± 11.1K      406K  …  458K           2 ( 2%)        0%
  branch_misses       578K  ± 1.79K      575K  …  585K           5 ( 6%)        0%
Benchmark 2 (88 runs): ./BenchHTMLTokenizer dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           114ms ± 1.38ms     110ms …  116ms          5 ( 6%)          +  0.7% ±  0.3%
  peak_rss           83.3MB ± 94.6KB    83.0MB … 83.5MB          1 ( 1%)          -  0.1% ±  0.0%
  cpu_cycles          229M  ±  856K      227M  …  232M           3 ( 3%)        💩+  1.4% ±  0.1%
  instructions        450M  ± 10.4K      450M  …  450M           3 ( 3%)        💩+  2.7% ±  0.0%
  cache_references   9.42M  ±  128K     9.25M  … 10.5M           3 ( 3%)          -  1.2% ±  0.4%
  cache_misses        418K  ± 9.03K      400K  …  443K           2 ( 2%)        ⚡-  2.0% ±  0.7%
  branch_misses       575K  ± 3.01K      570K  …  600K           5 ( 6%)          -  0.5% ±  0.1%

However, if we tailor some benchmarks to test the scenarios where each should theoretically perform the worst, we can see some clearer differences.

I believe the worst case for the Firefox implementation is successfully matching the named character reference &supsetneqq;. As mentioned earlier, su as the first two characters narrows down the possibilities the least, with 55 remaining possibilities, and &supsetneqq; should take the longest to match out of the remaining possibilities.

Here are the results for tokenizing a file with nothing but 30,000 &supsetneqq; sequences in a row:

Benchmark 1 (197 runs): ./BenchHTMLTokenizer gecko gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.6ms ± 1.11ms    48.3ms … 53.2ms          0 ( 0%)        0%
  peak_rss           53.0MB ± 85.4KB    52.7MB … 53.2MB          1 ( 1%)        0%
  cpu_cycles          137M  ±  816K      135M  …  140M           3 ( 2%)        0%
  instructions        278M  ± 9.06K      278M  …  278M           7 ( 4%)        0%
  cache_references   3.27M  ± 58.4K     3.13M  … 3.55M          17 ( 9%)        0%
  cache_misses        361K  ± 10.0K      342K  …  396K           2 ( 1%)        0%
  branch_misses       314K  ± 5.29K      306K  …  335K           5 ( 3%)        0%
Benchmark 2 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          45.9ms ±  805us    43.6ms … 47.5ms         22 (10%)        ⚡-  9.3% ±  0.4%
  peak_rss           53.0MB ± 83.9KB    52.7MB … 53.2MB          1 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles          117M  ±  635K      116M  …  119M           3 ( 1%)        ⚡- 14.5% ±  0.1%
  instructions        259M  ± 5.16K      259M  …  259M           5 ( 2%)        ⚡-  7.0% ±  0.0%
  cache_references   3.27M  ±  128K     3.15M  … 4.10M          20 ( 9%)          -  0.0% ±  0.6%
  cache_misses        357K  ± 7.06K      344K  …  384K           5 ( 2%)          -  1.1% ±  0.5%
  branch_misses       183K  ± 1.95K      179K  …  193K          21 (10%)        ⚡- 41.8% ±  0.2%

On the flipside, the scenario where the Firefox implementation likely outperforms the Ladybird implementation the most is an invalid named character reference that can be rejected from the first two characters alone (I've arbitrarily chosen &cz).

Here are the results for tokenizing a file with nothing but 30,000 &cz sequences in a row:

Benchmark 1 (163 runs): ./BenchHTMLTokenizer gecko ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          61.4ms ±  958us    59.2ms … 63.6ms          1 ( 1%)        0%
  peak_rss           65.1MB ± 85.5KB    64.9MB … 65.3MB         24 (15%)        0%
  cpu_cycles          104M  ±  525K      102M  …  106M           4 ( 2%)        0%
  instructions        194M  ± 4.18K      194M  …  194M           2 ( 1%)        0%
  cache_references   5.89M  ±  125K     5.79M  … 6.92M           8 ( 5%)        0%
  cache_misses        374K  ± 4.44K      367K  …  385K           0 ( 0%)        0%
  branch_misses       163K  ± 1.24K      160K  …  166K           3 ( 2%)        0%
Benchmark 2 (159 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          63.0ms ± 1.00ms    60.1ms … 65.0ms          0 ( 0%)        💩+  2.6% ±  0.3%
  peak_rss           65.1MB ± 74.3KB    64.8MB … 65.2MB          1 ( 1%)          -  0.1% ±  0.0%
  cpu_cycles          112M  ±  673K      111M  …  117M           8 ( 5%)        💩+  7.6% ±  0.1%
  instructions        214M  ± 4.26K      214M  …  214M           1 ( 1%)        💩+ 10.4% ±  0.0%
  cache_references   5.87M  ± 54.9K     5.77M  … 6.19M           5 ( 3%)          -  0.4% ±  0.4%
  cache_misses        375K  ± 4.52K      364K  …  391K           2 ( 1%)          +  0.2% ±  0.3%
  branch_misses       164K  ±  751       161K  …  166K           2 ( 1%)          +  0.4% ±  0.1%

However, neither of these scenarios are likely to be all that common in reality. My hunch (but I don't have any data to back this up) is that there are two scenarios that are common in real HTML:

In the second scenario where an & character is surrounded by whitespace, the tokenizer will never actually enter the named character reference state, since that requires & to be followed by an ASCII alphanumeric character, so all implementations will perform the same there.

We can test the first scenario, though. Here are the results for a file with nothing but 30,000 valid named character references (chosen at random) in a row:

Benchmark 1 (229 runs): ./BenchHTMLTokenizer gecko all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.6ms ±  713us    41.3ms … 45.2ms         13 ( 6%)        0%
  peak_rss           54.4MB ± 84.8KB    54.1MB … 54.5MB          5 ( 2%)        0%
  cpu_cycles          103M  ±  545K      102M  …  105M           2 ( 1%)        0%
  instructions        188M  ± 10.6K      188M  …  188M          17 ( 7%)        0%
  cache_references   3.63M  ± 91.6K     3.54M  … 4.35M          15 ( 7%)        0%
  cache_misses        361K  ± 8.92K      347K  …  386K           0 ( 0%)        0%
  branch_misses       383K  ± 1.57K      379K  …  387K           0 ( 0%)        0%
Benchmark 2 (233 runs): ./BenchHTMLTokenizer dafsa all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.0ms ±  770us    40.8ms … 44.3ms         30 (13%)        ⚡-  1.3% ±  0.3%
  peak_rss           54.4MB ± 80.6KB    54.0MB … 54.5MB          1 ( 0%)          -  0.1% ±  0.0%
  cpu_cycles          102M  ±  452K      101M  …  104M           4 ( 2%)        ⚡-  1.3% ±  0.1%
  instructions        191M  ± 8.04K      191M  …  191M          11 ( 5%)        💩+  2.0% ±  0.0%
  cache_references   3.54M  ± 63.5K     3.45M  … 4.09M          13 ( 6%)        ⚡-  2.6% ±  0.4%
  cache_misses        355K  ± 5.23K      344K  …  370K           1 ( 0%)        ⚡-  1.7% ±  0.4%
  branch_misses       344K  ± 1.24K      341K  …  348K          17 ( 7%)        ⚡- 10.1% ±  0.1%

Something worth noting about why Firefox fares better here than in the &supsetneqq; benchmark above is that the distribution of possible matches after two characters is very uneven (as mentioned previously). In 42.7% of cases (269 out of 630) where there is at least 1 possible match after the first two characters, there is only 1 possible match. This makes the 'narrowing down the matches' portion of the Firefox implementation essentially just a string comparison.

Something else we can look at is raw matching speed in isolation (i.e. not within the context of HTML tokenization). In my benchmarking, the Firefox implementation wins out in this category:

Benchmark 1 (95 runs): ./BenchMatcherGecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           105ms ± 1.14ms     102ms …  108ms          5 ( 5%)        0%
  peak_rss           4.56MB ± 72.7KB    4.33MB … 4.72MB          0 ( 0%)        0%
  cpu_cycles          426M  ± 1.40M      424M  …  430M           9 ( 9%)        0%
  instructions        745M  ± 81.5       745M  …  745M           0 ( 0%)        0%
  cache_references   8.03M  ± 81.6K     7.89M  … 8.54M           3 ( 3%)        0%
  cache_misses       28.0K  ± 5.70K     21.2K  … 46.5K           7 ( 7%)        0%
  branch_misses      5.41M  ± 2.49K     5.41M  … 5.42M           0 ( 0%)        0%
Benchmark 2 (84 runs): ./BenchMatcherDafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           120ms ± 1.46ms     116ms …  123ms          0 ( 0%)        💩+ 13.8% ±  0.4%
  peak_rss           4.50MB ± 61.6KB    4.46MB … 4.59MB          0 ( 0%)          -  1.3% ±  0.4%
  cpu_cycles          487M  ± 4.39M      477M  …  496M           2 ( 2%)        💩+ 14.3% ±  0.2%
  instructions       1.02G  ± 66.2      1.02G  … 1.02G          14 (17%)        💩+ 36.7% ±  0.0%
  cache_references   6.11M  ± 90.5K     5.98M  … 6.81M           2 ( 2%)        ⚡- 23.9% ±  0.3%
  cache_misses       26.0K  ± 4.02K     20.3K  … 39.8K           5 ( 6%)        ⚡-  7.1% ±  5.2%
  branch_misses      6.01M  ± 20.5K     5.98M  … 6.06M           0 ( 0%)        💩+ 11.1% ±  0.1%

Two things to note:

Data size🔗

As noted earlier, Firefox uses a total of 48 arrays for its named character reference data:

So, in total, the Firefox implementation uses 40,167 bytes (39.23 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 15,755 bytes (15.39 KiB), or, in other words, the Ladybird implementation uses 60.8% of the data size of the Firefox implementation.

Ease-of-use🔗

Since, for the purposes of my benchmarking, the Firefox implementation was made to conform to the API of the NamedCharacterReferenceMatcher that is used in the Ladybird implementation, there's no meaningful difference in terms of ease-of-use.

However, I'll take this opportunity to talk about what changes were made to make that happen and how the real Firefox implementation differs.

Very crudely approximated, the implementation went from ~200 SLoC to ~110 SLoC. So, the NamedCharacterReferenceMatcher abstraction may represent a marginal improvement in terms of ease-of-use.

Summary🔗

Overall, the Firefox implementation fares quite well in this comparison.

Comparison with Blink/WebKit (Chrome/Safari)🔗

Chromium logo WebKit logo (Safari's engine)

Blink (the browser engine of Chrome/Chromium) started as a fork of WebKit (the browser engine of Safari), which itself started as a fork of KHTML. There are some differences that have emerged between the two since Blink was forked from WebKit, but for the purposes of this article I'm only going to benchmark against the Blink implementation and assume the results would be roughly the same for the WebKit implementation (the difference mostly comes down to data size, which I'll mention in the "Data size" section later).

Like Firefox, the Chrome/Safari named character reference tokenization does not use a trie. For the 'matching' portion, the Chrome/Safari implementation is actually quite similar in concept to the Firefox implementation:

The main differences from the Firefox implementation are that the Chrome/Safari version (a) only uses the first character to narrow the initial possible matches, and (b) uses binary searches to narrow the possibilities after that instead of linearly moving the lo and hi indexes.

Performance🔗

Similar to Firefox, the performance difference in the 'tens of thousands of valid and invalid named character references' benchmark is basically a wash:

Benchmark 1 (88 runs): ./BenchHTMLTokenizer blink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           115ms ±  943us     113ms …  117ms          0 ( 0%)        0%
  peak_rss           83.3MB ± 99.6KB    83.0MB … 83.5MB          2 ( 2%)        0%
  cpu_cycles          232M  ±  754K      230M  …  234M           0 ( 0%)        0%
  instructions        461M  ± 4.73K      461M  …  461M           0 ( 0%)        0%
  cache_references   9.95M  ±  299K     9.71M  … 12.3M           6 ( 7%)        0%
  cache_misses        412K  ± 5.68K      401K  …  425K           0 ( 0%)        0%
  branch_misses       747K  ± 1.78K      744K  …  757K           2 ( 2%)        0%
Benchmark 2 (88 runs): ./BenchHTMLTokenizer dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           114ms ± 1.26ms     110ms …  117ms          1 ( 1%)          -  0.6% ±  0.3%
  peak_rss           83.3MB ± 77.5KB    83.1MB … 83.5MB          0 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles          228M  ±  882K      227M  …  233M           5 ( 6%)        ⚡-  1.4% ±  0.1%
  instructions        450M  ± 7.33K      450M  …  450M           4 ( 5%)        ⚡-  2.4% ±  0.0%
  cache_references   9.48M  ±  402K     9.29M  … 13.1M           8 ( 9%)        ⚡-  4.7% ±  1.1%
  cache_misses        412K  ± 7.21K      398K  …  432K           0 ( 0%)          -  0.0% ±  0.5%
  branch_misses       575K  ± 6.48K      571K  …  633K           6 ( 7%)        ⚡- 23.0% ±  0.2%

The DAFSA is faster than Chrome in the all-&supsetneqq; benchmark, but the difference is not as big as it was with Firefox, since Chrome uses binary searches to narrow down the possible matches whereas Firefox uses linear scans:

Benchmark 1 (211 runs): ./BenchHTMLTokenizer blink gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          47.3ms ±  782us    45.0ms … 49.6ms         19 ( 9%)        0%
  peak_rss           53.0MB ± 94.4KB    52.7MB … 53.2MB          2 ( 1%)        0%
  cpu_cycles          123M  ±  807K      122M  …  126M           4 ( 2%)        0%
  instructions        290M  ± 7.46K      290M  …  290M          17 ( 8%)        0%
  cache_references   3.31M  ±  200K     3.19M  … 5.79M           7 ( 3%)        0%
  cache_misses        358K  ± 8.15K      345K  …  401K           1 ( 0%)        0%
  branch_misses       182K  ± 1.82K      178K  …  194K           3 ( 1%)        0%
Benchmark 2 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          45.9ms ±  663us    43.6ms … 47.3ms         11 ( 5%)        ⚡-  3.0% ±  0.3%
  peak_rss           53.0MB ± 82.6KB    52.7MB … 53.2MB          1 ( 0%)          +  0.1% ±  0.0%
  cpu_cycles          117M  ±  677K      116M  …  120M           4 ( 2%)        ⚡-  5.0% ±  0.1%
  instructions        259M  ± 7.78K      259M  …  259M           9 ( 4%)        ⚡- 10.8% ±  0.0%
  cache_references   3.26M  ± 99.4K     3.17M  … 4.00M          18 ( 8%)          -  1.4% ±  0.9%
  cache_misses        358K  ± 8.14K      342K  …  386K           8 ( 4%)          -  0.2% ±  0.4%
  branch_misses       183K  ± 3.07K      179K  …  214K          17 ( 8%)          +  0.9% ±  0.3%

The DAFSA is once again worse at detecting &cz as invalid, and the results are similar to what they were with Firefox:

Benchmark 1 (160 runs): ./BenchHTMLTokenizer blink ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          62.6ms ± 1.08ms    60.6ms … 64.3ms          0 ( 0%)        0%
  peak_rss           65.1MB ± 79.4KB    64.8MB … 65.2MB          1 ( 1%)        0%
  cpu_cycles          108M  ±  640K      107M  …  111M           3 ( 2%)        0%
  instructions        203M  ± 11.3K      203M  …  203M           7 ( 4%)        0%
  cache_references   5.92M  ± 51.5K     5.82M  … 6.30M          10 ( 6%)        0%
  cache_misses        384K  ± 8.88K      366K  …  409K           0 ( 0%)        0%
  branch_misses       164K  ± 1.44K      160K  …  169K           2 ( 1%)        0%
Benchmark 2 (157 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          63.9ms ± 1.10ms    61.8ms … 65.5ms          0 ( 0%)        💩+  2.0% ±  0.4%
  peak_rss           65.1MB ± 92.8KB    64.8MB … 65.2MB          4 ( 3%)          -  0.0% ±  0.0%
  cpu_cycles          113M  ±  707K      112M  …  117M           1 ( 1%)        💩+  4.4% ±  0.1%
  instructions        214M  ± 10.9K      214M  …  214M           7 ( 4%)        💩+  5.6% ±  0.0%
  cache_references   5.89M  ± 63.4K     5.76M  … 6.31M           5 ( 3%)          -  0.5% ±  0.2%
  cache_misses        387K  ± 9.20K      367K  …  412K           2 ( 1%)          +  0.7% ±  0.5%
  branch_misses       165K  ± 2.77K      162K  …  197K           5 ( 3%)          +  0.6% ±  0.3%

For the '30,000 valid named character references' benchmark, the DAFSA is slightly faster (similar results as Firefox):

Benchmark 1 (228 runs): ./BenchHTMLTokenizer blink all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.9ms ±  844us    41.7ms … 47.0ms         21 ( 9%)        0%
  peak_rss           54.3MB ± 89.2KB    54.0MB … 54.5MB          2 ( 1%)        0%
  cpu_cycles          105M  ±  959K      104M  …  112M           5 ( 2%)        0%
  instructions        204M  ± 6.18K      204M  …  204M           7 ( 3%)        0%
  cache_references   3.78M  ±  137K     3.67M  … 5.37M          13 ( 6%)        0%
  cache_misses        359K  ± 11.2K      345K  …  457K          13 ( 6%)        0%
  branch_misses       466K  ± 1.43K      463K  …  472K           4 ( 2%)        0%
Benchmark 2 (232 runs): ./BenchHTMLTokenizer dafsa all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.1ms ±  592us    41.1ms … 44.3ms         13 ( 6%)        ⚡-  1.8% ±  0.3%
  peak_rss           54.4MB ± 79.6KB    54.1MB … 54.5MB          0 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles          102M  ±  487K      101M  …  104M           2 ( 1%)        ⚡-  3.3% ±  0.1%
  instructions        191M  ± 5.33K      191M  …  191M           4 ( 2%)        ⚡-  6.0% ±  0.0%
  cache_references   3.55M  ± 74.9K     3.46M  … 4.21M          10 ( 4%)        ⚡-  6.3% ±  0.5%
  cache_misses        355K  ± 5.67K      344K  …  373K           2 ( 1%)          -  1.0% ±  0.5%
  branch_misses       345K  ± 2.12K      342K  …  374K           2 ( 1%)        ⚡- 26.0% ±  0.1%

In the raw matching speed benchmark, the DAFSA comes out ahead:

Benchmark 1 (70 runs): ./BenchHTMLTokenizerBlink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           144ms ± 1.33ms     141ms …  148ms          1 ( 1%)        0%
  peak_rss           4.55MB ± 65.8KB    4.33MB … 4.72MB          0 ( 0%)        0%
  cpu_cycles          590M  ± 2.33M      585M  …  604M           4 ( 6%)        0%
  instructions       1.07G  ± 83.4      1.07G  … 1.07G           0 ( 0%)        0%
  cache_references   12.3M  ±  162K     12.1M  … 13.0M           6 ( 9%)        0%
  cache_misses       27.6K  ± 5.79K     21.1K  … 67.6K           2 ( 3%)        0%
  branch_misses      8.71M  ± 21.1K     8.68M  … 8.79M           1 ( 1%)        0%
Benchmark 2 (85 runs): ./BenchMatcherDafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           119ms ± 1.47ms     116ms …  122ms          0 ( 0%)        ⚡- 17.6% ±  0.3%
  peak_rss           4.52MB ± 65.7KB    4.46MB … 4.59MB          0 ( 0%)          -  0.8% ±  0.5%
  cpu_cycles          484M  ± 4.33M      477M  …  496M           0 ( 0%)        ⚡- 17.8% ±  0.2%
  instructions       1.02G  ± 80.6      1.02G  … 1.02G           0 ( 0%)        ⚡-  4.4% ±  0.0%
  cache_references   6.10M  ±  177K     5.99M  … 7.66M           2 ( 2%)        ⚡- 50.5% ±  0.4%
  cache_misses       25.4K  ± 3.42K     20.4K  … 38.8K           4 ( 5%)        ⚡-  7.9% ±  5.3%
  branch_misses      6.02M  ± 21.7K     5.98M  … 6.06M           0 ( 0%)        ⚡- 30.9% ±  0.1%

Data size🔗

The Chrome implementation uses four arrays:

All together, the Chrome implementation uses 41,167 bytes (40.39 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 16,953 bytes (16.56 KiB), or 59.0% of the data size of the Chrome implementation.

The Safari implementation uses the same four arrays, but has made a few more data size optimizations:

So, the Safari implementation uses either 30,040 bytes (29.34 KiB) if HTMLEntityTableEntry uses 12 bytes or 21,116 bytes (20.62 KiB) if HTMLEntityTableEntry uses 8 bytes. This means that Safari's data size optimizations (or at least their intended effect) makes its data size smaller than Ladybird's (even if the Ladybird implementation tightly bitpacked its values array, it'd still use 229 bytes more than the 8-byte-HTMLEntityTableEntry Safari version). This also shows that the larger data size of the Chrome implementation is not inherent to the approach that it uses.

Ease-of-use🔗

For now I'll just say there's no meaningful difference, but there's a caveat that will be discussed later.

Summary🔗

Overall, the Chrome implementation as-it-is-now fares about as well as the Firefox implementation in this comparison, but has some potential strengths/weaknesses of its own. That is, it covers one weakness of the Firefox implementation by using binary searches instead of linear scans, but it always has to narrow down the possibilities from a larger initial range (since it only uses the first character to get the range of possible matches whereas Firefox uses the first two characters).

The Safari version fares much better in terms of data size (potentially beating out my DAFSA version), and its size optimizations could be applied to the Chrome version as well since the core approach is the same between them.

At this point, though, you might be asking yourself, why don't we try...

Combining Firefox, Chrome, and Safari together🔗

In theory, the best ideas from the Firefox and Chrome/Safari implementations could be combined into one new implementation:

I haven't tested this combination to see how exactly it stacks up, but I would assume it'd be quite good overall.

Something I didn't mention about the Chrome implementation🔗

Since I converted the Chrome implementation to use Ladybird's NamedCharacterReferenceMacher API in an effort to improve the accuracy of my benchmarking, one major aspect of the Chrome implementation was lost in translation: the Chrome implementation doesn't actually use the character-by-character tokenization strategy we've discussed so far.

Instead, it uses the "lookahead (but never beyond an insertion point) until we're certain we have enough characters ..." strategy mentioned back in the Named character reference tokenization overview. The very high-level summary of the approach (as it is actually implemented in Chrome) is very similar to the description of it in that section:

The downside of the Chrome implementation in particular is actually a choice that was made that's not inherent to the overall approach: they don't preserve any matching state between tokenizer iterations and always backtrack to the & before trying to match again. For example, when matching against something like &notin; that's being written one-character-at-a-time (via document.write), the matching algorithm described in the "Comparison with Blink/WebKit (Chrome/Safari)" section will be executed for each of:

In theory, the redundant work that's performed in these sorts of scenarios should have a noticeable affect on performance, but, in practice, I wasn't able to prove that out with benchmarking.

Details and results of my benchmarking
  • The test file (inserting lots of valid named character references one-character-at-a-time using document.write)
  • The branch containing the code under test (a faithful adaptation of Blink's lookahead strategy and a runtime flag to switch to an implementation that preserves matching state between retry attempts due to 'not enough characters')
Benchmark 1 (14 runs): ./headless-browser --dump-text pathological-write-valid.html
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.46s  ± 8.11ms    1.45s  … 1.48s           0 ( 0%)        0%
  peak_rss           62.1MB ±  123KB    61.9MB … 62.3MB          0 ( 0%)        0%
  cpu_cycles         5.90G  ± 33.5M     5.88G  … 5.99G           1 ( 7%)        0%
  instructions       29.1G  ±  562K     29.1G  … 29.1G           1 ( 7%)        0%
  cache_references    210M  ±  988K      208M  …  211M           0 ( 0%)        0%
  cache_misses       8.53M  ±  322K     7.92M  … 9.09M           0 ( 0%)        0%
  branch_misses      5.03M  ± 19.2K     5.01M  … 5.07M           0 ( 0%)        0%
Benchmark 2 (14 runs): ./headless-browser --blink-preserve-state --dump-text pathological-write-valid.html
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.46s  ± 4.52ms    1.45s  … 1.47s           0 ( 0%)          -  0.3% ±  0.3%
  peak_rss           62.0MB ±  164KB    61.7MB … 62.4MB          0 ( 0%)          -  0.1% ±  0.2%
  cpu_cycles         5.88G  ± 18.1M     5.86G  … 5.93G           0 ( 0%)          -  0.3% ±  0.4%
  instructions       29.1G  ±  422K     29.1G  … 29.1G           0 ( 0%)          -  0.1% ±  0.0%
  cache_references    209M  ± 1.26M      207M  …  212M           1 ( 7%)          -  0.5% ±  0.4%
  cache_misses       8.55M  ±  425K     8.19M  … 9.78M           1 ( 7%)          +  0.3% ±  3.4%
  branch_misses      5.00M  ± 27.0K     4.97M  … 5.07M           0 ( 0%)          -  0.6% ±  0.4%

So, despite HTMLEntitySearch::Advance being called 4.5x more in the 'no state preserved' benchmark, no difference shows up in the results. I believe this is because the actual matching is a small part of the work being done in this benchmark, or, in other words, there is a difference but it's being drowned out by all the work being done elsewhere (JavaScript being run, tokenizer input being updated, etc). I have a hunch that Ladybird in particular might be greatly exacerbating this effect and making all this tangential work slower than it theoretically needs to be, especially in the case of updating the tokenizer input. For example, Chrome uses a rope-like SegmentedString to mitigate the cost of inserting into the middle of the input, while Ladybird currently reallocates the entire modified input on each insertion.

To sum up what I'm trying to say:

This is important to note because it is the reason I included the following point in my 'list of things you'll need to trust in order for my benchmarking to be accurate' in the intro to the "Comparison to the major browser engines" section:

  • The performance characteristics exhibited would hold when going the other direction (putting my implementation into their tokenizer)

That is, I have some reason to believe 'going the other direction' may actually be slightly more favorable to my DAFSA implementation, as all the code outside of the named character reference tokenization state itself likely will do less that will muddy benchmarking results.

A benefit of the 'lookahead' strategy🔗

An upside of the 'lookahead' strategy is that, when you know there's no active insertion point, you can always match against the full remaining input in one go. This is potentially an improvement over the 'tokenize-one-character-at-a-time' strategy if there is any significant amount of work that's done between tokenizer iterations. Here's some simple diagrams to illustrate the point:

work done to move to the next character

work done to move to the next character

work done to move to the next character

a

b

c

...

When using the 'character-by-character' approach, any work that's done to move the input cursor is repeated before each character is matched against

work done to skip to the end of the match

a

b

c

...

When using the 'lookahead' approach, matching happens in a tight loop and any work done to move the input cursor only happens once, after the matching is complete

There are two theoretical benefits I can think of with this:

Luckily, we don't actually need to change much about our 'character-by-character tokenization' approach to get these benefits, as we only need to make it so we use lookahead whenever there's no active insertion point. In Ladybird, that might look something like this (full implementation):

BEGIN_STATE(NamedCharacterReference)
{
    if (stop_at_insertion_point == StopAtInsertionPoint::No) {
        // Use the 'lookahead' approach without needing to worry about insertion points
    } else {
        // Use the character-by-character tokenization approach
    }

    // ...
}

In practice, this seems to give up to a 1.13x speedup in our Ladybird benchmarks essentially for free:

Benchmark results
Benchmark 1 (44 runs): ./BenchHTMLTokenizer dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           114ms ±  934us     112ms …  115ms          3 ( 7%)        0%
  peak_rss           83.5MB ± 74.9KB    83.3MB … 83.6MB          0 ( 0%)        0%
  cpu_cycles          226M  ±  964K      223M  …  229M           2 ( 5%)        0%
  instructions        452M  ± 6.81K      452M  …  452M           0 ( 0%)        0%
  cache_references   9.64M  ± 86.1K     9.51M  … 9.86M           4 ( 9%)        0%
  cache_misses        418K  ± 10.5K      399K  …  448K           3 ( 7%)        0%
  branch_misses       573K  ± 2.51K      570K  …  584K           3 ( 7%)        0%
Benchmark 2 (47 runs): ./BenchHTMLTokenizer dafsa-lookahead
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           108ms ± 1.05ms     106ms …  111ms          0 ( 0%)        ⚡-  5.1% ±  0.4%
  peak_rss           83.5MB ± 81.5KB    83.3MB … 83.6MB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles          203M  ±  869K      201M  …  205M           0 ( 0%)        ⚡- 10.5% ±  0.2%
  instructions        377M  ± 10.8K      377M  …  377M           1 ( 2%)        ⚡- 16.6% ±  0.0%
  cache_references   9.57M  ± 71.3K     9.42M  … 9.77M           1 ( 2%)          -  0.7% ±  0.3%
  cache_misses        415K  ± 7.59K      401K  …  429K           0 ( 0%)          -  0.7% ±  0.9%
  branch_misses       553K  ± 2.04K      547K  …  556K           0 ( 0%)        ⚡-  3.5% ±  0.2%
Benchmark 1 (109 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          45.8ms ±  866us    43.7ms … 47.2ms         11 (10%)        0%
  peak_rss           53.2MB ± 67.6KB    53.0MB … 53.4MB         29 (27%)        0%
  cpu_cycles          117M  ±  601K      116M  …  118M           3 ( 3%)        0%
  instructions        261M  ± 6.10K      261M  …  261M           2 ( 2%)        0%
  cache_references   3.27M  ± 62.6K     3.19M  … 3.69M           4 ( 4%)        0%
  cache_misses        354K  ± 4.17K      345K  …  368K           2 ( 2%)        0%
  branch_misses       182K  ± 4.28K      178K  …  213K          10 ( 9%)        0%
Benchmark 2 (124 runs): ./BenchHTMLTokenizer dafsa-lookahead gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          40.4ms ±  756us    38.3ms … 42.0ms         13 (10%)        ⚡- 11.8% ±  0.5%
  peak_rss           53.2MB ± 77.2KB    52.9MB … 53.4MB         40 (32%)          +  0.0% ±  0.0%
  cpu_cycles         93.4M  ±  546K     92.5M  … 95.8M           3 ( 2%)        ⚡- 19.9% ±  0.1%
  instructions        190M  ± 5.41K      190M  …  190M           3 ( 2%)        ⚡- 27.1% ±  0.0%
  cache_references   3.31M  ± 42.2K     3.23M  … 3.48M           3 ( 2%)          +  1.4% ±  0.4%
  cache_misses        354K  ± 5.07K      345K  …  372K           5 ( 4%)          -  0.1% ±  0.3%
  branch_misses       153K  ± 10.5K      146K  …  215K          17 (14%)        ⚡- 16.1% ±  1.2%
Benchmark 1 (79 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          63.2ms ±  964us    61.6ms … 65.7ms          0 ( 0%)        0%
  peak_rss           65.3MB ± 79.1KB    65.0MB … 65.4MB          0 ( 0%)        0%
  cpu_cycles          112M  ±  606K      111M  …  115M           2 ( 3%)        0%
  instructions        215M  ± 8.83K      215M  …  215M           0 ( 0%)        0%
  cache_references   5.91M  ±  107K     5.78M  … 6.50M           1 ( 1%)        0%
  cache_misses        372K  ± 4.43K      363K  …  390K           2 ( 3%)        0%
  branch_misses       162K  ± 1.17K      160K  …  165K           0 ( 0%)        0%
Benchmark 2 (80 runs): ./BenchHTMLTokenizer dafsa-lookahead ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          62.8ms ± 1.08ms    61.3ms … 64.3ms          0 ( 0%)          -  0.6% ±  0.5%
  peak_rss           65.2MB ± 88.6KB    64.9MB … 65.4MB          1 ( 1%)          -  0.0% ±  0.0%
  cpu_cycles          111M  ±  454K      110M  …  112M           0 ( 0%)        ⚡-  1.4% ±  0.1%
  instructions        209M  ± 7.60K      209M  …  209M           2 ( 3%)        ⚡-  2.7% ±  0.0%
  cache_references   5.91M  ± 68.1K     5.78M  … 6.09M           5 ( 6%)          -  0.1% ±  0.5%
  cache_misses        374K  ± 5.24K      365K  …  397K           1 ( 1%)          +  0.6% ±  0.4%
  branch_misses       164K  ±  964       162K  …  168K           1 ( 1%)          +  1.0% ±  0.2%
Benchmark 1 (115 runs): ./BenchHTMLTokenizer dafsa all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.3ms ±  843us    40.9ms … 45.6ms          6 ( 5%)        0%
  peak_rss           54.5MB ± 83.6KB    54.3MB … 54.6MB          1 ( 1%)        0%
  cpu_cycles          100M  ±  744K     98.4M  …  103M           2 ( 2%)        0%
  instructions        193M  ± 11.0K      193M  …  193M          12 (10%)        0%
  cache_references   3.58M  ± 40.3K     3.47M  … 3.70M           3 ( 3%)        0%
  cache_misses        363K  ± 11.3K      344K  …  398K           1 ( 1%)        0%
  branch_misses       344K  ± 1.99K      341K  …  349K           0 ( 0%)        0%
Benchmark 2 (127 runs): ./BenchHTMLTokenizer dafsa-lookahead all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          39.4ms ±  849us    37.2ms … 40.8ms         14 (11%)        ⚡-  9.0% ±  0.5%
  peak_rss           54.5MB ± 84.3KB    54.3MB … 54.6MB          1 ( 1%)          +  0.0% ±  0.0%
  cpu_cycles         84.8M  ±  521K     84.0M  … 87.4M           5 ( 4%)        ⚡- 15.5% ±  0.2%
  instructions        148M  ± 6.20K      148M  …  148M           7 ( 6%)        ⚡- 23.2% ±  0.0%
  cache_references   3.56M  ± 51.3K     3.47M  … 3.82M           4 ( 3%)          -  0.4% ±  0.3%
  cache_misses        356K  ± 7.44K      342K  …  384K           7 ( 6%)        ⚡-  2.0% ±  0.7%
  branch_misses       316K  ± 3.08K      313K  …  347K           3 ( 2%)        ⚡-  8.1% ±  0.2%

This also explains a few things that I glossed over or deferred until later throughout the article:

Another difference I didn't mention🔗

Something else that I've failed to mention until now regards exactly how backtracking is performed.

In Ladybird, backtracking is very straightforward: modify the input cursor's position by subtracting N from the current position (where N is the number of overconsumed code points). This is safe to do from a code point boundary perspective because the input is always UTF-8 encoded, and named character reference matching will only ever consume ASCII characters, so going back N bytes is guaranteed to be equivalent to going back N code points.

In all the other major browsers, though, backtracking is done by re-inserting the overconsumed code points back into the input stream, at the position just after the current code point. That is, it modifies the input stream such that the next code points to-be-tokenized will be those that were just re-inserted.

As of now, I'm unsure if the way that Firefox/Chrome/Safari do it is necessary (and therefore whether or not Ladybird will need to adopt the same strategy to be fully compliant with the spec). If it is necessary, I'm either unaware of the relevant web platform test, or there is a missing web platform test that checks whatever it's necessary for. If it's not necessary, then there may be an opportunity in the other browser engines to simplify backtracking when dealing with named character references.

Further improvements to the DAFSA implementation🔗

As of the writing of this article, the DAFSA implementation that's been described so far is exactly what's in Ladybird. During the process of writing this, though, I came up with some ideas (largely inspired by the Firefox/Chrome/Safari implementations) to improve my implementation by utilizing every last possible bit I could get my hands on. This came in the form of two independent optimizations, with one of them just so happening to barely make the other possible.

'First layer' acceleration🔗

A property of named character references that I missed, but that the major browser engines all take advantage of, is that the first character of a named character reference is always within the range of a-z or A-Z (inclusive). In terms of the DAFSA, we can take advantage of this property to accelerate the search for the first character: instead of linearly scanning across the child nodes, we can:

This would turn the O(n) search within the 'first layer' of the DAFSA into an O(1) lookup. As for what needs to be stored in the lookup table, remember that we build up a 'unique index' when traversing a list of children, with any child that we iterate over adding its number field to the total:

a

b

c

d

The number field of the a and b nodes contribute to the unique index when searching for c

So, we can pre-compute the accumulated unique index that would result when matching a character in the 'first layer,' and then store that in the lookup table. For example, the relevant data for the first four nodes in the first layer of our DAFSA looks like this:

    .{ .char = 'A', .number = 27 },
    .{ .char = 'B', .number = 12 },
    .{ .char = 'C', .number = 36 },
    .{ .char = 'D', .number = 54 },
    // ...

so the corresponding lookup table entries could look like this:

    0,
    27,
    39, // 27 + 12
    75, // 27 + 12 + 36
    // ...

We can then use char - 'A' (for uppercase characters) to index into the lookup table and instantly get the final unique index that would have resulted from successfully searching for that character normally (for lowercase characters you'd get the index using char - 'a' + 26 to allow for using a lookup array with exactly 52 values).

This change alone provides quite a big improvement to raw matching speed, since the 'first layer' represents the largest list of children in the DAFSA (by a significant margin):

Benchmark 1 (44 runs): ./bench-master
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           114ms ± 1.28ms     112ms …  116ms          0 ( 0%)        0%
  cpu_cycles          469M  ± 2.40M      466M  …  482M           5 (11%)        0%
  instructions        740M  ± 1.25       740M  …  740M           0 ( 0%)        0%
  cache_references   6.27M  ± 60.6K     6.14M  … 6.48M           1 ( 2%)        0%
  cache_misses       2.05K  ± 5.11K      979   … 34.8K           4 ( 9%)        0%
  branch_misses      5.69M  ± 13.0K     5.67M  … 5.75M           8 (18%)        0%
Benchmark 2 (74 runs): ./bench-first-layer-accel
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          68.1ms ± 1.26ms    66.6ms … 70.3ms          0 ( 0%)        ⚡- 40.2% ±  0.4%
  cpu_cycles          278M  ±  416K      277M  …  279M           3 ( 4%)        ⚡- 40.7% ±  0.1%
  instructions        385M  ± 1.42       385M  …  385M           0 ( 0%)        ⚡- 47.9% ±  0.0%
  cache_references   6.24M  ± 51.2K     6.14M  … 6.38M           1 ( 1%)          -  0.6% ±  0.3%
  cache_misses       1.38K  ±  346      1.02K  … 2.40K           0 ( 0%)          - 32.8% ± 57.4%
  branch_misses      4.79M  ± 4.33K     4.78M  … 4.80M           4 ( 5%)        ⚡- 15.9% ±  0.1%

Linear scan → binary search🔗

The second improvement is a theoretical one for now: what if we could take the time complexity of searching a list of children from O(n) to O(log n) by using a binary search? Let's think about what changes we'd need to make for that to be possible:

  1. Every list of children would need to be sorted
  2. We'd need a way to know the length of a list of children upfront (right now we'd have to iterate through them until we find one with the last_sibling flag set to know where a list of children ends)
  3. We'd need a way to avoid needing to iterate a list of children to build up the unique index

For the first point, this is actually already the case—all lists of children are sorted (note: I'm unsure if this is a coincidental artifact of how I'm constructing my DAFSA, or if it's a common property of a DAFSA generally). For the second point, we could remove the last_sibling flag and replace it with a children_len field. For the third, we already have a model available for this with the lookup table approach we used for the first layer: we could just store an accumulated number in each Node struct instead of a per-node number.

The problem🔗

The problem with all this is that there just aren't enough bits available to store all the information required. For context, here's the DAFSA Node representation being used currently:

const Node = packed struct(u32) {
    char: u8,
    number: u8,
    end_of_word: bool,
    last_sibling: bool,
    _extra: u2 = 0,
    first_child_index: u12,
};

Things to note:

And we want to:

Overall, we need to store 8 more bits of information but only have 2 bits to spare, so somehow we'll need to eek out 6 extra bits.

Let's see what we can do...

A free spare bit🔗

As mentioned earlier, named character references only contain characters within the ASCII range. This means that we can use a u7 instead of a u8 for the char field. This gives us an extra spare bit essentially for free (well, it will make the CPU have to do some extra work to retrieve the value of the field, but we'll ignore that for now).

5 bits to go.

Zeroing the first layer numbers🔗

Since we're using a lookup table for the cumulative numbers of the first layer of nodes, there's no reason to store real values in the number fields of those nodes in the DAFSA. Removing those values is also extremely helpful because that's where the largest values appear; the largest cumulative number outside of the first layer is 163 which can be stored in 8 bits (4 bits fewer than we thought we needed for this field).

1 bit to go.

The final bit🔗

The last bit is the trickiest, as they say. Instead of jumping straight to an approach that works well, though, let's try out some interesting approaches that also work but are sub-optimal.

6-bit char🔗

From the expandable "Nitty-gritty DAFSA node size details" section earlier:

[The char field] can technically be represented in 6 bits, since the actual alphabet of characters used in the list of named character references only includes 61 unique characters ('1'...'8', ';', 'a'...'z', 'A'...'Z'). However, to do so you'd need to convert between the 6 bit representation and the actual ASCII value of each character to do comparisons.

Having to do this extra work when performing any comparison against the char field is not ideal, to the point that it makes the binary search version perform quite a bit worse than the linear scan version (at least with the conversion function I came up with).

Build the unique index during the binary search🔗

One quite-cool-but-also-not-worth-it approach is to avoid storing full cumulative number fields, and instead store incremental numbers that will result in the correct cumulative number if they are combined together in a specific way while performing a binary search. In practice, what that means is:

Here's a visualization showing how/why this can work, using a particularly relevant list of children (to be discussed afterwards):

This specific list of children was chosen because 64 (the number value of the node with the character 'w' here) is actually the largest number value in the entire DAFSA after transforming the number fields in this way. This means that instead of 8 bits to store the number field, we only need 7, thus saving the 1 bit that we're looking for.

Yet again, though, this extra work being done while performing the binary search cancels out the benefit of the binary search, and it ends up being marginally slower than the linear scan version.

An actually good approach🔗

The approach I found that works the best is actually quite straightforward to summarize: entirely remove the first layer of nodes from the DAFSA. We can get away with this because we already have a lookup table available for the first layer, so we could theoretically just stuff all the necessary information in there instead of keeping it in the DAFSA.

In practice, it's not quite as simple as that, though. If you're interested in the details, expand below:

Nitty-gritty details of removing the first layer from the DAFSA

After adding the lookup table for the first layer, NamedCharacterReferenceMatcher uses this approach:

  • Start with node_index set to 0
  • If node_index is currently 0, verify that the current character is within a-z or A-Z, and, if so, look up the cumulative unique index in the lookup table. Set the node_index to the <lookup table index> + 1 (since the first layer of nodes in the DAFSA always follow the root node)
  • If node_index is not 0, do DAFSA traversal as normal (search the node's children starting at nodes[node_index].first_child_index)

This means that we're currently relying on the first layer of nodes being in the DAFSA in order to allow for node_index to be used as a universal 'cursor' that tracks where we are in the DAFSA. If we simply remove the first layer of nodes, then we'd need two separate cursors: 1 that we would use for the lookup table, and 1 that we would use for the DAFSA nodes array. That might not be a bad approach, but what I went for instead is this:

  • Instead of storing a node_index in NamedCharacterReferenceMatcher, store an optional slice of children to check (?[]const Node) that starts as null
  • If children_to_check is null, use the first layer lookup table to get three pieces of information: (1) the cumulative unique index number, (2) the index of the node's first child in the DAFSA array, and (3) the length of the node's list of children. Set children_to_check to a slice of the DAFSA array with the retrieved length, and starting at the retrieved first child index
  • If children_to_check is not null, search within the children_to_check slice as normal

It might not be completely obvious how this saves us a bit, but removing the first layer also happens to remove the largest values of our added children_len field. Somewhat miraculously, the new largest value is 13 (down from 24), so instead of needing 5 bits, we now only need 4 bits for that field.

Putting it all together🔗

After all that, we end up with two struct representations for the data we're looking to store:

One for the first layer lookup table:

const FirstLayerNode = packed struct {
    number: u12,
    child_index: u10,
    children_len: u5,
};

and one for the rest of the DAFSA nodes:

const Node = packed struct(u32) {
    char: u7,
    number: u8,
    end_of_word: bool,
    children_len: u4,
    child_index: u12,
};

With this, we can accelerate the search time for the first character drastically, and make it possible to use a binary search for all the rest of the searches we perform. In terms of raw matching speed, this provides another decent improvement over the first layer lookup table alone:

Benchmark 1 (44 runs): ./bench-master
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           115ms ± 1.73ms     112ms …  118ms          0 ( 0%)        0%
  peak_rss            754KB ±  617       750KB …  754KB          1 ( 2%)        0%
  cpu_cycles          471M  ± 2.35M      467M  …  478M           0 ( 0%)        0%
  instructions        740M  ± 1.65       740M  …  740M           0 ( 0%)        0%
  cache_references   6.30M  ± 53.8K     6.14M  … 6.38M           1 ( 2%)        0%
  cache_misses       6.81K  ± 5.18K     1.19K  … 19.4K           6 (14%)        0%
  branch_misses      5.70M  ± 14.3K     5.68M  … 5.74M           1 ( 2%)        0%
Benchmark 2 (73 runs): ./bench-first-layer-accel
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          69.0ms ± 1.70ms    66.7ms … 71.3ms          0 ( 0%)        ⚡- 40.0% ±  0.6%
  cpu_cycles          279M  ± 1.18M      277M  …  282M           0 ( 0%)        ⚡- 40.7% ±  0.1%
  instructions        385M  ± 2.77       385M  …  385M           1 ( 1%)        ⚡- 47.9% ±  0.0%
  cache_references   6.29M  ± 73.4K     6.14M  … 6.51M           1 ( 1%)          -  0.1% ±  0.4%
  cache_misses       8.70K  ± 6.54K     1.13K  … 21.8K           0 ( 0%)          + 27.8% ± 33.7%
  branch_misses      4.79M  ± 9.12K     4.78M  … 4.82M           4 ( 5%)        ⚡- 15.9% ±  0.1%
Benchmark 3 (83 runs): ./bench-first-layer-accel-binary-search-u7-char-u8-number
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          60.6ms ± 3.40ms    58.3ms … 81.0ms          2 ( 2%)        ⚡- 47.3% ±  0.9%
  cpu_cycles          244M  ± 1.35M      242M  …  251M           1 ( 1%)        ⚡- 48.1% ±  0.1%
  instructions        431M  ± 3.45       431M  …  431M           2 ( 2%)        ⚡- 41.8% ±  0.0%
  cache_references   6.30M  ± 87.0K     6.12M  … 6.47M           0 ( 0%)          +  0.1% ±  0.4%
  cache_misses       6.66K  ± 3.61K     1.19K  … 16.4K           1 ( 1%)          -  2.2% ± 22.6%
  branch_misses      4.92M  ± 18.7K     4.89M  … 5.00M          15 (18%)        ⚡- 13.6% ±  0.1%

Taken together (the Benchmark 3 results), these two optimizations make raw matching speed about 1.9x faster than it was originally.

As mentioned, though, these raw matching speed improvements don't translate to nearly the same improvement when benchmarking this new implementation within the Ladybird tokenizer (note also that these benchmarks don't use the 'lookahead when outside of an insertion point' optimization).

Benchmark 1 (89 runs): ./BenchHTMLTokenizer dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           113ms ± 1.10ms     112ms …  115ms          0 ( 0%)        0%
  peak_rss           83.3MB ± 78.1KB    83.1MB … 83.5MB         30 (34%)        0%
  cpu_cycles          228M  ±  827K      227M  …  231M           3 ( 3%)        0%
  instructions        450M  ± 4.99K      450M  …  450M           0 ( 0%)        0%
  cache_references   9.43M  ±  266K     9.28M  … 11.9M           1 ( 1%)        0%
  cache_misses        404K  ± 4.56K      393K  …  417K           0 ( 0%)        0%
  branch_misses       574K  ± 7.43K      570K  …  641K           2 ( 2%)        0%
Benchmark 2 (92 runs): ./BenchHTMLTokenizer dafsa-binary-search
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           109ms ± 1.10ms     107ms …  111ms          0 ( 0%)        ⚡-  3.8% ±  0.3%
  peak_rss           83.3MB ± 74.0KB    83.1MB … 83.5MB         31 (34%)          +  0.0% ±  0.0%
  cpu_cycles          210M  ±  778K      209M  …  212M           0 ( 0%)        ⚡-  7.9% ±  0.1%
  instructions        417M  ± 4.75K      417M  …  417M           0 ( 0%)        ⚡-  7.3% ±  0.0%
  cache_references   9.39M  ±  124K     9.21M  … 10.4M           4 ( 4%)          -  0.4% ±  0.6%
  cache_misses        407K  ± 5.47K      395K  …  423K           1 ( 1%)          +  0.8% ±  0.4%
  branch_misses       533K  ± 1.50K      530K  …  542K           7 ( 8%)        ⚡-  7.0% ±  0.3%
Benchmark 1 (218 runs): ./BenchHTMLTokenizer dafsa gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          45.8ms ±  658us    43.6ms … 47.0ms         12 ( 6%)        0%
  peak_rss           53.0MB ± 84.2KB    52.7MB … 53.2MB          2 ( 1%)        0%
  cpu_cycles          117M  ±  581K      116M  …  119M           7 ( 3%)        0%
  instructions        259M  ± 5.32K      259M  …  259M           3 ( 1%)        0%
  cache_references   3.25M  ± 82.1K     3.16M  … 4.05M          16 ( 7%)        0%
  cache_misses        355K  ± 5.76K      345K  …  376K           1 ( 0%)        0%
  branch_misses       182K  ± 2.45K      178K  …  200K          16 ( 7%)        0%
Benchmark 2 (236 runs): ./BenchHTMLTokenizer dafsa-binary-search gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          42.4ms ±  921us    40.2ms … 43.8ms          7 ( 3%)        ⚡-  7.5% ±  0.3%
  peak_rss           53.0MB ± 77.8KB    52.7MB … 53.2MB          1 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles          103M  ±  771K      101M  …  106M           6 ( 3%)        ⚡- 12.3% ±  0.1%
  instructions        220M  ± 8.52K      220M  …  220M          13 ( 6%)        ⚡- 15.1% ±  0.0%
  cache_references   3.27M  ±  109K     3.18M  … 4.16M          16 ( 7%)          +  0.4% ±  0.5%
  cache_misses        360K  ± 8.94K      345K  …  394K           1 ( 0%)          +  1.3% ±  0.4%
  branch_misses       189K  ± 7.40K      179K  …  215K           2 ( 1%)        💩+  3.8% ±  0.6%
Benchmark 1 (158 runs): ./BenchHTMLTokenizer dafsa ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          63.2ms ±  859us    61.3ms … 64.6ms          1 ( 1%)        0%
  peak_rss           65.1MB ± 85.1KB    64.8MB … 65.2MB          2 ( 1%)        0%
  cpu_cycles          112M  ±  550K      111M  …  115M           1 ( 1%)        0%
  instructions        214M  ± 7.62K      214M  …  214M           5 ( 3%)        0%
  cache_references   5.87M  ± 82.4K     5.77M  … 6.52M           5 ( 3%)        0%
  cache_misses        374K  ± 4.78K      365K  …  405K           4 ( 3%)        0%
  branch_misses       164K  ± 4.62K      160K  …  219K           1 ( 1%)        0%
Benchmark 2 (163 runs): ./BenchHTMLTokenizer dafsa-binary-search ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          61.4ms ±  833us    59.3ms … 62.8ms          0 ( 0%)        ⚡-  2.9% ±  0.3%
  peak_rss           65.1MB ± 76.6KB    64.8MB … 65.2MB          1 ( 1%)          +  0.0% ±  0.0%
  cpu_cycles          105M  ±  573K      104M  …  107M           1 ( 1%)        ⚡-  6.6% ±  0.1%
  instructions        195M  ± 7.30K      195M  …  195M           1 ( 1%)        ⚡-  8.8% ±  0.0%
  cache_references   5.87M  ± 94.8K     5.79M  … 6.87M           6 ( 4%)          +  0.1% ±  0.3%
  cache_misses        375K  ± 4.47K      365K  …  388K           2 ( 1%)          +  0.1% ±  0.3%
  branch_misses       164K  ± 3.11K      162K  …  202K           3 ( 2%)          +  0.3% ±  0.5%
Benchmark 1 (230 runs): ./BenchHTMLTokenizer dafsa all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.3ms ±  635us    41.1ms … 44.9ms         13 ( 6%)        0%
  peak_rss           54.4MB ± 90.7KB    54.0MB … 54.5MB          1 ( 0%)        0%
  cpu_cycles          102M  ±  651K      101M  …  106M           3 ( 1%)        0%
  instructions        191M  ± 8.94K      191M  …  191M          14 ( 6%)        0%
  cache_references   3.55M  ± 48.8K     3.48M  … 3.88M          10 ( 4%)        0%
  cache_misses        358K  ± 8.22K      342K  …  387K           9 ( 4%)        0%
  branch_misses       344K  ± 4.30K      340K  …  405K           2 ( 1%)        0%
Benchmark 2 (247 runs): ./BenchHTMLTokenizer dafsa-binary-search all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          40.4ms ±  754us    38.1ms … 42.3ms         13 ( 5%)        ⚡-  6.7% ±  0.3%
  peak_rss           54.4MB ± 73.5KB    54.1MB … 54.5MB         76 (31%)          +  0.0% ±  0.0%
  cpu_cycles         90.1M  ±  522K     88.8M  … 92.3M           2 ( 1%)        ⚡- 11.8% ±  0.1%
  instructions        170M  ± 7.36K      170M  …  170M          13 ( 5%)        ⚡- 11.0% ±  0.0%
  cache_references   3.54M  ± 32.2K     3.46M  … 3.68M           7 ( 3%)          -  0.3% ±  0.2%
  cache_misses        356K  ± 6.32K      344K  …  379K           1 ( 0%)          -  0.6% ±  0.4%
  branch_misses       314K  ± 1.08K      311K  …  319K           8 ( 3%)        ⚡-  8.8% ±  0.2%

However, it's enough to put this new DAFSA implementation ahead of the Firefox and Chrome/Sarafi implementations in all of the benchmarks I'm using.

Benchmark 1 (91 runs): ./BenchHTMLTokenizer dafsa-binary-search
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           110ms ±  956us     108ms …  111ms          0 ( 0%)        0%
  peak_rss           83.5MB ± 79.4KB    83.2MB … 83.6MB          1 ( 1%)        0%
  cpu_cycles          212M  ± 1.31M      209M  …  217M           3 ( 3%)        0%
  instructions        420M  ± 9.03K      420M  …  420M           5 ( 5%)        0%
  cache_references   9.57M  ±  186K     9.36M  … 10.9M           2 ( 2%)        0%
  cache_misses        405K  ± 5.61K      394K  …  421K           0 ( 0%)        0%
  branch_misses       535K  ± 1.70K      532K  …  540K           0 ( 0%)        0%
Benchmark 2 (89 runs): ./BenchHTMLTokenizer gecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           113ms ± 1.02ms     111ms …  115ms          0 ( 0%)        💩+  2.6% ±  0.3%
  peak_rss           83.6MB ± 70.9KB    83.3MB … 83.6MB          0 ( 0%)          +  0.1% ±  0.0%
  cpu_cycles          225M  ± 1.28M      223M  …  234M           2 ( 2%)        💩+  5.8% ±  0.2%
  instructions        441M  ± 7.17K      441M  …  441M           7 ( 8%)        💩+  5.0% ±  0.0%
  cache_references   9.85M  ±  227K     9.64M  … 11.3M           4 ( 4%)        💩+  2.9% ±  0.6%
  cache_misses        411K  ± 5.54K      402K  …  431K           2 ( 2%)        💩+  1.5% ±  0.4%
  branch_misses       581K  ± 19.0K      575K  …  758K           7 ( 8%)        💩+  8.5% ±  0.7%
Benchmark 3 (88 runs): ./BenchHTMLTokenizer blink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           115ms ±  856us     113ms …  117ms          0 ( 0%)        💩+  4.5% ±  0.2%
  peak_rss           83.5MB ± 72.4KB    83.2MB … 83.6MB         26 (30%)          -  0.0% ±  0.0%
  cpu_cycles          232M  ±  940K      230M  …  235M           0 ( 0%)        💩+  9.4% ±  0.2%
  instructions        463M  ± 8.80K      463M  …  463M           6 ( 7%)        💩+ 10.4% ±  0.0%
  cache_references   10.2M  ±  141K     9.94M  … 10.9M           2 ( 2%)        💩+  6.1% ±  0.5%
  cache_misses        410K  ± 5.40K      398K  …  424K           0 ( 0%)          +  1.1% ±  0.4%
  branch_misses       751K  ± 1.90K      747K  …  755K           0 ( 0%)        💩+ 40.3% ±  0.1%
Benchmark 1 (234 runs): ./BenchHTMLTokenizer dafsa-binary-search gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          42.6ms ±  717us    40.3ms … 44.2ms         17 ( 7%)        0%
  peak_rss           53.2MB ± 90.0KB    52.8MB … 53.4MB         88 (38%)        0%
  cpu_cycles          103M  ±  604K      101M  …  105M           2 ( 1%)        0%
  instructions        222M  ± 6.94K      222M  …  222M           9 ( 4%)        0%
  cache_references   3.27M  ± 96.0K     3.19M  … 3.98M          15 ( 6%)        0%
  cache_misses        356K  ± 5.77K      344K  …  377K           1 ( 0%)        0%
  branch_misses       182K  ± 3.18K      178K  …  215K          11 ( 5%)        0%
Benchmark 2 (198 runs): ./BenchHTMLTokenizer gecko gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.5ms ±  831us    48.5ms … 52.4ms         20 (10%)        💩+ 18.7% ±  0.3%
  peak_rss           53.2MB ± 84.3KB    53.0MB … 53.4MB          1 ( 1%)          +  0.1% ±  0.0%
  cpu_cycles          138M  ±  610K      136M  …  139M           3 ( 2%)        💩+ 33.8% ±  0.1%
  instructions        280M  ± 7.27K      280M  …  280M           6 ( 3%)        💩+ 26.3% ±  0.0%
  cache_references   3.34M  ±  118K     3.22M  … 4.58M           9 ( 5%)        💩+  2.1% ±  0.6%
  cache_misses        356K  ± 5.10K      344K  …  372K           3 ( 2%)          -  0.1% ±  0.3%
  branch_misses       315K  ± 5.57K      305K  …  346K           4 ( 2%)        💩+ 72.8% ±  0.5%
Benchmark 3 (213 runs): ./BenchHTMLTokenizer blink gecko-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          47.1ms ±  620us    44.9ms … 48.3ms         18 ( 8%)        💩+ 10.6% ±  0.3%
  peak_rss           53.2MB ± 84.0KB    52.8MB … 53.4MB          1 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles          122M  ±  745K      121M  …  127M           2 ( 1%)        💩+ 18.5% ±  0.1%
  instructions        292M  ± 5.64K      292M  …  292M           4 ( 2%)        💩+ 31.7% ±  0.0%
  cache_references   3.30M  ±  139K     3.19M  … 4.78M          24 (11%)          +  0.9% ±  0.7%
  cache_misses        355K  ± 5.58K      343K  …  371K           7 ( 3%)          -  0.4% ±  0.3%
  branch_misses       183K  ±  710       180K  …  185K           3 ( 1%)          +  0.2% ±  0.2%
Benchmark 1 (160 runs): ./BenchHTMLTokenizer dafsa-binary-search ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          62.4ms ± 1.16ms    60.4ms … 64.8ms          0 ( 0%)        0%
  peak_rss           65.3MB ± 81.8KB    64.9MB … 65.4MB         45 (28%)        0%
  cpu_cycles          107M  ±  850K      105M  …  114M           3 ( 2%)        0%
  instructions        196M  ± 12.0K      196M  …  196M          14 ( 9%)        0%
  cache_references   5.92M  ± 73.9K     5.81M  … 6.21M           3 ( 2%)        0%
  cache_misses        386K  ± 7.90K      369K  …  409K           1 ( 1%)        0%
  branch_misses       164K  ± 1.74K      161K  …  179K           9 ( 6%)        0%
Benchmark 2 (161 runs): ./BenchHTMLTokenizer gecko ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          62.3ms ±  988us    59.7ms … 64.1ms          0 ( 0%)          -  0.1% ±  0.4%
  peak_rss           65.3MB ± 79.2KB    65.0MB … 65.4MB          2 ( 1%)          +  0.1% ±  0.0%
  cpu_cycles          106M  ±  618K      104M  …  108M           3 ( 2%)          -  0.8% ±  0.2%
  instructions        195M  ± 12.6K      195M  …  195M          13 ( 8%)          -  0.8% ±  0.0%
  cache_references   6.03M  ±  169K     5.84M  … 6.99M           2 ( 1%)        💩+  1.9% ±  0.5%
  cache_misses        386K  ± 7.61K      370K  …  413K           2 ( 1%)          -  0.1% ±  0.4%
  branch_misses       165K  ± 1.02K      163K  …  169K           2 ( 1%)          +  0.4% ±  0.2%
Benchmark 3 (158 runs): ./BenchHTMLTokenizer blink ladybird-worst-case
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          63.4ms ± 1.22ms    61.0ms … 65.3ms          0 ( 0%)        💩+  1.7% ±  0.4%
  peak_rss           65.2MB ± 78.2KB    65.0MB … 65.4MB          1 ( 1%)          -  0.0% ±  0.0%
  cpu_cycles          109M  ±  789K      107M  …  115M           2 ( 1%)        💩+  2.0% ±  0.2%
  instructions        204M  ± 11.7K      203M  …  204M           4 ( 3%)        💩+  3.7% ±  0.0%
  cache_references   6.00M  ± 90.0K     5.85M  … 6.25M           0 ( 0%)        💩+  1.4% ±  0.3%
  cache_misses        388K  ± 8.11K      371K  …  409K           0 ( 0%)          +  0.6% ±  0.5%
  branch_misses       165K  ± 1.24K      162K  …  173K           1 ( 1%)          +  0.4% ±  0.2%
Benchmark 1 (244 runs): ./BenchHTMLTokenizer dafsa-binary-search all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          41.0ms ±  843us    38.3ms … 42.2ms         21 ( 9%)        0%
  peak_rss           54.5MB ± 87.9KB    54.3MB … 54.6MB          4 ( 2%)        0%
  cpu_cycles         90.6M  ±  591K     89.6M  … 92.6M           2 ( 1%)        0%
  instructions        172M  ± 9.67K      172M  …  172M          19 ( 8%)        0%
  cache_references   3.56M  ±  117K     3.47M  … 5.05M          14 ( 6%)        0%
  cache_misses        359K  ± 9.17K      343K  …  399K           1 ( 0%)        0%
  branch_misses       316K  ± 4.49K      312K  …  382K           1 ( 0%)        0%
Benchmark 2 (229 runs): ./BenchHTMLTokenizer gecko all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          43.7ms ±  735us    41.4ms … 44.9ms         18 ( 8%)        💩+  6.6% ±  0.3%
  peak_rss           54.6MB ± 85.1KB    54.3MB … 54.8MB          2 ( 1%)          +  0.1% ±  0.0%
  cpu_cycles          103M  ±  460K      102M  …  105M           5 ( 2%)        💩+ 13.6% ±  0.1%
  instructions        189M  ± 4.69K      189M  …  189M           4 ( 2%)        💩+ 10.1% ±  0.0%
  cache_references   3.65M  ± 93.4K     3.54M  … 4.77M          14 ( 6%)        💩+  2.5% ±  0.5%
  cache_misses        356K  ± 5.90K      344K  …  377K           5 ( 2%)          -  0.9% ±  0.4%
  branch_misses       385K  ±  861       383K  …  388K           2 ( 1%)        💩+ 21.8% ±  0.2%
Benchmark 3 (224 runs): ./BenchHTMLTokenizer blink all-valid
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          44.6ms ±  885us    42.1ms … 46.2ms         23 (10%)        💩+  8.9% ±  0.4%
  peak_rss           54.5MB ± 87.3KB    54.1MB … 54.6MB         88 (39%)          -  0.1% ±  0.0%
  cpu_cycles          106M  ±  654K      105M  …  109M           2 ( 1%)        💩+ 17.2% ±  0.1%
  instructions        205M  ± 9.02K      205M  …  205M          10 ( 4%)        💩+ 19.4% ±  0.0%
  cache_references   3.80M  ±  103K     3.69M  … 5.07M           9 ( 4%)        💩+  6.5% ±  0.6%
  cache_misses        361K  ± 9.82K      345K  …  396K           3 ( 1%)          +  0.5% ±  0.5%
  branch_misses       464K  ± 1.33K      460K  …  469K           6 ( 3%)        💩+ 46.7% ±  0.2%
Benchmark 1 (67 runs): ./BenchMatcherDafsaBinarySearch
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.6ms ± 1.32ms    72.1ms … 76.4ms          0 ( 0%)        0%
  peak_rss           4.57MB ± 70.4KB    4.46MB … 4.72MB         20 (30%)        0%
  cpu_cycles          299M  ± 1.18M      296M  …  301M           1 ( 1%)        0%
  instructions        471M  ± 79.5       471M  …  471M           0 ( 0%)        0%
  cache_references   6.03M  ± 49.6K     5.90M  … 6.18M           2 ( 3%)        0%
  cache_misses       26.0K  ± 4.18K     20.5K  … 37.4K           1 ( 1%)        0%
  branch_misses      5.28M  ± 53.2K     5.14M  … 5.37M           3 ( 4%)        0%
Benchmark 2 (48 runs): ./BenchMatcherGecko
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           105ms ± 1.18ms     103ms …  107ms          0 ( 0%)        💩+ 40.8% ±  0.6%
  peak_rss           4.57MB ± 60.4KB    4.46MB … 4.72MB         11 (23%)          -  0.1% ±  0.5%
  cpu_cycles          426M  ± 1.48M      424M  …  430M           2 ( 4%)        💩+ 42.4% ±  0.2%
  instructions        745M  ± 68.8       745M  …  745M           8 (17%)        💩+ 58.2% ±  0.0%
  cache_references   8.04M  ± 77.0K     7.95M  … 8.48M           1 ( 2%)        💩+ 33.3% ±  0.4%
  cache_misses       27.1K  ± 5.19K     21.3K  … 44.7K           6 (13%)          +  4.3% ±  6.7%
  branch_misses      5.41M  ± 2.77K     5.41M  … 5.42M           2 ( 4%)        💩+  2.5% ±  0.3%
Benchmark 3 (36 runs): ./BenchMatcherBlink
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           140ms ± 1.66ms     138ms …  146ms          1 ( 3%)        💩+ 88.2% ±  0.8%
  peak_rss           4.60MB ± 75.3KB    4.46MB … 4.72MB         12 (33%)          +  0.7% ±  0.6%
  cpu_cycles          573M  ± 4.44M      568M  …  594M           4 (11%)        💩+ 91.6% ±  0.4%
  instructions       1.07G  ± 70.5      1.07G  … 1.07G           7 (19%)        💩+126.3% ±  0.0%
  cache_references   12.3M  ±  107K     12.2M  … 12.7M           2 ( 6%)        💩+104.6% ±  0.5%
  cache_misses       28.3K  ± 5.50K     21.3K  … 43.8K           3 ( 8%)        💩+  9.1% ±  7.4%
  branch_misses      8.78M  ± 10.3K     8.75M  … 8.80M           1 ( 3%)        💩+ 66.2% ±  0.3%

So, at long last, we're at the point where the title (hopefully) becomes justified: this improved DAFSA implementation seems to be slightly better across the board.

Future possibilities🔗

One funny aspect of this whole thing is that the problem is actually quite simple once you understand it, and there are probably a lot of different ways one could approach it. If you've read this far, it's very likely that you have some ideas of your own on how to make something better: either a whole different approach, or an improvement to some part of one of the approaches detailed so far.

I'll outline some avenues I think might warrant some further attention, but I also expect that I'll miss things that someone else may consider obvious.

DAFSA with first-two-character acceleration🔗

In the last section, I took inspiration from the other browsers' implementations by adding a lookup table to accelerate the search for the first character, but it'd also be possible to take one more page from the Firefox implementation and do the same thing for the second character, too.

I actually have tried this out, and the implementation that I came up with:

Overall, these changes cut the raw lookup time by around -16% (as measured by the benchmark I'm using, at least):

Benchmark 1 (169 runs): ./bench-first-layer-accel-binary-search
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          59.4ms ± 1.20ms    58.0ms … 61.4ms          0 ( 0%)        0%
  cpu_cycles          242M  ±  797K      241M  …  250M          14 ( 8%)        0%
  instructions        431M  ± 1.28       431M  …  431M           0 ( 0%)        0%
  cache_references   6.21M  ± 74.9K     6.10M  … 6.99M           2 ( 1%)        0%
  cache_misses       1.53K  ±  629      1.01K  … 5.03K          16 ( 9%)        0%
  branch_misses      4.88M  ± 4.28K     4.87M  … 4.90M           6 ( 4%)        0%
Benchmark 2 (201 runs): ./bench-two-layer-accel-linear-search
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          49.7ms ± 2.64ms    47.4ms … 76.2ms         10 ( 5%)        ⚡- 16.2% ±  0.7%
  cpu_cycles          198M  ± 1.20M      197M  …  207M          11 ( 5%)        ⚡- 18.2% ±  0.1%
  instructions        332M  ± 4.83       332M  …  332M          26 (13%)        ⚡- 22.8% ±  0.0%
  cache_references   7.29M  ± 48.6K     7.19M  … 7.44M           1 ( 0%)        💩+ 17.3% ±  0.2%
  cache_misses       1.56K  ±  442      1.09K  … 3.45K           7 ( 3%)          +  1.8% ±  7.2%
  branch_misses      3.96M  ± 7.41K     3.95M  … 4.00M           7 ( 3%)        ⚡- 18.7% ±  0.0%

So, for 8 KiB more data you can get another decent performance improvement, but so far I've only implemented this version in Zig so I can't report more information on this yet (would need to port it to C++ to test it with Ladybird). This also represents my first attempt at this 'two layer acceleration' strategy, so it's possible there's more juice to squeeze here.

SIMD🔗

Single instruction, multiple data (SIMD) is something I have had very little experience with using up to this point, and my naive attempts at using SIMD to accelerate my DAFSA implementation were not fruitful. However, it seems like there's potential to take advantage of SIMD for this type of problem (if not with a DAFSA, then with some totally different approach that takes better advantage of what SIMD is good at).

Data-oriented design🔗

As I understand it, the core idea of data-oriented design is: instead of using an 'array of structs', use a 'struct of arrays' where each array holds segments of data that are frequently accessed together. If applied well, this can both cut down on wasted padding bits between elements and make your code much more CPU-cache-friendly.

Example for those unfamiliar with data-oriented design

For example:

const Struct = struct {
  foo: u8,
  bar: u16,
}
const array_of_structs = [100]Struct{ ... };

With the above, each element will have 8 bits of padding (the fields of Struct use 3 bytes, but @sizeOf(Struct) is 4 bytes), and array_of_structs will use 400 bytes. Additionally, if you have a loop where you're only accessing one field like this:

for (&array_of_structs) |element| {
  if (element.foo == '!') return true;
}

then you're accidentally paying the cost of the larger Struct size since fewer will fit into cache. If we move to a 'struct of arrays' approach instead like so:

const StructOfArrays = struct {
  foos: [100]u8 = { ... },
  bars: [100]u16 = { ... },
};

then we're only using 300 bytes for these two arrays, and if we write a loop that only looks at foos like so:

for (&StructOfArrays.foos) |foo| {
  if (foo == '!') return true;
}

it will be able to benefit from each element being contiguous in memory and from the fact that more elements of the array will fit into cache at a time.

This is something I experimented quite a bit with, but never got results from. I believe the problem is that the access patterns of the DAFSA don't really benefit from the 'struct of arrays' approach, even though it seems like they might. The char field is accessed repeatedly while searching a list of children, but all the other fields of the Node are almost always accessed after that search is finished, so any benefit we get from having the char fields contiguous, we lose just as much from having the other fields farther away from their associated char. As far as I can tell, it's overall equally-or-more efficient to just use a plain old array-of-structs for our DAFSA nodes.

Entirely different data structures🔗

I effectively pulled the DAFSA out of a hat, without surveying the possibilities much. Someone more well versed in the field of data structures will likely have some ideas about what's out there that could work better.

Wrapping up🔗

This article continually and relentlessly grew in scope, and has ended up quite a bit more in-depth than I originally imagined (this is something that's familiar to me, unfortunately). If you've read this far, thank you, and I hope you were able to get something out of it.

Throughout the process of writing, I've accrued a number of improvements that I can make on top of my original Ladybird pull request:

So, a new pull request (or a few) to Ladybird will be forthcoming with some combination of these changes. However, I expect that exactly what those future PR(s) will look like may be shaped by the feedback I receive from this post, as I remain confident that better approaches than mine are out there, and, if you've read this article, you have all the knowledge necessary (and then some) to come up with an implementation of your own.

Finally, I'll leave you with some links: