Pathological HTML

This fixture is unusual content the walker must handle without crashing and without emitting broken DOM. The text contains every kind of awkward character a real customer document might.

Smart quotes and dashes

A “smart quote” pair (U+201C and U+201D) wraps prose like this one. The term smart quote appears in regular prose; the curly quotes around the phrase “smart quote” are NOT part of the term name, only the bare word is. Same for an em-dash, written as — (U+2014), and an en-dash, written as – (U+2013). Sentences using em-dash for parenthetical clauses — like this one — are common in docs and the walker must not be confused by the dash characters when scanning for word boundaries.

Non-printing characters

A non-breaking space ( ) renders as a space but doesn’t break: term nbsp appears here with one non-breaking space between “term” and “nbsp”. The walker should still mark “nbsp” because the word boundary is intact.

A soft hyphen () is invisible until a line break occurs at its position: “soft hyphen” written as softhyphen renders identical to “soft hyphen” unless the browser breaks the line there. The walker should treat the soft hyphen as a zero-width character and mark “soft hyphen” as one term.

The byte order mark (BOM, U+FEFF) is invisible. A file might start with one; a paragraph generally should not, but a paste from Notion sometimes does. The walker must not mistake a BOM-prefixed word as a different word.

A control char like U+0000 through U+001F should be stripped or left alone — never interpreted as text. The same goes for U+007F (DEL).

HTML entities

The text “© 2026” renders as ”© 2026” — a literal ”©” character. Both forms — the HTML entity form (©) and the numeric entity form (©) — should yield the same character at the DOM level, and the word “HTML entity” / “numeric entity” should be marked once in this paragraph.

Combining marks and ligatures

A combining mark like the combining acute (U+0301) attaches to the preceding base character: “café” can be written as “café” (U+00E9 single code point) or as “café” (U+0065 + U+0301 two code points). Visually identical. The walker should treat both forms as the same string when looking up term names.

A ligature like “ﬁ” (U+FB01) renders the same as “fi” (two code points). Some hosts normalize, some don’t. If the term “ﬁle” is in the glossary, but the prose has “file” with the ligature substituted by the font renderer, do we mark? Policy decision: NFC-normalize at extraction time, so the ligature glyph in prose still finds the unligated term.

A surrogate pair is required to represent code points beyond the BMP (U+10000 and above). Most emoji are surrogate pairs in UTF-16. The walker must operate on code points, not on UTF-16 code units, to avoid splitting them.

Emoji and ZWJ sequences

An emoji like 🦊 is one code point. A multi-glyph emoji like 👨‍👩‍👧‍👦 (family) is a sequence of base emoji joined by ZWJ (zero-width joiner, U+200D). The full sequence renders as one glyph but is six code points (👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦). The walker must not split a ZWJ sequence when scanning for word boundaries.

Skin-tone modifiers also use ZWJ-like sequences: 👋🏽 is base hand 👋 plus modifier U+1F3FD. Together they render as one waving-hand-medium-skin emoji.

Bidi overrides

A bidi override character (U+202E, U+2066, etc.) forces text direction regardless of script. A document using bidi overrides to spoof identifiers — for example to make “moc.evil” render as “live.com” — is a known attack vector. The walker must not allow override characters inside term names to alter what the rail row label shows.

Closing paragraph

All fifteen terms — smart quote, em-dash, en-dash, ZWJ, nbsp, soft hyphen, BOM, emoji, ligature, combining mark, surrogate pair, control char, HTML entity, numeric entity, bidi override — appear at least once above. Each should have a corresponding mark in the post-hydration DOM. None of them should cause the walker to throw an exception, emit malformed HTML, or insert characters that weren’t already there.