Parse Me, Baby, One More Time: Bypassing HTML Sanitizer via Parsing Differentials

May 12, 2025 Yanac

Server-side HTML sanitization is inherently broken. Nevertheless, it is used everywhere to protect against cross-site scripting (XSS) vulnerabilities.

In this talk, we will delve into why this is the case. To remove XSS payloads, an HTML sanitizer must first parse its input. Then, it determines which parts of the input are dangerous and removes or rewrites them. Lastly, it serializes the transformed input back to its textual form and returns it.
This process means a sanitizer is only as strong as the employed HTML parser. Despite HTML looking deceptively simple, implementing an HTML parser is surprisingly complex. While officially specified, parsing HTML has tons of edge cases and quirks. Sanitizers have to implement all of them, effectively mimicking the exact behavior of a browser. Even if a developer pulls off this nontrivial feat, additional pitfalls lie in the differences in behavior between browsers.
This talk will show how sanitizers deployed by millions of people fall well short of these goals and are easily bypassable.

We will present MutaGen, a framework that generates HTML fragments prone to abuse parsing implementation differences, so-called parsing differentials. When evaluating the generated fragments on 11 server-side HTML sanitizers, we found that all use deficient parsers. In benign cases, this means the sanitizer mangles harmless input. However, by abusing such parsing differentials we could automatically bypass all but two of them.

By:
David Klein | Researcher, Technische Universität Braunschweig

Full Abstract and Presentation Materials:
https://www.blackhat.com/eu-24/briefings/schedule/#parse-me-baby-one-more-time-bypassing-html-sanitizer-via-parsing-differentials-42514Black HatRead More