Page 1 of 1

Anybody using HTML5 parser to repair HTML?

Posted: Tue Apr 24, 2012 12:26 am
by MikeGale
It occurred to me that there might be a fairly simple way to repair HTML. It might overcome some limits I've noticed with HTMLTidy.

The repair would match what newer browsers are doing and would include SVG and MathML markup.

That would use the HTML5 parser. This follows the rules laid down with HTML5, and implemented in various browsers.

There are versions available in various languages (Java etc.). The reference version is written in Python.

Has anybody here experimented with that idea?

One route would be to take HTML content, from file or HTTP and output it serialised either as file or a stream in memory.

Something like that could work in a similar way to HTMLTidy.

If anybody has tried parts of that I'd appreciate your observations.

Might be a good complement to pre-process content before working on it with CSE.