A sad description of what happens if you design a bad markup system

For general web development questions that are not specifically related to CSS HTML Validator. This includes (but is not limited to) general HTML, CSS, Accessibility, JavaScript, and SEO questions.
Post Reply
User avatar
MikeGale
Rank VI - Professional
Posts: 728
Joined: Mon Dec 13, 2004 1:50 pm
Location: Tannhauser Gate

A sad description of what happens if you design a bad markup system

Post by MikeGale »

I assume this is all correct. I've certainly found the Wiki markup to be arcane.

I picked up a description of code developed to convert Wikipedia markup to something useful. This may be of interest to some people working with HTML, or at least for a very good laugh.
Wiki text is a format that follows the PHP maxim “Make everything as inconsistent and confusing as possible”. There are hundreds of millions of interesting documents written in this format, distributed under free licenses on sites that use the Mediawiki software, mainly Wikipedia and Wiktionary. Being able to parse wiki text and process these documents would allow access to a significant part of the world's knowledge.

The Mediawiki software itself transforms a wiki text document into an HTML document in an outdated format to be displayed in a browser for a human reader. It does so through a step by step procedure of string substitutions, with some of the steps depending on the result of previous steps. The main file for this procedure has 6200 lines of code and the second biggest file has 2000, and then there is a 1400 line file just to take options for the parser.

What would be more interesting is to parse the wiki text document into a structure that can be used by a computer program to reason about the facts in the document and present them in different ways, making them available for a great variety of applications.

Some people have tried to parse wiki text using regular expressions. This is incredibly naive and fails as soon as the wiki text is non-trivial. The capabilities of regular expressions don't come anywhere close to the complexity of the weirdness required to correctly parse wiki text. One project did a brave attempt to use a parser generator to parse wiki text. Wiki text was however never designed for formal parsers, so even parser generators are of no help in correctly parsing wiki text.

Wiki text has a long history of poorly designed additions carelessly piled on top of each other. The syntax of wiki text is different in each wiki depending on its configuration. You can't even know what's a start tag until you see the corresponding end tag, and you can't know where the end tag is unless you parse the entire hierarchy of nested tags between the start tag and the end tag. In short: If you think you understand wiki text, you don't understand wiki text.
That is taken from the documentation for a Rust program at https://docs.rs/parse_wiki_text/latest/parse_wiki_text/.

(I discovered it through a person called Fiatjaf, who develops the Nostr messaging system.)
Post Reply