xMarkdown implementation

Mon Mar 29 19:19:05 EST 2004

Aaron Swartz wrote:

> For correcting the HTML, have you considered HTML Tidy? It already has 
> tons of logic to deal with this stuff, and I think it's available as a 
> Perl plugin.

Yes, I have (a little ;). Basically, xMarkdown is not aimed at 
correcting HTML - it's aimed at generating it correctly in the first 
place. The way it works is that the content of the document is held 
outside the structure; it parses Markdown to generate the structure and 
get the content, but it doesn't attempt to map the Markdown into HTML 
via a transformative process. The way the structure is implemented, it's 
basically impossible to generate something malformed, at the cost of 
your content potentially being in the wrong place (xMarkdown moving 
closing tags in your raw HTML is a good example) if the input is 
"invalid" (for some value of invalid). So it's not really a tool to 
"correct" really, it's a tool to create. And, if what it creates is 
wrong, then it's a bug.

I think it would actually be reasonable code to formally proof too; 
because of the type of data structure we're talking about there's a 
relatively simple proof by induction you can do (assuming that there are 
no loops; which we do, because it blows up big bang otherwise :) which 
would show that the output always meets the XML validity rules. (By 
saying I think it's provable; I haven't proved it nor do I think the 
current code would necessarily pass such a proof - character set 
handling is reasonably suspect, for example).

Tidy is potentially useful though, and I think you make a very good 
point. One of the things I'm very interested is going beyond what 
xMarkdown offers in terms of the 100% xml guarantee - I can output XML, 
but there's no real certainty that the XML is actually XHTML (are the 
tags valid? are they in the correct context? do they have mandated 
attributes? etc.), and being able to extent the guarantee to XHTML would 
be very interesting (for example, have you seen the studies of sites 
that claim to be HTML valid, but actually aren't? I guess they actually 
were valid at some point, but lost validity at some point in time - it 
would be nice to not have to think about it). Now, there are a number of 
ways of going about it. I could code a load of rules into a rendering 
subsystem, or I could just run it through Tidy (once we have valid XML, 
the ambiguity issue is even less of a problem). Maybe the answer is to 
meet half-way: extend the guarantee with a few bits of code that hit the 
low-hanging fruit, and then pull in Tidy if we detect it's available and 
wanted.

It would be quite interesting to know how many people use Tidy in page 
generation. I know I never have done; and I've been running systems 
which have been XHTML1.1 valid for some time. I think I see Tidy much 
more as a rectifying tool, and therefore don't think about it in terms 
of actually generating content to begin with. I will take a look at it 
though; it can only inform me. There are definitely a number of 
different approaches you can take toward the same end-goal, and I 
strongly believe in diversity of solutions.

-- bob.