Incremental parser (was: Backtick Hickup)

Allan Odgaard 29mtuz102 at
Mon Aug 13 21:56:02 EDT 2007

I forked the topic, since this is an (interesting) topic of its own,
not really related to the interpretation of code-spans.

On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:

>> [...] I know most Markdown parsers do not follow conventional

>> parser wisdom, but IMO this is also the interpretation that suits

>> an incremental tokenizer/parser best compared to your

>> interpretation [...]

> [...]

> There is a lot of look-aheads in Markdown:

Are you talking about the spec or implementations? I believe when it
comes to the implementation it would be more correct to say a lot of
“iteratively performing search and replacement on the entire document”.

> emphasis won't be applied if asterisks or underscores can't be

> matched in pairs; links won't be links if there's no suitable

> parenthesis after the closing bracket, Setext-style headers need

> the line of hyphens or equal signs following its content, the

> parsing mode for list items depends on whether or not it contains a

> blank line, etc.

All but the style thing is limited (fixed size) look-ahead. This is
not a problem. But those “look to the end of the document each time
you see X” is a huge problem (for performance, and performance for
Markdown is bad) -- if there was interest in addressing this, it
could be done, sure we would have to mildly tweak a few rules, but
those rules are anyway not written down today, they are just de facto
rules based on the current implementation, this is why I jumped in at
this back-tick thing, because as I see it, we have a very
unconventional parser (in and ported by you to PHP) and
we let the language be defined by how this parser ends up dealing
with edge-cases (like pairing two back-ticks with three back-ticks).
But often setting that way of dealing with things as the standard, is
just counter-productive to ever getting a “real” parser for Markdown.

> There's no way to do a truly incremental parsing of Markdown...

> well, you could in a way, but you'd have to mutate many parts of

> the output document while parsing

I strongly disagree. TextMate does a very good job at syntax
highlighting Markdown, and it is based 100% on an incremental parser
-- in v2.0 there will be some new parser features which will allow
for it to deal with 99% of all Markdown out there. Where it has
problems is really in the edge-cases, but that is partly because
these are undefined, and partly because when they come up, e.g. like
this back-tick thing, they “get defined” in a bad way.

(and there you have my motivation for going into this thread)

> (like HTML parsers do in browsers),

Say what?

> or to delay the output of ambigus parts until the end of the

> document; all this surely defeats the purpose of an incremental

> parser.

I think you misunderstand my use of the term. By incremental I mean
that it scans the input document byte-by-byte (and creates tokens,
from which a parse-tree is built), never going back to already
scanned bytes. So this gives it a somewhat linear time complexity
with a low constant factor.

I believe I have already mentioned it, but for reference,
takes almost 40s to convert the TextMate manual into HTML where TM
2’s parser, which parses the manual *exactly* the same uses less than
a quarter of a second.

I believe the ruby parser (maku?) is also based on doing an
incremental scan, I have not played with it yet, but I would think it
also shows much better performance.

That said, another reason why I am focused on an incremental parser
is because then we get closer to a formal grammar -- i.e. if we see
token A we switch to state X, etc. rather than now where nested
constructs are only made possible by dubious md5-transformations on
subsets of the document.

> The worst "look-ahead" (or most complex "mutations") would be for

> reference-style links which can have their definitions absolutely

> anywhere in the document. Interestingly, that's probably one of the

> most appreciated features of Markdown.

You may have misunderstood the incremental as giving incremental
output -- that is not the case.

So basically you parse the document token-for-token. When you see
e.g. [foo][bar] you create the link, but just keep the bar as its
value -- when you see [bar]: http://domain.tld then you insert http://
domain.tld in the symbol table for the bar symbol.

When it gets time to write out the document (i.e. the full input has
been parsed / parse tree built) you just look-up the links in the
symbol table when they are written out.

I.e. this is really no problem.

More information about the Markdown-Discuss mailing list