Incremental parser (was: Backtick Hickup)
29mtuz102 at sneakemail.com
Mon Aug 13 21:56:02 EDT 2007
I forked the topic, since this is an (interesting) topic of its own,
not really related to the interpretation of code-spans.
On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:
>> [...] I know most Markdown parsers do not follow conventional
>> parser wisdom, but IMO this is also the interpretation that suits
>> an incremental tokenizer/parser best compared to your
>> interpretation [...]
> There is a lot of look-aheads in Markdown:
Are you talking about the spec or implementations? I believe when it
comes to the implementation it would be more correct to say a lot of
“iteratively performing search and replacement on the entire document”.
> emphasis won't be applied if asterisks or underscores can't be
> matched in pairs; links won't be links if there's no suitable
> parenthesis after the closing bracket, Setext-style headers need
> the line of hyphens or equal signs following its content, the
> parsing mode for list items depends on whether or not it contains a
> blank line, etc.
All but the style thing is limited (fixed size) look-ahead. This is
not a problem. But those “look to the end of the document each time
you see X” is a huge problem (for performance, and performance for
Markdown is bad) -- if there was interest in addressing this, it
could be done, sure we would have to mildly tweak a few rules, but
those rules are anyway not written down today, they are just de facto
rules based on the current implementation, this is why I jumped in at
this back-tick thing, because as I see it, we have a very
unconventional parser (in markdown.pl and ported by you to PHP) and
we let the language be defined by how this parser ends up dealing
with edge-cases (like pairing two back-ticks with three back-ticks).
But often setting that way of dealing with things as the standard, is
just counter-productive to ever getting a “real” parser for Markdown.
> There's no way to do a truly incremental parsing of Markdown...
> well, you could in a way, but you'd have to mutate many parts of
> the output document while parsing
I strongly disagree. TextMate does a very good job at syntax
highlighting Markdown, and it is based 100% on an incremental parser
-- in v2.0 there will be some new parser features which will allow
for it to deal with 99% of all Markdown out there. Where it has
problems is really in the edge-cases, but that is partly because
these are undefined, and partly because when they come up, e.g. like
this back-tick thing, they “get defined” in a bad way.
(and there you have my motivation for going into this thread)
> (like HTML parsers do in browsers),
> or to delay the output of ambigus parts until the end of the
> document; all this surely defeats the purpose of an incremental
I think you misunderstand my use of the term. By incremental I mean
that it scans the input document byte-by-byte (and creates tokens,
from which a parse-tree is built), never going back to already
scanned bytes. So this gives it a somewhat linear time complexity
with a low constant factor.
I believe I have already mentioned it, but for reference, markdown.pl
takes almost 40s to convert the TextMate manual into HTML where TM
2’s parser, which parses the manual *exactly* the same uses less than
a quarter of a second.
I believe the ruby parser (maku?) is also based on doing an
incremental scan, I have not played with it yet, but I would think it
also shows much better performance.
That said, another reason why I am focused on an incremental parser
is because then we get closer to a formal grammar -- i.e. if we see
token A we switch to state X, etc. rather than now where nested
constructs are only made possible by dubious md5-transformations on
subsets of the document.
> The worst "look-ahead" (or most complex "mutations") would be for
> reference-style links which can have their definitions absolutely
> anywhere in the document. Interestingly, that's probably one of the
> most appreciated features of Markdown.
You may have misunderstood the incremental as giving incremental
output -- that is not the case.
So basically you parse the document token-for-token. When you see
e.g. [foo][bar] you create the link, but just keep the bar as its
value -- when you see [bar]: http://domain.tld then you insert http://
domain.tld in the symbol table for the bar symbol.
When it gets time to write out the document (i.e. the full input has
been parsed / parse tree built) you just look-up the links in the
symbol table when they are written out.
I.e. this is really no problem.
More information about the Markdown-Discuss