Incremental parser (was: Backtick Hickup)
    Allan Odgaard 
    29mtuz102 at sneakemail.com
       
    Mon Aug 13 21:56:02 EDT 2007
    
    
  
I forked the topic, since this is an (interesting) topic of its own,  
not really related to the interpretation of code-spans.
On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:
>> [...] I know most Markdown parsers do not follow conventional  
>> parser wisdom, but IMO this is also the interpretation that suits  
>> an incremental tokenizer/parser best compared to your  
>> interpretation [...]
> [...]
> There is a lot of look-aheads in Markdown:
Are you talking about the spec or implementations? I believe when it  
comes to the implementation it would be more correct to say a lot of  
“iteratively performing search and replacement on the entire document”.
> emphasis won't be applied if asterisks or underscores can't be  
> matched in pairs; links won't be links if there's no suitable  
> parenthesis after the closing bracket, Setext-style headers need  
> the line of hyphens or equal signs following its content, the  
> parsing mode for list items depends on whether or not it contains a  
> blank line, etc.
All but the style thing is limited (fixed size) look-ahead. This is  
not a problem. But those “look to the end of the document each time  
you see X” is a huge problem (for performance, and performance for  
Markdown is bad) -- if there was interest in addressing this, it  
could be done, sure we would have to mildly tweak a few rules, but  
those rules are anyway not written down today, they are just de facto  
rules based on the current implementation, this is why I jumped in at  
this back-tick thing, because as I see it, we have a very  
unconventional parser (in markdown.pl and ported by you to PHP) and  
we let the language be defined by how this parser ends up dealing  
with edge-cases (like pairing two back-ticks with three back-ticks).  
But often setting that way of dealing with things as the standard, is  
just counter-productive to ever getting a “real” parser for Markdown.
> There's no way to do a truly incremental parsing of Markdown...  
> well, you could in a way, but you'd have to mutate many parts of  
> the output document while parsing
I strongly disagree. TextMate does a very good job at syntax  
highlighting Markdown, and it is based 100% on an incremental parser  
-- in v2.0 there will be some new parser features which will allow  
for it to deal with 99% of all Markdown out there. Where it has  
problems is really in the edge-cases, but that is partly because  
these are undefined, and partly because when they come up, e.g. like  
this back-tick thing, they “get defined” in a bad way.
(and there you have my motivation for going into this thread)
> (like HTML parsers do in browsers),
Say what?
> or to delay the output of ambigus parts until the end of the  
> document; all this surely defeats the purpose of an incremental  
> parser.
I think you misunderstand my use of the term. By incremental I mean  
that it scans the input document byte-by-byte (and creates tokens,  
from which a parse-tree is built), never going back to already  
scanned bytes. So this gives it a somewhat linear time complexity  
with a low constant factor.
I believe I have already mentioned it, but for reference, markdown.pl  
takes almost 40s to convert the TextMate manual into HTML where TM  
2’s parser, which parses the manual *exactly* the same uses less than  
a quarter of a second.
I believe the ruby parser (maku?) is also based on doing an  
incremental scan, I have not played with it yet, but I would think it  
also shows much better performance.
That said, another reason why I am focused on an incremental parser  
is because then we get closer to a formal grammar -- i.e. if we see  
token A we switch to state X, etc. rather than now where nested  
constructs are only made possible by dubious md5-transformations on  
subsets of the document.
> The worst "look-ahead" (or most complex "mutations") would be for  
> reference-style links which can have their definitions absolutely  
> anywhere in the document. Interestingly, that's probably one of the  
> most appreciated features of Markdown.
You may have misunderstood the incremental as giving incremental  
output -- that is not the case.
So basically you parse the document token-for-token. When you see  
e.g. [foo][bar] you create the link, but just keep the bar as its  
value -- when you see [bar]: http://domain.tld then you insert http:// 
domain.tld in the symbol table for the bar symbol.
When it gets time to write out the document (i.e. the full input has  
been parsed / parse tree built) you just look-up the links in the  
symbol table when they are written out.
I.e. this is really no problem.
    
    
More information about the Markdown-Discuss
mailing list