Incremental parser

Jacob Rus jacobolus at gmail.com
Tue Aug 14 21:54:48 EDT 2007


Michel Fortin wrote:

> Le 2007-08-13 à 21:56, Allan Odgaard a écrit :

> Both. Implementations use look-aheads because simply the syntax often

> require looking ahead to see if the pattern is complete before deciding

> if something should be taken literally or not. For instance, an asterisk

> alone doesn't start emphasis unless it has a corresponding one in the

> same paragraph.


Okay, but the spec could easily be changed to continue emphasis until
the end of the current paragraph, if no ending * is found. This would
IMO be much more logical for document readers/writers (as well as
computer parsers), as some asterisk 60 lines later would no longer
affect part of the document on the current line.

Given that `*`s surrounded by spaces are left alone, I don't think this
change would have any real downside, in terms of unintentional emphasis,
and it would make it possible to have syntax highlighting which actually
accurately matched the expected output, which would make my life as a
document author much easier.


> In the case of the backtick in the other thread, it's clearly not an

> edge case to me: I'm pretty sure that it has been designed that way.

> First, it's much more useable, and second, the code explicitly checks

> for the surrounding backticks so it can't really be an oversight. But

> that's for the other thread...


Well, the specification is clearly ambiguous, if it can be interpreted
(pretty reasonably AFAICT) in two completely different ways.


> I also agree that parsing Markdown is slower than it should be, but I

> wonder which non-written rules you're talking about that makes it so

> slow. When I wrote about looking until the end of the document, I was

> talking about reference-style links, which are pretty well documented as

> such. Maybe you had something else in mind?


You don't actually need to look to the end of the document when
constructing the parse tree, however. Links which actually have a
reference can easily be distinguished from those which don't in a quick
pass at the end of parsing.


> Syntax highlighting isn't the same thing as "parsing" Markdown, not in

> my mind. It's more like "tokenizing" Markdown, and for that you don't

> really need to be aware of the whole document at once; you just need to

> identify things, not interpret them... Although it can get complicated

> if you consider that some thing such as reference-style links are only

> links when they have a corresponding definition.


Yes, but tokenizing the document into a parse tree is 90% of the battle…
the rest can be quickly (i.e. at worst linear time) handled at the end.


> I was referring to how HTML browsers mutate the DOM in strange ways

> while parsing improper HTML, if only to "fix" things such as this:


Yes, but that's not really in the scope of our discussion; I don't
particularly care how clearly invalid markdown is converted. It might
be nice to specify (as the HTML5 spec is trying to do for HTML), but
document authors don't need to know about it.


> Indeed, I was considering the parser was doing the output too, or at

> least generating sequencial events (à la SAX). If you're talking about

> creating the document tree in one pass, I suppose this can work if you

> allow mutations of the document (as opposed to just appending new

> elements at the end as you'd do with an XML parser).


It requires one pass to create a document tree, and one more pass for a
few other things, such as assigning links.


> Ok, back to performance. How many time do you start a new Perl process

> when building the manual? My benchmarks indicates that starting the


AFAIK it only takes starting one perl process to run markdown on a large
file...


> I'm totally not convinced that creating a byte-by-byte parser in Perl or

> PHP is going to be very useful. Using regular expressions is much faster

> than executing equivalent PHP code in my experience.


This is a silly statement. Compiled vs. interpreted languages has
nothing to do with the big-O run-time of an algorithm. Real parsers are
optimized for this type of task and will take linear time (with suitable
changes in the markdown spec), while regular expressions are not (and
it's possible even that a document could be created that would take
exponential time with the current markdown implementation).


> That doesn't mean PHP Markdown and Markdown.pl can't be made faster, but

> I'd be surprised if it ever reach the speed of TextMate's dedicated

> parsing state machine.


TextMate's parser has to do quite a bit of extra work that a dedicated
parser for a particular format would not have to do. I would not be at
all surprised if a Perl or PHP implementation could be made extremely fast.


> You may not like the way Markdown.pl or PHP Markdown works, but we each

> have to work within the limits of our particular language. Between a

> PHP-made state machine or a PHP-regex hybrid design, I'd choose the

> second one if it means I get decent performance. And I don't think a

> formal grammar will help much getting better performance in a PHP or

> Perl environment for the reasons outlined above.


I really don't understand where you're coming from on this point. Why
would a PHP state machine be so terribly slow?


> There's a tricky case here however: [foo][bar] isn't a link in Markdown

> unless "bar" is defined somewhere; if it isn't defined, it's plain text.


As previously mentioned that's quite trivial to pick up after the
document has been completely tokenized.

-Jacob Rus



More information about the Markdown-Discuss mailing list