Incremental parser (was: Backtick Hickup)

Allan Odgaard 29mtuz102 at sneakemail.com
Sun Aug 19 01:07:53 EDT 2007


On Aug 14, 2007, at 9:41 AM, Michel Fortin wrote:


> [...]

> I agree that the syntax needs to be defined more clearly.


I am glad that we are finally reaching agreement on this. You may not
recall, but a year ago you asked me: “is it so much important that
these border cases be consistent across all implementations?” [1]

[1]: http://six.pairlist.net/pipermail/markdown-discuss/2006-July/
000146.html


> I think the syntax page should be updated when we find an ambiguity.


That would be nice, yes -- but IMO we need to take a step back and
really define the syntax in a more formal way, cause just clarifying
a lot of border cases is tedious and complex. Doing something closer
to a real grammar would not leave us with all these ambiguities in
the first place, as stated, this is also why I brought up the
incremental parser, because this works based on a state machine, a
state machine has a clear transition from state to state, based on
the input, not the present ad-hoc parser.


> But I'm not the one in charge of that page. I'd suggest checking

> the testsuites announced on this list: most decisions regarding

> edge cases have been "documented" there as regression tests. If

> some behaviour is part of the test suite, you can be pretty much

> certain that it's not a parser quirk.


I have not looked at these, that is, I did look at Gruber’s original
test suite, and it basically just tested a lot of simple cases. This
is IMO *not* the way to go about things, i.e. defining a grammar
based on a lot of test cases.

Take e.g. this letter from last year http://six.pairlist.net/
pipermail/markdown-discuss/2006-August/000151.html -- here I talk
about the problems which arise from mixing nested block elements and
the lazy-mode rules. I think this should be clearly defined, not just
defined via test cases, because we need clear rules, not recipes
about how to handle a few select cases.


> [...]

> Syntax highlighting isn't the same thing as "parsing" Markdown, not

> in my mind. It's more like "tokenizing" Markdown [...]


But building a parse-tree is pretty easy if you have already
tokenized Markdown correctly. Anyway, TM does build the parse-tree as
well. This is slightly beyond the point though, I was just saying TM
does take the “incremental approach”, and it works quite well for the
actual documents out there.


> [...]

> Ok, back to performance.


Just to be clear, my motivation here is *not* performance. My
motivation is getting Markdown interpreted the same in different
contexts, which it presently isn’t always, i.e. to get a clearly
defined syntax, so I can ensure that the highlight in TM follows the
standard to the point (and thus the syntax highlight you get follows
the look of the post to your blog, the HTML constructed from
Markdown.pl, or the local preview of the Markdown done with redcloth).


> How many time do you start a new Perl process when building the

> manual?

> [...]

> Is the manual available in Markdown format somewhere? I'd like to

> do some benchmarking.


http://six.pairlist.net/pipermail/markdown-discuss/2006-August/
000152.html


> I'm totally not convinced that creating a byte-by-byte parser in

> Perl or PHP is going to be very useful.


The key here is really having clearly defined state transitions.


> Using regular expressions is much faster than executing equivalent

> PHP code in my experience [...] I'd be surprised if it [PHP

> Markdown / Markdown.pl] ever reach the speed of TextMate's

> dedicated parsing state machine.


TM has a language grammar declaration where each token is defined by
a regexp -- it could be a lot faster if a dedicated parser was
written, but my goal here was flexibility, not speed.

I am *not* touting TM’s parser as fast, I am trying to convince you
that the current way things are done, is pretty bad, and bad for many
reasons, the (lack of) formalness with which the grammar is defined,
the (lack of) simplicty in the code (and thus extensibility of the
language grammar), and also (lack of) performance (by how the current
implementation effectively does not support nested constructs, and
thus have to fake it by doing iterative manglings of subsets of the
document, to treat that as a nested” environment, complicated a lot
by how it is documented to support embedded HTML (untouched by the
Markdown parser, but in practice some edge cases are not handled
correctly here)).


> [...] If you wish to create a better definition of the language,

> I'll be glad to help by answering questions I can answer,

> exemplifying edge cases and their desirable outputs, etc.


We pretty much went over that last year, and I thought I had made the
point by now, that what I am after is defining the syntax, not the
edge-cases -- I can read how Markdown.pl deals with them myself
(although it deals with several by producing invalid code).


> If you want the syntax changed so that it better fit your parser

> (and possibly other incremental parsers) then I can provide my

> point of view, but I'm not the one who takes the final decision.


Unfortunately Gruber is dead silent when it comes to this.

It may come off as self-serving to approach things from the
traditional incremental-parser (formal grammar / BNF) POV, but it is
because I really think this would be best for bringing all
implementations of the Markdown parser in sync, give better
performance, not have as many broken edge-cases as now, and have the
tools provide accurate syntax highlight.

Already there are several forks of Markdown (i.e. where stuff is
added to the syntax), so I don’t think the best approach (for me)
would be to start yet another fork -- Markdown should be one
standard, not a dozen different ones, and that is why I am so keen on
having a clearly defined standard.


> [...]

> There's a tricky case here however: [foo][bar] isn't a link in

> Markdown unless "bar" is defined somewhere; if it isn't defined,

> it's plain text. That may seem like an edge case right now, but

> when/if Markdown gets the [shortcut link] syntax (as added to the

> current betas of 1.0.2), this may become a more interesting problem

> for syntax highlighting as any bracketed text will then become a

> potential link depending on whether or not it has been defined

> elsewhere in the document.


Yes, and personally I would say whenever you do [foo][bar] you get a
link, regardless of whether or not bar is a defined reference -- if
bar is not a defined reference, you could default to make it
reference the URL ‘#’ -- this makes parsing *much* easier (here I am
thinking about the case where you do: ‘*dum [foo*][bar]’ or ‘[*foo]
[bar] dum*’. The 3 reasons for choosing this rule is that 1) partial
documents are tokenized the same as full document (consider that my
references may be from an external file, yet some stuff may still
work on the “partial” document (i.e. the one w/o the actual
bibliography, such as a local preview and the syntax highlight), 2)
no-one would likely make use of the “feature” that [foo][bar] is the
raw text [foo][bar] when bar is undefined (this is equivalent to
saying that <p>foo</b></p> should keep </b> as literal text, since no
<b> was found), and 3) it really is easier for the user to relate to
“the pattern [something][something] is a link”.



More information about the Markdown-Discuss mailing list