jgm at berkeley.edu
Sun Aug 19 11:48:30 EDT 2007
I wholeheartedly agree with you about the desirability of a formal
grammar, but I don't think your example is a good illustration of
> take this example:
> This **is `raw** text`
> Here we “naively” (i.e. regular parser) see the bold start-token
> first, and it is paired, but since Markdown scans for raw text before
> bold text, it ends up as:
> <p>This **is <code>raw** text</code></p>
That's just how an incremental parser based on a former grammar *should*
handle the case. To use your own sample specification:
> inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+
> bold: '**' inline '**' | '__' inline '__'
> code: s-q-code | d-q-code
> s-q-code: '`' CODE+ '`'
> d-q-code: '``' (CODE | ESCAPE)+ '``'
In your example we have '**' followed by an inline consisting of
two elements: the PARA-TEXT 'is ' and a code element '`raw** text`'.
That doesn't match your rule for bold.
Here's a case that may be better:
This is a *[link*](/url)
which markdown turns into
<p>This is a <em><a href="/url">link</em></a></p>
An incremental parser would see '*' and scan for a list of inline
elements followed by another '*'. Since the entire '[link*](/url)' is an
inline element, and there's no star following this, we'd get
<p>This is a *<a href="/url">link*</a></p>
My experience with pandoc convinces me that constructing a formal
grammar for markdown is going to be very complicated. Most of the
complexity comes from markdown's very permissive handling of lists.
Here it's instructive to compare reStructuredText, which has a
formal grammar and incremental parser, and which
- requires a blank line before all lists, including sublists
- requires list items to be indented consistently (i.e. no
"lazy" lists with an indented first line only)
Of course, reStructuredText is also more demanding on the user,
precisely because of this lack of flexibility.
Looking over the code for pandoc's markdown parser, I see that a lot
of the complexity is due to pandoc's syntax extensions. If I have time,
I'll prune it down to a simple "standard markdown" parser which might
already look something like a formal syntax specification.
More information about the Markdown-Discuss