29mtuz102 at sneakemail.com
Sun Aug 19 15:59:45 EDT 2007
On Aug 19, 2007, at 10:48 AM, John MacFarlane wrote:
>> take this example:
>> This **is `raw** text`
>> [...] it ends up as:
>> <p>This **is <code>raw** text</code></p>
> That's just how an incremental parser based on a former grammar
> handle the case. [...]
No. A normal parser converts the text into tokens, and builds a parse-
tree based on these, as they appear.
So when we see the first ‘bold’ token, a new node is added to our
parse-tree and all future nodes are added as children of this node
(the same with code/raw, which is thus a descendent of bold in the
> An incremental parser would see '*' and scan for a list of inline
> elements followed by another '*'.
In the traditional tokenize and build AST, there is no “scanning”
done per se. The ‘*’ becomes a token which affects the building of
the AST (which effectively change the state where another ‘*’ will
“close’ the node we’re building, and thus leave bold).
> Since the entire '[link*](/url)' is an
> inline element
I didn’t make an entry for links in my example grammar, but it would
be one element if and only if we did links as one token in the lexer.
But since the link text can contain arbitrary inline elements, this
does not seem to be ideal.
Instead we would do something like:
link: '[' inline ']' reference
reference: '(' url title? ')'
| '[' name ']'
url: '<' URL-TEXT '>' | URL-TEXT
title: '"' D-Q-STR-TEXT* '"'
| "'" S-Q-STR-TEXT* "'"
For urls we probably want to also support that nesting of parenthesis
which was recently introduced, so the url rule will be slightly more
involved. But the above roughly captures the part matching links.
> My experience with pandoc convinces me that constructing a formal
> grammar for markdown is going to be very complicated.
Markdown in its current form will be. The problem is all the
heuristics in markdown and side-effects resulting from the current
parser. But doing a formal grammar for a language very similar to
markdown (still keeping the convenience), I don’t think will be that
bad -- the two main problems are 1) how are we going to deal with
basically bad code (i.e. if user does [foo **bar] should we insert
implicit ** when we see ]? this is how HTML in theory works, and I
really like that (in HTML), b) how to express the thing about how
some block-environments basically go in and work as a pre-processor
on successive lines, i.e. when we do: ‘> foo’ we enter block-quote,
and that means preprocessing the following lines by removing optional
leading ‘> ’ -- I think this can be handled in the lexer so this will
effectively be a few lexer-specific rules, and the grammar itself
will not really deal with it (basically it will be a grammar where
everything is defined via the lazy-mode, and the lexer will ensure
that everything actually conforms to lazy mode, when it reaches the
> Most of the
> complexity comes from markdown's very permissive handling of lists.
> Here it's instructive to compare reStructuredText, which has a
> formal grammar and incremental parser, and which
> - requires a blank line before all lists, including sublists
> - requires list items to be indented consistently (i.e. no
> "lazy" lists with an indented first line only)
> Of course, reStructuredText is also more demanding on the user,
> precisely because of this lack of flexibility.
Well -- I am not a fan of reSt because it puts too much overhead on
me. But I would rather call Markdown unpredictable when it comes to
list (rather than flexible), take this example:
* item 1
* item 1a
* item 2
* item 2a
* item 2b
* item 2c
* item 3
And this classic:
8. item 1
9. item 2
10. item 2a
Another problem is when putting block-elements in list items, where
you generally need extra spacing between items (which results in <p>
tags around the list item content), and some block elements can’t be
on the first line of a list item (like block quote).
More information about the Markdown-Discuss