Formal Grammar — some thoughts
Allan Odgaard
29mtuz102 at sneakemail.com
Sat Jul 29 16:38:01 EDT 2006
I recently subscribed and saw in the archive that Eric Astor was
asking for a formal grammar (unlikely the first time for such request.)
Currently there are a few problems in making such a thing so I was
curious if Mr. Gruber has made any thoughts about moving toward one?
This would also allow a more “clean” parser which would get rid of
some of the current problems (bad nesting[^1], styles which cross
environments[^2], and problems with the md5 checksums[^3]) and I am
sure it would also improve performance significantly.
Some of the problems with a formal grammar (as I see it) are:
1. interpreting tokens as literal text when end token is missing,
example: `this is __not starting bold`. For bold it doesn’t matter
IMO (having to escape the token,) but having to escape all single
appearances of `_` and `*` could be irritating, although presently _
often do come in pairs, so here one often already do need to wrap
filenames, environment variables and similar which use the underscore
in a raw environment.
2. using back-references in end-tokens, example: `a ``` ``raw`` ```
environment`. A formal grammar can’t really do that, and IMO the
clean solution would be to define single-quoted (backticked) raw as
supporting no escaping and end with the first `` ` ``, where double-
quoted (backticked) raw would support escaping of `` ` `` and `\`.
5. heuristically defined end of lists, sub-lists and block-quotes.
This would need to be more strict. I am not entirely sure what the
current definition is, so I am wary of reformulating a strict
version. From the source it seems that a sub-list is started when a
line is a list item with a different (exact) indent as the first list
item, allowing for some fun flexibility:
* item 1
* item 1a
* item 2
* item 2a
* item 2b
* item 2c
* item 3
There is also an ambiguity between `*` used for bold and used
for a list item.
A minor problem is that when in a list item environment the
rule e.g. for raw blocks needs to be redefined (to require 2 tabs or
8 spaces) and that would be necessary for each new level (to add an
extra indent in the requirement) with the likely outcome that raw
blocks would only be supported in e.g. the 3 first levels of list
items. OTOH I doubt anyone would feel safe using raw blocks in deeply
nested list items given the (IMHO) rather vague definitions about
when lists stop and interact with raw environments/block-quotes etc.
Take the following relative simple code which produce bogus markup as
an example of how fragile this stuff currently is:
* this is list item
> * this item is in a block quote
more block quoting?
are we still in list and block quote?
> is this a new block quote?
Thanks for reading this far.
[^1]: example: `__bold _and__ italic_`.
[^2]: example: `*not italic [link*text](#)`.
[^3]: I have only experienced this with MultiMarkdown, for which the
problem is easy to reproduce by using styles in footnotes.
More information about the Markdown-Discuss
mailing list