Incremental parser (was: Backtick Hickup)
Michel Fortin
michel.fortin at michelf.com
Mon Aug 27 12:42:28 EDT 2007
Le 2007-08-19 à 1:07, Allan Odgaard a écrit :
>> But I'm not the one in charge of that page. I'd suggest checking
>> the testsuites announced on this list: most decisions regarding
>> edge cases have been "documented" there as regression tests. If
>> some behaviour is part of the test suite, you can be pretty much
>> certain that it's not a parser quirk.
>
> I have not looked at these, that is, I did look at Gruber’s
> original test suite, and it basically just tested a lot of simple
> cases. This is IMO *not* the way to go about things, i.e. defining
> a grammar based on a lot of test cases.
You're complaining about the lack of precision in the syntax
definition (a valid complain). I can't really address that complain
(the document is not under my control), but I'm trying to help by
pointing out that some testcases (not all obviously) includes some
clues not found in the documentation. Obviously, and as you say, it
isn't a replacement for a more precise documentation.
Now that I think about that, there's probably a couple of things
which are only defined in the version history too (especially
Markdown.pl 1.0.1's history).
> Take e.g. this letter from last year http://six.pairlist.net/
> pipermail/markdown-discuss/2006-August/000151.html -- here I talk
> about the problems which arise from mixing nested block elements
> and the lazy-mode rules. I think this should be clearly defined,
> not just defined via test cases, because we need clear rules, not
> recipes about how to handle a few select cases.
One thing is certain: any output which is invalid HTML is a bug.
Beyond that, some things are unintended (bugs) and some things are as
intended (but not always documented). Some things are documented and
easy to find (in the syntax description), others are "documented" but
buried deeper in the version history or in test cases.
So, again, we agree that the documentation is suboptimal.
>> [...]
>> Ok, back to performance.
>
> Just to be clear, my motivation here is *not* performance. My
> motivation is getting Markdown interpreted the same in different
> contexts, which it presently isn’t always, i.e. to get a clearly
> defined syntax, so I can ensure that the highlight in TM follows
> the standard to the point (and thus the syntax highlight you get
> follows the look of the post to your blog, the HTML constructed
> from Markdown.pl, or the local preview of the Markdown done with
> redcloth).
Okay. Then on that goal I'm with you. The less divergence there is
between implementations the better.
>> How many time do you start a new Perl process when building the
>> manual?
>> [...]
>> Is the manual available in Markdown format somewhere? I'd like to
>> do some benchmarking.
>
> http://six.pairlist.net/pipermail/markdown-discuss/2006-August/
> 000152.html
Great, thank you. Posting some results in a new thread right now...
>> I'm totally not convinced that creating a byte-by-byte parser in
>> Perl or PHP is going to be very useful.
>
> The key here is really having clearly defined state transitions.
I'm not sure what you mean by that in relation to what I wrote above.
> I am *not* touting TM’s parser as fast, I am trying to convince you
> that the current way things are done, is pretty bad, and bad for
> many reasons, the (lack of) formalness with which the grammar is
> defined, the (lack of) simplicty in the code (and thus
> extensibility of the language grammar), and also (lack of)
> performance (by how the current implementation effectively does not
> support nested constructs, and thus have to fake it by doing
> iterative manglings of subsets of the document, to treat that as a
> nested” environment, complicated a lot by how it is documented to
> support embedded HTML (untouched by the Markdown parser, but in
> practice some edge cases are not handled correctly here)).
There are many complains about different things here. About the
syntax, you complain that it is badly defined (I agree).
You then talk about lack of simplicity in the code, which I assume
apply to Markdown.pl (or PHP Markdown), not the syntax; or perhaps
you mean that the syntax makes it impossible to write simple code to
parse it? I'm not sure I understand what you mean here.
Then you talk about the lack of extensibility of the language grammar
(which I'm not sure what you mean by that, is there a language
grammar for Markdown anyway?). Then you go on the lack of performance
(are you calling this a syntax or parser issue or both?).
Finally you say the current implementation (I assume you're talking
about Markdown.pl, perhaps PHP Markdown) does not "effectively"
support nested constructs (which constructs? what does "effectivly"
means here?) but "support" them somewhat by recursively reparsing
parts of the document. Very true, but how is that a problem for you?
I assume the later is a problem for you if you take every quirks and
bugs and try to reproduce them with an incremental parser: it gets
needlessly complicated. I don't think that's the way to go if you
want to produce an incremental parser.
>> [...] If you wish to create a better definition of the language,
>> I'll be glad to help by answering questions I can answer,
>> exemplifying edge cases and their desirable outputs, etc.
>
> We pretty much went over that last year, and I thought I had made
> the point by now, that what I am after is defining the syntax, not
> the edge-cases -- I can read how Markdown.pl deals with them myself
> (although it deals with several by producing invalid code).
Yeah, but let me explain better what I meant by this (today, and last
year too)...
Basically, I'm not going to start a formal grammar for Markdown from
scratch on my own. I'd be glad to help though.
You seem to have already done a good part of the job by writhing TM's
parser. While not perfect, I think a formal grammar based on it (or
perhaps something else such as Pandoc) could be a great starting point.
Once we have this, it'll be easier for me and others to comment on,
and to spot any difference with current Markdown.pl. Some differences
will be errors or unindented side effects on Markdown.pl's part which
the formal syntax should ignore, others will be the indented output
and will need to be "ported" to the grammar. These two things are not
always easy to distinguish, and for that I can help since I know
pretty well Markdown.pl inwards (which are mostly the same as PHP
Markdown).
So by this process, I believe we can evolve the formal syntax to a
point where it handles things pretty well. It can't be *the* formal
definition without John's approval, but it could certainly serve as a
better reference for other implementors than Markdown.pl will ever be.
>> If you want the syntax changed so that it better fit your parser
>> (and possibly other incremental parsers) then I can provide my
>> point of view, but I'm not the one who takes the final decision.
>
> Unfortunately Gruber is dead silent when it comes to this.
Some things are certainly going to stay ambiguous without some
insight from John, but there's still a lot that can be done without it.
> It may come off as self-serving to approach things from the
> traditional incremental-parser (formal grammar / BNF) POV, but it
> is because I really think this would be best for bringing all
> implementations of the Markdown parser in sync, give better
> performance, not have as many broken edge-cases as now, and have
> the tools provide accurate syntax highlight.
I don't really want to see the syntax changed in and out only to make
it easier to implement as an incremental parser. I don't think such a
parser would be usable (read fast-enough) in PHP anyway. Well,
perhaps it could be, but not in the traditional sense of an
incremental parser; the concept would probably need to be stretched a
lot to fit with regular expressions.
> Already there are several forks of Markdown (i.e. where stuff is
> added to the syntax), so I don’t think the best approach (for me)
> would be to start yet another fork -- Markdown should be one
> standard, not a dozen different ones, and that is why I am so keen
> on having a clearly defined standard.
If you don't add features or don't do things otherwise than the
documentation says, you don't have to call it a fork. That the syntax
is unclear for a couple of things doesn't imply that an attempt at
clarifying it is forking. Better call it a one of the multiple
possible interpretations of the syntax as currently defined. And if
that straightened up syntax is good enough, it could become by itself
a de-facto reference implementation for other implementors.
> Yes, and personally I would say whenever you do [foo][bar] you get
> a link, regardless of whether or not bar is a defined reference --
> if bar is not a defined reference, you could default to make it
> reference the URL ‘#’ -- this makes parsing *much* easier (here I
> am thinking about the case where you do: ‘*dum [foo*][bar]’ or
> ‘[*foo][bar] dum*’. The 3 reasons for choosing this rule is that 1)
> partial documents are tokenized the same as full document (consider
> that my references may be from an external file, yet some stuff may
> still work on the “partial” document (i.e. the one w/o the actual
> bibliography, such as a local preview and the syntax highlight), 2)
> no-one would likely make use of the “feature” that [foo][bar] is
> the raw text [foo][bar] when bar is undefined (this is equivalent
> to saying that <p>foo</b></p> should keep </b> as literal text,
> since no <b> was found), and 3) it really is easier for the user to
> relate to “the pattern [something][something] is a link”.
Hum, I disagree strongly here that creating links to nowhere (#) is
the solution to undefined reference links. This is bad usability for
authors who will need to test every links in resulting page to make
sure they're linking where they should be, and for readers who will
click a link expecting to get somewhere but getting nowhere. Leaving
it as text makes it clear for everyone that there is no link there
(whatever the authors' intent) and makes authors more likely to find
their error by visually inspecting the browser rendering of the output.
A much better compromise in my opinion would be to just treat these
brakets specially and not allow emphasis in the cases above. I'm not
entirely sure that's the ideal thing to do, but I don't really expect
anyone to do emphasis like that consciously (except as a test case),
so it's probably a good enough solution.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
More information about the Markdown-Discuss
mailing list