Backtick Hickup

Michel Fortin michel.fortin at michelf.com
Mon Aug 27 16:35:38 EDT 2007


Le 2007-08-19 à 10:46, Allan Odgaard a écrit :


> On Aug 14, 2007, at 9:45 AM, Michel Fortin wrote:

>

>> [...] Your interpretation of the syntax would require that:

>>

>> (mine) ` `````````` `

>> (your's) ``````````` `````````` ```````````

>

> Well, showing that my interpretation of Gruber’s writings leads to

> a lot of redundant back-ticks (in a fictional case) is not really

> showing that my interpretation is wrong ;)


Indeed, both fit with the syntax description document, and both work
with Markdown.pl or PHP Markdown too.


> But based on the code for Markdown.pl it would seem that the

> standard has an additional requirement, not made explicit in the

> syntax document (namely that back-ticks must not follow the end-

> token).


That'd be it. Underspecified doc again. I wonder how many
implementations got it "right" (where by "right" I mean like
Markdown.pl).


> Personally, as I have said before, the back-tick rules are

> confusing (when you want to include a back-tick in the code) and we

> might be better off by just defining some simpler rules.


I don't find them confusing, but perhaps it's only because I'm used
to it. Which aspect of it do you find confusing? Note that writing
both of the above produce the expected thing in Markdown.pl (or PHP
Markdown).

Perhaps if you wanted to have two consecutive code spans separated by
nothing, not even a space, then your interpretation of the syntax has
an advantage. This:

` test `` test `

would produce two consecutive code spans instead of one code span
containing two words and two backticks. But I'm not really sure how
compelling this use case is: why would you want two consecutive code
spans in the first place?


> My proposal (from the thoughts on a formal grammar) was to have

> `normal raw` and ``double-quoted raw`` where the latter would

> support escape codes (at least \`).

>

> But there are other options. Having escape-codes in raw though

> could prove to be generally useful.


I think I prefer the current behaviour. I can't really see when
having to escape the content of code span would be useful. Perhaps
you had something in mind when proposing that?



>> [...]

>> (There's also a check for a backslash at the start, although I

>> just realised that this needs work as it doesn't give a correct

>> result for an escaped litteral backslash like this: \\`code`.)

>

> And this is *exactly* why I think the current parser is so flawed,

> because you can’t look at things in isolation -- *everything* is

> dependent on what precedes it, not just the previous character, but

> every single character that comes before the current one


You're right indeed. The way I plan to solve that in PHP Markdown is
by handling characters escapes in the span gamut *tokenizer* (which
sorts out HTML tags and code spans currently).


> (granted, it seems that in practice, i.e. the standard inferred

> from how the parser actually works, things are dependent only on

> characters preceding things _in the same paragraph_ -- but it seems

> to me that this is really just a side-effect of how the parser is

> written, and not always desired. For example embedded HTML does not

> lean itself well to the “split the document into paragraphs”).


I have some difficulty figuring out an what you mean by "embeded HTML
does not lean itself well to the 'split the document into paragraphs'".

Markdown currently distinguish block-level HTML elements from span-
level HTML elements: The former creates blocks which are left alone
by Markdown (and left outside paragraphs) while the later gets
wrapped into paragraphs (as valid HTML expects them to be) along with
Markdown-formatted text.


> Anyway, if we agree that everything is dependent on everything that

> precedes it, I think we can slowly start to agree that *also*

> having things depend on what follows, is problematic.


Well, I think you mean problematic for writing a parser, in which
case I disagree. Making things depend on other things at the end of
the document is problematic for performance indeed; having things
depend of things at the end of the paragraph is not so problematic in
my view: the look ahead is capped by the end of the paragraph (or
whatever block you're in).

But of course, anything can be problematic if you're trying to fit
Markdown into a specific grammar language.


> I.e. we turn parsing into the chinese game of pickup sticks -- the

> way this is presently (mostly) solved is by doing iterative scans,

> where each iteration is handling a given “token”, so rather than

> have the placement of the token in the document define the outcome

> (i.e. the closer it is to the start of the document, the higher its

> precedence), it is based on the order of the iterative scans (i.e.

> the first token “seen” by the parser, where it might be blind to

> `**` the first time it scans the document), take this example:

>

> This **is `raw** text`

>

> Here we “naively” (i.e. regular parser) see the bold start-token

> first, and it is paired, but since Markdown scans for raw text

> before bold text, it ends up as:

>

> <p>This **is <code>raw** text</code></p>

>

> If we actually addressed this edge case in the standard, would we

> really define the above to be the expected behavior? And if so, how

> do we even document the general rule used here?


I think you've chosen a bad example.

I would define that as the expected behaviour, not because it works
that way in John's implementation or my own but because once you've
opened the code span, text content is taken literally until the end
marker for the code span, so the ** inside cannot be a marker for
emphasis.

As to how to parse it with an incremental parser, I assume you could
do that:

text: this
mark: **
text: is
mark: `
(switch tokenizer into "raw" mode until it sees a backtick)
text: raw** text
mark: `
(take last text token, remove backtick marks, and make a code span)
(switch back tokenizer into "span" mode)
end reached in span

The hard part comes when no matching backtick is found (assuming non-
paired backticks do not constitute code). Here's what I suggest for
the same case with no ending backtick:

text: this
mark: **
text: is
mark: `
(switch tokenizer into "raw" mode until it sees a backtick)
text: raw** text
end reached in raw
(reparse last text token in "span" mode)
text: raw
mark: **
(take tokens between the two ** marks and put them in
emphasis, the two marks are removed)
text: text
end

Note that in this case backtracking is limited to the last token,
which is itself limited in length by the current block (paragraph,
list item, ...). I have no idea how that could fit any formal grammar
language however.


> The “syntax” quickly becomes the implementation, because we would

> have to define it like “first the document is broken into embedded-

> HTML parts and non-embedded HTML parts, the HTML embedded parts is

> found using this heuristic: …, the non-embedded HTML parts are then

> broken into paragraphs (where a paragraph is defined using …), for

> each paragraph we first scan for one or more back-ticks and see if

> there is an equal number in the same paragraph, if so, that part is

> made raw, and that part is no longer worked on, and for the text to

> the left and right of the raw text we do …” etc.


Well, look at how the WHATWG is defining HTML right now: it's exactly
that. They describe how the parser works (in english), and everything
that match its behaviour is conforming... which brings out an
interesting side topic: how should HTML be parsed (or event
specified) within Markdown? :-)


> Such specification a) can lead to a lot of misunderstandings

> (already in the above I neglected to mention how escaping ` will

> not cause a code-span, although Markdown 1.0.1 does turn \`this\`

> into <code>, but it seems the regexp you use, does not), and b)

> requires the parser to be written in a certain way which is rather

> non-standard, so parser tools cannot help in this.


Indeed, that's a pretty important bug in 1.0.1, which was fixed in
some betas later but hasn't reached the public version of Markdown.pl
yet. Markdown.pl is due for an update, just like the documentation
for its syntax, but I can't do much about either myself.

If you're trying to point out that having a grammar and a parser
implementation that follows it to the letter is going to remove those
bugs, then I'm not sure I agree. Having an official grammar doesn't
shield you from bugs in implementations, nor does it guaranties that
the grammar has no unexpected behaviours (bugs) in itself. It does
however make it clear whose faults it is when something doesn't work
as expected.


> A more formal approach would be something like semi-EBNF:

>

> markdown: html | block-element

>

> html: '<' ID attribute* '>' html* '</' ID '>'

> | '<' ID attribute* '/>'

>

> block-element: heading | list | blockquote | raw | inline

>

> heading: '#'+ inline | inline '\n' ('-'|'='){3,} '\n'

>

> inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+

>

> bold: '**' inline '**' | '__' inline '__'

>

> code: s-q-code | d-q-code

>

> s-q-code: '`' CODE+ '`'

> d-q-code: '``' (CODE | ESCAPE)+ '``'

>

> ID: [A-Za-z][A-Za-z0-9]*

> CODE: [^`]

> ESCAPE: \.

> PARA-TEXT: [^\n] | \n[^\n]

>

>

> The above is written in Mail, and not meant to be exact, just give

> a rough idea of what I am talking about, as I am not sure that is

> entirely clear to you.


That's what I have in mind when I hear "formal grammar". It'd
probably be better to follow Markdown's terminology (bold <=>
emphasis; raw <=> codeblock, ID (for HTML) <=> NAME (or local name)),
although I admit PHP Markdown's internals do not always do (because
it follows Markdown.pl terminology: anchors, bolds, and italics
instead of links and emphasis).

A few things I see: html would need to be separated in block- and
span-level html (because span-level tags like `<small>` or `<b>` are
allowed in paragraphs, even at the start, so it makes a difference).

There's also a non-stated rule that Markdown.pl (and thus PHP
Markdown) enforce about headings: they're one line only; so perhaps
we'd want to enforce that too, although I'm not sure.



> And sure, we can’t get all the way with EBNF, but maybe we can get

> 95% of the way, and that would be a tremendous win.


I'm pretty sure it'd be useful indeed.



> As I noted in my initial letter (last year about thoughts on a

> formal grammar) we would (unfortunately) have to break with current

> behavior for (undocumented) edge-cases, like the raw above, since

> with the above specification, it is the first token seen, that

> decides which style to switch to -- we can still make requirements

> that it needs to be paired, e.g.:

>

> This is **not bold.

>

> Would not have `**` start bold. But personally, I am not favoring

> that direction, mainly though because it easily leads to problems

> parsing, but also because I am not sure it really is desired.


Well, I can't be sure what John had in mind when he wrote the syntax,
although I can say that personally I would prefer this not to be
turned into emphasis.


> Take e.g. a paragraph like:

>

> You can set the SVN_EDITOR variable.

>

> Now someone figures it would be good to append `(similar to

> CVS_EDITOR)`. This Now makes the full paragraph transform in an

> undesired way, even though the two sentences on their own transform

> fine, but when they follow each other, they do not (<em> is

> introduced in the resulting HTML).


I think the better solution to that problem would be to disallow
emphasis starting in the middle of a word ending within another. And
as for underscore-emphasis problem, I'd suggest doing just as PHP
Markdown Extra does (one of its *documented* features): only allow it
on word boundaries, not in the middle of a word. I've yet to get a
complain about that change in behaviour and I know some people
switched to Extra just because of that.


> I know that it is stated somewhere that Markdown should be all

> about the person and implementation complexity is irrelevant. The

> problem is that implementation complexity has lead us to the

> current situation where we have parsers doing different things (and

> syntax highlight not always being accurate) and we have lots of

> broken edge-cases and IMO unintuitive behavior -- so we got the

> implementation complexity, but I don’t think we have something

> which is “better” than had this followed more formal rules.


I agree that having a more complete description of the syntax, which
may include formal rules once we get there, can only be positive.


>> [...] That said, it's certainly the very edge of an edge case. If

>> we're to define a formal syntax, let's not start there.

>

> As should hopefully be clear from the above, I am *not* talking

> about documenting every single edge case, I am talking about

> defining the syntax using more traditional means of defining syntaxes.


But how can you write a formal grammar without having to think of the
edge cases? Are you suggesting we should ignore edge cases when
defining the syntax? And if yes, what does qualify as an "edge case"?


> Anyway, enough dead horse beating for now. Hopefully I’ll find time

> to do a mostly complete parser based on EBNF for the current

> Markdown syntax, and then I can bring up the topic again, listing

> the compromises necessary for it to work, and the advantages/

> disadvantages may be more apparent then.


I look forward to it. It surely looks interesting.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list