Backslash escapes (was: Revised 2005 proposal for meta-data)

Andrea Censi andrea at censi.org
Fri Jan 5 18:40:51 EST 2007


On 1/4/07, Michel Fortin <michel.fortin at michelf.com> wrote:

> Le 2007-01-01 à 15:25, Andrea Censi a écrit :

> >> [1] Even further, you could allow non-punctuation to be escaped.

> >

> > In a sense, this is the most consinstent way of escaping.


After implementing it, and playing around, I changed my mind about
escaping [a-zA-Z]. It's useless and just confusing.


> > b) \<newline> represents a linebreak

>

> I can't see why this would be better than what we have now. In fact I

> think it's worse as it'll clutter the text version of the document

> unnecessarily; the current double-space syntax means that the

> Markdown-formatted text looks fine by itself, something which is a

> core goal for Markdown.


The problem I find with the current syntax is that I cannot *see*
whether there is the line break.


> > 2) Inside "quoted values", you MUST escape `"`

> > 3) Inside 'quoted values', you MUST escape `'`

>

> But what happens if you don't? If you want to go deep in the corner-

> cases of the syntax I think it'd be more useful to explain what

> parsers have to do when they encounter that rather than tell the

> author what not to write.


At one point, you have to decide what is legal and what is not in a
language. And, if it's not legal, then the behaviour is
implementation-dependent.

Just like HTML: it's very clear what is a legal HTML document.
However, even though browser do their best to sanitize illegal
documents, their behaviour in that case isn't specified by the spec.


> > I would tend to drop the special case

> >> [text](url "title"with"quotes")

> > as it is ambiguous.

>

> Drop it and replace it with what output? I agree that it has some

> ambiguities, but it's not that bad really, especially when parsing

> with regular expressions.


My personal point is that, to support that kind of syntax, I had to
write a function that it's the only ugly one in my shiny new
recursive-descent parser.

Also - but I reckon that it is sort of philosophical matter - it's
really really evil to design a language which contains ambiguities.
This is one case when the implementation (regexp-based system) heavily
influenced the syntax.


> > The first pass of processing the document simply becomes:

> >

> > until eof

....

> > end

>

> Something that sounds odd to me is that you're doing this as the

> first pass of the whole document, yet you don't take into account

> HTML blocks, code blocks and inline HTML tags, but you've thought of

> code spans. It'll have to get much more complicated than that if you

> want to handle escapes as a first pass.


Actually, it worked ok in my first implementation. The trick is to
re-expand the escapes in code blocks or HTML code.


> Why do you want to proceed escapes first anyway?


Assume the input string is
" `code` - \`not code\` - ``code with \` slash-tick `` "
The first pass I did was to replace "\`" with a code outside of the
input range. Let `?` represent that code. The string becomes:
" `code` - ?not code? - ``code with ? slash-tick `` "
now extract code blocks (CB):
CB("code"), "- ?not code? - ", CB("code with ? slash-tick")
and undo the escapes: in strings ? becomes `, in code spans ? becomes \`:
CB("code"), "- `not code` - ", CB("code with \` slash-tick")

I did the same for code blocks and HTML.

It worked, but I don't use this method anymore.


Anyway, to the goal of reaching a compromise, here's the revised
proposal for escaping:

=======

1. No escaping in code spans/blocks.

2. Everywhere else, **all** PUNCTUATION characters **can** be escaped,
and **must** be escaped when they could trigger links, tables, etc.
(punctuation=[^a-zA-Z0-9\s\n])

3. As a rule, quotes **must** be escaped inside quoted values:

* Inside `"quoted values"`, you **must** escape `"`.
* Inside `'quoted values'`, you **must** escape `'`.

* Other examples:

`"bah 'bah' bah"` = `"bah \'bah\' bah"` = `'bah \'bah\' bah'`

`'bah "bah" bah'` = `'bah \"bah\" bah'` = `"bah \"bah\" bah"`


4. There is an exception for backward compatibility, in links/images titles:

[text](url "title"with"quotes")

The exception is not valid for attribute lists and in other
contexts, where you have to use the canonical syntax.

========


As for point 4, my implementation tries its best to parse it, but
warns the user that it's bad manners.

--
Andrea Censi
"Life is too important to be taken seriously" (Oscar Wilde)
Web: http://www.dis.uniroma1.it/~censi


More information about the Markdown-Discuss mailing list