case in reference links (was: Re: text/markdown effort in IETF)

Thu Jul 10 18:18:51 EDT 2014

+++ Michel Fortin [Jul 10 14 07:53 ]:
>Le 10-juil.-2014 à 1:04, John MacFarlane <jgm at berkeley.edu> a écrit :
>
>> +++ Michel Fortin [Jul 09 14 18:07 ]:
>>
>>> Fun fact: PHP Markdown is mostly encoding agnostic. It understands UTF-8 sequences but any byte that is not a valid UTF-8 sequence is treated as a character in itself. It's only relevant when converting tabs into spaces however, and only if you have non-ASCII characters before the tab.
>>
>> Small amendment: There are at least two places where the difference
>> between utf-8 and latin1 matters:  tab expansion (as you note) and
>> reference links, since these are stipulated to be case insensitive.
>> (Case conversion is sensitive to the encoding.)
>
>Like Markdown.pl, PHP Markdown will just treat non-ASCII characters in a case-sensitive way so in my case it doesn't matter.

I think this is a deficiency in Markdown.pl.  The syntax description
says that reference links are case-insensitive, and it doesn't say
anything about this just applying to ascii references.  I think someone
who writes in, say, Spanish, would be quite naturally expect words with
accents to behave the same as words without accents in reference links.

By the way, I'm not sure what the motivation for making the reference
links case-insensitive was.  I conjecture that it was to allow the
following sort of thing:

    [Foo][] is better than [bar][].  And [Bar][] is worse than [foo][].

    [foo]: /url1
    [bar]: /url2

This is a good motivation:  it would be a  burden to have to define
separate references for capitalized and uncapitalized versions of a
phrase, or to use the longer form `[Foo][foo]` for capitalized
versions.  But this motivation extends naturally beyond ascii.

Hence, I think markdown processors *should* do a proper unicode
case fold in determining when references match.

Unfortunately, as you point out, this becomes very complex, and
brings in locale dependence for a few cases (e.g. Turkish).  Still,
I think it's the ideal we should aspire to.