Detab should be multi-byte aware?
Michel Fortin
michel.fortin at michelf.com
Mon Oct 9 21:33:49 EDT 2006
Le 9 oct. 2006 =88 20:34, John Gruber a =8Ecrit :
> Michel Fortin <michel.fortin at michelf.com> wrote on 10/9/06 at 8:26 PM:
>
>> If anyone is interested in a fix for PHP Markdown, just change
>> the call to the `strlen` function within detab to a call to
>> `mb_strlen($line, 'utf-8')`. I'll fix this for the next
>> version.
>
> Will that still work if people pass in Windows Latin 1 or Mac
> Roman-encoded text? Yes, I'm too lazy to try it...
I haven't tried it inside PHP Markdown yet, but I've tested =20
`mb_strlen` and it seems to treat any invalid UTF-8 byte sequence as =20
individual characters. So the neat result is that text in ISO Latin, =20
Windows Latin, or Mac Roman will work fine unless it contains =20
sequences which are valid UTF-8. For instance, "=8E" in UTF-8 is seen =20=
as "=C3=A9" in Mac Roman, so if you have "=C3=A9" in a Mac Roman-encoded =
text =20
it'll be treated as only one character. I'm not sure how high is that =20=
risk for all character combinaisons, but it obviously is less =20
problematic than the current behaviour is to UTF-8.
Another solution is to omit the 'utf-8' parameter and rely on the PHP =20=
internal encoding to be the same as the input. (The internal encoding =20=
can be set by the user using `mb_internal_encoding('utf-8')`.) Doing =20
that however implies that PHP Markdown will work with something else =20
than UTF-8 by default, and I'm not so sure if that's a good idea.
Yet another solution is a distinct configuration variable set to =20
UTF-8 by default.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
More information about the Markdown-Discuss
mailing list