Detab should be multi-byte aware?

Michel Fortin michel.fortin at michelf.com
Mon Oct 9 21:33:49 EDT 2006


Le 9 oct. 2006 =88 20:34, John Gruber a =8Ecrit :


> Michel Fortin <michel.fortin at michelf.com> wrote on 10/9/06 at 8:26 PM:

>

>> If anyone is interested in a fix for PHP Markdown, just change

>> the call to the `strlen` function within detab to a call to

>> `mb_strlen($line, 'utf-8')`. I'll fix this for the next

>> version.

>

> Will that still work if people pass in Windows Latin 1 or Mac

> Roman-encoded text? Yes, I'm too lazy to try it...


I haven't tried it inside PHP Markdown yet, but I've tested =20
`mb_strlen` and it seems to treat any invalid UTF-8 byte sequence as =20
individual characters. So the neat result is that text in ISO Latin, =20
Windows Latin, or Mac Roman will work fine unless it contains =20
sequences which are valid UTF-8. For instance, "=8E" in UTF-8 is seen =20=

as "=C3=A9" in Mac Roman, so if you have "=C3=A9" in a Mac Roman-encoded =
text =20
it'll be treated as only one character. I'm not sure how high is that =20=

risk for all character combinaisons, but it obviously is less =20
problematic than the current behaviour is to UTF-8.

Another solution is to omit the 'utf-8' parameter and rely on the PHP =20=

internal encoding to be the same as the input. (The internal encoding =20=

can be set by the user using `mb_internal_encoding('utf-8')`.) Doing =20
that however implies that PHP Markdown will work with something else =20
than UTF-8 by default, and I'm not so sure if that's a good idea.

Yet another solution is a distinct configuration variable set to =20
UTF-8 by default.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list