Detab should be multi-byte aware?

Michel Fortin michel.fortin at
Tue Oct 10 10:18:30 EDT 2006

Le 10 oct. 2006 à 3:17, A. Pagaltzis a écrit :

> * John Gruber <gruber at> [2006-10-10 05:55]:

>> I think it's simpler and better to just say "use UTF-8".


> +1


> UTF-8 is in fact deliberately constructed such that the chance of

> arbitrary text accidentally being valid UTF-8 approaches zero with

> increasing length of the text.

Except that increasing the length of the text won't have any effect
when using `mb_strlen` because:

1. I only pass small snippets through it to calculate the number of
space needed for replacing the tab character, not the whole text;

2. If I give it a string with both valid and invalid UTF-8 character
sequences, it will consider valid sequences as one character
and the invalid ones as two, or more depending on the number of

I decided to attempt more systematically testing by writing a small
PHP script that displays how UTF-8 characters are interpreted by a
couple of ASCII-compatible 8-bit encodings. You can test it here (be
sure to not look only at the first page):


From what I can see, it seems that ISO Latin and Windows Latin are
mostly imune to any problem. Mac Roman has a couple of problematic
strings which would be legitimate within text ("«é" for instance)
which are valid UTF-8. But the worse seem to come from ISO Cyrillic
and ISO Greek which have many common character combinations clashing
with UTF-8 sequences. For instance: "Чорнобиль" (Chernobyl in
Ukrainian) is 9 characters, but if you encode it with ISO 8859-5 and
then count the characters as if they were UTF-8 characters, you find
only 4.

This shows that using `mb_strlen` in `detab` as I suggested could
cause problems, especially with non-latin encodings, but also with
some rare, but not so silly, character combinations in Mac Roman.
That said, I think these problems are less important than UTF-8
characters not working right, so I still plan to use UTF-8 to count
the characters in `detab`.

Michel Fortin
michel.fortin at

More information about the Markdown-Discuss mailing list