UTF-8 BOM

Allan Odgaard 29mtuz102 at sneakemail.com
Sat Oct 27 09:23:47 EDT 2007


On 27/10/2007, at 14:55, Michel Fortin wrote:


> [...]

> Now, the interesting question is: what should PHP Markdown (or any

> Markdown implementation for that matter) do with the UTF-8 BOM? Here

> are three options:

>

> 1. Remove it?

> 2. Keep it at the start of the text?

> 3. Ignore it (as it does now)?

>

> Option 3 seems a logical option to me


Yes, ignore it!


> [...]

> Between option 1 and 2, surely option 1 (dropping the BOM) is the

> best. Otherwise it'd be hard to concatenate the output with a

> template HTML document.


And that is why the user should not have placed the BOM in an UTF-8
file in the first place ;)

UTF-8 is an ASCII superset that makes 99% of existing programs that
deal with ASCII work flawlessly with the text. Add the BOM and you
break that, i.e. using ‘cat’ to concatenate files will result in BOMs
in the middle of the result, use ‘grep’ to extract stuff, and you may
or may not get a BOM in the result, use a shebang line and find the
shell (execv()) won’t actually read it, save your C source with a BOM
and gcc will choke on it, etc.

The BOM is a byte-order-marker for UTF-16, it has no place in UTF-8.
Some may argue it is there to indicate that the file is UTF-8, but
UTF-8 can already be recognized with >99% certainty w/o the BOM, so
the BOM doesn’t really help here, and when text is sent over the wire,
there generally is a specified default encoding and a way to change
that, which does not include adding garbage to the start of the file
(and to the best of my knowledge no standard calls for the examination
of the first 3 bytes to determine encoding).


> [...]

> UTF-8 BOM handling sounds like a good thing to add to MDTest too.


I’d say no -- on the contrary, if the user adds a BOM to his UTF-8
file he should be told that this is a bad idea. Fortunately none of
the text editors on my system even has this option ;)



More information about the Markdown-Discuss mailing list