[dcc2] FWD: Steve Wittens comments
Dan Smith
dan at algenta.com
Sun Apr 25 21:08:03 EDT 2004
These comments were sent in by Steve Wittens:
---------------------------------------------
I took a look at the DCC2 specs and noticed that the format for filenames
in the standard mode seems very restricted (only digits, alphabetic
characters, and some symbols). More generally, there seems to be no mention
at all of character encoding usage in the specs anywhere. For single-file
transfers, the character set seems restricted to a subset of ASCII, while
the XML files use UTF-8 as example encoding (which seems to mean you would
always have to send a single, unicode-named file through a multifile send).
The logical choice for DCC2 would be to commit itself to supporting Unicode
as standard everywhere, more specifically by requiring UTF-8 as encoding
for all strings and filenames. By definition, Unicode supports all
available character sets. It is also extensible, as new characters added
with each Unicode revision can be used without any modifications to the
handling code.
On the web, UTF-8 is rapidly growing as the encoding of choice: not just in
(X)HTML, but in syndication formats like RSS or Atom too. It is the perfect
intermediate format: all modern operating systems provide Unicode support
and can de/encode legacy encodings to UTF-8 if required.
UTF-8\'s multibyte nature and ASCII compatibility make it perfect for
limited environments (like IRC) where encodings are sadly still assumed to
be non-existant.
Not worring about encoding will undoubtably result in different
implementations and incompatibilities. Fixing this from the start will
prevent this: there\'s nothing more annoying than having to write a
non-english language in ASCII just to send something over. You may laugh at
l33tsp34k, but imagine actually having to type like that because the \'e\'
and \'a\' are not supported. This sounds crazy, but it\'s exactly what
happens when you strip accents off languages like Spanish or French.
Non-latin languages seem to be ignored completely.
Clear multilanguage support through Unicode is not at all hard if you do it
right from the beginning, and it would be another selling point for DCC2
adoption.
More information about the dcc2
mailing list