[dcc2] FWD: Steve Wittens comments

Dan Smith dan at algenta.com
Sun Apr 25 21:08:03 EDT 2004


These comments were sent in by Steve Wittens:
---------------------------------------------

I took a look at the DCC2 specs and noticed that the format for filenames 
in the standard mode seems very restricted (only digits, alphabetic 
characters, and some symbols). More generally, there seems to be no mention 
at all of character encoding usage in the specs anywhere. For single-file 
transfers, the character set seems restricted to a subset of ASCII, while 
the XML files use UTF-8 as example encoding (which seems to mean you would 
always have to send a single, unicode-named file through a multifile send).

The logical choice for DCC2 would be to commit itself to supporting Unicode 
as standard everywhere, more specifically by requiring UTF-8 as encoding 
for all strings and filenames. By definition, Unicode supports all 
available character sets. It is also extensible, as new characters added 
with each Unicode revision can be used without any modifications to the 
handling code.

On the web, UTF-8 is rapidly growing as the encoding of choice: not just in 
(X)HTML, but in syndication formats like RSS or Atom too. It is the perfect 
intermediate format: all modern operating systems provide Unicode support 
and can de/encode legacy encodings to UTF-8 if required.

UTF-8\'s multibyte nature and ASCII compatibility make it perfect for 
limited environments (like IRC) where encodings are sadly still assumed to 
be non-existant.

Not worring about encoding will undoubtably result in different 
implementations and incompatibilities. Fixing this from the start will 
prevent this: there\'s nothing more annoying than having to write a 
non-english language in ASCII just to send something over. You may laugh at 
l33tsp34k, but imagine actually having to type like that because the \'e\' 
and \'a\' are not supported. This sounds crazy, but it\'s exactly what 
happens when you strip accents off languages like Spanish or French. 
Non-latin languages seem to be ignored completely.

Clear multilanguage support through Unicode is not at all hard if you do it 
right from the beginning, and it would be another selling point for DCC2 
adoption.






More information about the dcc2 mailing list