Unicode and other related standards      contents

The Unicode Standard is published by Unicode Consortium, created by Microsoft, Apple, Sun, IBM and other main software vendors. Despite the fact there new Unicode standard versions are released regulary, all of them are strictly compatable with older versions. I.e., all codepoints from older versions remain the same in newer versions.
Current ISO 10646 standard byInternational Organization for Standardization is in sync with Unicode standard.

Somebody may say that Unicode is implementation of ISO 10646 standard. With first ISO 10646 standard long time ago, the proposed 32 bit encoding (current UTF-32) would be hardly possible in practice. Software vendors created Unicode Consortium and Unicode standard in parallel, that defined practical encoding methods (so called transformation formats) for universal character set and currently takes care about clasification of rare symbols and other scripting system aspects. Unicode Consortium currently is open for everybody organization.

UTF-8 is Unicode format when each number is encoded by 1 to 6 byte sequenceses. It is compatable with US-ASCII, i.e. each English letter is encoded by single byte, the same as in US-ASCII. Most of accented latin letters, russian letters are encoded by 2 bytes. East Asian ideograms - by 3 bytes. UTF-8 is defined in RFC 2279 "UTF-8, a transformation format of Unicode and ISO 10646". It is also defined in ISO 10646 Annex R. UTF-8 is used on the Internet and on Unix type systems.

UTF-16 is Unicode format when each symbol is encoded by one or two 16 bit numbers. UTF-16 is defined in RFC 2781 . Simplified UTF-16 variant, when only one 16 bit number per character is used is called UCS-2. UTF-16 or UCS-2 is used Microsoft programs and operating systems. UTF-16 isn't always possible to use: zero bytes may occure if data is treated as byte sequence; byte order in 16 bit numbers may be different; no direct compatibility with US-ASCII.

UTF-7 is Unicode format when only 7 bit bytes sequences are used, mostly for e-mail. It is considered obsolete by now, it is recommended to use UTF-8 plus standard Base64 or quoted-printable encoding instead.

UCS-4 or UTF-32 defines possibility to use 32 bit numbers.

IETF (Internet Engineering Task Force) organization has decided that UTF-8 is the only enoding that Internet protocols must be able to support. See RFC 2277 "IETF Policy on Character Sets and Languages" . It can be seen in XML, LDAP, NNTP and other protocol documents, that define UTF-8 as text encoding in that protocols.

Internet Mail Consortium created by Microsoft, IBM, AOL, Sendmail, Sun and others, has prepared a document in 1998 "Using International Characters in Internet Mail". It recommends:
All mail-creating programs created or revised after January 1, 1999, must be able to create mail using the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot create mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities.
...
All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities.

According to MIME standard, RFC 2045,2046,2047,2048,2049, every email message or MIME part must have "Content-Type:" header with "charset" attribute. Only that way email program can determine message body encoding automatically and properly show it.

Hoever it isn't enough to determine encoding of the headers itself (like Subject or From). Because of that headers must be in the form defined by RFC 2047 and RFC 2231, e.g.: =?UTF-8?B?UmnEjWFyZGFzIMSMZXBhcw==?=. Many popular server side email programs can mangle 8 bit headers if not in that form.

Newsgroup readers often perform worse than email ones in that aspect, as there were no standards for non US-ASCII symbol usage in newsgroups for a long time. New drafts recommend to use the same MIME email standard for newsgroups as well.


  Valid XHTML 1.0!