Understanding character sets/encodings is one of most important things for internalization. Especially understanding Unicode is crucial in the modern internationalization architecture. Category:I18N Character Set lists the pages related to character sets and/or encodings.
Character set and encoding
- Before you get into any of those pages, it is worth taking your time to understand the difference between character set and character encoding. It is important especially to understand Unicode and Unicode encodings. In short, A character set simply provides a common set of characters and has nothing to do with a numeric value to represent a character. A character encoding is the process to map a character to a numeric value.
- It is still confusing, isn't it? You don't understand why it is important? Here is a problem with Unicode for example. Unicode character set simply defines a common set of characters, which cover all languages. And Unicode specification defines the variant encodings such as UTF-16, UTF-8, UTF-7 and etc. If you don't understand this difference, you will most likely get very confused with the variant encodings supporting Unicode character set. And you will make mistakes in implementing Unicode support.
- For example, Java supports Unicode character set and it uses UTF-16 encoding internally. But UTF-16 encoding is not favored by HTTP and you need to send some data in UTF-8 over HTTP connection in many cases. If you don't understand the difference and you assume Unicode character set and UTF-16 are identical, you may send some data in UTF-16 over HTTP connection without any encoding conversion from UTF-16 to UTF-8. Then your implementation would not work. This is because a numeric value of a character is different between them. (e.g. Latin capitcal letter A is mapped to '0x0041', two bytes, in UTF-16 while it is mapped to '0x41', one byte, in UTF-8.)
- So, please keep this difference in your mind while you go through those pages.
All items (1)