Internationalization Architecture Review Points
- From technical perspective, architecture review is the most important thing for internationalization. Especially if you are building a brand new product or a major feature, it is highly encouraged to perform internationalization architecture review by having the experts of internationalization. It is quite similar to inviting security/usability experts for your product architecture review. If you can uncover internationalization defects in this phase, your internationalization work will get much easier later on. Even if you think you cannot afford implementing all internationalization requirements from the beginning, you can at least prepare place holders for internationalization and make your changes to conform to those requirements a lot easier in the future releases. Also even if you already have released your products, the architecture review should be performed so you can identify the major issues prior to implementation changes/QA. Otherwise, you would not be able to do code changes/QA in a cost effective manner.
- The following sections are internationalization review point categories and those are introduced in the order of priority. So, if your products do not pass the earlier category, it may not be worth moving forward to the latter categories. For instance, if you don't know whether technology stacks can support the encodingds you have to support, you won't be able to say if your data handling is good.
- Also the rule of thumb in prioritizing internationalization issues is evaluating the number of languages/regions affected by the given internationalization issue. For instance, if your products lack GBK encoding support, you probably have the difficulty to start your business in China only. But if you lack ISO-8859-1 encoding support, you most likely lose your business opportunity in western European countries, which means 15 countries+. Though China is an emerging and attractive market, losing western European counties would be a more significant impact for your business. Another example is the lack of translation for your server products. Though it is always preferred to provide translation for any language, it is not always mandated requirement especially for the server products. (Note: if you are looking at Japanese market, you should provide the translation regardless of the nature of your products. They always require translation.) So, data handling, locale management should have higher priority than localizability especially for the server products.
- The following categories are prioritized based on this rule of thumb. Therefore, fulfilling the earlier categories is more important than fulfilling the latter categories from business perspective as well.
- First of all, you need to check if technology stacks used for your products can pass the following review points. This is the fundamental thing to check. While your technology stacks cannot support a certain encoding and do not provide a way to extend the encoding support, you unlikely can provide a support for that encoding. Or if your technology stacks do not provide translation for a certain language, you need to think how to deal with that lack of translation in technology stacks.
- So, please perform technology stack review prior to internationalization review for your products.
- It is important to determine character sets/character encodings supported by your products. Although the encodings to support are very depending on the nature of your products, it is highly encouraged to support Unicode in any case. Though some legacy systems are requiring legacy encodings such as ISO-8859-1, the new systems are supporting one of Unicode encodings like UTF-8, UTF-16. Since Unicode allows you to support almost all languages in this world, supporting Unicode maximizes your return of investment for internationalization.
- As it is mentioned above, it is highly recommended to support Unicode encodings always. Then what other encodings should be supported? The answer is highly depending on the nature of your products. Let's take GMAIL as an example here. GMAIL basically runs in UTF-8 when users browse incoming mails and composing mails. However, the encoding of mails sent by GMAIL is the legacy encoding rather than UTF-8, e.g. ISO-2022-JP for Japanese. This is because there are still some e-mail clients, which cannot handle UTF-8 encoding. (Note: most of e-mail client softwares do support UTF-8 but web mail clients and possibly mobile clients may not support UTF-8 but support only the legacy encodings.)
- The other example is web service. As far as the new implementations like XML services are concerned, they most likely support UTF-8 or UTF-16. But many of EDI specifications support only the legacy encodings like ISO-8859-1, US-ASCII and those do not support Unicode encodings.
- As you see above, you need to assess the requirement for encodings and determine the scope of support. In case you use technology platforms like Java, .Net, you can rely on the encoding support of those technology platforms and probably can support as many as those technology platforms do without extra cost. But if you want to limit the supported encodings, you may want to refer to Character Encoding Recommendation for Languages or some other specifications you need to support with your products.
The difference between character set and encoding
Before you start working Unicode support, it is worth taking your time to understand the difference between character set and character encoding. In short, A character set simply provides a common set of characters and has nothing to do with a numeric value to represent a character. A character encoding is the process to map a character to a numeric value.
Why is this important to understand Unicode? Here it is. Unicode character set simply defines a common set of characters, which cover all languages. And Unicode specification defines the variant encodings such as UTF-16, UTF-8, UTF-7 and etc. If you don't understand this difference, you will most likely get very confused with the variant encodings supporting Unicode character set. And you will make mistakes in implementing Unicode support.
For example, Java supports Unicode character set and it uses UTF-16 encoding internally. But UTF-16 encoding is not favored by HTTP and you need to send some data in UTF-8 over HTTP connection in many cases. If you don't understand the difference and you assume Unicode character set and UTF-16 are identical, you may send some data in UTF-16 over HTTP connection without any encoding conversion from UTF-16 to UTF-8. Then your implementation would not work. This is because a numeric value of a character is different between them. (e.g. Latin capital letter A is mapped to '0x0041', two bytes, in UTF-16 while it is mapped to '0x41', one byte, in UTF-8.) So, please keep this difference in your mind while you work on Unicode support.
Once you understand the difference between character set and encoding, please refer to Unicode Encodings. The difference among Unicode encodings is very well explained.
Here is a short summary of those encodings. Most commonly used Unicode encodings are UTF-16 and UTF-8. UTF-16 is commonly used for data processing while UTF-8 is used for communication. Since UTF-16 is a fixed-width character encoding , the processing in UTF-16 gets relatively simple and fast. But with UTF-16, it is necessary to take care of Endian difference among platforms and it is not suitable for communication. Then UTF-8 has the opposite characteristic of UTF-16.
Given this, please be aware that the technology platforms like Java, .Net run in UTF-16 while the communications like HTTP/SMTP are based on UTF-8. Though the conversion between UTF-8 and UTF-16 may be done by the platform, you need to take care of this conversion by yourselves and it is important to understand this.
Supplementary Character Support
Supplementary character support issue is well explained at Supplementary Characters in the Java Platform. Please refer to it for details. The short summary is this. In old version of Unicode, up to 3.0, there was not any definition in supplementary character range. While it was a case, there was no need to consider about this tricky issue. But starting from Unicode 3.1, Unicode character set got characters defined in supplementary character range. Now we have to be very strict on the difference between a character and a code point when we consider UTF-16 encoding. If we ignore supplementary character range, a character is identical to a code point in UTF-16 encoding. However, in supplementary character range, a character can be represented by two code points. Strictly speaking, Java and .Net are based on code point in terms of length semantics, not character. So, while APIs like String.length/substr in Java were assumed to work based on character, now it is no longer true. And an API like String.substr in Java may break a supplementary character in the middle of a character like a classic multi-byte processing issue in the old platform.
Unless you take care of this issue in your coding, you cannot claim you support supplementary characters or Unicode version 3.1 and above. From business perspective, this implies that you may have difficulty to have business with users in Hong Kong and People's Republic of China (PRC).
TODO: Need to write summary of GB18030 support. GB18030
- Data handling review is to make sure that your products can handle non-ASCII characters properly. It is always a good idea to start with reviewing data flow so you can list up the areas where you should look into further.
- It is always encouraged to review data flows in your products unless you are already sure about the problematic areas to look into. (i.e. You have already done data flow review off the top of your head, which is unlikely.) The purpose of the data flow review is mainly listing up the problematic areas to look into further.
- Often the following areas are risky and you should better look into further with check points discussed in the following sections:
- Boundary of technologies/components
- Normally large products are composed of multiple technologies and components. For example, your products may be composed of multiple open source libraries such as J2EE container, Axis, databases like MySQL and so on. You should better look for the boundaries of those technologies and/or components and list them up for the further look.
- Communication Channel
- In many cases, a system has some communications with other systems via HTTP, SMTP, SOAP or some other protocols. Each one of them can have the different encoding requirements or the different way of locale aware data handling.
- Native calls
- If software calls native APIs, it most likely has the dependency on OS setup unless otherwise you can explicitly overwrite it. In such case, the software may have some limitations in encoding support and others regardless of your development platform capability.
- Whenever technology/component boundaries introduced above exist, it is necessary to check the following:
- Supported Encodings
- First of all, you should check if all technologies/components offer the same level of encoding support. If there is the difference among them, the supported encoding of your software will be determined by the least common encoding.
- Encoding Negotiation
- Even if all technologies/components support Unicode encodings, it does not always mean you can communicate with other systems in Uniocde. You must have some way to determine which encoding should be used to send/receive some data through communication channel. For example, in case of HTTP, it should be determined based on Content-Type in HTTP header. This kind of specification needs to be checked and you have to conform to it. Also unfortunately there are some protocols may not provide a clear way for encoding negotiation. (e.g. many of EDI specifications do not define a clear way. It rather says it is always in ISO-8859-1. But you are in trouble if you have to support multiple EDI specifications with your software.) In such case, you need to figure out how you can deal with it.
- If your system runs in UTF-8 and you need to send out something like EDI or e-mails in the different encoding, you need to convert data from UTF-8 to something else. This is called transcoding. When you transcode data, you need to be aware transcoding may cause the compatibility problems depending on data. For example, you have all data in UTF-8 and you have to send out some notification mails in ISO-8859-1 through SMTP. If you transcode data from UTF-8 to ISO-8859-1 without checking data, you may corrupt or drop some data, which are not supported by ISO-8859-1 since UTF-8 can afford much more variant characters than ISO-8859-1. This type of issue needs to be caught in transcoding process.
- Normalization is a process to transform data so it is consistent. If you get this, you are extraordinary. Here is a little easier explanation for ordinary people. For some languages like French or Japanese, there are multiple ways to enter data meaing the same. The good example is 'Á' (Latin Capital Letter A with Acute, \u00C1) can also be represented as the combination of 'A' (Latin Capital Letter A, \u0041) and '´'(Acute Accent, \u00B4). If you are doing text processing, you need to be aware those data are meant to be the same. To support such text processing, you need to normalize data in a certain way.
- Unicode defines the following four normalization form. (Please refer to Unicode Normalization for details.)
- Normalization Form C (a.k.a NFC) - Canonical Composition
- Normalization From D (a.k.a NFD) - Canonical Decomposition
- Normalization Form KC (a.k.a NFKC) - Compatibility Decomposition followed by Canonical Composition
- Normalization Form KD (a.k.a NFKD) - Compatibility Decomposition followed by Canonical Decomposition
- These are useful normalization forms that you can use in Unicode base platforms like Java, .Net. It is better to consider of taking advantage of those normalization logic rather than coming up with your own.
Canonical Data Handling
- In some cases like XML Services, EDI, you need to send some locale aware data like number, date, date time, timezone as text rather than a certain object. Then you must send those text data in canonical format to be locale independent. For example, XML specification follows ISO 8601 for date time format. When you use date data type in XML schema, you should have those texts in that format like 1999-11-15T0830:05-08:00.
- And unfortunately canonical format for those locale aware data might be different among technologies/components. You need to be aware of such difference and take care of them properly.
Escaping Non-ASCII data
- Communication protocols like HTTP does not allow you to have non-ASCII data in HTTP header. Instead, it defines how to escape non-ASCII data so you don't have to include such non-ASCII data. For instance, RFC 2396 defines how to encode non-ASCII characters in URL. And non-ASCII characters should be encoded like %E8%A1%A8.
- Also there are other encodings available for the difference purposes.
- Data entry is the important part to review in data flow. At data entry, you probably do some data validation and you need to consider the following from internationalization point of view:
- Encoding validation
- If you support the legacy encoding somewhere in your software, it is necessary to make sure incoming data does not beyond the capability of the legacy encoding. The good example is supporting the legacy encoding for database tier while you have the middle-tier in Java platform. While the middle-tier is based on UTF-16, the database can be configured as ISO-8859-1. (Note: this configuration is often chosen by American/west European users because of storage size efficiency and data process performance.) Then you should better ensure the incoming data do not beyond ISO-8859-1 definition. Although you might think those can be detected by database tier, it is not always true. Depending on JDBC driver implementation, exception may or may not be raised and it is possible to replace those invalid characters in database character with the replacement character like "?". Also it is better to inform users in earlier steps.
- Length validation
- Some of technology stack like database may be based on byte length semantics while others like Java are based on code point length semantics. In case of the above example, it would be fine since byte length and code point length are the same in ISO-8859-1 accidentally. But if the database is configured as multi-byte encoding like UTF-8, then there is the problem since byte length is different from code point length in UTF-8. In other words, a non-ASCII character consumes 2 - 4 bytes in UTF-8 while it is 1 or 2 code point. So when you do the length validation at data entry, you need to evaluate what length semantics should be used. Also you need to evaluate the encoding for validation if byte length semantics should be used since byte length gets affected by encoding. (e.g. one Japanese character consumes 2 bytes in Shift_JIS while it takes 3 bytes in UTF-8.)
- Non-ASCII data rejection
- In some cases, you may have to reject non-ASCII data. (e.g. ID, object names) If it is the case, it is encouraged to reject non-ASCII characters at data entry. If you do so in the process, it will get users frustrated.
- Once data handling is internationalized, the next level to reach is providing locale management. Locale management is to provide:
- Locale aware resource switch
- Locale aware data formatting
- Linguistic function
- Localized function
- Please be aware that it is very hard to have locale management unless you take this into your considerantion in architecting your products while other internationalization features may be added afterwards. Adding locale management support later is not a good idea and also very expensive. So, even if you may not need internationalization in early phase of your business, it is still highly encouraged to review your product architecture from this perspective to avoid the future rearchitect and put some place holder to deal with this type of issue in the future.
- The first step is designing data structure to be multi-lingual/multi-locale aware. In other words, you need to distinguish translatable data from others. And your code should not use translatable data for any other purpose but display. Please keep in your mind that this data structure design must be done before you start any implementation based on the data structure. Otherwise, it will be hard to do since changing data structure can introduce massive changes. Especially once you release your products, you will not want to do this for sure. (e.g. changing access APIs may introduce a lot of code changes, migrating the existing data into the new data structure, changing cache mechanism to be locale aware, etc.)
- The very well known example is user interface resource. If you hardcode text for user interface in your codes, you don't have a way to provide multi-lingual/multi-locale support for user interface and you need to have the different build per language. Java provides ResourceBundle to resolve this issue so you can externalize those texts from your codes into ResourceBundle, which is a sort of data structure or mechanism to provide multi-lingual/multi-locale support. Then your codes use only String IDs.
- This idea should be applied not only to user interface resource but also to other locale aware resources. For instance, if you provide an inventory solution, the inventory should be able to have item information in multiple languages and/or multiple locales. It is because item information will be exposed to your users' customers in the different countries and most likely your users will need to translate item information or have to change the available items per territory. (i.e. your users may be ok with one language support but they cannot push the same to their customers, possibly consumers.) If your inventory solution is not designed to handle this situation, your users will have hard time to resolve this issue.
- That's why you need to be careful about data structure in your design time.
- Secondly, it is necessary to determine what attributes are needed and how to interact locale information with others. The following is the list of things to define.
You might think locale information means language/territory only. It is true when you think about user interface only. But there are some other attributes you may need depending on your product needs. Also please be aware that you need to determine how to store those locale attributes.
- Group/Decimal Symbol (e.g. 1.000,00)
- Measurement System (U.S, Metric)
- Standard digits (Arabic, Hindi, etc)
- Currency Symbol
- Positive/Negative Format
- Short/Medium/Long Date Format
- Time Format
- Time separator
- AM/PM designator
Locale Information Access
Once you define attributes to handle, you need to determine how to hold those attributes in your software and also how you interact locale information with others.
The first question is probably easy to answer since you can define your own locale context for your software. But the second one is difficult since you most likely have to conform to the different specifications like HTTP, SMTP, XML, various Web Services, database and so on. For instance, if you need to use some of Oracle database internationalization features, you need to call "alter session" command to set NLS parameters for the database session. That's the only way to get Oracle database know the locale information. So, you don't have a choice and need to follow it.
Unfortunately, you will need to research how others recognize the locale information one by one.
Locale Information Conversion
Apart from locale access mechanism, others to communicate may follow the different standard to handle locale information or they may even have their own. For instance, Oracle database does not understand language code like "en". Instead, it has its own naming convention for all NLS parameters. And "en" should be mapped to "AMERICAN" to set NLS session parameter correctly. (Note: the good news is Oracle provide Java APIs to fill such gap between their own definition and others. You can find it through Oracle Technology Network.)
This kind of gap may be observed in the different area. And again you need to research on this one by one.
Locale Aware Resource Switch
Locale Aware Formatting
- Phone number
- Spell check
- Text search
Scripting (should this be moved under data handling?)