Scratchpad

If you are new to Scratchpad, and want full access as a Scratchpad editor, create an account!
If you already have an account, log in and have fun!!

READ MORE

Scratchpad
Register
Advertisement

Multilingual Data Structure

Overview[]

Background[]

These days it is getting more and more important to provide multilingual support for user interface, configuration data and some of transactional data. While the facility for multilingual support of user interface is provided by technology stack like Java, JSF and etc, multilingual support for configuration data/transactional data requires application developers to design application's data structure in multilingual manner. This note discusses several multilingual data structures and advantage/disadvantage of those structures. Also the sample implementations with XML and RDBMS are discussed.

Consideration[]

While multilingual data structure is discussed, the following points shall be discussed.

  • Language data isolation
Multilingual data have to be coupled with context by some language neutral data like ID and etc. Multiligual data structure should isolate language data like translation/localized value from context well. Otherwise, it leads into the difficult situation in maintaining language data afterwards. This should be considered in the very beginning of the design phase.
  • Fallback mechanism
In many cases, language data can be sparse either permanently or temporarily. Therefore, it is necessary to consider how to fallback when language data for a particular context does not exist.
  • Maintainability
Multilingual data would be released as a part of application or created by application users. In either case, it is necessary to consider the maintainability of language data. For example, if multilingual data have to be released as a part of application and language data is not well isolated from context, it would require to release language data whenever context is updated regardless of the changes in language data part. This would be nightmare for the localization group and users since the localization group always have to produce language data patch and users have to apply them as many as languages they have in their environment.
  • Storage Efficiency
Multilingual support certainly requires some extra storage since it stores some context key for each language. It should be minimized especially multilingual data can grow over the time.
  • Performance
Multilingual support always requires some extra work compared to single language support. Therefore, it can impact on the performance unless you are careful enough about the performance impact in both updating and querying language data. In many cases, update and query performance are exclusive. So, it is important to evaluate which is more important for the application case by case.

Data Structures[]

Base - Language[]

In this approach, it has two schema, which are Base and Language. Base includes both language neutral data and language data for the base language, which is the language used for the fallback. Language includes context keys and language data except the base language, i.e. translation. Then View for the specific language shall be derived from Base and Language.

Base - langauge neutral data + language data for the base language (i.e. English in the sample below)

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname xml:lang="en">BlackBerry 8700c</displayname>
    <description xml:lang="en">BlackBerry 8700c (Refurb)</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname xml:lang="en">HP iPAQ hw6515</displayname>
    <description xml:lang="en">HP iPAQ hw6515 (Refurb)</description>
  </product>
</inventory>

Language - context keys and language data except the base language (i.e. sku is a key and displayname, description are language data)

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>

View - View derived from base and translation data (i.e. the following is Japanese view)

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>
  • Language data isolation
Language data is isolated thoroughly from other language neutral data.
  • Fallback mechanism
Since the base language data is a part of Base and it can be expected to be always fulfilled, the simplest fallback is fallbacking to the base language if the specific language data does not exist. In case the more sophisticated fallback is required, it should perform the fallback with language schema like checking ja-JP -> ja and then fallback to the base language data as needed.
  • Maintainability
Since language data schema is thoroughly isolated from language neutral data, the maintainability is pretty good with this approach. Also since the base language data can be used as the final fallback point, the language data can also be sparse. Therefore, in terms of the maintainability of the release, this approach would be the best.
  • Storage Efficiency
Unless it is necessary to store View above with some reason, this approach does not require any redundant data except context keys.
  • Performance
Since this approach can afford the sparse language data, update performance should be good. However, query performance can be poor if the fallback mechanism has to be sophisticated to perform the fallback with language data first.
  • Summary
This approach should work well for most of multilingual scenarios. The weakness of this approach are:
  • It is difficult to switch the base language once it starts having the sparse language data
  • Query performance can be poor if data size is large and language data gets sparse and the fallback takes time

Core - Language[]

In this approach, it has two schema, which are Core and Language. Core includes only language neutral data. Language includes context keys and language data. Then View for the specific language shall be derived from Core and Language.

Core - langauge neutral data only

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
  </product>
</inventory>

Language - context keys and all language data (i.e. sku is a key and displayname, description are language data)

English data

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <displayname xml:lang="en">BlackBerry 8700c</displayname>
    <description xml:lang="en">BlackBerry 8700c (Refurb)</description>
  </product>
  <product sku="101">
    <displayname xml:lang="en">HP iPAQ hw6515</displayname>
    <description xml:lang="en">HP iPAQ hw6515 (Refurb)</description>
  </product>
</inventory>

Japanese Data

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>

View - View derived from base and translation data (i.e. the following is Japanese view)

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>
  • Language data isolation
Language data is isolated thoroughly from other language neutral data.
  • Fallback mechanism
Since the base language data does not exist, the fallback has to be sophisticated enough to deal with the sparse language data. It would be necessary to do one of the following:
  • Fulfill a certain language data so the language data can act as the base language in the fallback
  • Fulfill all language data with the initial entry temporarily to avoid the sparse language data
  • Fallback to some language neutral data if language data does not exist (note: this cannot be option for data to show up in UI.)
  • Maintainability
Since language data schema is thoroughly isolated from language neutral data, the maintainability is pretty good with this approach. However, if the implementation cannot afford the sparse language data, it will be necessary to release language data patch whenever new language data is added unlike Base - Language model above.
  • Storage Efficiency
Unless it is necessary to store View above with some reason, this approach does not require any redundant data except context keys. However, if the implementation requires some extra data to cope with the sparse language data, it would be less efficient than Base - Language model.
  • Performance
Depending on the approach to cope with the sparse language data, the performance will be impacted. If the sparse language data is resolved by fulfilling a certain language to make it the base language, update performance will be good but query performance will be the same as Base - Language approach. Instead, if the sparse language data is resolved by fulfilling all language data, update performance will be poor but query performance will be better than Base - Language model since query does not require any fallback.
  • Summary
This approach should work well unless the sparse language data has to be considered. In other words, if language data is faily static and is not changed much as UI, this approach would work better than Base - Language model. But it is still necessary to consider the temporary sparse language data issue for patching.

One for all language[]

In this approach, it has one schema. And each language data should be tagged with language information.

Data - Includes langauge neutral data + all language data with language information

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname xml:lang="en">BlackBerry 8700c</displayname>
    <description xml:lang="en">BlackBerry 8700c (Refurb)</description>
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname xml:lang="en">HP iPAQ hw6515</displayname>
    <description xml:lang="en">HP iPAQ hw6515 (Refurb)</description>
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>

View - View derived from base and translation data (i.e. the following is Japanese view)

<?xml version="1.0" encoding="UTF-8" ?>
<inventory>
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname xml:lang="ja">BlackBerry 8700c - Japanese</displayname>
    <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese</displayname>
    <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>
  • Language data isolation
Language data isolation is poor since one file includes both language neutral data and language data.
  • Fallback mechanism
Same as Core - Language model.
  • Maintainability
Language data is not isolated from language neutral data, the maintainability is poor with this approach. Also unless there is some mechanism to cope with the sparse language data, it makes the maintainability worse.
  • Storage Efficiency
Unless it is necessary to store View above with some reason, this approach does not require any redundant data.
  • Performance
Same as Core - Language model.
  • Summary
This model works well only if multilingual data is static and is not updated frequently. (e.g. locale specific seed data.) Using this model for frequently updated data is highly discouraged.

Full per language[]

This approach should be considered as temporary solution for multilingual support. The idea is simply duplicating full set of language data per each language.

Data - Includes langauge neutral data + a language data per language

English Data

<?xml version="1.0" encoding="UTF-8" ?>
<inventory xml:lang="en">
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname>BlackBerry 8700c</displayname>
    <description xml:lang="en">BlackBerry 8700c (Refurb)</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname>HP iPAQ hw6515</displayname>
    <description>HP iPAQ hw6515 (Refurb)</description>
  </product>
</inventory>

Japanese Data

<?xml version="1.0" encoding="UTF-8" ?>
<inventory xml:lang="ja">
  <product sku="100">
    <name>BlackBerry_8700c</name>
    <stock>10</stock>
    <displayname>BlackBerry 8700c - Japanese</displayname>
    <description>BlackBerry 8700c (Refurb) - Japanese</description>
  </product>
  <product sku="101">
    <name>HP_iPAQ_hw6515/name>
    <stock>25</stock>
    <displayname>HP iPAQ hw6515 - Japanese</displayname>
    <description>HP iPAQ hw6515 (Refurb) - Japanese</description>
  </product>
</inventory>
  • Language data isolation
Language data isolation is poor since one file includes both language neutral data and language data.
  • Fallback mechanism
Similar to Core - Language model. But the fallback should happen at schema level rather than each element.
  • Maintenanceability
Language data is not isolated from language neutral data, the maintenanceability is poor with this approach. Also unless there is some mechanism to cope with the sparse language data, it makes the maintenanceability worse.
  • Storage Efficiency
Since it duplicates all data basically, the storage efficiency is poor.
  • Performance
Similar to Core - Language model. But since the fallback should happen at schema level, query performance should be better than other cases with the sparse language data.
  • Summary
This model works well only if multilingual data is static and is not updated frequently and data size is small. The good thing in this approach is that the implementation would be simplest out of all approaches. However, using this model for frequently updated data is highly discouraged since the maintenanceability is pretty poort.

A mnitue saved is a minute earned, and this saved hours! A mnitue saved is a minute earned, and this saved hours!

Multilingual Data Structure with Relational Database[]

TODO, sample implementation should be discussed
Advertisement