5. Multilingual Features#

This chapters describe the multilingual structure of Altibase, as well as environment settings and other points to consider when using Altibase in a multilingual environment.

Overview#

Multilingual support means that the DBMS is capable of storing and processing character sets used in different countries. In other words, the DBMS enables processing for clients using different languages and characters such as Korea, Chinese, Japanese, and others.

Character-Set#

A character set is a particular group of characters that are associated with respective numeric values. The following table shows how an individual character is associated with a different numeric value depending on whether it is encoded using the UTF-8, UTF-16 BE or UTF-16 LE character set.

Character	UTF-8	UTF-16 BE	UTF-16 LE
A	41	00 41	41 00
Ő	C3 B6	00 F6	F6 00

NLS(National Language Support)#

This allows the database to be used in a particular language environment. If NLS is appropriately set, the user can read and write data to and from the database using the character set specified by the user's application.

Multilingual Support Structure#

Multi-language support consists of performing conversions between the character sets used by the database and the client application, respectively

The server-client relationship for multi-language support is divided into the following four categories:

The database and the client use the same character set.
The database and the client use different character sets.
The database and multiple clients use different character sets.
Unicode data type support

The database and the client use the same character set.#

This indicates that the character set between the database and the client is the same.

Image title — [Figure 5‑1] A Database and a Client with the same Character Set

As shown in Figure 5-1, if both the database and the client use KSC5601, character set does not occur.

The database and the client use different character sets.#

If the character set used by the database is different from the character set used by the client, the character set conversation occurs. This can sometimes lead to data loss, as shown in Figure 5-2.

To prevent data loss caused by character set conversion, it is recommended that the character set used on the database be a superset of the character set used by the client.

Thus, to prevent data loss when character conversion is performed as seen in the figure above, the character set used by the database should be MS949 or UTF8, which is a superset of MS949.

The database and multiple clients use different character sets.#

If multiple clients have different character sets, the server must specify a character set that includes all of the clients character sets to avoid loss of character conversions.

The above figure shows a system configuration in which each client connected to one database uses Japanese, Chinese, and Korean, etc. To prevent the loss of character conversion between a server and multiple clients using each language, the database character set should be set to UTF8, which supports the languages used by all of these clients.

Unicode Data Type Support#

Regardless of which character set the database and client are set to, it is possible to support multiple languages using NCHAR or NVARCHAR data types that support Unicode data.

Character Set Classification for Multilingual Support.#

Database Character Set.#

The "database character set" is the character set with which data are saved in the database.

Because the SQL standard is an ASCII character set, the character set that encompasses (completely includes) the ASCII character set can be used as a database character set. However, UTF16 is excluded from the database character set because it does not encompass an ASCII character.

How to Specify the Database Character Set#

When a database is created, the database character set can be specified using the CREATE DATABASE statement.

Supported Database Character Sets#

Altibase supports the use of the following database character sets, all of which support ASCII:

US7ASCII
KO16KSC5601
MS949
BIG5
GB231280
MS936 (Identical to the GBK, ZHS16GBK, CP936 character sets of other vendors)
UTF8
SHIFTJIS
MS932 (Identical to the CP932 character sets of other vendors)
EUCJP

National Character Set#

The national character set is used to store NCHAR and NVARCHAR data types, and can be used to store text in Unicode.

How to Specify the National Character Set#

When a database is created, the national character set of the database is specified using the CREATE DATABASE statement.

Supported National Character Set#

Altibase supports the following two national character sets:

UTF8
UTF16 (Big Endian)

Client Character Set#

The client character set is the character set used to display data to the client.

Data sent from the server are converted to, and displayed in, the character set specified by respective clients.

How to Specify the Client Character Set.#

The client character set can be specified using ALTIBASE_NLS_USE on the client.

Supported Client Character Sets#

US7ASCII (Default)
KO16KSC5601
MS949
BIG5
GB231280
MS936
UTF8
UTF16 (Big Endian)
SHIFTJIS
MS932
EUCJP

Using Unicode in a Multilingual Database#

Unicode is an internationally encoded character set that can store information in any language in a single character set. Unicode also has a unique value for all characters, regardless of platform or programming language.

Therefore, Unicode is a code that can be useful when it is desired to store languages of various countries at the same time.

Unicode Encoding#

Unicode encoding is a method of mapping Unicode to bytes for storage on a computer.

Altibase uses an encoding scheme such as UTF-8 or UTF-16 to represent a code scheme or character set.

Storing Unicode Characters#

Unicode characters can be stored in a database in two ways:

When the database is created, it can be designated as one in which character data are stored as Unicode data.
NCHAR or NVARCHAR columns can be used to store Unicode characters.

Please note that, if the database character set is UTF8 and the national character set is UTF16, there are two different ways to store Unicode characters in the same database.

Unicode Database#

When creating a database, the Unicode data can be stored in CHAR and VARCHAR columns by creating a database that supports Unicode by setting the database character set to UTF8.

Supported Character Set#

UTF8

When is a Unicode database neeeded?#

When SQL statements or sotred procedures include Unicode data.
When it is unsure whether multilingual data will be inserted into the database, or what column they will be inserted into.

Unicode Datatypes#

Even if a character set other than UTF8 was specified at the time a database was created, it is still possible to store Unicode characters using the NCHAR or NVARCHAR data type.

Supported Character Sets#

UTF8
UTF16

When are Unicode Datatypes Needed?#

When columns for storing multilingual data are needed in a non-Unicode database.
When most of the data is in the same language, but there are columns where some of the data to be saved in some other languages(s).

Setting Environment for a Multilingual Database#

In order to establish a database that supports multi-languages, settings must be made as follow:

When creating a database, consider which character set is the most widely used by clients, and specify that character set for the server.
Set NLS appropriately for the client character set.
Set other environment variables and properties

Setting Environment Variables#

Set the following environment variables on the clients:

ALTIBASE_NLS_USE
ALTIBASE_NLS_NCHAR_LITERAL_REPLACE

ALTIBASE_NLS_USE#

Any of the following character sets may be used on the clients. Data sent from the server are converted to, and displayed in, the character set specified by each of the clients.

US7ASCII (Default)
KO16KSC5601
MS949
BIG5
GB231280
MS936
UTF8
SHIFTJIS
MS932
EUCJP

ALTIBASE_NLS_NCHAR_LITERAL_REPLACE#

If this is set to 1(TRUE), the client does not convert strings that are preceded by the "N" character to the database character set. Rather, it sends them to the server without change, and the server converts them to the national character set. The default is 0 (FALSE).

Queries used by client applications is generally converted to the database character set and then sent to the server. Under this scheme, for a database that uses the US7ASCII character set, data that fall out of the range of the US7ASCII character set cannot be inserted into that database, even if an NCHAR column is created for that purpose.

For example, if the client character set is KO16KSC5601 and the database character set is US7ASCII, data are converted from the client character set to the database character set when an INSERT statement is executed. In this case, as can be seen in the following example, because it can't be converted to US7ASCII, the replacement character '?' is stored in the table.

iSQL> create table t1 ( i1 nvarchar(10) );
Create success.

iSQL> insert into t1 values ( '안' );
1 row inserted.

iSQL> select * from t1;
I1
--------------------
?

Therefore, a method of saving data that does not fall within the range of the database character set in an NCHAR column is needed. In one such method, seen below, an environment variable setting is made and data are inserted using the NCHAR literal:

$  export ALTIBASE_NLS_NCHAR_LITERAL_REPLACE=1
...
iSQL> create table t1 ( i1 nvarchar(10) );
Create success.

iSQL> insert into t1 values ( N'안' );
1 row inserted.

iSQL> select * from t1;
I1
--------------------
안

As seen above, If ALTIBASE_NLS_NCHAR_LITERAL_REPLACE is set to 1(TRUE) and data are inserted, the client does not convert strings that are preceded by the "N" character to the database character set. Instead, these strings are sent to the server without change, where they are converted to the national character set.

Example#

The following describes the process of setting an environment that uses the default character set as KSC5601 and the national character set as UTF16.

Database Creation#

iSQL(sysdba)> CREATE DATABASE mydb INITSIZE=10M NOARCHIVELOG CHARACTER SET KSC5601 NATIONAL CHARACTER SET UTF16;

DB Info (Page Size     = 32768) 
        (Page Count    = 257) 
        (Total DB Size = 8421376) 
        (DB File Size  = 1073741824) 
    Creating MMDB FILES     [SUCCESS] 
    Creating Catalog Tables [SUCCESS] 
    Creating DRDB FILES     [SUCCESS] 
  [SM] Rebuilding Indices [Total Count:0]  [SUCCESS] 
DB Writing Completed. All Done. 
Create success.

Making Environment Settings on the Client#

To use KSC5601 on the client, set the environment variable as follows:

$ export ALTIBASE_NLS_USE=KSC5601

To use ASCII on the client, set the environment variable as follows:

$ export ALTIBASE_NLS_USE=ASCII

Setting Other Environment Variables and Properties#

Set the following environment variable and property appropriately for the usage environment:

Environment Variable

ALTIBASE_NLS_NCHAR_LITERAL_REPLACE
Property

NLS_COMP or NLS_NCHAR_CONV_EXCP

Consideration When Choosing a Database Character Set#

When choosing a database character set, it should be selected in consideration of loss, conversion cost, and an identifier that may occur in character conversion.

Scope of Usage#

Identifiers#

Column names, schema objects and comments are saved in the database using the database character set, however, other identifiers can only be stored using the US7ASCII character set.

The following tables show which character sets can be used for each kind of identifier.

Identifier Name	Available Character Set
Column Name	Database Character Set
Schema Object	Database Character Set
Annotation	Database Character Set
Database Link Name	Database Character Set
Database Name	US7ASCII
File Name (Such as Data and Log Files)	US7ASCII
Directory Name	US7ASCII
Keyword	US7ASCII
Tablespace Name	US7ASCII

[Table 5‑1] Character Sets that Can Be Used for Each Identifier

Stored SQL Statement#

SQL statements such as stored procedures and trigger statements are stored in a meta table as database character sets.

Constraint#

Replication cannot be performed between two databases that use different character sets.

Effects of Character Set Conversion.#

If the database character set is different from the client character set, a conversion occurs. Such character conversion can potentially cause data loss as well as affect performance degradation.

Data Loss#

If the conversion from a character set that has a large range to a small character set occurs, data loss can occur.

Any characters that cannot be represented using the destination character set will be converted to a replacement character. In US7ASCII, the replacement character is the question mark ('?').

Conversion Overhead#

If all clients use the same character set, and the same character set is specified when a database is created, no character conversion will occur.

However, if different character sets are in use on each client, and the database character set is a superset of the character sets used by the clients, character conversion will occur.