Oracle8i National Language Support Guide
Release 8.1.5

A67789-01

Library

Product

Contents

Index

Prev Next

D
Glossary

Glossary

ASCII

American Standard Code for Information Interchange. A common encoded 7-bit character set for English. ASCII includes the letters A-Z and a-z, as well as digits, punctuation symbols, and control characters. The Oracle character set name for this is US7ASCII.

Binary Sorting

Sorting of strings based on their binary coded value representations.

Case Conversion

Case conversion refers to changing a character from its uppercase to lowercase form, or vice versa.

Character

An independent unit used to represent data, such as a letter, a letter with a diacritical mark, a digit, ideograph, punctuation, or symbol.

Character Classification

Character classification information provides details about the type of character associated with each legal character code; that is, whether it is an alphabetic, uppercase, lowercase, punctuation, control, or space character, etc.

Character Encoding Scheme

The type of mapping used in defining an encoded character set. Oracle supports many character set encodings including single-byte, multiple-byte, shift-sensitive multi-byte and fixed-width character set encoding.

Character Set Conversion

Conversion from one encoded character set to another.

Client Character Set

The encoded character set which the client uses. A client character set can differ from the database server character set, in which case, character set conversion must occur.

Collation

Ordering of all character strings from an alphabet into a linear sequence. Collation may be used on a linguistic sort order or a binary sort order.

Combining Character

A character that graphically combines with a preceding base character. These characters are not used in isolation. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Composite Character

A single character which can be represented by a composite character sequence. This type of character is found in the scripts of Thai, Lao, Vietnamese, and Korean Hangul, as well as many Latin characters used in European languages.

Composite Character Sequence

A character sequence consisting of a base character followed by one or more combining characters. This is also referred to as a combining character sequence.

Database Character Set

The encoded character set.

Diacritical Mark

A mark added to a letter that usually provides information about pronunciation or stress.

EBCDIC

Extended Binary Coded Decimal Interchange Code. EBCDIC is a family of encoded character sets used mostly on IBM systems.

Encoded Character Set

A character set encoding is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation.

Encoding Scheme

See "Character Encoding Schemes".

EUC

Extended UNIX Codes. A common encoding method used on Asian UNIX systems. It combines up to four different encoded character sets in a single data stream.

Euro

The new monetary currency used by participating member states of the European Union.

Export

To write data to files for the purpose of archiving, or moving data between operating systems or Oracle databases.

Font

An ordered collection of character glyphs which provides a graphical representation of characters within a character set.

Glyph

The graphic representation of a character on a display device or paper. For example, H, H, or H are different glyphs, but represent the same character.

Ideograph

A symbol representing an idea. Chinese is an example of an ideographic system.

Import

To read a module from the file system or database, and incorporate it into a display.

Internationalization

The process of making software flexible enough to be used in many different linguistic and cultural environments. Internationalization should not be confused with localization, which is the process of preparing software for use in one specific locale.

ISO

International Standards Organization.

ISO/IEC 10646

A universal character set standard defining the characters of most major scripts used in the modern world. In 1993, ISO adopted Unicode version 1.1 as ISO/IEC 10646-1:1993. ISO/IEC 10646 has two formats: UCS2 is a 2-byte fixed-width format and UCS4 is a 4-byte fixed-width format. There are three levels of implementation, all relating to support for composite characters. Level 1 requires no composite character support, level 2 requires support for specific scripts (including most of the Unicode scripts such as Arabic, Thai, etc.), and level 3 requires unrestricted support for composite characters in all languages.

ISO Currency

The 3-letter abbreviation used to denote a local currency, which is based on the ISO 4217 standard. For example, "USD" represents the United States Dollar.

ISO 8859

A family of 8-bit encoded character sets. The most common one is ISO 8859-1 (also known as Latin-1), and is used for Western European languages.

Latin-1

Formally known as the ISO 8859-1 character set standard. An 8-bit extension to ASCII which adds 128 characters covering the most common Latin characters used in Western Europe. The Oracle character set name for this is WE8ISO8859P1. See also "ISO 8859".

Linguistic Index

An index built on a linguistic collation order.

Linguistic Sorting

Sorting of strings based on requirements from a locale instead of based on the binary representation of the strings.

Local Currency

The currency symbol used in a country or region. For example, "$" represents the United States Dollar.

Locale

A collection of information regarding the linguistic and cultural preferences from a particular region. Typically, a locale consists of language territory, character set, linguistic, and calendar information defined in NLS data files.

Localization

The process of providing language- or culture-specific information for software systems. Translation of an application's user interface would be an example of localization. Localization should not be confused with internationalization, which is the process of generalizing software so it can handle many different linguistic and cultural conventions.

Monolingual Support

Support for only one language.

Multibyte Character

A coded character that can be represented in one or more bytes. Multibyte data streams can include characters with varying widths, and can therefore make extensive text processing of individual characters a challenge. See "Wide Characters".

NCHAR Character Set

An alternate character set from the database character set that can be specified for NCHAR, NVARCHAR2, and NCLOB columns. NCHAR character sets, unlike the database character set, can support fixed-width multibyte character sets. Care must be taken when selecting an NCHAR character set, since its character repertoire must be included in the database character set as well.

Net8

Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.

NLS

National Language Support. NLS allows users to interact with the database in their native languages. It also allows applications to run in different linguistic and cultural environments.

NLSDATA

A general phrase referring to the contents in many files with .nlb suffixes. These files contain data that the NLSRTL library uses to provide specific NLS support.

NLSRTL

National Language Support Run-Time Library. This library is responsible for providing locale-independent algorithms for internationalization. The locale-specific information (i.e., NLSDATA) is read by the NLSRTL library during run-time.

Replacement Character

A character used during character conversion when the desired character is not available in the target character set. For example, "?" is often used as Oracle's default replacement character.

Restricted Multilingual Support

Multilingual support which is restricted to a group of related languages. Support for related languages, but not all languages. Similar language families, such as Western European languages can be represented with, for example, ISO 8859/1. In this case, however, Thai could not be added.

SQL*Net

Now called Net8. Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.

Script

A collection of related graphic symbols used in a writing system. Some scripts are used to represent multiple languages, and some languages use multiple scripts. Example of scripts include Latin, Arabic, and Han.

Server Character Set

The character set used by the database server.

UCS-2

UCS stands for "Universal Multiple-Octet Coded Character Set". It is a 1993 ISO and IEC standard character set.

Unicode

Unicode is a type of universal character set, a collection of 64K characters encoded in a 16-bit space. It encodes nearly every character in just about every existing character set standard, covering most written scripts used in the world. It is owned and defined by Unicode Inc. Unicode is canonical encoding which means its value can be passed around in different locales. But it does not guarantee a round-trip conversion between it and every Oracle character set without information loss.

Unicode Codepoint

A 16-bit binary value that can represent a unit of encoded text for processing and interchange. Every point between U+0000 and U+FFFF is a code point. The term is interchangeable with code element, code position, and code value.

Unicode Mapping Between UCS and UTF Formats

The following shows how different Unicode-related character sets relate to one another in terms of character code value ranges:

UCS2   UTF8   Description  

0x0000 - 0x007F  

0x00 - 0x7F  

Single bytes  

0x0080 - 0x07FF  

0xC0 - 0xDF  

2-byte sequence leaders (5+6 bits)  

0x0800 - 0xFFFF  

0xE0 - 0xEF  

3-byte sequence leaders (4+6+6 bits)  

 

0x80 - 0xBF  

Follower bytes (6 bits each)  

UCS4   UTF8   Description  

0x00000000 - 0x0000007F  

0x00 - 0x7F  

Single bytes  

0x00000080 - 0x000007FF  

0xC0 - 0xDF  

2-byte sequence leaders (5+6 bits)  

0x00000800 - 0x0000FFFF  

0xE0 - 0xEF  

3-byte sequence leaders (4+6+6 bits)  

0x00001000 - 0x001FFFFF  

0xF0 - 0xF7  

4-byte sequence leaders (3+6+6+6 bits)  

0x00200000 - 0x03FFFFFF  

0xF8 - 0xFB  

5-byte sequence leaders (2+6+6+6+6 bits)  

0x04000000 - 0x7FFFFFFF  

0xFC - 0xFD  

6-byte sequence leaders (1+6+6+6+6+6 bits)  

 

0x80 - 0xBF  

Follower bytes (6 bits each)  

 

0xFE - 0xFF  

Reserved or unused  

UCS4   UTF16   Description  

0x00000000 - 0x0000FFFF  

0x0000 - 0xFFFF  

Same as UCS2  

0x00010000 - 0x0010FFFF  

0xD800 - 0xDBFF  

High surrogate ((x-0x10000)>>10)&0x3FF  

 

0xDC00 - 0xDFFF  

Low surrogate (x-0x10000)&0x3FF  

0x00110000 - 0x7FFFFFFF  

 

Not mapped to UTF16  

UCS2

Fixed-width 16-bit Unicode. Each character occupies 16 bits of storage. The Latin-1 characters are the first 256 code points in this standard, so it can be viewed as a 16-bit extension of Latin-1. Oracle does not yet support this character set in the NLS run-time library.

UCS4

Fixed-width 32-bit Unicode. Each character occupies 32 bits of storage. The UCS2 characters are the first 65,536 code points in this standard, so it can be viewed as a 32-bit extension of UCS2. This is also sometimes referred to as ISO-10646. ISO-10646 is a standard that specifies up to 2,147,483,648 characters in 32768 planes, of which the first plane is the UCS2 set. The ISO standard also specifies transformations between different encodings.

Unrestricted Multilingual Support

Being able to use as many languages as desired. A universal character set, such as Unicode, helps to provide unrestricted multilingual support because it supports a very large character repertoire, encompassing most modern languages of the world.

UTF-8

A variable-width encoding of UCS2 which uses sequences of 1, 2, or 3 bytes per character. Characters from 0-127 (the 7-bit ASCII characters) are encoded with one byte, characters from 128-2047 require two bytes, and characters from 2048-65535 require three bytes. The Oracle character set name for this is UTF8 (for the Unicode 2.0 standard). The standard has left room for expansion to support the UCS4 characters with sequences of 4, 5, and 6 bytes per character.

UTF-16

An extension to UCS2 that allows for pairs of UCS2 code points to represent extended characters from the UCS4 set. UCS2 has ranges of code points allocated for high (leading) and low (trailing) surrogates that support UTF16 encodings.

Wide Character

A fixed-width character format that is well-suited for extensive text processing because it allows for data to be processed in consistent fixed-width chunks. Wide characters are intended for supporting internal character processing, and are therefore implementation-dependent.




Prev

Next
Oracle
Copyright © 1999 Oracle Corporation.

All Rights Reserved.

Library

Product

Contents

Index