Oracle8(TM) ConText(R) Cartridge Application Developer's Guide
Release 2.0
A54630-01

Library

Product

Contents

Index

7
Linguistic Concepts

This chapter describes the approach used by ConText linguistics to provide advanced analysis of English-language text.

The following topics are covered in this chapter:

Overview of ConText linguistics
Application Program Interface (API)
Linguistic Core
Text Input
Linguistic Output
User Settings

Overview of ConText Linguistics

ConText linguistics is used to analyze the content of English-language documents. You use ConText linguistics to create different views of the contents of documents that allow the user to quickly review the essential content of documents and determine their relevance.

Because these services are separate and distinct from text and theme indexing, you can incorporate linguistic analysis and functionality in a text application, independent of the text/theme indexing process.

ConText linguistics can generate the following forms of linguistic output for documents:

Output Type	Description
Themes	The main concepts of a document.
Generic Gist	Paragraph or paragraphs in a document that best represent what the document is about as a whole.
Point-of-View Gist	Paragraph or paragraphs in a document that best represent a given theme in the document.

You obtain linguistic output by submitting a linguistic request using the CTX_LING PL/SQL package. Linguistic requests can only be processed by ConText servers running with the Linguistic personality.

Requirements

The requirements for using ConText linguistics are:

text stored in a column (either directly or indirectly through a pathname to files)
a policy for the column
ConText server running with Linguistic personality

Note:

The setup requirements of having text in a column and having a policy for the column apply to ConText indexes (text/theme) as well as ConText linguistics. The procedures for storing text and creating policies are not discussed in this manual. For more information about storing text in columns and creating policies for the columns, see Oracle8 ConText Cartridge Administrator's Guide.

Linguistic Personality

To process requests for linguistic output (themes and Gists) a ConText server with the Linguistic personality must be running. A ConText server with the Linguistic personality can also have other personalities in its personality mask.

Starting up ConText servers is the task of the ConText administrator, through the CTXSYS Oracle user.

See Also:

For more information about the Linguistic personality and specifying personality masks for ConText servers, see Oracle8 ConText Cartridge Administrator's Guide.

Services Queue

The Services Queue is used for managing ConText linguistic requests. Such a request is cached in memory until the requestor submits the request, at which time the request is added to the Services Queue. If more than one request is cached in memory when the user submits the requests, ConText stores all of the requests as a single batch job.

If a ConText server has the appropriate Linguistic personality, the server monitors the Service Queue for requests and processes the next request in the queue.

Note:

If no ConText servers with the ' L' personality are running, the Services Queue still accepts requests and holds the requests for the next available ConText server with the appropriate personality.

The ConText administration tool can be used to perform all administration functions on the Services Queue (e.g., cleaning up entries, etc.). In addition the CTX_SVC PL/SQL package can be used to perform ConText administration from the command-line.

Creating Linguistic Output

You can generate linguistic output in batch during the text indexing process or generate it as needed. Because the generation of linguistic output is independent of the text-indexing process, ConText places no restrictions on when you can create themes and gists.

See Also:

For more information about generating linguistic output at indexing time versus generating linguistic output on demand, see "Combining Theme/Text Queries with Linguistic Output" in Chapter 8.

Application Program Interface (API)

Linguistic and queue management functions are invoked by using PL/SQL procedures called or executed within the programming language in which the application is developed. If the application is developed in PL/SQL, these procedures may be invoked directly as PL/SQL execute statements. If the application is developed in another language, such as C, the PL/SQL procedures for linguistic and queue management functions are accessed through the Oracle Call Interface (OCI).

ConText provides the following PL/SQL packages for generating linguistic output and managing the Services Queue, respectively:

CTX_LING
CTX_SVC

CTX_LING Package

The stored procedures in CTX_LING are used to request linguistic output and submit the requests to the Services Queue. CTX_LING also provides procedures for specifying user settings for generating linguistic output and enabling logging of parse information generated during the processing of a request.

The model for submitting requests and querying the linguistic output is similar to the two-step query model (CONTAINS procedure) provided within the ConText framework for content-based text retrieval.

For example, to generate themes for a document, you first create a table to store the results of the theme generation, then call CTX_LING.REQUEST_THEMES procedure followed by the CTX_LING.SUBMIT function. ConText stores the results in a theme table. To view the results, issue a SELECT statement to select the theme from the output table.

See Also:

For more information about the procedures in the CTX_LING package, see "CTX_LING:Linguistics" in Chapter 10.

CTX_SVC Package

The stored procedures in CTX_SVC are used to monitor the Services Queue for the status of specific requests. CTX_SVC can be used to check the status of pending requests, and to display errors encountered. You can also cancel the request if it has not been picked up for processing by a ConText server or clear the request if the request encountered an error.

See Also:

For more information about procedures in the CTX_SVC package, see "CTX_SVC: Services Queue Administration" in Chapter 10.

Linguistic Core

The linguistic core is made up of the following components:

lexicon
knowledge catalog
parsing engine

Lexicon

The lexicon is a static knowledge base that provides word and phrase information for the parsing engine. The lexicon recognizes over one million English words and phrases and defines hundreds of lexical characteristics for each word.

Note:

The lexicon is specific to the English language, but it recognizes the difference between American and British usage and spelling.

Linguistic information about words in the lexicon is divided into the following types:

Information Type	Description
Syntax	Syntax flags provide surface level assessments of a word or phrase isolated from its grammatical context.
Theme	Theme flags identify the thematic qualities of a word (e.g. weak noun/needs support, strong verb). The parser uses these flags to determine how a word contributes to the thematic construction of the sentence as a whole.

Knowledge Catalog

The knowledge catalog is a language-independent organization of industries, fields of study, special terms and jargon, and abstract concepts. It creates a classification scheme that defines ConText's semantic view of the world.

The knowledge catalog is organized as a hierarchy of concepts. When a parent concept subsumes one or more concepts, the parent concept is called a category. The knowledge catalog is divided into the following six main categories:

business and economics
government and military
science and technology
social environment
geography
abstract ideas and concepts

These categories are divided further into more categories and concepts, creating a tree-like structure, whose branches break down the various realms of discourse. Children categories are related to parent categories by an "is-associated-with" realtionship, losely defined as such to cover other standard child-parent type relationships such as "is-a-part-of", "belongs-to", or "is-related-to". For instance, the concept of jazz music is defined by the following hierarchy:

social environment

arts and entertainment

performing arts

music

jazz music

The concept jazz music belongs to the category of music, which is a part of the more general category of arts and entertainment, which is a part of the even more general category social environment.

Cross-References

Additional semantic relationships are represented by cross references that link concepts that are not ancestrally related to each other in the hierarchy. For example, the category of Middle East, which lies under geography, is cross referenced with petroleum industry, which lies under science and technology.

Canonical Forms

When ConText analyses documents for theme extraction and theme indexing, phrases must be converted into their canonical forms before they can attach into the knowledge hierarchy and be returned to the user as themes. To make this conversion, the knowledge catalog keeps the following lists:

Type of List	Description
Nominals and Plurals	A list of mappings from inflected variations of words to their standard noun forms as stored in the knowledge catalog's hierarchy of concepts. For example, the words ratify and ratifies are mapped to the canonical form ratification.
Alternate Forms	A list of mappings from acronyms, abbreviations, and alternate spellings to their standard forms. For example, IBM is a acronym for the standard form IBM - International Business Machines Corporation

See Also:

For more information about creating theme indexes and issuing theme queries, see Chapter 5, "Theme Queries".

Parsing Engine

The parsing engine identifies paragraph, sentence, and token (word) boundaries, as well as phrases and clauses. It then passes the tokens to the lexicon where grammar and theme flags are attached and linguistic analysis begins.

Once the lexicon identifies the grammatical function of each word in a sentence, using the word's placement in the sentence and its relationship to the surrounding words, the parsing engine determines the thematic function of the word in the sentence.

As the parsing engine encounters successively larger text blocks (sentences, paragraphs, and the whole document), it expands the analysis to add new information about the text to its knowledge base.

If case-conversion is enabled, the parsing engine converts all the text to lowercase and processes the text through the case-sensitivity routines to determine proper capitalization.

Note:

Case conversion does not affect the original text of the documents being processed; only the output of the parsing engine is stored in mixed-case.

Text Input Requirements

ConText linguistics has the following requirements and restrictions for text input:

punctuation
paragraph separation
document size
writing styles
case-sensitivity

Punctuation

Each word and sentence should be clearly identified using standard conventions such as blank spaces and recognized punctuation. Complete sentences produce the best results, but are not required. ConText can process incomplete sentences as well as text in headers and lists.

Paragraph Separation

To successfully process text, the ConText requires documents to be separated into paragraphs. The method by which the paragraph delimiters are recognized is based on whether the text is formatted.

Formatted Text

In formatted text, the filters used to extract the text must provide paragraph delimiters that can be recognized by ConText.

The internal filters provided by ConText automatically recognize the paragraph delimiters used in the document format for the filter. Similarly, any external filters used for filtering text must recognize the paragraph delimiters used in the document format for the filter.

See Also:

For more information about filters, see Oracle8 ConText Cartridge Administrator's Guide.

Plain (ASCII) Text

With plain (ASCII) text, paragraph delimiters are determined on a per document basis. ConText samples the first 8 Kilobytes of text in a document to identify the common method used to mark the beginning and end of paragraphs in the sample. That method is then applied to the rest of the document.

Document Size

ConText linguistics can process documents of any size, up to a maximum size of 5 megabytes for a single document.

Note:

If a ConText linguistics request is submitted for a document larger than 5 megabytes, ConText returns an error and does not generate output for the document

Writing Styles

ConText can analyze written material of all styles and subject matter. This includes technical manuals, literature of all types, newspapers and magazines, encyclopedias, and electronic-mail messages.

ConText linguistics is not well-suited for processing transcriptions of unstructured, spoken words, such as colloquial dialogue or casual conversation. ConText linguistics also does not work well with non-natural languages such as computer programming languages.

Case-sensitivity

ConText linguistics depends on text that is properly capitalized, which helps indicate the beginning of sentences and identifies proper nouns. ConText linguistics can also process text that is not in mixed-case, which is especially useful for all-uppercase or all-lowercase text that may exist in legacy systems.

ConText processes mixed-case text by first reducing the text to all lowercase, then analyzing each word to determine if the word should be capitalized or not.

This internal case-conversion takes place only if the appropriate setting has been enabled in the setting configuration for the session.

Note:

While linguistic output is stored in mixed-case, the text of the source documents is not converted to mixed-case. The conversion is done internally and used only to facilitate the linguistic analysis.

The Proper Names Table

ConText has a list of more than six hundred thousand proper names that are stored in a database table and used by the case-sensitivity routines to properly capitalize terms identified as proper nouns/names.

For database space and performance reasons, the proper names table, CTX_PROPER_NAME, is not populated with the list of proper names during installation. If you wish to use the case-sensitivity routines, the proper names list must be imported after installation.

Loading the Proper Names Table

The proper names table, CTX_PROPER_NAME, is required only for enabling case-sensitivity.

The list of proper nouns for CTX_PROPER_NAME is provided as an export file in the admin directory for ConText. For example, in a UNIX-based environment, the export file is named ctxprop.dmp and is located in $ORACLE_HOME/ctx/admin.

See Also:

For more information about loading the proper names table, see the Oracle8 Server installation documentation specific to your operating system.

Attention:

Once imported, the CTX_PROPER_NAME table and the Oracle index for the table are large (nearly 140 MB). To ensure the table is imported successfully, the default tablespace for CTXSYS must be large enough to accommodate both the table and its Oracle index, as well as the other Oracle ConText Cartridge database objects.

To import CTX_PROPER_NAME, change directories to the location of the export file and use the Oracle Import utility.

For example:

imp ctxsys/ctxsys_passwd FILE=ctxprop.dmp FULL=y IGNORE=y COMMIT=y 
BUFFER=number_of_bytes

A large buffer size enables faster importing of data. A buffer size of at least 1 MB is suggested for Solaris 2.x.

See Also:

For more information about the import utility, see Oracle8 Server Utilities.

Linguistic Output

ConText linguistics produces the following output:

theme indexes
lists of themes
point-of-view Gists
generic Gists

Theme Indexes

Theme indexes are created as a prerequisite for issuing theme queries. Given a theme policy, you can create a theme index for all documents in an entire text column using CTX_DDL.CREATE_INDEX.

See Also:

For more information about creating theme indexes, see "Creating a Theme Index" in Chapter 5.

List of Themes

You can generate a list of themes or list of main concepts of a document on a per document basis. Because themes present a profile of the main subjects of a document, a list of themes provide a snapshot of what the document is about. You can generate up to 16 themes for each document, using the CTX_LING.REQUEST_THEMES procedure. When you generate the themes for a document, each theme is assigned a relative weight.

Note:

ConText linguistics produces only document-level themes; paragraph-level themes cannot be produced.

See Also:

For more information about generating themes, see "Generating Themes and Gists" in Chapter 8.

Theme Weight

Each document theme is assigned a weight that measures the strength of the theme relative to the other themes in the document.

The cumulative weight of a theme also reflects the overall thematic content of the document. As such, theme weights can be used to compare a document theme to other themes within the same document or to other documents with the same theme.

Theme Classification

The themes produced by ConText linguistics are essentially document classifications. Each theme provides information that can be used to classify the document into a semantic world view (classification structure) defined by the user. For this reason, ConText linguistics always normalize the terms and phrases in the theme output to their noun and plural forms, if applicable.

In addition, the theme output is not always a direct result of the actual terms and phrases found in a document. Often the output reflects ConText's understanding of how themes are related.

For example, if a document provides a detailed discussion of MS-DOS and UNIX, ConText returns DOS and UNIX as themes for the document; however, ConText might also return operating systems as a theme, indicating that a relationship exists between DOS and UNIX. The document could be classified under DOS, UNIX, operating systems, or any combination of the three.

Point-of-View Gists

A point-of-view Gist for a document provides a short summary of the document from a specific point of view. A point-of-view Gist consists of the document paragraphs that provide the best match for a single document theme. To create theme summaries for each theme in a document, use CTX_LING.REQUEST_GIST.

Because it provides a concise, focused summary for a particular theme in a document, a point-of-view Gist can be used to compare documents with similar themes.

You can control the size of a point-of-view Gist with linguistic settings.

Note:

The settings for theme summaries can only be modified by creating custom setting configurations in the GUI administration tool.

See Also:

For more information about how to generate themes, see "Generating Themes and Gists" in Chapter 8.

For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10.

Generic Gists

A generic Gist for a document provides a summary that reflects all of the themes in the document. It consists of the document paragraphs that provide the best match for the overall document themes.

To generate a generic gist, use CTX_LING.REQUEST_GIST. This procedure returns a generic Gist for the document along with summaries of the document's themes.

Because a generic Gist is generally longer than a point-of-view Gist, it serves better as a document reading tool than a document selection tool. For example, it can be used to quickly scan a document and to extract the most meaningful thematic information, rather than reading the entire document.

You can specify settings to control the size of the Gist and to determine whether the first and last document paragraphs are always included in the Gist.

Note:

The settings for generic Gist can only be modified by creating custom setting configurations in the GUI administration tool.

See Also:

For more information about how to generate a Gists, see "Generating Themes and Gists" in Chapter 8.

For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10.

Linguistic Settings

You can perform linguistic processing of documents to generate themes and Gists only when a ConText server with the Linguistic personality is running. The type of processing is determined by the following options:

convert all-uppercase or all lower-case text to case-sensitive text
generate Gists
use the full parser or the theme parser
process unknown terms (full theme parsing vs. limited theme parsing)

There is a default configuration, but you can also set these options by specifying a label with the CTX_LING.SET_SETTINGS_LABEL procedure. A label is a predefined configuration of settings.

Note:

You cannot change the predefined setting configurations that are shipped with ConText. However, you can use the administration tool to create custom setting configurations from the predefined setting configurations.

See Also:

For more information on how to specify linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10.

Library

Product

Contents

Index

7 Linguistic Concepts

Overview of ConText Linguistics

Requirements

Linguistic Personality

Services Queue

Creating Linguistic Output

Application Program Interface (API)

CTX_LING Package

CTX_SVC Package

Linguistic Core

Lexicon

Knowledge Catalog

Cross-References

Canonical Forms

Parsing Engine

Text Input Requirements

Punctuation

Paragraph Separation

Formatted Text

Plain (ASCII) Text

Document Size

Writing Styles

Case-sensitivity

The Proper Names Table

Loading the Proper Names Table

Linguistic Output

Theme Indexes

List of Themes

Theme Weight

Theme Classification

Point-of-View Gists

Generic Gists

Linguistic Settings

7
Linguistic Concepts