6
Designing Data Warehouse Applications

This chapter introduces integrated Oracle8 features for tuning enterprise scale data warehouses. By intelligently tuning the system, the data layout, and the application, you can build a high performance, scalable data warehouse.

Topics in this chapter include

Introduction

Data warehousing applications process a substantial amount of data by means of many CPU- and I/O-bound, data intensive tasks such as

loading, indexing, and summarizing tables
scanning, joining, sorting, aggregating, and fetching data

The resource required to complete the tasks on many gigabytes of data distinguishes data warehousing applications from other types of data processing. The bulk and complexity of your data may clearly indicate that you need to deploy multiple CPUs, investigate parallel processing, or consider specific data processing features that are directly relevant to the tasks at hand.

For example, in a typical data warehousing application, data intensive tasks might be performed on 100 gigabytes of data. At a processing speed of 0.2 G to 2 G of data per hour per CPU, a single CPU might need from 2 days to more than 2 weeks to perform a task. With more than a single gigabyte of data (certainly with upwards of 10G) you need to consider increasing the number of CPUs.

Similarly, if you need to copy 10 gigabytes of data, consider that using Export/Import might take a single CPU 10 hours. By contrast, using parallel CREATE TABLE AS SELECT on 10 CPUs might take only 1 hour.

Actual processing time depends on many factors, such as the complexity of the queries, the processing speed to which a particular hardware configuration can be tuned, and so on. Always run simple tests on your own system to find out its performance characteristics with regard to particular operations.

Features for Building a Data Warehouse

This section outlines Oracle8 Server features useful for building a data warehouse. It includes:

PARALLEL CREATE TABLE AS SELECT

The ability to CREATE TABLE AS SELECT in parallel enables you to reorganize extremely large tables efficiently. You might find it prohibitive to take a serial approach to reorganizing or reclaiming space in a table containing tens or thousands of Gigabytes of data. Using Export/Import to perform such a task might result in an unacceptable amount of downtime. If you have a lot of temporary space available, you can use CREATE TABLE AS SELECT to perform such tasks in parallel. With this approach, redefining integrity constraints is optional. This feature is often used for creating summary tables, which are precomputed results stored in the data warehouse.

Oracle8 Server Concepts

Parallel Index Creation

The ability to create indexes in parallel benefits both DSS and OLTP applications. On extremely large tables, rebuilding an index may take a long time. Periodically DBAs may load a large amount of data and rebuild the index. With the ability to create indexes in parallel, you may be able to drop an index before loading new data, and recreate it afterward.

Chapter 19, "Tuning Parallel Execution"

Partitioned Tables

You can avoid downtime with very large or mission-critical tables by using partitions. You can divide a large table into multiple physical tables using partitioning criteria. In a data warehouse you can manage historical data by partitioning by date. You can then do on a partition level all of the operations you might normally perform on the table level, such as backup and restore. You can add space for new data by adding a new partition, and delete old data by dropping an existing partition. Queries that use a key range to select from a partitioned table retrieve only the partitions that fall within that range. In this way partitions offer significant improvements in availability, administration and table scan performance.

Note: For performance reasons, partitioned tables should be used in Oracle8 Server rather than partition views. Please see the Oracle8 Server Migration Guide for instructions on migrating from partition views to partitioned tables.

See Also: Oracle8 Server Concepts for information about partitioned tables

"Partitioning Data" on page 19-27

ANALYZE

You can analyze the storage characteristics of tables, indexes, and clusters to gather statistics which are then stored in the data dictionary. The optimizer uses these statistics in a cost-based approach to determine the most efficient execution plan for the SQL statements you issue. Note that statistics can be either computed or estimated, depending on the amount of overhead you are willing to allow for this purpose.

Oracle8 Server Administrator's Guide

Parallel Load

When very large amounts of data must be loaded, the destination table may be unavailable for an unacceptable amount of time. The ability to load data in parallel can dramatically slash the amount of downtime necessary.

Chapter 19, "Tuning Parallel Execution"

Oracle8 Server Utilities for a description of SQL Loader conventional and direct path loads.

Features for Querying a Data Warehouse

This section summarizes Oracle8 Server features useful for querying a data warehouse. It includes:

Oracle Parallel Server Option

The Oracle Parallel Server option provides benefits important to both OLTP and DSS applications:

application failover
scalable performance
load balancing
multiuser scalability

OPS failover capability (the ability of the application to automatically reconnect if the connection to the database is broken) results in improved availability, a primary benefit for OLTP applications. Likewise, scalability in the number of users that can connect to the database is a major benefit in OLTP environments. OLTP performance on OPS can scale as well, if an application's data is isolated onto a single server.

For data warehousing applications, scalability of performance is a primary benefit of OPS. The architecture of OPS allows parallel query to perform excellent load balancing of work at runtime. If a node in an OPS cluster or MPP is temporarily slowed down, work that was originally assigned to parallel query servers on that node (but not yet commenced by those servers) may be performed by servers on other nodes, hence preventing that node from becoming a serious bottleneck. Even though OPS is a cornerstone of parallel query on clusters and MPPs, in a mostly query environment, the overhead on the distributed lock manager is minimal.

Parallel Aware Optimizer

Knowledge about parallelism is incorporated into the Oracle8 cost-based optimizer. Parallel execution considerations are thus a fundamental component in arriving at query execution plans. In addition, the user can control the trade-off of throughput for response time in plan selection.

The optimizer chooses intelligent defaults for the degree of parallelism based on available processors and the number of disk drives storing data the query will access. Access path choices (such as table scans vs. index access) take into account the degree of parallelism, resulting in plans that are optimized for parallel execution. Execution plans are more scalable, and there is improved correlation between optimizer cost and execution time for parallel query.

The initialization parameter OPTIMIZER_PERCENT_PARALLEL defines the weighting that the optimizer will use to minimize response time in its cost functions.

Parallel Data Manipulation Language and Parallel Query

Oracle8 Server provides for improved performance through use of parallel data manipulation language.

The parallel query feature enables multiple processes to work together simultaneously to process a single query. By dividing the task among multiple server processes, Oracle can execute a SQL statement more quickly than if only one server process were used.

Parallel query can dramatically improve performance for data-intensive data warehousing operations. It helps systems scale in performance when adding hardware resources. The greatest performance benefits are on symmetric multiprocessing (SMP), clustered, or massively parallel systems where query processing can be effectively spread out among many CPUs on a single system.

Oracle8 Server Concepts for conceptual background on parallel query

Bitmap Indexes

Regular B tree indexes work best when each key or key range references only a few records, such as employee names. Bitmap indexes, by contrast, work best when each key references many records, such as employee gender.

Bitmap indexing provides the same functionality as regular indexes, but uses a different internal representation, which makes it very fast and space efficient. Bitmap indexing benefits data warehousing applications which have large amounts of data and ad hoc queries, but a low level of concurrent transactions. It provides reduced response time for many kinds of ad hoc queries; considerably reduced space usage compared to other indexing techniques; and dramatic performance gains even on very low end hardware. Bitmap indexes can be created in parallel and are completely integrated with cost-based optimization.

Star Queries

One type of data warehouse design is known as a "star" schema. This typically consists of one or more very large "fact" tables and a number of much smaller "dimension" or reference tables. A star query is one that joins several of the dimension tables, usually by predicates in the query, to one of the fact tables.

Oracle cost-based optimization will recognize star queries and generate efficient execution plans for them; indeed, you must use cost-based optimization to get efficient star query execution. To enable cost-based optimization, simply ANALYZE your tables and be sure that the OPTIMIZER_MODE initialization parameter is set to its default value of CHOOSE.

See Also: Oracle8 Server Concepts for information on optimizing star queries

"STAR" on page 8-21 for information about the STAR hint

Star Transformation

Star transformation is a cost-based transformation designed to execute star queries efficiently. Whereas star optimization works well for schemas with a small number of dimensions and dense fact tables, star transformation works well for schemas with a large number of dimensions and sparse fact tables.

Star transformation is enabled by setting the initialization parameter STAR_TRANSFORMATION_ENABLED to TRUE. You can use the STAR_TRANSFORMATION hint to make the optimizer use the best plan in which the transformation has been used.

See Also: Oracle8 Server Concepts for a full discussion of star transformation.

Oracle8 Server Reference Manual describes the initialization parameter STAR_TRANSFORMATION_ENABLED.

"STAR_TRANSFORMATION" on page 8-22 explains how to use this hint.

Backup and Recovery of the Data Warehouse

Recoverability has various levels: to recover from disk failure, human error, software failure, fire, and so on, requires different procedures. ORACLE is only part of the solution. Organizations must decide how much to spend on backup and recovery by considering the business cost of a long outage.

The NOLOGGING option enables you to perform certain operations without the overhead of generating a log. Even without logging, you can avoid disk failure if you use disk mirroring or RAID technology. If you load your warehouse from tapes every day or week, you might satisfactorily recover from all failures by simply saving copies of the tape in several remote locations and reloading from tape when something goes wrong. At the other end of the spectrum, you could both mirror disks and take backups and archive logs, and maintain a remote standby system. The mirrored disks prevent loss of availability for disk failure, and also protect against total loss in the event of human error (such as dropping the wrong table) or software error (such as disk block corruption). In the event of fire, power failure, or other problems at the primary site, the backup site prevents long outages.

See Also: For more information on recovery and the NOLOGING option, see the Oracle8 Server Administrator's Guide and Oracle8 Server SQL Reference.

"[NO]LOGGING Option" on page 19-41

6 Designing Data Warehouse Applications