Load and transform external data into Oracle 9i

This article was published in the Oracle Magazin. We have tested the material for Oracle 9.0.1 on Windows 2000 and clarified some parts of the document and code.

Overview

Up until Oracle 9i, SQL*Loader was the tool to load a lot of data into an Oracle database from external sources. Oracle9i introduces several interesting new features that change the way you look at the data-loading and transformation process. These is performed by the ETL toolkit (Extraction, Transformation, and Loading) and collectively provide you with a powerful ETL toolkit. Three ETL features are particularly interesting

External table

Table functions

The MERGE statement

Typical Loading Scenario

In the following example we use external tables and table functions, two of these powerful ETL features. Let's assume that you have the following revenue data to load into a database and that you want to normalize it so you end up with one row for each person/month combination:

Person   Jan Feb Mar Apr Mai Jun Jul Aug Sep Oct Nov Dez
--------------------------------------------------------
Schnyder,345,223,122,345,324,244,123,123,345,121,345,197
   Weber,234,234,123,457,456,287,234,123,678,656,341,567
Keller,596,276,347,134,743,545,216,456,124,753,346,456
   Meyer,987,345,645,567,834,567,789,234,678,973,456,125
Holzer,509,154,876,347,146,788,174,986,568,246,324,987
Müller,456,125,678,235,878,237,567,237,788,237,324,778
Binggeli,487,347,458,347,235,864,689,235,764,964,624,347
Stoller,596,237,976,876,346,567,126,879,125,568,124,753
   Marty,094,234,235,763,054,567,237,457,325,753,577,346
Studer,784,567,235,753,124,575,864,235,753,864,634,678

Person    Month   Revenue
-------------------------
Schnyder    Jan       345
Schnyder    Feb       223
Schnyder    Mar       122
Schnyder    Apr       345
Schnyder    Mai       324
Schnyder    Jun       244
Schnyder    Jul       123
Schnyder    Aug       123
Schnyder    Sep       345
Schnyder    Oct       121
Schnyder    Nov       345
Schnyder    Dez       197
........    ...       ...

Using Oracle8i, you would need to use at least two discrete steps to load and transform this data. The first step is to load the the data into a staging table using SQL*LOADER, the next step is to transform and insert the data in the final table using a PL/SQL procedure.

Oracle9i's ETL features, however, allow for a more interesting approach to the task, one that loads and transforms in just one step.

External Tables

New in Oracle9i is the concept of an external table. This is a table that you define in the database's data dictionary, but for which the data itself is stored outside of the database. It's possible, for example, to define an external table that derives its data from the type of text file you would load using SQL*Loader. This is great, because the revenue data in the above example you need to load resides in just such a text file. Before you can create an external table, you need to create an Oracle directory object that points to the operating system directory in which your text file resides.

CONNECT sys/manager AS SYSDBA;

CREATE OR REPLACE DIRECTORY dat_dir AS 'C:\Oradata\Data';
CREATE OR REPLACE DIRECTORY log_dir AS 'C:\Oradata\Log';
CREATE OR REPLACE DIRECTORY bad_dir AS 'C:\Oradata\Bad';

GRANT READ ON DIRECTORY dat_dir TO scott;
GRANT WRITE ON DIRECTORY log_dir TO scott;
GRANT WRITE ON DIRECTORY bad_dir TO scott;

You can now use a new form of the CREATE TABLE statement that looks like a cross between a SQL statement and a SQL*Loader control file:

CONNECT scott/tiger;

CREATE TABLE revext (person      VARCHAR2(20),
                     rev_jan     NUMBER(4),
                     rev_feb     NUMBER(4),
                     rev_mar     NUMBER(4),
                     rev_apr     NUMBER(4),
                     rev_mai     NUMBER(4),
                     rev_jun     NUMBER(4),
                     rev_jul     NUMBER(4),
                     rev_aug     NUMBER(4),
                     rev_sep     NUMBER(4),
                     rev_oct     NUMBER(4),
                     rev_nov     NUMBER(4),
                     rev_dez     NUMBER(4))
ORGANIZATION EXTERNAL
(
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY dat_dir
   ACCESS PARAMETERS
   (
     records delimited by newline
     badfile bad_dir:'revext%a_%p.bad'
     logfile log_dir:'revext%a_%p.log'
     fields terminated by ','
     missing field values are null
     ( person,
       rev_jan,
       rev_feb,
       rev_mar,
       rev_apr,
       rev_mai,
       rev_jun,
       rev_jul,
       rev_aug,
       rev_sep,
       rev_oct,
       rev_nov,
       rev_dez
     )
   )
   LOCATION ('revext.dat')
)
PARALLEL 4
REJECT LIMIT UNLIMITED;

When you create an external table, you're really only creating some data dictionary entries. Nothing exciting happens until you query the table. Realize though, that you can query the table as you would any other SQL table. Let's assume you had the following production table:

CREATE TABLE revenue (
    person       VARCHAR2(20),
    month        VARCHAR2(3),
    revenue      NUMBER,
    CONSTRAINT revenue_pk PRIMARY KEY (person,month));

Given this table, you can use the following INSERT...SELECT FROM statement to extract revenue data from your external file in a normalized format and insert it into your production table.

INSERT INTO revenue (person,month,revenue)
   SELECT person,'Jan',rev_jan
   FROM revext
   WHERE rev_jan IS NOT NULL
UNION ALL
   SELECT person,'Feb',rev_feb
   FROM revext
   WHERE rev_feb IS NOT NULL
UNION ALL
   SELECT person,'Mar',rev_mar
   FROM revext
   WHERE rev_mar IS NOT NULL
UNION ALL
   SELECT person,'Apr',rev_apr
   FROM revext
   WHERE rev_apr IS NOT NULL
UNION ALL
   SELECT person,'Mai',rev_mai
   FROM revext
   WHERE rev_mai IS NOT NULL
UNION ALL
   SELECT person,'Jun',rev_jun
   FROM revext
   WHERE rev_jun IS NOT NULL
UNION ALL
   SELECT person,'Jul',rev_jul
   FROM revext
   WHERE rev_jul IS NOT NULL
UNION ALL
   SELECT person,'Aug',rev_aug
   FROM revext
   WHERE rev_aug IS NOT NULL
UNION ALL
   SELECT person,'Sep',rev_sep
   FROM revext
   WHERE rev_sep IS NOT NULL
UNION ALL
   SELECT person,'Oct',rev_oct
   FROM revext
   WHERE rev_oct IS NOT NULL
UNION ALL
   SELECT person,'Nov',rev_nov
   FROM revext
   WHERE rev_nov IS NOT NULL
UNION ALL
   SELECT person,'Dez',rev_dez
   FROM revext
   WHERE rev_dez IS NOT NULL;

Because you set the degree of parallelism to four when you created the external table, the database will divide the file to be read by four processes running in parallel. This parallelism happens automatically, with no additional effort on your part, and is really quite convenient. To parallelize this load using SQL*Loader, you would have had to manually divide your input file into multiple smaller files.

Table Functions

Disadvantage of UNION ALL

The INSERT statement shown previously works off the union of 12 SELECT statements. This means that the external data file gets read 12 times, once for each SELECT. Reading the input file 12 times isn't desirable, especially if the file is very large. Fortunately, this happens to be just the type of problem that you can solve using table functions. You can think of a table function as a highly streamlined transformation engine. A table function takes a set of rows as input and returns a different set of rows as output. Unlike traditional functions, table functions are designed to be invoked from a SELECT statement's FROM clause.

Table Function as highly streamlined transformation engine

For the revenue data, we want to take each input row and transform it into 12 output rows. Each input row holds data from 12 different revenue months. Your normalized destination table, however, requires that each month get its own row, so your table function must transform the 12 population counts from each input row into 12 separate output rows.

Table functions always return a collection of records

Before creating the table function, you need to create some types. Table functions always return a collection of records, so to begin, create a table type that corresponds to the definition of your ultimate destination table. To do this, first create an object type to define the record, and then create a table type based on that object type.

connect scott/tiger;

CREATE TYPE revenue_row AS OBJECT (
   person     VARCHAR2(20),
   month      VARCHAR2(3),
   revenue    NUMBER
);
/

CREATE TYPE revenue_tab
   AS TABLE OF revenue_row;
/

The input to your function will be rows returned by a SELECT statement against the external table REVEXT, so you'll need an appropriate REF CURSOR type. The following statement creates a package containing a REF CURSOR type REVENUE_CUR that matches the record structure of the REVEXT table. The package also defines a table function that takes such a cursor as an input parameter. Note that you have to create the table type REVENUE_TAB first, so you can use that type in your table function's RETURN clause:

CREATE OR REPLACE PACKAGE revenue_pkg
AS
   TYPE revenue_cur IS REF CURSOR RETURN revext%ROWTYPE;
   FUNCTION revenue_fun (revenue_arg IN revenue_cur)
      RETURN revenue_tab
      PARALLEL_ENABLE (PARTITION revenue_arg BY ANY)
      PIPELINED;
END;
/

The PARALLEL_ENABLE clause in the function specification allows the database to parallelize the function's execution. The PARTITION revenue_arg BY ANY clause indicates that the input rows can be arbitrarily divided into any number of buckets, all of which can then be processed in parallel. The PIPELINED clause enables the function to return the result set incrementally while other input data is still being processed. Think of the rows as "flowing through" the function during the execution of a query invoking the function. The function code defined in the package body takes care of transforming each input row into the desired 12 output rows.

CREATE OR REPLACE PACKAGE BODY revenue_pkg AS
   FUNCTION revenue_fun (revenue_arg IN revenue_cur)
   RETURN revenue_tab
   PARALLEL_ENABLE (PARTITION revenue_arg BY ANY)
   PIPELINED IS revenue_rec revext%ROWTYPE;

   outrow_jan revenue_row := revenue_row('','',0);
   outrow_feb revenue_row := revenue_row('','',0);
   outrow_mar revenue_row := revenue_row('','',0);
   outrow_apr revenue_row := revenue_row('','',0);
   outrow_mai revenue_row := revenue_row('','',0);
   outrow_jun revenue_row := revenue_row('','',0);
   outrow_jul revenue_row := revenue_row('','',0);
   outrow_aug revenue_row := revenue_row('','',0);
   outrow_sep revenue_row := revenue_row('','',0);
   outrow_oct revenue_row := revenue_row('','',0);
   outrow_nov revenue_row := revenue_row('','',0);
   outrow_dez revenue_row := revenue_row('','',0);

   BEGIN
      LOOP
         FETCH revenue_arg INTO revenue_rec;
         EXIT WHEN revenue_arg%NOTFOUND;

         IF revenue_rec.rev_jan IS NOT NULL THEN
           outrow_jan.person := revenue_rec.person;
           outrow_jan.month := 'Jan';
           outrow_jan.revenue := revenue_rec.rev_jan;
           PIPE ROW (outrow_jan);
         END IF;

         IF revenue_rec.rev_feb IS NOT NULL THEN
           outrow_feb.person := revenue_rec.person;
           outrow_feb.month := 'Feb';
           outrow_feb.revenue := revenue_rec.rev_feb;
           PIPE ROW (outrow_feb);
         END IF;

         IF revenue_rec.rev_mar IS NOT NULL THEN
           outrow_mar.person := revenue_rec.person;
           outrow_mar.month := 'Mar';
           outrow_mar.revenue := revenue_rec.rev_mar;
           PIPE ROW (outrow_mar);
         END IF;

         IF revenue_rec.rev_apr IS NOT NULL THEN
           outrow_apr.person := revenue_rec.person;
           outrow_apr.month := 'Apr';
           outrow_apr.revenue := revenue_rec.rev_apr;
           PIPE ROW (outrow_apr);
         END IF;

         IF revenue_rec.rev_mai IS NOT NULL THEN
           outrow_mai.person := revenue_rec.person;
           outrow_mai.month := 'Mai';
           outrow_mai.revenue := revenue_rec.rev_mai;
           PIPE ROW (outrow_mai);
         END IF;

         IF revenue_rec.rev_jun IS NOT NULL THEN
           outrow_jun.person := revenue_rec.person;
           outrow_jun.month := 'Jun';
           outrow_jun.revenue := revenue_rec.rev_jun;
           PIPE ROW (outrow_jun);
         END IF;

         IF revenue_rec.rev_jul IS NOT NULL THEN
           outrow_jul.person := revenue_rec.person;
           outrow_jul.month := 'Jul';
           outrow_jul.revenue := revenue_rec.rev_jul;
           PIPE ROW (outrow_jul);
         END IF;

         IF revenue_rec.rev_aug IS NOT NULL THEN
           outrow_aug.person := revenue_rec.person;
           outrow_aug.month := 'Aug';
           outrow_aug.revenue := revenue_rec.rev_aug;
           PIPE ROW (outrow_aug);
         END IF;

         IF revenue_rec.rev_sep IS NOT NULL THEN
           outrow_sep.person := revenue_rec.person;
           outrow_sep.month := 'Sep';
           outrow_sep.revenue := revenue_rec.rev_sep;
           PIPE ROW (outrow_sep);
         END IF;

         IF revenue_rec.rev_oct IS NOT NULL THEN
           outrow_oct.person := revenue_rec.person;
           outrow_oct.month := 'Oct';
           outrow_oct.revenue := revenue_rec.rev_oct;
           PIPE ROW (outrow_oct);
         END IF;

         IF revenue_rec.rev_nov IS NOT NULL THEN
           outrow_nov.person := revenue_rec.person;
           outrow_nov.month := 'Nov';
           outrow_nov.revenue := revenue_rec.rev_nov;
           PIPE ROW (outrow_nov);
         END IF;

         IF revenue_rec.rev_dez IS NOT NULL THEN
           outrow_dez.person := revenue_rec.person;
           outrow_dez.month := 'Dez';
           outrow_dez.revenue := revenue_rec.rev_dez;
           PIPE ROW (outrow_dez);
         END IF;

      END LOOP;
      RETURN;
   END;
END;
/

By enabling parallel DML and then using the table function, it's now possible to load and transform the revenue data in one operation:

ALTER SESSION ENABLE PARALLEL DML;

INSERT /*+ APPEND PARALLEL (t,4) */ INTO revenue t
   SELECT *
   FROM TABLE (revenue_pkg.revenue_fun (
        CURSOR(SELECT person,
                      rev_jan,
                      rev_feb,
                      rev_mar,
                      rev_apr,
                      rev_mai,
                      rev_jun,
                      rev_jul,
                      rev_aug,
                      rev_sep,
                      rev_oct,
                      rev_nov,
                      rev_dez
               FROM revext)));

The revenue data will be read from the external table. That process will be parallelized, using intrafile parallelism. The output from each of those parallel operations will feed into separate processes, also running in parallel, that transform each input row into the 12 output rows you desire. Because the table function handles the transformation, only one pass through the external file is necessary. It's not necessary to stage the data in any sort of work table. The data is pipelined from the external data file, through the table function, and directly into the destination table. Fewer copies of the data results in less disk space being consumed.

The MERGE Statement

The MERGE statement solves the long-standing problem of reloading data that you have loaded previously. Prior to Oracle9i, you needed to write procedural code to detect whether a row existed, and to issue an INSERT or UPDATE statement as appropriate. In Oracle9i, you can simply use the MERGE statement and let the database handle the details.

INPUT or UPDATE is the Question

Let's consider one final, and very common, data loading problem with respect to our scenario. Let's assume that we had previously loaded revenue data into the REVENUE table, and that our current input file contained both new data and updates to existing data. In cases where data in the file represents an update, we want to update existing rows in the REVENUE table. When data in the file represents new data, we want to insert new rows into the REVENUE table. This is a common problem, and one that's often handled by implementing some procedural logic using PL/SQL. For example, the following PL/SQL code reads rows one at a time from REVEXT, and issues either UPDATE or INSERT statements as appropriate.

Oracle 8i Solution

DECLARE
BEGIN
   FOR inrec IN (
      SELECT person,rev_jan,rev_feb,rev_mar,rev_apr,rev_mai,rev_jun,
                    rev_jul,rev_aug,rev_sep,rev_oct,rev_nov,rev_dez
      FROM revext
   ) LOOP
      /* Jan Data */
      IF inrec.rev_jan IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_jan
         WHERE person = inrec.person
         AND month = 'Jan';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Jan', inrec.rev_jan);
         END IF;
      END IF;
      /* Feb Data */
      IF inrec.rev_feb IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_feb
         WHERE person = inrec.person
         AND month = 'Feb';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Feb', inrec.rev_feb);
         END IF;
      END IF;
      /* Mar Data */
      IF inrec.rev_mar IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_mar
         WHERE person = inrec.person
         AND month = 'Mar';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Mar', inrec.rev_mar);
         END IF;
      END IF;
      /* Apr Data */
      IF inrec.rev_apr IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_apr
         WHERE person = inrec.person
         AND month = 'Apr';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Apr', inrec.rev_apr);
         END IF;
      END IF;
      /* Mai Data */
      IF inrec.rev_mai IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_mai
         WHERE person = inrec.person
         AND month = 'Mai';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Mai', inrec.rev_mai);
         END IF;
      END IF;
      /* Jun Data */
      IF inrec.rev_jun IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_jun
         WHERE person = inrec.person
         AND month = 'Jun';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Jun', inrec.rev_jun);
         END IF;
      END IF;
      /* Jul Data */
      IF inrec.rev_jul IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_jul
         WHERE person = inrec.person
         AND month = 'Jul';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Jul', inrec.rev_jul);
         END IF;
      END IF;
      /* Aug Data */
      IF inrec.rev_aug IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_aug
         WHERE person = inrec.person
         AND month = 'aug';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Aug', inrec.rev_aug);
         END IF;
      END IF;
      /* Sep Data */
      IF inrec.rev_sep IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_sep
         WHERE person = inrec.person
         AND month = 'Sep';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Sep', inrec.rev_sep);
         END IF;
      END IF;
      /* Oct Data */
      IF inrec.rev_oct IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_oct
         WHERE person = inrec.person
         AND month = 'Oct';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Oct', inrec.rev_oct);
         END IF;
      END IF;
      /* Nov Data */
      IF inrec.rev_nov IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_nov
         WHERE person = inrec.person
         AND month = 'Nov';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Nov', inrec.rev_nov);
         END IF;
      END IF;
      /* Dez Data */
      IF inrec.rev_dez IS NOT NULL THEN
         UPDATE revenue
         SET revenue = inrec.rev_dez
         WHERE person = inrec.person
         AND month = 'Dez';
         IF SQL%ROWCOUNT = 0 THEN
            INSERT INTO revenue
               (person, month, revenue)
               VALUES (inrec.person, 'Dez', inrec.rev_dez);
         END IF;
      END IF;
   END LOOP;
END;
/

An approach like this has several problems. For one, we're no longer using table functions, so we've just destroyed our ability to parallelize the transformation part of our load process. We're also forced to issue up to 12 DML statements for each input record, and we have to worry about whether the most likely case will be an UPDATE or an INSERT. Finally, it's a downright pain to have to code this type of logic. Wouldn't it be nice of there were a single statement to do all this work for us? In Oracle9i, there is such a statement. It's the MERGE statement, and it's designed to either INSERT new rows or UPDATE existing row as appropriate.

Oracle 9i Solution

When writing a MERGE statement, you must specify the following items:

The name of the destination table

A SELECT statement to serve as the source of the data

A condition for use in identifying input rows that represent updates to existing data

UPDATE and INSERT clauses

The following MERGE statement will properly merge data from our external table REVEXT (src) into the REVENUE (dest) table. The ON clause references the primary key columns for the destination table. When the database reads a row from the source query, it evaluates the ON condition against each row in the destination table (via the primary key index in this case). If a row is found in the destination table for which the ON condition evaluates to TRUE, then the database treats the source row as an update.

MERGE INTO revenue dest
USING (SELECT * FROM TABLE (revenue_pkg.revenue_fun(
        CURSOR(SELECT person,
                      rev_jan,
                      rev_feb,
                      rev_mar,
                      rev_apr,
                      rev_mai,
                      rev_jun,
                      rev_jul,
                      rev_aug,
                      rev_sep,
                      rev_oct,
                      rev_nov,
                      rev_dez
               FROM revext)))) src
ON (dest.person = src.person
AND dest.month = src.month)
WHEN MATCHED THEN
UPDATE SET dest.revenue = src.revenue
WHEN NOT MATCHED THEN
INSERT (person, month, revenue)
VALUES (src.person, src.month, src.revenue);

COMMIT;

This is really exciting! With one statement, we can read data from our external data file, pass it through our table function REVENUE_FUN in order to transform it to match our database table, and then either update existing rows or insert new rows into our database table as appropriate. What's more, no temporary staging tables are needed, and all of this work can easily be parallelized to any degree.

Conclusion

The problem of merging data has been around forever, and it's great that we finally have a way to deal with it at the database level that also takes advantage of the database's scalability. External tables promise to blur the distinction between external and internal data, making it easier to load data from external systems into the database. Using these new ETL features, you can take a multistep, single-threaded load process, and convert it to a single-step, pipelined, and highly parallelized load process

Download Everything from here