Overview
Up until Oracle 9i, SQL*Loader was the tool to load a lot of data into an Oracle database from external sources. Oracle9i introduces several interesting new
features that change the way you look at the data-loading and transformation process.
These is performed by the ETL toolkit (Extraction, Transformation, and Loading) and
collectively provide you with a powerful ETL toolkit. Three ETL features are particularly
interesting
Typical Loading Scenario
In the following example we use external
tables and table functions, two of these powerful ETL features. Let's assume
that you have the following revenue data to load into a database and that you want to
normalize it so you end up with one row for each person/month combination:
Person Jan Feb Mar Apr Mai Jun Jul Aug Sep Oct Nov Dez
--------------------------------------------------------
Schnyder,345,223,122,345,324,244,123,123,345,121,345,197
Weber,234,234,123,457,456,287,234,123,678,656,341,567
Keller,596,276,347,134,743,545,216,456,124,753,346,456
Meyer,987,345,645,567,834,567,789,234,678,973,456,125
Holzer,509,154,876,347,146,788,174,986,568,246,324,987
Müller,456,125,678,235,878,237,567,237,788,237,324,778
Binggeli,487,347,458,347,235,864,689,235,764,964,624,347
Stoller,596,237,976,876,346,567,126,879,125,568,124,753
Marty,094,234,235,763,054,567,237,457,325,753,577,346
Studer,784,567,235,753,124,575,864,235,753,864,634,678
Person Month Revenue
-------------------------
Schnyder Jan 345
Schnyder Feb 223
Schnyder Mar 122
Schnyder Apr 345
Schnyder Mai 324
Schnyder Jun 244
Schnyder Jul 123
Schnyder Aug 123
Schnyder Sep 345
Schnyder Oct 121
Schnyder Nov 345
Schnyder Dez 197
........ ... ...
Using Oracle8i, you would need to use at least two
discrete steps to load and transform this data. The first step is to load the the data
into a staging table using SQL*LOADER, the next step is to transform and insert the data
in the final table using a PL/SQL procedure.
Oracle9i's ETL features, however, allow for a
more interesting approach to the task, one that loads and transforms in just one
step.
External Tables
New in Oracle9i is the concept of an external table. This is a
table that you define in the database's data dictionary, but for which the data itself is
stored outside of the database. It's possible, for example, to define an external
table that derives its data from the type of text file you would load using SQL*Loader.
This is great, because the revenue data in the above example
you need to load resides in just such a text file. Before you can create an external
table, you need to create an Oracle directory object that points to the operating system
directory in which your text file resides.
CONNECT sys/manager AS SYSDBA;
CREATE OR REPLACE DIRECTORY dat_dir AS 'C:\Oradata\Data';
CREATE OR REPLACE DIRECTORY log_dir AS 'C:\Oradata\Log';
CREATE OR REPLACE DIRECTORY bad_dir AS 'C:\Oradata\Bad';
GRANT READ ON DIRECTORY dat_dir TO scott;
GRANT WRITE ON DIRECTORY log_dir TO scott;
GRANT WRITE ON DIRECTORY bad_dir TO scott;
You can now use a new form of the CREATE TABLE statement that looks
like a cross between a SQL statement and a SQL*Loader control file:
CONNECT scott/tiger;
CREATE TABLE revext (person
VARCHAR2(20),
rev_jan NUMBER(4),
rev_feb NUMBER(4),
rev_mar NUMBER(4),
rev_apr NUMBER(4),
rev_mai NUMBER(4),
rev_jun NUMBER(4),
rev_jul NUMBER(4),
rev_aug NUMBER(4),
rev_sep NUMBER(4),
rev_oct NUMBER(4),
rev_nov NUMBER(4),
rev_dez NUMBER(4))
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY dat_dir
ACCESS PARAMETERS
(
records delimited by newline
badfile bad_dir:'revext%a_%p.bad'
logfile log_dir:'revext%a_%p.log'
fields terminated by ','
missing field values are null
( person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
)
)
LOCATION ('revext.dat')
)
PARALLEL 4
REJECT LIMIT UNLIMITED;
When you create an external table, you're really only creating some
data dictionary entries. Nothing exciting happens until you query the table. Realize
though, that you can query the table as you would any other SQL table. Let's assume you
had the following production table:
CREATE TABLE revenue (
person VARCHAR2(20),
month VARCHAR2(3),
revenue NUMBER,
CONSTRAINT revenue_pk PRIMARY KEY (person,month));
Given this table, you can use the following INSERT...SELECT FROM
statement to extract revenue data from your external file in a normalized format and
insert it into your production table.
INSERT INTO revenue (person,month,revenue)
SELECT person,'Jan',rev_jan
FROM revext
WHERE rev_jan IS NOT NULL
UNION ALL
SELECT person,'Feb',rev_feb
FROM revext
WHERE rev_feb IS NOT NULL
UNION ALL
SELECT person,'Mar',rev_mar
FROM revext
WHERE rev_mar IS NOT NULL
UNION ALL
SELECT person,'Apr',rev_apr
FROM revext
WHERE rev_apr IS NOT NULL
UNION ALL
SELECT person,'Mai',rev_mai
FROM revext
WHERE rev_mai IS NOT NULL
UNION ALL
SELECT person,'Jun',rev_jun
FROM revext
WHERE rev_jun IS NOT NULL
UNION ALL
SELECT person,'Jul',rev_jul
FROM revext
WHERE rev_jul IS NOT NULL
UNION ALL
SELECT person,'Aug',rev_aug
FROM revext
WHERE rev_aug IS NOT NULL
UNION ALL
SELECT person,'Sep',rev_sep
FROM revext
WHERE rev_sep IS NOT NULL
UNION ALL
SELECT person,'Oct',rev_oct
FROM revext
WHERE rev_oct IS NOT NULL
UNION ALL
SELECT person,'Nov',rev_nov
FROM revext
WHERE rev_nov IS NOT NULL
UNION ALL
SELECT person,'Dez',rev_dez
FROM revext
WHERE rev_dez IS NOT NULL;
Because you set the degree of parallelism to four when you created the
external table, the database will divide the file to be read by four processes running in
parallel. This parallelism happens automatically, with no additional effort on your part,
and is really quite convenient. To parallelize this load using SQL*Loader, you would have
had to manually divide your input file into multiple smaller files.
Table Functions
The INSERT statement shown previously works off the union of 12 SELECT
statements. This means that the external data file gets read 12 times, once for each
SELECT. Reading the input file 12 times isn't desirable, especially if the file is
very large. Fortunately, this happens to be just the type of problem that you can solve
using table functions. You can think of a table function as a highly streamlined
transformation engine. A table function takes a set of rows as input and returns a
different set of rows as output. Unlike traditional functions, table functions are
designed to be invoked from a SELECT statement's FROM clause.
For the revenue data, we want to take each input row and transform
it into 12 output rows. Each input row holds data from 12 different revenue months.
Your normalized destination table, however, requires that each month get its own row, so
your table function must transform the 12 population counts from each input row into 12
separate output rows.
Before creating the table function, you need to create some types.
Table functions always return a collection of records, so to begin, create a table type
that corresponds to the definition of your ultimate destination table. To do this, first
create an object type to define the record, and then create a table type based on that
object type.
connect scott/tiger;
CREATE TYPE revenue_row AS OBJECT
(
person VARCHAR2(20),
month VARCHAR2(3),
revenue NUMBER
);
/
CREATE TYPE revenue_tab
AS TABLE OF revenue_row;
/
The input to your function will be rows returned by a SELECT statement
against the external table REVEXT, so you'll need an appropriate REF CURSOR type.
The following statement creates a package containing a REF CURSOR type REVENUE_CUR that matches the record structure of the
REVEXT table. The package also defines a table function that takes such a cursor
as an input parameter. Note that you have to create the table type REVENUE_TAB first, so you can use that type in your table
function's RETURN clause:
CREATE OR REPLACE PACKAGE revenue_pkg
AS
TYPE revenue_cur IS REF CURSOR RETURN
revext%ROWTYPE;
FUNCTION revenue_fun (revenue_arg IN
revenue_cur)
RETURN revenue_tab
PARALLEL_ENABLE (PARTITION revenue_arg BY ANY)
PIPELINED;
END;
/
The PARALLEL_ENABLE clause in the function specification allows the
database to parallelize the function's execution. The PARTITION revenue_arg BY ANY clause
indicates that the input rows can be arbitrarily divided into any number of buckets, all
of which can then be processed in parallel. The PIPELINED clause enables the function
to return the result set incrementally while other input data is still being
processed. Think of the rows as "flowing through" the function during the execution
of a query invoking the function. The function code defined in the package body takes
care of transforming each input row into the desired 12 output rows.
CREATE OR REPLACE PACKAGE BODY revenue_pkg AS
FUNCTION revenue_fun (revenue_arg IN
revenue_cur)
RETURN revenue_tab
PARALLEL_ENABLE (PARTITION revenue_arg BY
ANY)
PIPELINED IS revenue_rec revext%ROWTYPE;
outrow_jan revenue_row := revenue_row('','',0);
outrow_feb revenue_row := revenue_row('','',0);
outrow_mar revenue_row := revenue_row('','',0);
outrow_apr revenue_row := revenue_row('','',0);
outrow_mai revenue_row := revenue_row('','',0);
outrow_jun revenue_row := revenue_row('','',0);
outrow_jul revenue_row := revenue_row('','',0);
outrow_aug revenue_row := revenue_row('','',0);
outrow_sep revenue_row := revenue_row('','',0);
outrow_oct revenue_row := revenue_row('','',0);
outrow_nov revenue_row := revenue_row('','',0);
outrow_dez revenue_row := revenue_row('','',0);
BEGIN
LOOP
FETCH revenue_arg INTO revenue_rec;
EXIT WHEN revenue_arg%NOTFOUND;
IF revenue_rec.rev_jan IS NOT NULL THEN
outrow_jan.person :=
revenue_rec.person;
outrow_jan.month :=
'Jan';
outrow_jan.revenue :=
revenue_rec.rev_jan;
PIPE ROW (outrow_jan);
END IF;
IF revenue_rec.rev_feb IS NOT NULL THEN
outrow_feb.person :=
revenue_rec.person;
outrow_feb.month :=
'Feb';
outrow_feb.revenue :=
revenue_rec.rev_feb;
PIPE ROW (outrow_feb);
END IF;
IF revenue_rec.rev_mar IS NOT NULL THEN
outrow_mar.person :=
revenue_rec.person;
outrow_mar.month :=
'Mar';
outrow_mar.revenue :=
revenue_rec.rev_mar;
PIPE ROW (outrow_mar);
END IF;
IF revenue_rec.rev_apr IS NOT NULL THEN
outrow_apr.person :=
revenue_rec.person;
outrow_apr.month :=
'Apr';
outrow_apr.revenue :=
revenue_rec.rev_apr;
PIPE ROW (outrow_apr);
END IF;
IF revenue_rec.rev_mai IS NOT NULL THEN
outrow_mai.person :=
revenue_rec.person;
outrow_mai.month :=
'Mai';
outrow_mai.revenue :=
revenue_rec.rev_mai;
PIPE ROW (outrow_mai);
END IF;
IF revenue_rec.rev_jun IS NOT NULL THEN
outrow_jun.person :=
revenue_rec.person;
outrow_jun.month :=
'Jun';
outrow_jun.revenue :=
revenue_rec.rev_jun;
PIPE ROW (outrow_jun);
END IF;
IF revenue_rec.rev_jul IS NOT NULL THEN
outrow_jul.person :=
revenue_rec.person;
outrow_jul.month :=
'Jul';
outrow_jul.revenue :=
revenue_rec.rev_jul;
PIPE ROW (outrow_jul);
END IF;
IF revenue_rec.rev_aug IS NOT NULL THEN
outrow_aug.person :=
revenue_rec.person;
outrow_aug.month :=
'Aug';
outrow_aug.revenue :=
revenue_rec.rev_aug;
PIPE ROW (outrow_aug);
END IF;
IF revenue_rec.rev_sep IS NOT NULL THEN
outrow_sep.person :=
revenue_rec.person;
outrow_sep.month :=
'Sep';
outrow_sep.revenue :=
revenue_rec.rev_sep;
PIPE ROW (outrow_sep);
END IF;
IF revenue_rec.rev_oct IS NOT NULL THEN
outrow_oct.person :=
revenue_rec.person;
outrow_oct.month :=
'Oct';
outrow_oct.revenue :=
revenue_rec.rev_oct;
PIPE ROW (outrow_oct);
END IF;
IF revenue_rec.rev_nov IS NOT NULL THEN
outrow_nov.person :=
revenue_rec.person;
outrow_nov.month :=
'Nov';
outrow_nov.revenue :=
revenue_rec.rev_nov;
PIPE ROW (outrow_nov);
END IF;
IF revenue_rec.rev_dez IS NOT NULL THEN
outrow_dez.person :=
revenue_rec.person;
outrow_dez.month :=
'Dez';
outrow_dez.revenue :=
revenue_rec.rev_dez;
PIPE ROW (outrow_dez);
END IF;
END LOOP;
RETURN;
END;
END;
/
By enabling parallel DML and then using the table function, it's now
possible to load and transform the revenue data in one operation:
ALTER SESSION ENABLE PARALLEL DML;
INSERT /*+ APPEND PARALLEL (t,4) */ INTO revenue t
SELECT *
FROM TABLE (revenue_pkg.revenue_fun (
CURSOR(SELECT person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
FROM
revext)));
The revenue data will be read from the external table. That process
will be parallelized, using intrafile parallelism. The output from each of those parallel
operations will feed into separate processes, also running in parallel, that transform
each input row into the 12 output rows you desire. Because the table function handles the
transformation, only one pass through the external file is necessary. It's not necessary
to stage the data in any sort of work table. The data is pipelined from the external data
file, through the table function, and directly into the destination table. Fewer copies
of the data results in less disk space being consumed.
The MERGE Statement
The MERGE statement solves the long-standing problem of reloading data that you have
loaded previously. Prior to Oracle9i, you needed to write procedural code to detect
whether a row existed, and to issue an INSERT or UPDATE statement as appropriate. In
Oracle9i, you can simply use the MERGE statement and let the database handle the
details.
INPUT or UPDATE is the Question
Let's consider one final, and very common, data loading problem with respect to our
scenario. Let's assume that we had previously loaded revenue data into the REVENUE table,
and that our current input file contained both new data and updates to existing
data. In cases where data in the file represents an update, we want to update
existing rows in the REVENUE table. When data in the file represents new data, we want to
insert new rows into the REVENUE table. This is a common problem, and one that's often
handled by implementing some procedural logic using PL/SQL. For example, the following
PL/SQL code reads rows one at a time from REVEXT, and issues either UPDATE or INSERT
statements as appropriate.
DECLARE
BEGIN
FOR inrec IN (
SELECT
person,rev_jan,rev_feb,rev_mar,rev_apr,rev_mai,rev_jun,
rev_jul,rev_aug,rev_sep,rev_oct,rev_nov,rev_dez
FROM revext
) LOOP
/* Jan Data */
IF inrec.rev_jan IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jan
WHERE person = inrec.person
AND month = 'Jan';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Jan', inrec.rev_jan);
END IF;
END IF;
/* Feb Data */
IF inrec.rev_feb IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_feb
WHERE person = inrec.person
AND month = 'Feb';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Feb', inrec.rev_feb);
END IF;
END IF;
/* Mar Data */
IF inrec.rev_mar IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_mar
WHERE person = inrec.person
AND month = 'Mar';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Mar', inrec.rev_mar);
END IF;
END IF;
/* Apr Data */
IF inrec.rev_apr IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_apr
WHERE person = inrec.person
AND month = 'Apr';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Apr', inrec.rev_apr);
END IF;
END IF;
/* Mai Data */
IF inrec.rev_mai IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_mai
WHERE person = inrec.person
AND month = 'Mai';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Mai', inrec.rev_mai);
END IF;
END IF;
/* Jun Data */
IF inrec.rev_jun IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jun
WHERE person = inrec.person
AND month = 'Jun';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Jun', inrec.rev_jun);
END IF;
END IF;
/* Jul Data */
IF inrec.rev_jul IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jul
WHERE person = inrec.person
AND month = 'Jul';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Jul', inrec.rev_jul);
END IF;
END IF;
/* Aug Data */
IF inrec.rev_aug IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_aug
WHERE person = inrec.person
AND month = 'aug';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Aug', inrec.rev_aug);
END IF;
END IF;
/* Sep Data */
IF inrec.rev_sep IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_sep
WHERE person = inrec.person
AND month = 'Sep';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Sep', inrec.rev_sep);
END IF;
END IF;
/* Oct Data */
IF inrec.rev_oct IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_oct
WHERE person = inrec.person
AND month = 'Oct';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Oct', inrec.rev_oct);
END IF;
END IF;
/* Nov Data */
IF inrec.rev_nov IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_nov
WHERE person = inrec.person
AND month = 'Nov';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Nov', inrec.rev_nov);
END IF;
END IF;
/* Dez Data */
IF inrec.rev_dez IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_dez
WHERE person = inrec.person
AND month = 'Dez';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO
revenue
(person, month, revenue)
VALUES (inrec.person, 'Dez', inrec.rev_dez);
END IF;
END IF;
END LOOP;
END;
/
An approach like this has several problems. For one, we're no longer using table
functions, so we've just destroyed our ability to parallelize the transformation part of
our load process. We're also forced to issue up to 12 DML statements for each input
record, and we have to worry about whether the most likely case will be an UPDATE or an
INSERT. Finally, it's a downright pain to have to code this type of logic. Wouldn't it be
nice of there were a single statement to do all this work for us? In Oracle9i, there is
such a statement. It's the MERGE statement, and it's designed to either INSERT new
rows or UPDATE existing row as appropriate.
When writing a MERGE statement, you must specify the following items:
- The name of the destination table
- A SELECT statement to serve as the source of the data
- A condition for use in identifying input rows that represent updates to existing
data
- UPDATE and INSERT clauses
The following MERGE statement will properly merge data from our external table REVEXT
(src) into the REVENUE (dest) table. The ON clause references the primary key columns for
the destination table. When the database reads a row from the source query, it evaluates
the ON condition against each row in the destination table (via the primary key index in
this case). If a row is found in the destination table for which the ON condition
evaluates to TRUE, then the database treats the source row as an update.
MERGE INTO revenue dest
USING (SELECT * FROM TABLE (revenue_pkg.revenue_fun(
CURSOR(SELECT person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
FROM
revext)))) src
ON (dest.person = src.person
AND dest.month = src.month)
WHEN MATCHED THEN
UPDATE SET dest.revenue = src.revenue
WHEN NOT MATCHED THEN
INSERT (person, month, revenue)
VALUES (src.person, src.month, src.revenue);
COMMIT;
This is really exciting! With one statement, we can read data from our external data
file, pass it through our table function REVENUE_FUN in order to transform it to match
our database table, and then either update existing rows or insert new rows into our
database table as appropriate. What's more, no temporary staging tables are needed, and
all of this work can easily be parallelized to any degree.
Conclusion
The problem of merging data has been around forever, and it's great that we finally
have a way to deal with it at the database level that also takes advantage of the
database's scalability. External tables promise to blur the distinction between external
and internal data, making it easier to load data from external systems into the database.
Using these new ETL features, you can take a multistep, single-threaded load process, and
convert it to a single-step, pipelined, and highly parallelized load process
Download Everything from here
|