Oracle Function Based Indexes

Introduction

This is the first of many articles on new Oracle8i release 8.1 database features. I will be discussing in some depth many of the 150+ new features contained in the Oracle database over the next couple of months. In each of these articles we will explore some new feature found in the database and explain/provide:

Function Based Indexes

Oracle8i introduces a feature virtually every DBA and programmer will be using immediately -- the ability to index functions and use these indexes in query. In a nutshell, this capability allows you to have case insenstive searches or sorts, search on complex equations, and extend the SQL language efficiently by implementing your own functions and operators and then searching on them.

Why to use this feature

It's easy and provides immediate value.

It can be used to speed up existing applications without changing any of their logic or queries.

It can be used to supply additional functionality to applications with very little cost.

So why is it easy and of immediate value? It's easy because it's just a CREATE INDEX statement. Consider the following example: I begin by creating a copy of the «scott/tiger» demo employee table. I then change the data in the employee name column to be in mixed case. I then create an index on the UPPER of the ename column -- effectively creating a case insensitive index:

update emp set ename = initcap(ename);
commit;
create index emp_upper_idx on emp(upper(ename));

We now have an index on the «UPPER» of a column. Any application that already issues 'case insensitive» queries of the form:

set autotrace on explain
select ename, empno, sal from emp where upper(ename) = 'KING';

ENAME           EMPNO        SAL
---------- ---------- ----------
King             7839       5000

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=40)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (Cost=1 Card=1 Bytes=40)
   2    1     INDEX (RANGE SCAN) OF 'EMP_UPPER_IDX' (NON-UNIQUE) (Cost=1 Card=1)

will transparently make use of this index -- gaining the performance boost an index can deliver. Before this feature was available, every row in the EMP table would have been scanned, upper-cased and compared. In constrast, with the index on upper(ename, the query takes the constant KING to the index, range scans a little data and accesses the table by ROWID to get the data. This is very fast.

This performance boost is most visible when indexing user written functions on columns. Oracle7 Release 7.1 added the ability to use user written functions in SQL so that you could:

select my_function(ename)
  from emp 
where some_other_function(empno) > 10
/

This was great because you could now effectively extend the SQL language to include application specific functions. Unfortunately however, the performance of the above query was a bit disappointing. Say the EMP table had 1,000 rows in it -- the function «some_other_function» would be executed 1,000 times during the query, once per row. Additionally, assume the function took 1/100 of a second to execute. This relatively simple query now takes at least 10 seconds.

How to enable Function Based Indexes

The following is a list of what needs to be done to use function based indexes:

You must have the system privelege query rewrite to create function based indexes on tables in your own schema.

You must have the system privelege global query rewrite to create function based indexes on tables in other schemas

For the optimizer to use function based indexes, the following session or system variables must be set:

QUERY_REWRITE_ENABLED=TRUE
QUERY_REWRITE_INTEGRITY=TRUSTED

You may enable these at either the session level with ALTER SESSION or at the system level via ALTER SYSTEM or by setting them in the init.ora parameter file. The meaning of query_rewrite_enabled is to allow the optimizer to rewrite the query allowing it to use the function based index. The meaning of is to tell the optimizer to «trust» that the code marked deterministic by the programmer is in fact deterministic. If the code is in fact not deterministic (that is, it returns different output given the same inputs), the resulting rows from the index may be incorrect.

Use the Cost Based Optimizer. Function based indexes are only visible to the Cost Based Optimizer and will not be used by the Rule Based Optimizer ever.

Use substr() to constrain return values from user written functions that return VARCHAR2 or RAW types. Optionally hide the substr in a view (recommended).

Once the above list has been satisfied, it is as easy as «CREATE INDEX» from there on in. The optimizer will find and use your indexes at runtime for you.

Real Example

Here is a real example - a modified «soundex» routine in PL/SQL:

alter session set query_rewrite_enabled = true;
alter session set query_rewrite_integrity = trusted;

create or replace package stats
as
     cnt number default 0;
end;
/

create or replace function my_soundex(p_string in varchar2) return varchar2
deterministic
as
     l_return_string   varchar2(6) default substr(p_string, 1, 1);
     l_char            varchar2(1);
     l_last_digit      number default 0;

     type vcArray is table of varchar2(10) index by binary_integer;
     l_code_table      vcArray;

begin
     stats.cnt := stats.cnt+1;

     l_code_table(1) := 'BPFV';
     l_code_table(2) := 'CSKGJQXZ';
     l_code_table(3) := 'DT';
     l_code_table(4) := 'L';
     l_code_table(5) := 'MN';
     l_code_table(6) := 'R';

     for i in 1 .. length(p_string)
     loop
         exit when (length(l_return_string) = 6);
         l_char := substr(p_string, i, 1);

         for j in 1 .. l_code_table.count
         loop
         if (instr(l_code_table(j), l_char) > 0 AND j <> l_last_digit)
         then
             l_return_string := l_return_string || to_char(j,'fm9');
             l_last_digit := j;
         end if;
         end loop;
     end loop;

     return rpad(l_return_string, 6, '0');
end;
/

Notice in this function, the usage of the keyword «deterministic». Deterministic declares that the above function -- when given the same inputs -- will always return the exact same output. This keyword is needed in order to create an index on a user written function. You must tell Oracle that the function is «deterministic» and will return a consistent result given the same inputs. This implies for example that you cannot index using the package «dbms_rand», the random number generator. Its results are not deterministic, given the same inputs you'll get random output. The builtin sql function UPPER on the other hand is deterministic so you can create an index on the UPPER of a column.

Now that we have the function «My_Soundex()», lets see how it performs without an index...

create table test_soundex(name varchar2(30));

insert into test_soundex
       select object_name
         from all_objects
        where rownum <= 5000;

exec stats.cnt := 0;

set timing on
set autotrace on explain
select name
from test_soundex A
where my_soundex(name) = my_soundex('FILE$')
/

NAME
------------------------------
FILE$

Elapsed: 00:00:03.00

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=34)
   1    0   TABLE ACCESS (FULL) OF 'TEST_SOUNDEX' (Cost=1 Card=1 Bytes=34)

set autotrace off
set timing off
set serveroutput on
exec dbms_output.put_line(stats.cnt)
10000

So, we can see this query took over 3 seconds to execute and had to do a full scan on the table. The function my_soundex was invoked 10,000 times (according to our counter), twice for each row. Lets see how indexing the function can be used to speed things up.

The first thing we will do is create the index as follows:

create index test_soundex_idx on test_soundex(substr(my_soundex(name),1,6));

Now, the interesting thing to note in this create index command is the use of the substr function. This is because we are indexing a function that returns a string. If we were indexing a function that returned a number or date this substr would not be necessary. The reason we must substring the user written function that returns a string is that they return VARCHAR2(4000) types. That is too big to be indexed -- index entries must fit within 1/3 the size of the block. If we tried we would recieve (in a database with a 2k blocksize) the following:

create index test_soundex_idx on test_soundex(my_soundex(name),1,6);
create index test_soundex_idx on test_soundex( my_soundex(name),1,6 )
*
ERROR at line 1:
ORA-01450: maximum key length (3118) exceeded

In databases with larger block sizes, the number 3118 would vary but, unless you are using a 16k or larger blocksize, you will not be able to index a VARCHAR2(4000).

So, in order to index a user written function that returns a string, we must constrain the return type in the create index statement. In the above, knowing that my_soundex returns at most 6 characters, we are substring the first six characters.

We are now ready to test the performance of the table with the index on it. We would like to monitoring the effect of the index on INSERTS as well as the speedup for SELECTS to see the effect on each. In the un-indexed test case, our queries took over 3 seconds.

delete from test_soundex;
exec stats.cnt := 0;

insert into test_soundex
select object_name
from all_objects
where rownum <= 5000;

exec dbms_output.put_line(stats.cnt)
5000

exec stats.cnt := 0;

set autotrace on explain
set timing on

select name
from test_soundex B
where substr(my_soundex(name),1,6) = my_soundex('FILE$')
/

NAME
------------------------------
FILE$

Elapsed: 00:00:00.05

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=34)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'TEST_SOUNDEX' (Cost=1 Card=1 Bytes=34)
   2    1     INDEX (RANGE SCAN) OF 'TEST_SOUNDEX_IDX' (NON-UNIQUE) (Cost=1 Card=1)

The important things to note here are that:

The insert of 5,000 records took longer. Indexing a user written function will necessarily affect the performance of inserts and some updates. Since most applications insert and update singleton entries.

While the insert ran slower, the query ran much faster. Also, as the size of our table grows, the full scan query will take longer and longer to execute. The index based query will always execute with the near same performance characteristics as the table gets larger.

We had to use «substr()» in our query. This is not as nice as just coding «where my_soundex(name) = my_soundex( 'FILE$' )» but we can easily get around that we will show below:

We would now like to see how to make it so the query does not have use the substr function call. The use of the substr call could be error prone -- your end users have to know to substr from 1 for 6 characters. If they used a different size, the index would not be used. Also, you want to control in the server the number of bytes to index. This will allow you do reimplement the my_soundex function later with 7 bytes instead of 6 if you want to. We can do this, hide the substr, with a view quite easily as follows:

create or replace view test_soundex_v
as
select name, substr(my_soundex(name),1,6) name_soundex
from test_soundex
/

exec stats.cnt := 0;

select name
from test_soundex_v B
where name_soundex = my_soundex('FILE$')
/

NAME
------------------------------
FILE$

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=34)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'TEST_SOUNDEX' (Cost=1 Card=1 Bytes=34)
   2    1     INDEX (RANGE SCAN) OF 'TEST_SOUNDEX_IDX' (NON-UNIQUE) (Cost=1 Card=1)

exec dbms_output.put_line( stats.cnt )
2

So, what we have done here is to hide the substr( f(x), 1, 6 ) in the view itself. The optimizer still recognizes that this virtual column is in fact the indexed column and does the «right thing». We see the same performance improvement and the same query plan. Using this view is as good as using the base table, better even since it hides the complexity and allows you to change the size of the substr later.

Conclusion

Here are some of the pros of this feature

Pros

Cons

Its easy to use/implement and provides immediate value
It can be used to speed up existing applications without changing any of their logic or queries. Many orders of magnitude query improvement may be observed.
It can be used to precompute complex values without using a trigger
Can be created either as B*Tree or bitmap index
Index can be built on an arithmetic expression or expression containing PL/SQL, package functions, C callout or SQL built-in functions
Optimizer can estimate selectivity more accurately if the expressions are materialized in a function-based index. Range scan can be used for queries with expression in where clause and has index build on the expression used.
Provides efficient linguistic collation to use NLS sort index
Indexes can be created on object columns and REF columns by using methods defined for the object.

You cannot direct path load the table with a function based index if that function was user written and requires the SQL ENGIRE. That means you cannot direct path load into a table indexed using my_soundex(x), but you could if it had indexed upper(x).
It will affect the performance of inserts and updates. (Remember, you insert a row once, you query it thousands of times.)
If the function you are indexing is a user written function and it returns a string, you may have to expose a view for the end users to use.

In general, the pros heavily out weigh any of the cons in this case. The inability to direct path load with a pl/sql based index can easily be overcome by indexing after the load with the parallel query option. The performance of inserts is only marginally affected, most applications won't even notice the effect.