The ANALYZE statement
The ANALYZE statement should no more be used to
collect optimizer statistics. The COMPUTE and ESTIMATE clauses of ANALYZE statement
are supported solely for backward compatibility and may be removed in a future release.
The DBMS_STATS package collects a broader, more accurate set of statistics, and gathers
statistics more efficiently.
You may continue to use ANALYZE statement to for
other purposes not related to optimizer statistics collection:
More information can be found here.
Gathering Statistics with DBMS_STATS Procedures
Statistics are gathered using the DBMS_STATS
package. This PL/SQL package is also used to modify, view, export, import, and delete
statistics.
The DBMS_STATS package can gather statistics on
table and indexes, and well as individual columns and partitions of tables.
When you generate statistics for a table, column, or
index, if the data dictionary already contains statistics for the object, then Oracle
updates the existing statistics. The older statistics are saved and can be restored later
if necessary.
When statistics are updated for a database object,
Oracle invalidates any currently parsed SQL statements that access the object. The next
time such a statement executes, the statement is re-parsed and the optimizer
automatically chooses a new execution plan based on the new statistics.
Collect Statistics on Table Level
sqlplus scott/tiger
exec dbms_stats.gather_table_stats ( -
ownname
=> 'SCOTT', -
tabname
=> 'EMP', -
estimate_percent => dbms_stats.auto_sample_size, -
method_opt => 'for all
columns size auto', -
cascade
=> true, -
degree => 5 -
)
/ PL/SQL procedure successfully
completed.
ownname
|
Schema of table to
analyze. |
tabname
|
Name of table. |
estimate_percent
|
Percentage of rows to estimate (NULL
means compute). Use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to have Oracle determine
the appropriate sample size for good statistics. This is the default. |
method_opt
|
The default is FOR ALL COLUMNS SIZE
AUTO. |
cascade
|
Gather statistics on the indexes for
this table. Index statistics gathering is not parallelized. Using this option is
equivalent to running the GATHER_INDEX_STATS Procedure on each of the table's
indexes. Use the constant DBMS_STATS.AUTO_CASCADE to have Oracle determine whether
index statistics to be collected or not. |
degree
|
Degree of parallelism. The default
for degree is NULL. |
All Parameters / Options can be found in the Oracle
Manual
Collect Statistics on Schema Level
sqlplus scott/tiger
exec dbms_stats.gather_schema_stats ( -
ownname
=> 'SCOTT', -
options
=> 'GATHER', -
estimate_percent => dbms_stats.auto_sample_size, -
method_opt => 'for all
columns size auto', -
cascade
=> true, -
degree => 5 -
)
/ PL/SQL procedure successfully
completed.
ownname
|
Schema to analyze (NULL means current schema).
|
options
|
gather
|
Reanalyzes the whole schema |
gather empty
|
Only analyzes tables that have no existing
statistics |
gather stale
|
Only reanalyzes tables with more than 10%
modifications (inserts, updates, deletes). |
gather auto
|
Reanalyzes objects which currently have no
statistics and objects with stale statistics (Using gather auto is like
combining gather stale and gather empty.) |
Note that both gather stale and gather auto require monitoring. If you issue the
alter table xxx monitoring command, Oracle tracks
changed tables with the dba_tab_modifications view, which allows you to see the
exact number of inserts, updates, and deletes tracked since the last analysis of
statistics.
|
estimate_percent
|
Percentage of rows to estimate (NULL
means compute). Use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to have Oracle determine
the appropriate sample size for good statistics. This is the default. |
method_opt
|
The default is FOR ALL COLUMNS SIZE
AUTO. |
cascade
|
Gather statistics on the indexes for
this table. Index statistics gathering is not parallelized. Using this option is
equivalent to running the GATHER_INDEX_STATS Procedure on each of the table's
indexes. Use the constant DBMS_STATS.AUTO_CASCADE to have Oracle determine whether
index statistics to be collected or not. |
degree
|
Degree of parallelism. The default
for degree is NULL. |
All Parameters / Options can be found in the Oracle
Manual
Collect Statistics on Other Levels
DBMS_STATS can collect optimizer statistics on the following levels, see Oracle Manual
GATHER_DATABASE_STATS
GATHER_DICTIONARY_STATS
GATHER_FIXED_OBJECTS_STATS
GATHER_INDEX_STATS
GATHER_SCHEMA_STATS
GATHER_SYSTEM_STATS
GATHER_TABLE_STATS
Statistics for Partitioned Schema Objects
Partitioned schema objects may contain multiple sets
of statistics. They can have statistics which refer to the entire schema object as a
whole (global statistics), they can have statistics which refer to an individual
partition, and they can have statistics which refer to an individual sub-partition of a
composite partitioned object.
Unless the query predicate narrows the query to a
single partition, the optimizer uses the global statistics.
Because most queries are not likely to be this restrictive, it is most important to
have accurate global statistics. Therefore, actually gathering global statistics with
the DBMS_STATS package is highly recommended.
Oracle Histogram Statistics
If DBMS_STATS discovers an index whose columns are unevenly
distributed, it will create histograms for that index to aid the cost-based SQL optimizer
in making a decision about index versus full-table scan access.
Example
First we set up a table with some very skewed data - so skewed that when we query
WHERE ID=1, Oracle Database will want to use an index on ID, and when we query WHERE
ID=99, Oracle Database will not want to use an index.
sqlplus scott/tiger
CREATE TABLE skewed_data
AS
SELECT DECODE(ROWNUM,1,1,99) id,
all_objects.*
FROM all_objects
/
Table created.
CREATE INDEX idx_skewed_data ON skewed_data (id);
Index created.
begin
dbms_stats.gather_table_stats
( ownname => USER,
tabname => 'SKEWED_DATA',
method_opt => 'for all indexed columns size
254',
cascade => TRUE
);
end;
/
PL/SQL procedure successfully completed.
set autotrace traceonly explain
SELECT * FROM skewed_data WHERE id=1;
Execution Plan
----------------------------------------------------------
Plan hash value: 1799207757
-----------------------------------------------------------------------------------------------
| Id |
Operation
| Name | Rows |
Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT
|
| 1 | 96 |
2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID|
SKEWED_DATA | 1 | 96
| 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE
SCAN | IDX_SKEWED_DATA
| 1 |
| 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
SELECT * FROM skewed_data WHERE id=99;
Execution Plan
----------------------------------------------------------
Plan hash value: 746880940
---------------------------------------------------------------------------------
| Id | Operation |
Name | Rows | Bytes | Cost (%CPU)|
Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT
| | 51545 |
4832K| 190 (3)| 00:00:03 |
|* 1 | TABLE ACCESS FULL| SKEWED_DATA | 51545 |
4832K| 190 (3)| 00:00:03 |
---------------------------------------------------------------------------------
The Table skewed_data contains a column ID, which is very much skewed most of
the values are 99, with one record containing a value of 1. After we index and gather
statistics on the table (generating histograms on that indexed column, so the optimizer
knows that the data is skewed), we can see that the optimizer prefers an index range scan
over a full scan when ID=1 is used and vice versa for ID=99.
Use Histograms!
The cost-based optimiser uses data value histograms
to get accurate estimates of the distribution of column data. Histograms provide improved
selectivity estimates in the presence of data skew, resulting in optimal execution plans
with non-uniform data distributions.
Histograms can affect performance and should be used
only when they substantially improve query plans. They are useful only when they
reflect the current data distribution of a given column. If the data distribution of a
column changes frequently, you must re-compute its histogram frequently.
Number of Histograms
The number of Histograms to used is specified with the SIZE parameter in method_opt:
method_opt => 'for all indexed columns size 254',
What are Histograms
Histograms are bands of column values, so that each band contains
approximately the same number of rows. The useful information that the histogram provides
is where in the range of values the endpoints fall.
Consider the following table column with values between 1 and 100 and a
histogram with 10 buckets. If the data in the column is uniformly
distributed, then the histogram looks as follows, where the numbers are the endpoint
values.
Histogram with Uniform Distribution
The number of rows in each bucket is 1/10 of the total number of rows
in the table.
If the data is not uniformly distributed, then the histogram might look
as follows:
Histogram with Non-Uniform Distribution
In this case, most of the rows have the value 5 for the column.
Only 1/10 of the rows have values between 60 and 100.
Conclusion
The package DBMS_STATS can be used to gather global
statistics. It is most important to have accurate global
statistics for partitioned schema objects. Histograms can
affect performance and should be used only when they substantially improve query
plans.
|