metalink上对直方图的介绍

Posted by Vincent on January 20, 2009

Doc ID: Note:1031826.6
Subject: Histograms: An Overview
Type: PROBLEM
Status: PUBLISHED
Content Type: TEXT/PLAIN
Creation Date: 29-MAY-1997
Last Revision Date: 23-JUN-2003
Histograms 

==========

 

For uniformly distributed data, the cost-based approach makes fairly accurate

guesses at the cost of executing a particular statement. However, when the

data is not uniformly distributed, the optimizer cannot accurately estimate

the selectivity of a query. Beginning in release 7.3, for columns which do not

have uniform data distribution, Oracle will allow you to store histograms

describing the data distribution of a particular column.

 

 

When to Use Histograms

----------------------

 

Histograms are stored in the dictionary and computed by using the ANALYZE

command on a particular column. Therefore, there is a maintenance and space

cost for using histograms. You should only compute histograms for columns

which you know have highly-skewed data distribution.

 

 

When to Not Use Histograms

--------------------------

 

Also, be aware that histograms, as well as all optimizer statistics, are

static. If the data distribution of a column changes frequently, it is

necessary to recompute the histogram for a given column. Histograms are not

useful for columns with the following characteristics:

 

o all predicates on the column use bind variables

o the column data is uniformly distributed

o the column is not used in WHERE clauses of queries

o the column is unique and is used only with equality predicates

 

 

How to Use Histograms

---------------------

 

Create histograms on columns that are frequently used in WHERE clauses of

queries and have a highly-skewed data distribution. You create a histogram

by using the ANALYZE TABLE command. For example, if you want to create a

10-bucket histogram on the SAL column of the EMP table, issue the following

statement:

 

DBMS_STATS.GATHER_TABLE_STATS (NULL,'EMP', method_opt => 'FOR COLUMNS sal SIZE 10');

ANALYZE TABLE emp COMPUTE STATISTICS FOR COLUMNS sal SIZE 10;

 

The SIZE keyword states the maximum number of buckets for the histogram.

You would create a histogram on the SAL column if there were an unusual

number of employees with the same salary and few employees with other

salaries.

 

The 'FOR' clause can be used with either COMPUTE STATISTICS or ESTIMATE

STATISTICS. The following clauses can be used with the ANALYZE TABLE command:

 

FOR TABLE

collect table statistics for the table

FOR ALL COLUMNS

collect column statistics for all columns in the table

FOR ALL INDEXED COLUMNMS

collect column statistics for all indexed columns in the table

FOR COLUMNS

collect column statistics for the specified columns

FOR ALL INDEXES

all indexes associated with the table will be analyzed

SIZE

specifies the maximum number of partitions (buckets) in the

histogram.

Default value: 75

Range of values: 1 - 254

 

 

Choosing the Number of Buckets for a Histogram

----------------------------------------------

 

The default number of buckets is 75. This value provides an appropriate level

of detail for most data distributions. However, since the number of buckets

in the histogram, the sampling rate, and the data distribution all affect

the usefulness of a histogram, you may need to experiment with different

numbers of buckets to obtain the best results.

 

If the number of frequently occurring distinct values in a column is relatively

small, then it is useful to set the number of buckets to be greater than the

number of frequently occurring distinct values.

 

 

Viewing Histograms

------------------

 

You can find information about existing histograms in the database through the

following data dictionary views:

 

USER_TAB_HISTOGRAMS, ALL_TAB_HISTOGRAMS, and DBA_TAB_HISTOGRAMS.

USER_PART_HISTOGRAMS, ALL_PART_HISTOGRAMS, and DBA_PART_HISTOGRAMS.

USER_SUBPART_HISTOGRAMS, ALL_SUBPART_HISTOGRAMS, and DBA_SUBPART_HISTOGRAMS.

 

The number of buckets in each column's histogram is found in these dictionary views :

 

o USER_TAB_COL_STATISTICS, ALL_TAB_COL_STATISTICS,DBA_TAB_COL_STATISTICS

(extracted from USER_TAB_COLUMNS, ALL_TAB_COLUMNS, and DBA_TAB_COLUMNS)

o USER_PART_COL_STATISTICS,ALL_PART_COL_STATISTICS, DBA_PART_COL_STATISTICS,

o USER_SUBPART_COL_STATISTICS, ALL_SUBPART_COL_STATISTICS, DBA_SUBPART_COL_STATISTICS

 

 

These views have the same definition.

 

DBA_TAB_HISTOGRAMS

 

This view lists histograms on columns of all tables.

 

Column name Represents This

---------------------------------------------------------

OWNER Owner of table

TABLE_NAME Table name

COLUMN_NAME Column name

ENDPOINT_NUMBER Endpoint number

ENDPOINT_VALUE Normalized endpoint values for this bucket

 

 

 

DBA_TAB_COLUMNS

 

This view contains information which describes columns of all tables.

(NOTE: Views and clusters, although included in this view are not relevant

to histograms.)

 

Column Name Represents This

----------------------------------------------------------

OWNER Owner of table

TABLE_NAME Table name

COLUMN_NAME Column name

DATA_TYPE Datatype of the column

DATA_LENGTH Length of the column

DATA_PRECISION Precision for NUMBER or FLOAT datatypes

DATA_SCALE Digits to right of decimal

NULLABLE NULL allowable?

COLUMN_ID Sequence no. of column

DEFAULT_LENGTH Length of default value

DATA_DEFAULT Default value

NUM_DISTINCT Number of distinct values for the column

LOW_VALUE Smallest value for the column, expressed in hex

for the internal representation ofthe first 32

bytes of the value

HIGH_VALUE Highest value for the column, expressed in hex for

the internal representation of the first 32 bytes

of the value

DENSITY Density of the column (a measure of how distinct

the values are)

NUM_NULLS The number of columns with null value

NUM_BUCKETS The number of buckets in the histogram

LAST_ANALYZED The date that analyze was last run on the table

SAMPLE_SIZE The amount of data sampled

 

 

The column LAST_ANALYZED is useful in determining the last time

statistics, with or without histograms, were computed. This is

often important to assess the reason for cost-based optimizer's

choices of execution paths. All tables involved in a query must be

regularly analyzed as data changes.

 

----

 

今天遇到的案例是:

select count(*) from t_name where seller_id=175754307 and is_main = 1;

索引存在在 idx_t_name_sellerid(seller_id,... ...),其中条件is_main不在索引列的

但是语句默认死活不走索引,除非加上index的hint,由于应用查询条件是动态的,不能加hint。当然将is_main冗余到这个索引里里面也是可行的

后来,高人 发现 seller_id 的 Buckets 是默认的75,因为seller_id的删选度还是比较高的,可以将buckets设置为1试试

analyze table t_name compute statistics column seller_id size 1.

事实证明--> 方法可行。 柱状图的buckets还需要再研究下哈

 

-- 这里还有个写的不错的文章,很好理解。

http://sunwgneuqsoft.itpub.net/post/34741/456504

在user_tab_columns里面有一个列叫做NUMBER_BUCTETS,这个代表直方图的篮子个数。这么讲吧,在表SUNWG中一共有6000条记录,
我把这6000条记录按照ID的大小进行排序,排序后的结果大概是1,1,1…1999,2000,2001,一共有6000条。在前面我做直方图
统计的时候设置的BUCTET的个数是10,就等于把这6000条记录按照ID的顺序分成十份,放到十个篮子中。那么第一个篮子的开始值
就应该是ENDPOINT_NUMBER = 0所代表的ENDPOINT_VALUE,也就是1,第一个篮子的截止值就应该是ENDPOINT_NUMBER = 1所代表的ENDPOINT_VALUE,也是1。

直方图信息的准确性就由两个数值决定,一个是BUCTET的个数,一个NUM_DISTINCT的个数。

当BUCTET < NUM_DISTINCT得到的是HEIGHT BALANCED(高度平衡)直方图,而当BUCTET = NUM_DISTINCT的时候得到的是FREQUENCY(频率)直方图。
这里面BUCTET的最大值是254,所以我们一般见到的直方图都是HEIGHT BALANCED(高度平衡)直方图。

一般说来,BUCTET越多,那么关于列的分布情况信息就越准确,可是在统计直方图的时候会消耗更多的资源。

当我们设置BUCTET = 1的时候,就等于是删除了直方图信息。
Analyze table sunwg compute statistics for table for columns id size 1;

 继续研究下 《基于成本的Oracle优化法则》,里面也有不错的介绍

This work is licensed under a CC A-S 4.0 International License.