Multi-dimensional clustering (MDC) enables a table to be physically clustered on more than one key, or dimension, at the same time. Prior to the introduction of MDC, tables could only be clustered in one physical sequence, via a clustering index. The clustering index would be used to maintain the physical order of data on pages, based on the key order of the index, as rows are inserted or updated in the table. Clustering can significantly improve performance of queries that have predicates containing one or more leading keys of the clustering index, as only a portion of the physical table needs to be accessed. However, a table with a clustering index can become unclustered over time as available space in the table pages fill, and new rows are unable to be stored where they ideally should be stored. When that happens, performance can start to degrade until the table can be reorganized to restore the physical order of the data in accordance with the clustering index definition.
With MDC, performance benefits of a clustered index are extended to more than one dimension, or clustering key. Any queries that involve one or more of the specified dimensions of the table will benefit from the underlying clustering, as they will only access those pages containing rows with the specified MDC dimension values. In addition, MDC tables automatically retain their clustering over the specified dimensions, thus eliminating the need to periodically reorganize the table to restore physical order.
MDC brings with it some new terms that are used to describe the logical and physical multi-dimensional clustering concepts. First, every unique combination of dimension values forms a “cell”, which is physically comprised of sets of consecutive physical pages on disk. Each of these sets of pages is called a “block”. All the rows that are part of the same cell are stored in the blocks that make up that logical cell. A block index is created for each dimension that will contain all dimension key values. These block indexes will be much smaller, and much more efficient, than regular indexes in that they will be pointing to blocks of pages rather than to individual rows. Finally, the set of blocks that contain data for the same key value of one of the dimension block indexes is called a “slice”.
When implementing MDC for a particular table, it is very important to choose the correct set of dimensions for clustering the table, based on both the anticipated query requirements and the nature of the data itself. If the dimensions are chosen appropriately, the benefits of MDC can be realized as significant performance gains. However, if the dimensions are poorly chosen then performance can be degraded and space utilization can be dramatically increased.
In general, the best candidates for dimensions to choose for an MDC scheme would be columns that have a relatively low cardinality of unique values with a high number of occurrences of rows for each unique value. As more dimensions are added into the mix for a particular table, each unique combination of dimension values (i.e. cells) should consist of enough rows such that each cell consist of multiple full blocks of data. Cells that contain few rows will consist of only partially filled blocks, resulting in very poor space utilization.