DB2 Support by WRG: 06/01/2013

Tuesday, June 11, 2013

What's New in DB2 LUW?

Here is a high-level summary of some of the new features that have been added to DB2 LUW over the past couple of releases (v8.x and v9.x).

If you would like more in-depth information about any of these new features, please contact Phoenix DBA and we can explore a “deeper dive” into the features you are interested in, as well as recommendations as to if and how they can solve current or future business problems that you may be facing.

New in DB2 LUW v8.1:

Multi-Dimensional Clustering

New in DB2 LUW v8.2:

HADR (High Availability Disaster Recovery)

New in DB2 LUW v9.1:

New in DB2 LUW v9.5:

New in DB2 LUW v9.7:

TRUNCATE TABLE SQL Statement

Enhanced OLAP Functionality

In addition to OLAP functions such as RANK, DENSE_RANK and ROW_NUMBER, which were already supported in previous releases of DB2 LUW, v9.5 introduces five additional OLAP functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE and RATIO_TO_REPORT.

The LAG function returns the expression value for the row that is “x” rows before the current row in an OLAP window of data.

The LEAD function returns the expression value for the row that is “x” rows after the current row in an OLAP window of data.

The FIRST_VALUE function returns the expression value for the first row in an OLAP window of data.

The LAST_VALUE function returns the expression value for the last row in an OLAP window of data.

The RATIO_TO_REPORT function returns the ratio of an argument to the sum of the arguments in an OLAP window of data.

These new functions can best be understood through a simple example that uses three of these new functions.

Given the following table:

CREATE TABLE SALARY_TABLE
    (NAME VARCHAR(20)
    ,SALARY DECIMAL(8,2)
    );

. . . that is populated with 20 rows of data, the following query is executed:

SELECT NAME, SALARY,
      (LEAD (SALARY, 1) OVER (ORDER BY SALARY)) – SALARY AS LEAD,
       FIRST_VALUE (SALARY) OVER (ORDER BY SALARY DESC) – SALARY AS FRST_VAL,
       DECIMAL(RATIO_TO_REPORT (SALARY) OVER (), 5, 4) AS RATIO_TO_REPORT
    FROM SALARY_TABLE
    ORDER BY SALARY DESC;

The result of the query is shown below:

NAME        SALARY        LEAD     FRST_VAL    RATIO_TO_REPORT

ELLEN    200000.00           -         0.00             0.1130
PETE     175000.00    25000.00     25000.00             0.0989
FRANK    150000.00    25000.00     50000.00             0.0847
CATHY    125000.00    25000.00     75000.00             0.0706
TOM      120000.00     5000.00     80000.00             0.0678
JOE      100000.00    20000.00    100000.00             0.0565
SUE      100000.00        0.00    100000.00             0.0565
DEB       90000.00    10000.00    110000.00             0.0508
JOAN      85000.00     5000.00    115000.00             0.0480
DAVE      80000.00     5000.00    120000.00             0.0452
SALLY     80000.00        0.00    120000.00             0.0452
BILL      75000.00     5000.00    125000.00             0.0424
BOB       70000.00     5000.00    130000.00             0.0395
ANN       60000.00    10000.00    140000.00             0.0339
PAUL      50000.00    10000.00    150000.00             0.0282
KAREN     50000.00        0.00    150000.00             0.0282
MARY      50000.00        0.00    150000.00             0.0282
RAY       45000.00     5000.00    155000.00             0.0254
TRACY     40000.00     5000.00    160000.00             0.0226
JIM       25000.00    15000.00    175000.00             0.0141

First, let’s look at the LEAD function. We want to get the difference between each person’s salary, and the salary of the person directly before them in result set. Since the result set is returned in descending order of SALARY, Ellen is listed first with the highest salary and has a null value for LEAD, since nobody is listed before her. Pete is next, and his salary is 25,000.00 less than Ellen’s. Frank’s salary is 25,000.00 less than Pete’s, and so on. In the event of a two or more people with equal salaries, the first one listed has a delta salary compared to the person before him/her, and the second (and subsequent) “ties” have a LEAD value of 0.00.

The FIRST_VALUE function in this example is showing the difference between the person’s own salary and the salary of the highest paid person in the department (Ellen, with 200,000.00).

The RATIO_TO_REPORT function in this example shows the ratio (or percentage) of each person’s salary compared to the total salary of everybody listed. The total salary of the entire table is 1,770,000.00. With a salary of 200,000.00, Ellen is making 0.1130 of the total salary in the table. Jim, at the bottom with a salary of only 25,000.00, is only making 0.0141 of the total salary in the table.

OLAP functions can also be “partitioned” by another value. If our table contained a column for the DEPARTMENT that each employee worked in, the OLAP functions could be further broken down such that the FIRST_VALUE and RATIO_TO_REPORT functions would be relative to each employee’s standing within their department rather than the entire table as a whole.

Table Range Partitioning

Table range partitioning has been available in DB2 z/OS since the early days of that product, and received a significant refresh in functionality in DB2 z/OS v8, but did not become available in DB2 LUW until DB2 LUW v9.1.

Table range partitioning is a data organization scheme in which table data is stored across multiple physical partitions (or ranges) based on the value in one of more table columns. Each data partition can be stored in separate tablespaces, or in the same tablespace, as desired. Table partitioning can improve performance of queries by eliminating large amounts of I/O for range based queries as only the partitions containing data in scope of the queries will need to be accessed.

One significant difference between the implementation of table range partitioning in DB2 LUW as opposed to DB2 z/OS is that in DB2 LUW, partitions can be named, and they can be attached or detached from tables as needed. This allows for easy roll-in or roll-out of data from a partitioned table via ALTER TABLE statements with either the ATTACH PARITION or DETACH PARTITION options.

The DETATCH PARITION option of an ALTER TABLE statement allows for a named partition of a table to be detached into a separate, standalone table. The separate table can then be dropped if the data is to be deleted or destroyed; it can be attached to another partitioned table that is acting as a history or archive table; or the data can be refreshed and then reattached back to the original table as a new partition containing new data.

Example of detaching and re-attaching a table partition:

Let’s say we have a table named SALES that is partitioned by year and month. We want to remove all the data for June 2012 (which resides in named partition JUN12) and replace it with data for June 2013 in a new partition. To accomplish this, we would perform the following steps:

1) Remove the June 2012 data by detaching it to a separate table named TEMPSALES with the following statement:

ALTER TABLE SALES
    DETACH PARTITION JUN12 INTO TEMPSALES;

2) Replace the data in the TEMPSALES table with the new data for June 2013 with the following command:

LOAD FROM jun2013.txt OF DEL REPLACE INTO TEMPSALES;

3) Attach the TEMPSALES table, which contains the new data, to the SALES table as a new partition by executing the following statement:

ALTER TABLE SALES
    ATTACH PARTITION JUN13
    STARTING ‘2013-06-01’ ENDING ‘2013-06-30’
    FROM TEMPSALES;

4) At this point, the new data is not yet accessible until a SET INTEGRITY statement is executed to update the indexes for the SALES table to include the new partition. This can be done with the following statement:

SET INTEGRITY FOR SALES
    ALLOW WRITE ACCESS
    IMMEDIATE CHECKED;

HADR (High Availability Disaster Recovery)

The HADR (High Availability Disaster Recovery) feature is a database replication feature that provides a high availability solution to protect against a failure of a database server. Database structure changes and data changes are automatically replicated from the active or primary database to a standby database. HADR is essentially an active/passive HA solution, though read-only access on the standby database can optionally be enabled to make it a fully-active/limited-active HA solution as of DB2 LUW v9.7.

Data replication from the primary database to the standby database can be in one of three modes: synchronous, near-synchronous and asynchronous.

Synchronous replication allows for the greatest protection against transaction loss but at the cost of longer transaction response time, as update transactions are considered successful only after they have been applied to the primary database, and after confirmation that they have been received by and applied to the standby database.
Near-synchronous replication provides slightly less protection against transaction loss while providing better performance of transaction response time. In near-synchronous HADR mode, update transactions are considered successful after they have been applied to the primary database and after confirmation that the standby database has received the updates (though has not yet applied them).
Asynchronous replication allows for the highest risk of transaction loss in the event of failure, but also provides for the shortest transaction response time. In asynchronous mode, update transactions are considered successful after they have been applied to the primary database and after the updates have been sent to the standby database. There is no wait for confirmation that the standby database has actually received or applied the updates.

In the event of a failure of the primary database server, failover to the standby server can be accomplished by issuing a “TAKEOVER HADR” command from the standby database server for the database in question. When the takeover command has been issued, the standby database now becomes the primary database and can resume full and normal processing. Caution must be taken when issuing a takeover command that the former primary database is no longer accessible. If it is, then this can result in a situation in which both nodes in an HADR pairing are functioning as primary databases, and are both accepting and processing update transactions. This is called a “split-brain” scenario, and makes the resynchronization of data extremely difficult as both databases have different instances of updated data.

An additional feature that works in conjunction with HADR is ACR (automatic client reroute), in which client applications have knowledge of both the primary and the standby servers in an HADR pairing. In the event that the primary database is no longer responding, such as when the client receives a communication error, the client will automatically attempt to connect to the secondary server (i.e. the new primary server, after a TAKEOVER HADR command has been issued).

It should be noted that in its current implementation, HADR can only be practically used either as a high-availability solution or as a disaster recovery solution. By their natures, HA and DR standby databases have one significant mutually exclusive trait, and that is physical proximity to the primary database. Ideally, an HA standby database will be physically co-located in the same data center (and on the same network segment) as the primary database, so that replication between primary and standby will not be impacted by network latency, and also so that there will be little degradation in network communication times between the application servers and the standby server in the event of an HADR takeover. This is in contrast to a DR standby database, which will be physically located at a different site than the primary database, to protect itself from a site disaster that occurs at the primary data center. This limitation of “HA or DR” will be lifted in a future release of DB2 LUW, as multiple standby databases will be allowed. This will enable a true HA-DR solution as one standby database can be located locally as an HA standby, and a second standby database can be located remotely as a DR standby.

Multi-Dimensional Clustering

Multi-dimensional clustering (MDC) enables a table to be physically clustered on more than one key, or dimension, at the same time. Prior to the introduction of MDC, tables could only be clustered in one physical sequence, via a clustering index. The clustering index would be used to maintain the physical order of data on pages, based on the key order of the index, as rows are inserted or updated in the table. Clustering can significantly improve performance of queries that have predicates containing one or more leading keys of the clustering index, as only a portion of the physical table needs to be accessed. However, a table with a clustering index can become unclustered over time as available space in the table pages fill, and new rows are unable to be stored where they ideally should be stored. When that happens, performance can start to degrade until the table can be reorganized to restore the physical order of the data in accordance with the clustering index definition.

With MDC, performance benefits of a clustered index are extended to more than one dimension, or clustering key. Any queries that involve one or more of the specified dimensions of the table will benefit from the underlying clustering, as they will only access those pages containing rows with the specified MDC dimension values. In addition, MDC tables automatically retain their clustering over the specified dimensions, thus eliminating the need to periodically reorganize the table to restore physical order.

MDC brings with it some new terms that are used to describe the logical and physical multi-dimensional clustering concepts. First, every unique combination of dimension values forms a “cell”, which is physically comprised of sets of consecutive physical pages on disk. Each of these sets of pages is called a “block”. All the rows that are part of the same cell are stored in the blocks that make up that logical cell. A block index is created for each dimension that will contain all dimension key values. These block indexes will be much smaller, and much more efficient, than regular indexes in that they will be pointing to blocks of pages rather than to individual rows. Finally, the set of blocks that contain data for the same key value of one of the dimension block indexes is called a “slice”.

When implementing MDC for a particular table, it is very important to choose the correct set of dimensions for clustering the table, based on both the anticipated query requirements and the nature of the data itself. If the dimensions are chosen appropriately, the benefits of MDC can be realized as significant performance gains. However, if the dimensions are poorly chosen then performance can be degraded and space utilization can be dramatically increased.

In general, the best candidates for dimensions to choose for an MDC scheme would be columns that have a relatively low cardinality of unique values with a high number of occurrences of rows for each unique value. As more dimensions are added into the mix for a particular table, each unique combination of dimension values (i.e. cells) should consist of enough rows such that each cell consist of multiple full blocks of data. Cells that contain few rows will consist of only partially filled blocks, resulting in very poor space utilization.

Wednesday, June 5, 2013

What's New in DB2 z/OS?

Here is a high-level summary of some of the new features that have been added to DB2 z/OS over the past three releases (v8, v9 and v10).

If you would like more in-depth information about any of these new features, please contact Phoenix DBA and we can explore a “deeper dive” into the features you are interested in, as well as recommendations as to if and how they can solve current or future business problems that you may be facing.

New in DB2 z/OS v8:

New in DB2 z/OS v9:

New in DB2 z/OS v10:

Temporal Tables

A common business problem is the need to be able to identify what data looked like at a particular point in time in the past. For example, an insurance company may need to know what level of coverage was in place for a particular policy two months ago, when a claim was made. Traditionally, in order to support a requirement such as this, databases would need to be specifically designed with history tables containing “effective date” and “end date” columns, and complex logic would need to be coded and maintained in application programs to both correctly populate the tables (ensuring no overlap or gaps in dates), and retrieve the data as of a specified point-in-time.

The introduction of temporal tables addresses and alleviates these issues. There are two “flavors” of temporal tables: system time and business time. In a system time temporal implementation, DB2 keeps track of history based on when data was actually changed in the database. In a business time temporal implementation, DB2 keeps track of history based on information provided by the application, such as effective dates of when data is to be considered “active”. It is possible to keep track of both kinds of temporal data (system and business) in a single table. Such a table would be considered a bitemporal table.

In a system time implementation, DB2 will automatically maintain the history of data by archiving the old versions of updated or deleted data into a history table that is associated with the base table. In a system time implementation, the base table will contain the current data, while the history table contains the archived data. Start and end timestamp columns are automatically maintained by DB2 to keep track of when data was archived.

In a business time implementation, the application provides the start and end timestamp information for data that is stored in a business time temporal table. This provides the capability to not only store past and current data in a business time temporal table, but future data as well.

No changes to SQL are required for retrieving current data from a temporal table. The SELECT statement would return the most current data from a system time temporal table, and would return the data for the currently effective rows (based on timestamp) from a business time temporal table.

For retrieving non-current, or “point-in-time” data from a temporal table, an “AS OF” timestamp clause can be added to the SELECT statement to specify the time period that you are interested in.

Example of SELECTing from a system time temporal table:

SELECT BALANCE
    FROM CHECKING_ACCOUNT FOR SYSTEM_TIME
         AS OF ‘2012-06-23-12.33.27.000000’
    WHERE ACCOUNT_NUMBER = ‘5081237459’;

In this example, we are looking to retrieve the balance of checking account “5081237459” as of a very specific point in time in the past. Even though we are naming the CHECKING_ACCOUNT table as the target table from which we wish to retrieve the data, DB2 is aware that CHECKING_ACCOUNT is a temporal table with an associated history table, and will actually query and retrieve the requested data from the history table.

Example of SELECTing from a business time temporal table:

SELECT UNIT_PRICE
    FROM PARTS FOR BUSINESS_TIME AS OF ‘2013-09-15’
    WHERE PART_NUMBER = ‘A3473-33129’;

In this second example, let’s assume that pricing changes are scheduled to go into effect later this year. The new prices for each part have already been stored in the PARTS table with effective dates for when the new prices will take effect. For part number “A3473-33129”, we want to determine how much it will cost on September 15 of this year.

Access Currently Committed

DB2 SELECT statements acquire share locks on rows or pages while processing, which can sometimes result in contention or concurrency issues when other in-flight transactions have updated the data (either via INSERT, UPDATE, or DELETE statements) that the SELECT statement is trying to read, but have not yet committed their updates. This will cause the reader to go into a lock wait state, and it will remain in that lock wait state until the updater either issues a commit or a rollback (thus releasing its locks), or until it has surpassed a DB2 system defined lock timeout threshold.

Lock avoidance for readers can be attained by performing an uncommitted read (“UR”), in which the read transaction will return uncommitted updates in its result set. While this may be acceptable for some kinds of processing (i.e. rough counts or aggregations, where 100% accuracy or data integrity is not necessarily needed), it may not be acceptable for other processing in which there can be zero tolerance for inaccurate data.

Access to “currently committed” data is an option that is available in DB2 v10. When enabled it will allow for lock waits to be avoided when uncommitted inserts or deletes are encountered during the processing of a select statement. Avoiding a lock wait for uncommitted updates is not currently available.

When access to currently committed data is enabled, any rows that are encountered by the select statement that were inserted by in-flight uncommitted transactions will be ignored. Likewise, any rows that are encountered that have been deleted by in-flight uncommitted transactions will be included. In essence, the select statement is only looking at the rows that existed in the table before the in-flight updating (insert and/or delete) transaction started.

As mentioned previously, access to currently committed updates (i.e. accessing the “old”, or currently committed, version of a row before it was updated) is not yet allowed, and lock waits will continue to occur whenever rows that have been updated by in-flight uncommitted transactions are encountered. However, it is not unreasonable to assume that IBM is working on this restriction, and that it will be allowed in some future release or update to DB2.

Access to currently committed data is enabled by a new BIND parameter for packages and plans, and also by the use of new keywords during the PREPARE of dynamic SQL statements.

Hash Tables

Hash table access is now available in DB2. Hash organized tables allow for direct access to a specific row based on a hashing algorithm that is applied to the primary key of the table, without the need for an index. This can greatly improve performance for certain kinds of tables, in that “traditional” access via an index can typically involve up to four bufferpool getpages, and possibly up to four physical I/O’s to disk. With a well designed and well organized hash table, these numbers can ideally be reduced to one bufferpool getpage and zero or one physical I/O.

Hash tables should generally only be considered for tables that are stable or predictable in size and where the data is mostly accessed by its primary key (in particular, the entire primary key, if the PK consists of multiple columns). The primary key data should be diverse enough to minimize the probability of collisions that result from too many rows hashing to the same location. When a collision occurs, subsequently inserted rows will need to be relocated from their targeted hashed location, with a pointer inserted in the target location that points to the new location. The more relocated rows there are in the hash table, the less benefit hash table access will provide as additional getpages and I/O’s may be needed to retrieve data, thus negating the intent of hashing.

Since the nature of hash tables dictates that rows are randomly distributed throughout the physical tablespace, clustering indexes are not allowed to be defined on them, though non-clustered indexes can be. Therefore, hash organization for a table is not a viable choice if sequential processing of large portions of the table is desired.

Trusted Context and Roles

A trusted context establishes a trusted relationship between DB2 and an external entity, such as an application server or another DB2 system. When the external entity (i.e. DB2 client) connects to DB2, DB2 determines if that entity and connection can be trusted by evaluating a series of attributes defined in the trusted context. After a trusted connection has been established, then the DB2 authorization ID that has been used for the connection can have access to a set of privileges via the trusted context that would not normally be available to it outside of the trusted connection. This is done through the assigning of privileges to roles rather than directly to IDs, RACF groups or UN IX groups. Roles are only available within the confines of trusted contexts, and provide the ability to more finely control from where one or more privileges can be exercised.

Trusted context, in conjunction with roles, thus allow you to restrict the privileges associated with a particular ID based on from where that ID is attempting to perform its activities.

For example: a trusted context can be set up for functional ID “XAPPID1” and application server “APPSRV1” and assigning it a role of “APPSIUD”, which gives it SELECT, INSERT, UPDATE and DELETE privileges on all the tables in a particular database. This is a level of access that has not been granted directly to the XAPPID1 itself, or to any RACF or UNIX group that XAPPID1 may be a member of. When XAPPID1 connects to DB2 from application server APPSRV1, it will be able to read and update the tables in question as it has established a trusted connection via the trusted context and has picked up these privileges through the role associated with that trusted context. However, if somebody happens to know or learn the password associated with XAPPID1 (which is a common occurrence at Phoenix) and attempts to connect to DB2 from another source (say, their workstation if they have a DB2 Connect client, or by logging on to the mainframe or UNIX database server directly with that ID and password), they will not have established a trusted connection, and therefore will not have access to the database in question.

Clone Tables

Clone tables allow for two exact structural duplicate instances of a specific table to exist in a database. One instance will be the “active” instance, and the second instance will be the clone. They will be structurally identical in almost every way (columns, indexes, check constraints, tablespace, etc.), with the exception that the clone instance of the table will have a unique name. The only significant difference would be in the data they contain.

The intent of clone tables is to allow for a fast replacement of the data loaded in a table. The active instance of the table will contain the active data, while the clone table will initially be empty when it is first created. The clone can then be populated with data using normal DB2 processing, such as SQL INSERT statements or a DB2 LOAD utility. This allows for a new copy of the data to be staged in the clone instance of the table without impacting the active instance of the table.

When it is time to replace the active instance of the table with the new data that resides in the clone instance of the table, the EXCHANGE DATA statement will be issued. This will cause the active and clone instances of the table to be swapped such that the active table now contains the new data and the clone table now contains the old data. Any references to the active table now will have access to the new data. The data in the clone instance of the table can now be prepared for the next refresh.

The main advantage of clone tables is that they allow for very quick refreshes of data with nearly zero outage time for the table in question. The EXCHANGE DATA statement just requires a momentary outage in order to perform the physical swap of the underlying VSAM datasets that make up the active table tablespace and associated indexes.

Prior to the availability of clone tables, the most practical way to do a full refresh of data in a table was via a LOAD REPLACE utility operation. Depending on the amount of data being loaded, this could result in an outage of anywhere from seconds to minutes to hours. The utilization of clone table greatly enhances the availability of DB2 tables that need to be periodically refreshed with new data.

MERGE SQL Statement

The MERGE statement allows for a target table to be updated using the specified input data. It basically combines the functionality of an INSERT statement and an UPDATE statement into a single statement. The MERGE statement is issued with a search condition, which is usually searching on a key value. If a match is found, then the target row is updated with the specified input data. If a match is not found, then a new row is inserted into the target table with the specified input data.

This can simplify traditional processing in which an application program will execute a SELECT statement to determine the existence of a target row in a table. If the row is found, then the application program will execute an UPDATE statement to update the row; otherwise, the application program will execute an INSERT statement to add a new row to the table. The MERGE statement will now combine these three separate statements into a single SQL operation.

The MERGE statement is also enabled for multi-row processing, similar to multi-row fetch and multi-row insert processing.

TRUNCATE TABLE SQL Statement

The TRUNCATE TABLE statement, which has long been available in other DBMS’s such as SQL Server, is now available in DB2. It allows for a quick deletion of all data in a DB2 table, as opposed to a mass DELETE DML statement.

If used with the IMMEDIATE keyword, then the truncate operation is processed immediately and cannot be undone by a subsequent ROLLBACK statement. If the IMMEDIATE keyword is not included in the TRUNCATE TABLE statement, then a ROLLBACK will undo the truncate operation.

There are some restrictions to the use of the TRUNCATE TABLE statement, foremost of which is that a TRUNCATE TABLE statement cannot be issued against a parent table in a referential foreign key constraint. This is regardless of whether the child table in the referential relationship contains any rows.

OLAP Functionality

Three OLAP functions are now available in DB2: RANK, DENSE_RANK and ROW_NUMBER.

RANK and DENSE_RANK are very similar in that they will assign a ranking value to a row with respect to how it fares in comparison to other rows in the result set, based on a given criteria. RANK will assign a value that is relative to the top of the list , where DENSE_RANK will eliminate gaps in the sequencing of the assigned ranking.

ROW_NUMBER will simply assign the relative position of the row in the result set, starting at one.

These three functions can best be understood through a simple example that uses all three functions.

Given the following table:

CREATE TABLE SALARY_TABLE
    (NAME VARCHAR(20)
    ,SALARY DECIMAL(8,2)
    );

. . . that is populated with 20 rows of data, the following query is executed:

SELECT NAME, SALARY,
       RANK () OVER (ORDER BY SALARY DESC) AS RANK,
       DENSE_RANK () OVER (ORDER BY SALARY DESC) AS DENSE_RANK,
       ROW_NUMBER () OVER (ORDER BY SALARY DESC) AS ROW_NUMBER
    FROM SALARY_TABLE
    ORDER BY SALARY DESC;

The result of the query is shown below:

NAME        SALARY    RANK    DENSE_RANK    ROW_NUMBER

ELLEN    200000.00     1       1             1
PETE     175000.00     2       2             2
FRANK    150000.00     3       3             3
CATHY    125000.00     4       4             4
TOM      120000.00     5       5             5
JOE      100000.00     6       6             6
SUE      100000.00     6       6             7
DEB       90000.00     8       7             8
JOAN      85000.00     9       8             9
DAVE      80000.00    10       9            10
SALLY     80000.00    10       9            11
BILL      75000.00    12      10            12
BOB       70000.00    13     11            13
ANN       60000.00    14      12            14
PAUL      50000.00    15      13            15
KAREN     50000.00    15      13            16
MARY      50000.00    15      13            17
RAY       45000.00    18      14            18
TRACY     40000.00    19      15            19
JIM       25000.00    20      16            20

Ellen has the highest salary, and is therefore assigned a RANK value of 1. She is followed by Pete, Frank, Cathy and Tom who are ranked 2, 3, 4 and 5 respectively. Joe and Sue both have identical salaries and are therefore both assigned a RANK value of 6. They are followed by Deb, who has a RANK value of 8. Deb’s RANK value of 8 has been assigned because there are seven other people who have a salary greater than hers. Other “ties” in the ranking can be seen at numbers 10 and 15.

DENSE_RANK follows a similar pattern, except that in the case of ties in the ranking criteria, the next sequential value is assigned to avoid skips in the DENSE_RANK values assigned. In this example, following a tie between Joe and Sue at number 6, Deb is assigned a DENSE_RANK value of 7 as opposed to the assigned RANK value of 8.

Native SQL Procedures

Stored procedures have been available in DB2 z/OS since v5, but were limited in that they needed to be written in a third generation language such as COBOL or C, and implemented as an external stored procedure running in a Workload Manager (WLM) controlled address space.

The SQL Procedures Language has been available since v5, in which DB2 z/OS stored procedures could be written in a native DB2/SQL language that is a subset of the ANSI SQL/PSM language standard. But to implement an SQL Procedure prior to DB2 v9, the SQL/PL stored procedure needed to be precompiled and translated into a C program, which would then in turn be compiled and implemented as an external stored procedure.

Since Phoenix was not licensed for a C compiler on our mainframe environment, the use of SQL Procedures on DB2 z/OS was not an option prior to DB2 v9.

The native SQL Procedures support introduced in DB2 v9 removes this restriction. Stored procedures now written in SQL/PL no longer need to be precompiled and converted to C programs. They can now be run natively directly within the DB2 engine.

Native XML Support

DB2 now allows for unstructured XML data to be stored natively alongside structured relational data in a DB2 table via DB2’s “pureXML” technology.

Among the capabilities provided by pureXML in DB2 are:

A native XML data type
SQL/XML language, which provides full XML functions within the SQL language to access XML structures
Hybrid data access, in which both XML and relational data can be accessed together in a single SQL/XML statement
Read and write access to XML documents and sub-documents
XML indexes to provide efficient access paths to XML data stored within a DB2 table
XML schema validation against an XML schema that is registered in the Schema Repository, including support for multiple versions of an XML schema

Collectively with these capabilities, DB2 becomes a repository for native storage of both relational and XML data, thus making DB2 a hybrid database server.

Prior to the introduction of pureXML technology in DB2, XML data could be stored in a DB2 database, but it would have to be stored either as an unformatted string in a CLOB data type column, or it would need to be “shredded” into individual elements that would then be stored in traditional relational columns.

If XML data was stored within a CLOB data type, then DB2 had no understanding of the structure of the XML document it was storing, and searching on components of an XML document meant that the entire document would need to be retrieved and parsed. This could be an expensive operation in terms of CPU resources. If the XML document was shredded into individual elements that were stored in traditional relational database columns, then all the inherent advantages and integrity of the original XML document would be lost.

With native XML storage, XML data can be searched on and retrieved efficiently without having to lose any of the integrity of the XML document itself.

The SQL/XML language that is introduced with pureXML allows for individual elements of an XML document to be inserted, updated and deleted. Stored XML documents can also be indexed for efficient searching and retrieval of data, in much the same way that relational indexes are used for traditional relational data in a DB2 table.

Multi-Row Fetch and Insert

DB2 is a relational database, which means that it works with “sets” of data at all times. DB2 tables are basically sets of zero to many rows of data, and SQL statements often process against sets of data (i.e. tables) and return sets of data as their results (i.e. cursors). However, because DB2 application programs are usually written in third generation programming languages such as COBOL, they can only process one row of data at a time. When working with a cursor that contains a result set, rows are fetched one at a time from within a loop in the program and processed as they are retrieved. If the result set cursor contains 100 rows, for example, then processing is bouncing back and forth between the application program and DB2 100 times, once for each fetch. This constant context switching between the application program and DB2 can generate a fair amount of CPU overhead, which is unavoidable because of the nature of 3GL “one row at a time” processing.

Multi-row fetch alleviates this problem. Cursor processing in an application program can now return multiple rows in a single fetch statement, up to 32,767 rows at one time. This is done via the use of host-variable arrays rather than single host-variable fields within the host COBOL program. The fetch statement is coded with the number of rows to be fetched with each call. The result is that with the use of multi-row fetch in COBOL programs, much of the CPU overhead of multiple fetches can be reduced or eliminated, thus improving overall performance and processing elapsed time.

Benchmark testing of multi-row fetch has shown that performance gains can start to be realized at around 10 rows per fetch, and optimal performance starts to be seen at around 100+ rows per fetch.

Similar performance gains can be seen with multi-row insert, which follows a similar philosophy. Rows to be inserted are staged in host-variable arrays, which can then be sent to DB2 in a single INSERT statement.

Multi-row fetch and multi-row inserts require a more sophisticated level of error checking. New diagnostic commands are available to help with error handling when using multi-row fetch/insert processing.

There is a moderate amount of programming changes that would need to be made to retrofit multi-row fetch/insert processing into an existing DB2 COBOL program, but the performance gains that can be attained may make it worth the effort if the typical processing workload of the program involves fetching or inserting large sets of data per transaction, and CPU consumption and elapsed runtime are chronic constraints for the application.

Materialized Query Tables

A materialized query table (MQT) is basically a pre-calculated and stored view. While an MQT can be used for any kind of view, it would more commonly be used for aggregation of data, or any kind of query that involves a fair amount of calculations to create a result set. Since an MQT is stored as a physical table, it can have indexes defined on it for efficient access.

The main advantage of an MQT is that if an MQT is defined as a particular query, the DB2 optimizer is intelligent enough to recognize a subsequent query that comes along with a similar or identical definition, and rather than process the new query to calculate the result set, it can just “reroute” the query to use the previously calculated and stored results in the MQT table. The end user or program that is executing the query does not even need to be aware of the existence of the MQT . . . this reroute is performed automatically by the DB2 optimizer. However, the MQT can also be accessed directly via SQL statements if the user or program wishes to do so.

The automatic reroute of queries to be able to access the MQT can be disabled if the MQT is defined with the “DISABLE QUERY OPTIMIZATION” option. In that case, the MQT can only be accessed directly via SQL.

One of the disadvantages of MQTs is that the data that is stored in them can and will become stale if the source tables from which the MQT data was originally derived from are updated. The MQT can be refreshed periodically via an explicitly invoked “REFRESH TABLE” SQL statement, which will re-execute the query that defines the MQT and store the new result set, overlaying the previous results.

There are two “flavors” of MQTs: “maintained by system”, or “maintained by user”. System maintained MQTs can only be populated or refreshed by a “REFRESH TABLE” statement. User maintained MQT’s can be populated or refreshed either by a “REFRESH TABLE” statement, by a LOAD utility, or by SQL DML (INSERT, UPDATE or DELETE) statements.

Label based Access Control Security

Label-Based Access Control (LBAC) allows for more granular control over who can access data in specific rows of an LBAC secured table. This is done via a multi-level security (MLS) hierarchy (i.e. a “security policy”) that can be set up within the security system. For DB2 z/OS, the security system would be RACF. For DB2 LUW, the components that make up the security policy will be defined within the DB2 database itself.

With LBAC security, a special column in defined in each table that will contain a security label. The security label will define what level of security is needed to access the row. Each protected table can be associated with only one security policy, though multiple security policies can exist for different tables. Each user will have one or more security labels associated with their ID (each label is associated with one security policy). When they attempt to access the data via SQL, their security label is compared to the security label associated with each row of the table. If their security label either matches or is at a higher level in the security policy than the security policy assigned to the row, they are allowed to access it. If not, then the row will be ignored as if it does not exist (they will not receive an error message).

Here is an example of a simple security policy hierarchy:

TOP SECRET
      SECRET
            UNCLASSIFIED

In a simple security policy like this, each row of data can be classified as either “TOP SECRET”, “SECRET”, or “UNCLASSIFIED”. Each user will be associated with one of these three labels, and will be able to see all rows of data that are classified with a security label that is equal to or lower in the security policy than his or her label.

For instance, say user JOE has a security label of “TOP SECRET”. He will be able to access all rows in the table, since “TOP SECRET” is the highest label in the hierarchy.

If user MARY has a security label of “SECRET”, she will be able to access all rows in the table that are labeled either “SECRET” or “UNCLASSIFIED”. She will not be able to see any rows that are labeled as “TOP SECRET”, since that label is higher in the hierarchy than her assigned security label.

If user BOB has a security label of “UNCLASSIFIED”, he will only be able to see rows labeled as “UNCLASSIFIED” since that is the lowest level in the security policy hierarchy. He will not be able to see any rows labeled “TOP SECRET” or “SECRET”.

One of the idiosyncrasies of LBAC security is that different users can execute the exact same SQL statement against the exact same table, and receive different results due to their security label.

Let’s say a particular table, named T1, contains 10 rows of data. One row of data has a security label of “TOP SECRET”. Three rows of data have a security label of “SECRET”. The other six rows of data have a security label of “UNCLASSIFIED”.

If users JOE, MARY and BOB all run the following SQL statement, they will receive different results:

SELECT COUNT(*)
    FROM T1;

JOE (“TOP SECRET” label) would get a result of 10, as he can see all rows in the table.

MARY (“SECRET” label) would get a result of 9, as she can see the three “SECRET” rows and the six “UNCLASSIFIED” rows.

BOB (“UNCLASSIFIED” label) would get a result of 6, as he can only see the six “UNCLASSIFIED” rows.

Data Encryption and Decryption

Data can be encrypted and decrypted at the column level using the ENCRYPT and DECRYPT SQL scalar functions. A password is needed to encrypt and decrypt the data. The password can either be supplied as part of the ENCRYPT or DECRYPT function, or it can be set within a DB2 special register via the SET ENCRYPTION PASSWORD statement (recommended). A password hint can also be stored with the encrypted data to help “remember” the password. The password hint can be retrieved by the GETHINT function.

Once the data has been stored after being encrypted via the ENCRYPT function, the same password must be used to correctly decrypt the data when retrieving at a later time. If a different password is used for decryption, the data returned will not match the original string. No error or warning message will be returned.

Example of using the ENCRYPT and DECRYPT functions:

Example 1 (no password hint):

SET ENCRYPTION PASSWORD = ‘TARZAN’;
INSERT INTO EMP (SSN) VALUES ENCRYPT (‘289-46-8832’);
SELECT DECRYPT_CHAR(SSN) FROM EMP;

Example 2 (with password hint):

INSERT INTO EMP (SSN) VALUES ENCRYPT (‘289-46-8832’, ‘TARZAN’, ‘? AND JANE’);
SELECT DECRYPT_CHAR(SSN, ‘TARZAN’) FROM EMP;

NOTE: In the LUW environment, data encryption and decryption is already included in the base product. However, on the z/OS environmenrt, the Encryption Facility is a separately licensed product that requires specific hardware, and is not currently available in the Phoenix environment.

Common Table Expressions and Recursive SQL

A common table expression is basically a temporary table that exists only for the duration of a single SQL statement. Ideally, this can be used to replace a process in which data is gathered and stored in an interim table, which is then accessed by a subsequent SQL statement. This can now be done in a single SQL statement.

A common table expression can reference another common table expression that was defined earlier within the same SQL statement, thus allowing you to “stack” a number of intermediate processing results and process them all within a single execution of an SQL statement.

A common table expression can also reference itself, which allows for recursive processing within SQL. This is especially useful when processing tables that are self-referencing, as in a hierarchy. An example of this would be an employee table that references the employee’s manager, who in turn is also an employee with their own manager, etc.

Example of a recursive common table expression:

Given the following table:

CREATE TABLE PARTLIST
    (PART CHAR(2)
    ,SUBPART CHAR(2)
    ,QUANTITY INTEGER
    );

. . . that is populated as follows:

PART    SUBPART    QUANTITY

00      01          5
00      05          3
01      02          2
01      03          3
01      04          4
01      06          3
02      05          7
02      06          6
03      07          6
04      08         10
04      09         11
05      10         10
05      11         10
06      12         10
06      13         10
07      14          8
07      12          8

Say you have a need to answer the question “what is the total quantity of each part required to build part ‘01’?”

This can be done with the following recursive common table expression:

WITH RPL (PART, SUBPART, QUANTITY) AS
    (SELECT ROOT.PART, ROOT.SUBPART, ROOT.QUANTITY
         FROM PARTLIST ROOT
         WHERE ROOT.PART = '01'
     UNION ALL
     SELECT PARENT.PART, CHILD.SUBPART,
            PARENT.QUANTITY*CHILD.QUANTITY
         FROM RPL PARENT, PARTLIST CHILD
         WHERE PARENT.SUBPART = CHILD.PART
    )
    SELECT PART, SUBPART, SUM(QUANTITY) AS TOTAL_QTY_USED
        FROM RPL
        GROUP BY PART, SUBPART
        ORDER BY PART, SUBPART;

The result of the query is shown below:

PART    SUBPART    TOTAL_QTY_USED

01      02           2
01      03           3
01      04           4
01      05          14
01      06          15
01      07          18
01      08          40
01      09          44
01      10         140
01      11         140
01      12         294
01      13         150
01      14         144

DB2 Support by WRG