Skip to content

3. The Query Optimizer#

This chapter describes the structure of the optimizer and explains the query optimization process.

Query Optimizer Overview#

The query optimizer analyzes a SQL statement, creates execution plans, and selects the most efficient execution plan by evaluating their costs. This process determines most of query performance. The more complicated a query is, the more its performance depends on the precision of this process.

The Optimizer Procedure#

Prior to optimization, a SQL statement is first parsed and validated, and a parsing tree is created thereof. The optimizer creates an efficient plan tree by evaluating various costs from this tree.

An optimizer that executes this process has the following structure.

[Figure 3-1] The Structure of the Optimizer

The optimizer consists of the query rewriter, logical plan generator, and the physical plan generator. Each element performs the following.

  • Query Rewriter

    Using the parse tree, rewrites the query so that it can be easily optimized while having the same results. For further information about the Altibase query conversion algorithm, please refer to Query Conversion.

  • Logical Plan Generator

    Using the rewritten query and statistics, calculates the execution cost and generates an optimal logical query plan.

  • Physical Plan Generator

    Using the optimized query plan, generates a physical plan tree.

The optimization process executed by each element of the query optimizer can be categorized as below.

  • Query conversion
  • Logical execution plan generation
  • Physical execution plan generation

Elements Which Influence the Optimizer#

The following elements affect the Altibase query optimizer process and as a result, the explain plans become different.

  • Format of SQL statements and predicates
  • Index and constraints
  • Statistics
  • SQL hints
  • Optimizer-related properties

For more detailed information about statistics and SQL hints, please refer to The Explain Plan.


Query Conversion#

While processing a query, the optimizer rewrites a complete parse tree into a SQL statement that has the same meaning and is easy to optimize. This is called query conversion.

The query optimizer uses the following query conversion methods. Some are used when the query rewriter rewrites queries. Others are used when the logical plan generator optimizes query plans.

  • Common Subexpression Elimination
  • Constant Filter Precedence
  • View Merging
  • Subquery Unnesting
  • Predicate Pushdowns
  • Transitive Predicate Generation

Common Subexpression Elimination#

If there are two or more identical conditional expressions in a WHERE statement, the optimizer merges the duplicate expressions into one.

In the following example, the optimizer eliminates the duplicate condition (dno = 60).

SELECT dno, salary
  FROM employees
 WHERE ( dno = 60 AND salary = 4200 )
    OR ( dno = 60 );

In the following example, the optimizer eliminates the condition (salary = 4200) because it is meaningless.

SELECT dno, salary
  FROM employees
 WHERE ( dno = 60 OR salary = 4200 )
   AND ( dno = 60 );

Constant Filter Precedence#

A constant filter is a condition that consistently evaluates to either TRUE or FALSE, regardless of the table value (e.g., 1 = 1 or 1 <> 1).

Since a constant filter always has the same logical value, it is evaluated only once and no additional comparison operation costs are incurred. If the logical value of a constant filter resolves to FALSE, additional checks and table access become unnecessary. Seemingly meaningless constant filters have a wide variety of uses.

For example, a constant filter can be used when a schema has been created but data hasn't yet been loaded into it. The following is an example of such a query:

CREATE TABLE t3 AS SELECT * FROM t1, t2 WHERE 1 <> 1;

As seen above, a constant filter can be used to create table t3 which contains all of the columns in tables t1 and t2, but not any of the data.

Constant filters can also be used to limit search privileges as shown below.

SELECT * FROM t1, t2 WHERE t1.i1 = t2.a1 AND ? > 20;

As shown above, the user can specify a host variable that corresponds to age to prevent users who do not meet the condition from executing queries or avoid overload from the execution of queries without result sets.

In addition, the optimizer can use a constant filter on a predicate with a subquery to execute it only once and prevent repeated execution of the subquery.

SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 WHERE t2.date = SYSDATE );

In the above example, the EXISTS condition is a constant filter irrelevant to the data stored in table t1.

View Merging#

This method merges a view in the main query with the main query.

For simple views which include only joins between conditional clauses, the optimizer performs view merging unless the user uses the NO_MERGE hint

The following is an example of a query statement which joins the emp_engineer view and departments table, being converted by view merging.

CREATE OR REPLACE VIEW emp_engineer AS
SELECT eno, e_lastname, emp_job, salary, dno
  FROM employees
 WHERE emp_job = 'engineer';

SELECT e.eno, e.e_lastname, e.salary, d.dname, d.mgr_no
  FROM emp_engineer e, departments d
 WHERE d.dno = e.dno 
   AND e.salary >= 1500;
=>
SELECT e.eno, e.e_lastname, e.salary, d.dname, d.mgr_no
  FROM employees e, departments d
 WHERE d.dno = e.dno 
   AND e.emp_job = 'engineer'
   AND e.salary >= 1500;
  • Related Hint: NO_MERGE

Subquery Unnesting#

A nested subquery is a subquery in the WHERE clause. A subquery usually references columns in the main query to limit result sets. "Subquery Unnesting" converts a query with a nested subquery into an unnested join statement.

The following is an example of a query with a nested subquery being converted by subquery unnesting.

SELECT * FROM employees
 WHERE dno IN (SELECT dno FROM departments)
=>
SELECT *
  FROM (SELECT dno FROM departments) d, employees e
 WHERE d.dno = e.dno;
  • Related hints: UNNEST, NO_UNNEST

Predicate Pushdown#

The optimizer utilizes various types of predicate pushdowns to reduce execution costs and retain mathematical consistency. The following are the main predicate pushdowns used by the optimizer.

  • Predicate Pushdowns on Views

  • Predicate Pushdowns on Outer Joins

  • Predicate Pushdowns on Join Conditions

Predicate Pushdowns on Views#

A predicate pushdown on views pushes down a predicate expressed in the WHERE clause of the main query when a query is being executed on a user-defined view.

For example, consider the following view and query.

CREATE VIEW v1 (a1, a2) AS
SELECT i1, i2
  FROM t1
 WHERE i2 > 20;

SELECT *
  FROM v1
 WHERE a1 = 1;

If the optimizer concludes that a predicate pushdown will deliver the most efficient execution plan while optimizing the above query, it determines that the predicate in the WHERE clause is to be processed internally within the view as shown below. The following query expresses this concept.

SELECT *
  FROM (
        SELECT i1, i2
          FROM t1
         WHERE i2 > 20
           AND i1 = 1
       ) v1;

The query is changed so that it can use the index for the t1.i1 column. However, the optimizer does not always decide to use this kind of predicate pushdown.

With a proper understanding of the internal structure of views, the user can make the optimizer choose a predicate pushdown on views by using the appropriate predicate. Or the user can use hints to explicitly instruct the optimizer to do so. However, predicate pushdowns on views do not necessarily guarantee performance equal to or better than that of the original query

  • Related hints: NO_PUSH_SELECT_VIEW, PUSH_SELECT_VIEW

Predicate Pushdowns on Outer Joins#

Various predicate pushdown methods are available if an outer join is used in a FROM clause. What these methods have in common is that the predicate described in the WHERE clause is evaluated before the outer join is processed.

Consider the following query for example.

SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.i1 = t2.a1 WHERE t1.i1 = 1;

If the above query undergoes a predicate pushdown, it is processed as the following query:

SELECT * FROM (SELECT * FROM t1 WHERE t1.i1 = 1) t1 LEFT OUTER JOIN t2 ON t1.i1 = t2.a1;

The predicate t1.i1 = 1 is processed before the join condition of the left outer join is processed to reduce the size of the t1 result set.

This method is chosen after confirming the mathematical consistency and evaluating the cost. For example, the following three queries are different queries that can yield different results.

SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.i1 = t2.a1 WHERE t2.a1 = 1;
SELECT * FROM t1 LEFT OUTER JOIN (SELECT * FROM t2 WHERE t2.a1 = 1) t2 ON t1.i1 = t2.a1;
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.i1 = t2.a1 AND t2.a1 = 1;

To ensure mathematical consistency and obtain the same results, the optimizer uses predicate pushdowns to evaluate the above queries, as shown below:

SELECT * 
  FROM t1 LEFT OUTER JOIN (SELECT * FROM t2 WHERE t2.a1 = 1) t2 
    ON t1.i1 = t2.a1
 WHERE t2.a1 = 1;

If a user wishes to move the WHERE condition to the ON clause for a left outer join, the WHERE clause must be left as below to get the same results.

SELECT * 
  FROM t1 LEFT OUTER JOIN t2 
    ON t1.i1 = t2.a1 AND t2.a1 = 1
 WHERE t2.a1 = 1;

As shown above, the user can expect similar results by arbitrarily adding a query predicate. However, it is important to discern whether adding a predicate still delivers the same results and improves performance.

Predicate Pushdowns on Views with Set Operators#

Predicate pushdowns are particularly effective in processing views created through set operations. For example, consider the following view and query:

CREATE VIEW v1 (a1, a2) AS 
( SELECT m1, m2 
    FROM t2 
   UNION ALL 
  SELECT x1, y1 
    FROM t3 );

SELECT * 
  FROM t1, v1 
 WHERE t1.i1 = v1.a1 
   AND t1.i1 = 1;

Even if the columns t2.m1 and t3.x1 have indexes in the above view definition, neither is used to execute the query. In this case, transitive predicate generation (which is described in the next section) internally creates the v1.a1=1 predicate.

SELECT * FROM t1, v1 WHERE t1.i1 = v1.a1 AND t1.i1 = 1 AND v1.a1 = 1;

A predicate pushdown transforms the query as follows, so that the t2.m1 and t3.x1 column indexes are available for use.

SELECT * 
  FROM t1, 
      ( SELECT m1, m2 
          FROM t2 
         WHERE t2.m1 = 1 
         UNION ALL
        SELECT x1, y1 
          FROM t3 
         WHERE t3.x1 = 1 ) v1
  WHERE t1.i1 = v1.a1 
    AND t1.i1 = 1;

The proper description of a predicate (as above) helps the optimizer create a more efficient execution plan. Also, the user can improve performance by explicitly modifying predicates. However, adding a predicate does not necessarily guarantee better performance and the user should remember that the cost of comparison operations can increase.

Predicate Pushdowns on Join Conditions#

This method pushes a join condition related to a view in the predicate of a main query inside the view

  • Related hint: PUSH_PRED

Transitive Predicate Generation#

Transitive predicate generation improves performance by adding a similar single table predicate when a join condition and a single table predicate exist.

Consider the following query for example.

SELECT * FROM t1, t2 WHERE t1.i1 = t2.a1 AND t1.i1 = 1;

If an index is unavailable, performance can be improved by adding a similar single table predicate as below.

SELECT * FROM t1, t2 WHERE t1.i1 = t2.a1 AND t1.i1 = 1 AND t2.a1 = 1;

Performance can be enhanced by decreasing the result set size of t2.

View Materialization#

An optimization method that is the opposite of a predicate pushdown on views is view materialization. This method temporarily stores the dataof a view that is repeatedly used by a query, while the query is being processed.

For example, consider the following view and query.

CREATE VIEW v1(a1, a2) AS SELECT i1, SUM(i2) FROM t1 GROUP BY i1;
SELECT * FROM v1 WHERE v1.a2 > (SELECT AVG(a2) FROM v1 WHERE a1 > 10 );

If the above query undergoes view materialization, the view's (v1) results are temporarily stored and used by the top-level query and subquery. That is, the query defined for the view does not need to be executed repeatedly in order to obtain v1's data.

User discretion is advised, as using a hint to instruct the optimizer to use a predicate pushdown on views on this kind of query can actually worsen performance.


Creating Logical Execution Plans#

The optimizer creates a logical execution plan using statistics and a transformed query. The optimizer uses statistics to calculate the costs (i.e., the storage characteristics) of tables, indexes and partitions that the SQL statement accesses. The optimizer expresses cost as the estimated time necessary to execute a query with a particular plan. The optimizer selects the access method and joining method with the lowest cost to create a logical execution plan.

Normalization#

The Concept of Normalization#

User-defined WHERE predicates can be written in multiple ways. The optimizer transforms user-defined WHERE clauses into a standardized form to effectively and consistently process a wide variety of conditions. This is called normalization.

Normalization is the process of extending and transforming predicates using logical expressions such as AND, OR, and NOT. "Conjunctive Normal Form (CNF)" handles AND operators as the highest-level expression, whereas "Disjunctive Normal Form(DNF)" handles OR operators as the highest-level expression.

CNF normalization rearranges the structure of a predicate so that AND operators are taken as the highest-level operators and OR operators are taken as lower level operators. The following example shows the structure of a predicate transformed by CNF normalization:

WHERE ( i1 = 1 AND i2 = 1 ) OR i3 = 1  
CNF: ( i1 = 1 OR i3 = 1 ) AND ( i2 = 1 OR i3 = 1 )

DNF normalization rearranges the structure of a predicate so that OR operators are taken as higher-level operators and AND operators are taken as lower level operators. The following example shows the structure of a predicate transformed by DNF normalization:

WHERE ( i1 = 1 OR i2 = 1 ) AND i3 = 1  
DNF: ( i1 = 1 AND i3 = 1 ) OR ( i2 = 1 AND i3 = 1 )

The optimizer compares the execution costs of a CNF normalized query and a DNF normalized query and selects the normalization type with lower cost.

The optimizer generally chooses CNF-based execution plans over DNF-based execution plans. For example, consider the following query.

SELECT * FROM t1 WHERE i1 = 1 OR i2 = 1;
CNF normalization: AND ( i1 = 1 OR i2 = 1 )
DNF normalization: ( i1 = 1 AND ) OR ( i2 = 1 AND )

The above will undergo CNF normalization if table t1 does not have an index or only one column has an index. This is because it is most efficient to process the query with a full table scan.

However, if both i1 and i2 columns have indexes, it is more efficient to DNF-normalize the predicate and separately obtain the results of (i1 = 1) and (i2 = 1), and then add them.

Depending on whether or not an index exists, the optimizer can choose a different normalization type for the same predicate. This is because the optimizer determines the normalization type by comparing execution costs.

If the WHERE clause has neither logical operator nor OR operator, the optimizer uses only CNF normalization. However, if the WHERE clause has one or more OR operators, the optimizer uses both CNF and DNF normalizations to compare their costs and create an execution plan.

User Tuning#

The normalization process inevitably results in the extension of predicates. Therefore, writing a predicate in a normalized form prevents predicate extension and eliminates unnecessary comparison operations.

For example, the below predicate can undergo either CNF or DNF normalization, but the user can modify it so that it can only undergo CNF normalization.

WHERE i1 = 1 OR i1 = 2 OR i1 = 3  
CNF normalization: i1 IN ( 1, 2, 3 )

Similarly, the following query can undergo CNF or DNF normalization to eliminate unnecessary normalization and comparison, to improve performance.

WHERE ( i1 = 1 AND i2 = 1 ) OR ( i1 = 2 AND i2 = 2 )  
CNF normalization: ( i1, i2 ) IN ( ( 1, 1 ), ( 2, 2 ) )
WHERE ( i1 = 1 AND i2 = 1 ) OR ( i3 = 1 AND i4 = 1 )  
DNF normalization: ( i1, i2 ) = ( 1, 1 ) OR ( i3, i4 ) = ( 1, 1 )

When writing a predicate in normalized form, certain information about the table (including indexes) must be considered. Furthermore, even if the user writes a predicate in normalized form, the optimizer may select an undesirable normalization type. This can be controlled by using hints, which is explained in detail in Using Hints.

Cost Estimation#

The optimizer creates the most efficient execution plan by considering the following three estimates based on statistics:

  • Selectivity

    This is the ratio of the number of rows that are anticipated to be selected by a certain predicate to the total number of target rows. The optimizer uses selectivity to calculate cardinality and costs, and decides the join order, the joining method and whether or not to use indexes. Selectivity is a fundamental element for creating an optimized execution plan.

  • Cardinality

    This is the number of rows that are anticipated to be chosen from the target rows. It is calculated as (the total number of rows * selectivity).

  • Cost

    This is the sum of the access cost and the disk I/O cost. The access cost is determined by the access method (e.g., index scan, full scan). For further information about access methods, please refer to Access Method.

Access Methods#

An access method is how data is fetched from the database. It is generally more efficient to use an index when fetching a small amount of records from a table and to perform a full table scan when accessing a large number of records. Altibase uses the following access methods:

  • Full Table Scans

  • Index Scans

Full Table Scans#

Reads all rows in the table and filters those that do not meet the selection criteria. Full table scans are used under the following circumstances:

  • When there is no index

  • When accessing a large amount of data in a table.

  • Related hint: FULL SCAN (table)

The following example shows a query and execution plan that use a full table scan.

iSQL> SELECT /*+ FULL SCAN(employees) */ eno, e_firstname, e_lastname, emp_job
        FROM employees 
       WHERE sex = 'F';
.
.
.
------------------------------------------------
PROJECT ( COLUMN_COUNT: 4, TUPLE_SIZE: 65 , COST: 0.18 )
 SCAN ( TABLE: EMPLOYEES, FULL SCAN, ACCESS: 20, COST: 0.14 )
------------------------------------------------

Index Scans#

There are two types of index scans:

  • Index Range Scans

  • Index Full Scans

  • Related hints: INDEX, INDEX ASC, INDEX DESC, NO INDEX

Index Range Scans#

An index range scan vertically explores the index tree from root to leaf and then scans only the necessary range for the leaves.

Altibase denotes this as the "key range processing method" as well. An index range scan uses an index and finds the range of data that meets a condition. In other words, an index range scan determines only the locations of the minimum and maximum values that meet the condition and scans all data within the range. This method guarantees improved performance because it eliminates additional comparison operations.

An index range scan is used when the leading columns of an index are used for the following predicates:

  • c1 = value

  • c1 < value

  • c1 > value

The data is sorted and returned in index column order. If the columns comprising the index are specified in the ORDER BY/GROUP BY clauses, the optimizer will avoid unnecessary sorting.

The following example shows a query using an index range scan and its execution plan.

iSQL> SELECT /*+ INDEX(employees, EMP_IDX1) */ eno, 
             e_firstname, 
             e_lastname, 
             emp_job
        FROM employees 
       WHERE dno = 1003;
.
.
.
----------------------------------------------------------
PROJECT ( COLUMN_COUNT: 4, TUPLE_SIZE: 65, COST: 0.03 )
 SCAN ( TABLE: EMPLOYEES, INDEX: EMP_IDX1, RANGE SCAN, ACCESS: 4, COST: 0.00 )
----------------------------------------------------------
Index Full Scans#

Unlike an index range scan which scans only a particular range, an index full scan scans the leaf node from top to bottom. An index full scan is usually selected as an alternative when an optimum index for a data query does not exist. Because the data is sorted by the index key, it is unnecessary for the optimizer to perform additional sort operations.

Altibase denotes this as the "key filter processing method" as well. This method does not perform a range scan with an index, but scans each index leaf node in order and compares it against a stored key value. This method only accesses the pages which store indexes and compares them. Therefore, the number of times data pages are accessed is reduced and this can lead to improved performance.

However, performance improvement is limited to disk tables as memory table indexes do not store key values.

The filter processing method is used for predicates on which indexes cannot be used. It directly reads data and compares it. If multiple filters must be used to process a WHERE clause predicate, the optimizer compares the estimated cost for each filter and processes filters by lowest price, so that filters are processed with the minimum cost.

The following example shows a query using an index full scan and its execution plan.

CREATE TABLE t1(c1 INT, c2 CHAR(10)) TABLESPACE sys_tbs_disk_data;
CREATE INDEX t1_idx ON t1(c1);
INSERT INTO t1 VALUES( 1, 'a' );
INSERT INTO t1 VALUES( 2, 'b' );
INSERT INTO t1 VALUES( 3, 'c' );

iSQL> SELECT * FROM t1 ORDER BY c1;
.
.
.
----------------------------------------------------------
PROJECT ( COLUMN_COUNT: 2, TUPLE_SIZE: 16, COST: 14.02 )
 SCAN ( TABLE: t1, INDEX: t1_IDX, FULL SCAN, ACCESS: 3, DISK_PAGE_COUNT: 64, COST: 14.00 )
----------------------------------------------------------

Considerations when Creating Indexes#

An index must find the key range that most efficiently processes a predicate. However, the following must be considered to create an index.

  • An index can improve search performance, but index management requires storage space (additional resource consumption). Moreover, too many indexes can degrade performance by incurring index maintenance costs for data insert, delete and update.

  • For memory tables, a query using an index (index scan) almost always yields better performance than a query scanning an entire table (full table scan).

  • For disk tables, index scans are not necessarily more effective than full table scans. While a full table accesses pages in a consistent pattern, an index scan accesses pages irregularly and may incur excessive disk I/O in some cases. Thus, it is recommended to create indexes for disk tables only if a search using an index scan returns a very small number of, or less than 10% of the total records in a table.

Therefore, the user should consider how frequently queries are executed, how much performance will improve due to indexes, the total system resources and how much performance will degrade due to UPDATE queries (INSERT, DELETE, and UPDATE) before deciding whether or not to create an index.

Index Selection by the Optimizer#

The optimizer determines the most efficient access method based on the conditions and table indexes

The optimizer uses various statistics to evaluate the cost of using each index and selects the most efficient access method. Once the access method has been determined, the optimizer decides the process method (e.g. the key range method, the filter method, etc.) for each predicate

The optimizer uses the following formulas to calculate the cost of each access method:

Access cost + Disk I/O cost

Access Cost Formula
Full scan : T(R)
Index Scan : log(T(R)) + T(R) * MAX(1/V(Index), selectivity)
Disk I/O Cost Formula
Full scan : B(R)
Index Scan :
Buffer Miss : [T(R) / V(R.a)] * ( 1- M/B(R) )
No Buffer Miss : B(R) * ( 1 - (log V(R.a)/logT(R)) )

The optimizer adds the access cost and disk I/O cost to calculate the cost of each method. Since memory tables do not have disk pages, disk I/O cost is automatically omitted for memory tables

When the optimizer calculates the cost for each index, it selects the condition that can be processed with the index and bases its calculations on it. The factor that most strongly affects this cost is the efficiency or "selectivity" of each of these conditions.

Condition selectivity is the ratio of the number of records selected by the condition to the total amount of records. A low condition selectivity means a small number of resultant records, which can lead to better performance.

For example, consider the following condition:

WHERE i1 = 1 AND i2 = 1

If an index has been defined for each of the i1 and i2 columns, the most important factor to consider when deciding which index to use is condition selectivity. For instance, if there are 100 [V(i1) = 100] different values in column i1 and 1000 [V(i2) = 1000] in column i2, the selectivity for each column is calculated as follows:

( i1 = 1 ) selectivity = 1/V(i1) = 1/100
( i2 = 1 ) selectivity = 1/V(i2) = 1/1000

For this query, the more efficient access method would be to use the index defined for column i2.

Selecting the access method as above generally creates the most suitable execution plan. However, because this method is based on cost calculation, an unsuitable access method can be chosen under certain circumstances.

For example, consider the following query:.

SELECT * FROM soldier WHERE gender = 'female' AND rank = 'lieutenant';

If both the gender and rank columns have indexes, the optimizer will calculate the costs and choose to use the index for evaluating the condition of the column rank. However, if the user knows beforehand that the number of result sets that satisfy the condition (gender ='female') is very small, it would be appropriate to use a hint or the like to instruct the optimizer to use another index.

Using Composite Indexes#

When a composite index is used, the optimizer checks the predicates and chooses one that allows key range processing to be performed on the maximum number of predicates. The more conditions processed using key range scanning, better performance can be expected.

Performance can vary greatly on how predicates that reference user-defined indexes are written. Therefore, understanding composite indexes is very helpful for SQL tuning.

The following example shows how various predicates are processed:

Composite index on t1( i1, i2, i3, i4 )
WHERE i1 = 1 AND i2 = 1 AND i3 = 1 AND i4 = 1

All of the predicates in the above WHERE clause can be processed using the composite index and key range processing.

WHERE i1 = 1 AND i2 > 0 AND i3 = 1 AND i4 = 1
Key Range              : i1 = 1, i2 > 0
Filter(or Key filter)  : i3 = 1, i4 = 1

In the above example, it is determined that key range processing can be performed on two of the conditions and a filter will be used for the rest. This is because key range processing can only be performed within a range for which maximum and minimum values can be set, and cannot be performed on the predicates following the inequality operation.

WHERE i1 = 1 AND i3 = 1 AND i4 = 1
Key Range      : i1 = 1
Filter         : i3 = 1, i4 = 1

In the above example, key range processing can only be performed on the first predicate although all operations are equality operations. This is because key range processing can only be performed when the predicates match the order of the columns on which the composite index is defined and no columns are missing. That is, key range processing cannot be used to evaluate the predicates that reference the i3 and i4 columns, because no predicate corresponding to the i2 column precedes them.

As described above, for composite indexes, only the predicates that appear in the same order as the order of the key columns (without any missing columns) can be evaluated using key range processing, and only if those predicates use equality operations.

If the following type of query is frequently executed, and in order to add an index, the user should add an index that can be used to its maximum.

WHERE i1 > 0 AND i2 = 1
  • Proper index: Index on t1( i2, i1 )

    Key range processing can be used to evaluate the predicates that reference the i2 and i1 columns without a filter.

  • Improper index: Index on t1( i1, i2 )

    Key range processing can be used to evaluate the predicate that references the i1 column, but filter processing is used to evaluate the predicate that references the i2 column. Therefore, this index is inefficient.

Indexes and Comparison Operators#

Just because an index exists, and a predicate references the column does not necessarily mean that the index can be used.

That is, it is necessary to pay attention to whether the index can be used depending on the expression type of the condition clause and the type of the comparison operator.

The comparison operator type and index availability are shown below:

[Table 3-1] Use of Comparison Operators with Indexes

Type

Comparison Operator

Index availability

Remarks

Simple Comparison

=

O

 

!=

O

 

O

 

<=

O

 

O

 

>=

O

 

Area Comparison

BETWEEN

O

 

NOT BETWEEN

O

 

Member Comparison

IN

O

 

NOT IN

O

 

Pattern Comparison

LIKE

O

Yes: t1.i1 LIKE abc%

No: t1.i1 LIKE %abc

NOT LIKE

X

 

NULL Comparison

IS NULL

O

 

IS NOT NULL

O

 

Exists Comparison

EXISTS

X

 

NOT EXISTS

X

 

UNIQUE

X

 

NOT UNIQUE

X

 

Quantify ANY

=ANY

O

 

!=ANY

O

 

<ANY

O

 

<=ANY

O

 

>ANY

O

 

>=ANY

O

 

Quantify ALL

=ALL

O

 

!= ALL

O

 

< ALL

O

 

<= ALL

O

 

> ALL

O

 

>= ALL

O

 

 

Comparison operator types for the GEOMETRY data type are listed below. An index defined on a GEOMETRY column can only be used if an R-Tree index exists.

[Table 3-2] Comparison Operators for GEOMETRY

Type

Comparison Operator

Status

Remarks

Geometry Comparison

CONTAINS

O

R-Tree index only

CROSSES

O

DISJOINT

O

DISTANCE

O

EQUALS

O

INTERSECTS

O

ISEMPTY

X

ISSIMPLE

X

NOT CONTAINS

X

NOT CROSSES

X

NOT EQUALS

X

NOT OVERLAPS

X

NOT RELATE

X

NOT TOUCHES

X

NOT WITHIN

X

OVERLAPS

O

RELATE

X

TOUCHES

O

WITHIN

O

As above, just because an index has been defined for a column and the index can be used with a comparison operator does not necessarily mean that the index can always be used.

Whether the index can be used is determined by the predicate format. An index can use the following predicate formats (The examples in the table assume that the column is an INTEGER).

[Table 3-3] Predicate Formats and Index Availability

Predicate Format Example of Valid Use Example of Invalid Use
The comparison operator must be capable of using an index t1.i1 = 1 t1.i1 NOT LIKE a
The comparison must reference a column t1.i1 = 1 1 = 3
No operations should be performed on the column t1.i1 = 1 + 1 t1.i1 + 1 = 3
Columns can be referenced only on one side of the operator (t1.i1, t1.i2) = (1, 1) t1.i1 = t2.i1 (t1.i1, 1) = (1, t1.i2) t1.i1 = t1.i2
No conversion operation should be performed on a column value t1.i1 = SMALLINT'1' t1.i1 = 1.0

As described above, you can only use indexes by writing the proper predicates. The user should be especially careful that the column value is neither converted nor altered.

For more detailed information about data types and data conversions for indexes, please refer to Indexes and Data Types.

Indexes and Data Types#

Just because a column referenced in a WHERE clause has an index does not necessarily mean that an index scan is always available. Index scanning may or may not be possible, depending on the data type and data conversion.

SELECT * FROM t1 WHERE t1.i1 = ?

For example, assuming that the i1 column is a VARCHAR type column and is also the PRIMARY KEY column for the table t1, checking the execution plan of the above query on iSQL will indicate that index scanning is possible. However, if the column is bound to a numeric data type, (e.g., INTEGER type) the values in the column must be implicitly converted to the numeric type to perform the comparison.

This renders index scanning impossible. Therefore, whether the query is processed using an index scan should be checked in iSQL. Also the value that is actually bound in the application and the data type of the column must be checked, even if an index scan is available in the execution plan. If these are not checked, the application may perform a full scan which will result in a performance gap with that measured within the iSQL utility.

The data types and their index availability are described below. The darkened cells indicate that the values in key columns will be type-converted when they are compared.

[Table 3-4] Data Types and Index Availability

VALUE

KEY
CHAR VARCHAR SMALLINT INTEGER BIGINT NUMERIC FLOAT REAL DOUBLE DATE BLOB NIBBLE BYTE GEOMETRY
CHAR O O X X X X X X X X - - - -
VARCHAR O O X X X X X X X X - - - -
SMALLINT X X O O O O O O O - - - - -
INTEGER X X O O O O O O O - - - - -
BIGINT X X O O O O O O O - - - - -
NUMERIC O O O O O O O O O - - - - -
FLOAT O O O O O O O O O - - - - -
REAL X X O O O O O O O - - - - -
DOUBLE O O O O O O O O O - - - - -
DATE O O - - - - - - - O - - - -
BLOB - - - - - - - - - - O - - -
NIBBLE - - - - - - - - - - - O - -
BYTE - - - - - - - - - - - - O -
GEOMETRY - - - - - - - - - - - - - O

Data types can be broadly categorized into CHARACTER and NUMERIC. Indexes can be used to compare data types within the same categories.

[Table 3-5]

Character Types

CHAR, VARCHAR

Numeric Types

Native

Integer Type

BIGINT, INTEGER, SMALLINT

Real Number Type

DOUBLE, REAL

Non-Native

(Exponent)

Fixed Decimal Data

NUMERIC, DECIMAL, NUMBER(p), NUMBER(p,s)

Fixed Decimal Datatype

FLOAT, NUMBER

Indexes are available for comparisons between the CHAR and VARCHAR types in the character type group, or comparisons between the data types in the integer type group, the real number type group, or the non-native type group.

  • Character Types

    • char_column = VARCHAR'abc'
    • varchar_column = CHAR'abc'
  • Numeric Types

    • integer_column = DOUBLE'1'
    • number_column = DOUBLE'1'
    • integer_column = NUMBER'1'

Different numerical data types are converted and compared according to the criteria shown in the following conversion matrix:

[Table 3-6]

Integer Type Real Number Type Exponent Type
Integer Type Integer Type Real Number Type Exponent Type
Real Number Type Real Number Type Real Number Type Real Number Type
Exponent Type Exponent Type Real Number Type Exponent Type

Such conversion can also be performed on the values in index columns, in which case the comparison operation is performed on the value resulting from value conversion in the index column. Therefore, execution is slower for an index scan for which conversion is necesssary (e.g., the comparison bigint_col = NUMERIC'1') than for an index scan for which conversion is unnecessary (e.g., the comparison bigint_col = BIGINT'1').

Apart from index scans, the performance of comparison operations that compare two values is strongly dependent on the data types of the values being compared. When values having the same data type are compared, no data conversion cost is incurred, and performance is optimal. However, when values having different data types are compared, measures must be taken to minimize data conversion.

For example, the shortest conversion path will be between the FLOAT and VARCHAR types when comparing a numeric type with a character type. However, when choosing data types it is necessary to consider factors other than data conversion costs (e.g., differences in the data storage space requirements and formats of different data types).

The following figure shows data type conversion paths.

[Figure 3-2] Data Type Conversion Paths

To use index columns in search operations, it is very important that conditional clauses are written as above. Since the conditional clause type can affect performance, caution must be exercised when writing WHERE clauses and client applications.

After choosing the access method for each table, the optimizer processes joins. The decision on the access method for each table is not final, and may be revised while processing joins.

Determining the Join Order#

The join order and process method exert the greatest influence on the performance of complicated queries. Thus, determining whether the join order and join processing method are suitable and subsequently revising them greatly helps to improve query performance.

The optimizer takes the following steps to process joins:

  1. Groups joins according to join condition
  2. Determines the join relationship for each join
  3. Decides on the join order for each group
  4. Decides on the joining method for each group
  5. Decides on the order and method for joining groups to each other

The join order can be specified by using the following hints.

  • ORDERED Hint
  • As a parameter to access method-related hints
  • As a parameter to joining method-related hints

This section describes how the optimizer determines the join order. The join order is determined by a greedy algorithm that is based on join selectivity.

Classifying Join Groups#

If all tables are taken into consideration when determining a join order (including those without joins ), it would only increase the workload and be of little help in making the decision. It is more efficient to group the tables that are joined together and then determine the join order within each group.

Consider the following query for example:

SELECT * 
  FROM t1, t2, t3, t4, t5
 WHERE t1.i1 = t2.a1 
   AND t2.a2 = t3.m1
   AND t4.x1 = t5.z1;

Join Group Classification: (t1, t2, t3), (t4, t5)

After classifying the join groups so that there are no join relationships between them, the join order within each group is determined as follows.

Configuring Join Relationships#

Join relationships within each join group are configured to determine a more efficient join order by minimizing the comparison cost between tables without direct relationships.

Consider the following query for example.

SELECT * 
  FROM t1, t2, t3, t4
 WHERE t1.i1 = t2.a1 
   AND t2.a2 = t3.m2
   AND t3.m3 = t4.x3;

The join relationships in the above query can be expressed as follows:

[Figure 3-3]

By evaluating only the cost of join relationships when determining the join order, the optimizer prevents tables without direct join relationships, (e.g., tables t1 and t4) from being chosen to be joined first.

Considering join relationships usually helps the optimizer to determine an efficient join order, but it does not always warrant the most efficient join order. The user can use hints to control the join order when necessary.

Deciding the Order of Join Relationships#

The optimizer uses the join relationships generated above to determine the order of join relationships by efficiency.

The way that the join order is determined is to choose the joins with the highest selectivity to be processed first. Then, these joins are again sorted by efficiency. However, this join relationship order is not the actual join order. The actual join order is finally set when the joining method is determined.

[Figure 3-4] Join Ordering

As shown in the figure above, the join order that is set at this stage only represents the depth of the join relationships. The most important factor in determining the join order is join selectivity, which means the ratio of the size of the result set created by combining two tables to the size of the original tables. In other words, when determining the join order, it is not the actual result set size that is important but how much of the result set size is reduced.

The optimizer calculates join selectivity as below. A detailed description of this formula is beyond the scope of this document.

Conceptual Join Selectivity Calculation
[T(R) * T(S) / MAX[V(R.a), V(S.a)]/ [T(R) + T(S)]

The join order is determined by join selectivity so that the result set is reduced in size by higher ratios, which means that an appropriate join order is likely to be determined later when the joining method is determined.

For example, assume that the number of records in the tables R, S, T, and U referred to in a query and the number of results in the result sets created by joining the tables are respectively as follows:

T(R) = 10, T(S) = 10, T(R JOIN S) = 10
T(T) = 1000, T(U) = 1000, T(T JOIN U) = 100

In the above example, although the number of records in the result set obtained by joining tables R and S is smaller than the result set obtained by joining tables T and U, the latter join relationship is more important because it reduces the number of results in the result set by a greater proportion.

The join order determined by the optimizer using the join relationships cannot be guaranteed to be the optimal join order in all cases. Thus, the join order can be controlled using join order hints.

After the join order has been determined, the joining method for each pair of tables is determined and then the join order is finalized.

Determining the Joining Method#

Once the join order has been determined, the optimizer determines the joining method for each join relationship of both tables. The join order, join direction, and joining method are determined by comparing the costs of different joining methods.

The joining methods supported by Altibase are broadly classified into the following four categories:

  • Nested Loop Joins
  • Sort-based Joins
  • Hash-based Joins
  • Merge Joins

The joining methods and available join types for each join category are shown below/

[Table 3-7] Possible Joins Available for Joining Methods

Join Category

Join Method

Join Direction

Left=>Right

Right=>Left

Nested Loop

Full nested loop

C, I, S, A, L

C, I, R

Full store nested loop

C, I, S, A, L, F

C, I, R, F

Index nested loop

I, S, A, L

I, R

Anti outer nested loop

F

F

Inverse index nested loop

 

S

Sort-based

One pass sort join

I, S, A, L, F

I, R, F

Two pass sort join

I, S, A, L, F

I, R, F

Inverse sort join

 

S, A

Hash-based

One pass hash join

I, S, A, L, F

I, R, F

Two pass hash join

I, S, A, L, F

I, R, F

Inverse hash join

R

S, A, L

Merge-based

Index merge join

I, S, A

I

Sort merge join

I, S, A

I

Possible Join Types

  •  C (Cartesian Product): The combination of two tables that are not joined
  •  I (Inner Join): A typical join between two joined tables
  •  S (Semi Join): A join between two tables that have a semi-join relationship between them
  •  A (Anti Join): A join between two tables that have an anti-join relationship between them
  •  L (Left outer join): A join between two tables that have a left outer join relationship between them
  •  R (Right outer join): A join between two tables taht have a left outer join relationship between them
  •  F (Full outer join): A join between two tables that have a full outer join relationship between them

For more detailed information on each join, please refer to the description of the SELECT statement in the SQL Reference.

The optimizer chooses the most efficient joining method from the possible joining methods through cost evaluation and determines the join direction. Once the joining method has been determined, the outer table (driving table) is displayed on the left and the inner table (driven table) is displayed on the right.

With the execution plan, the user can check which joining method has been selected and control the joining method using hints.

This section describes the joining methods for each join category.

Nested Loop Joins#

The nested loop join category has the following joining methods:

  • Full nested loop join

  • Full store nested loop join

  • Index nested loop join

  • Anti outer nested loop join

  • Inverse index nested loop join

The full nested loop join joins every row of a table to every row of another table. This method is mainly used for joining two tables that have no relationship established between them, like the query below.

SELECT * FROM t1, t2;

The full store nested loop join stores the results of the inner table and then performs a full nested loop join. This method is likely to be used when the result set size can be greatly reduced by processing conditions other than join conditions and is generally used by the Cartesian product among join groups.

SELECT * FROM t1, t2 WHERE t1.i1 = 1 AND t2.i1 = 1;

When the index nested loop joining method is used, one or more indexes are used to process join conditions. This method is likely to be used if the number of records in the outer table is small and if an index has been defined for the inner table.

Index on t2(i1)

SELECT * FROM t1, t2 WHERE t1.i1 = t2.i1 AND t1.i1 = 1;

The anti outer nested loop join is only used to process full outer joins. This method can be used only if indexes have been defined for the columns corresponding to the join condition in both the outer table and the inner table. In such cases, this method is likely to be selected over other methods.

Index on t1(i1), Index on t2(i1)

SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.i1 = t2.i1;

The inverse index nested loop join is only used to process semi-joins and is likely to be used in cases where the outer table has an index and the inner table does not. Its use is more advantageous when the outer table has a relatively larger number of records than the inner table. If the inner table has an index, however, the index nested loop join is likely to be selected over other methods.

Index on t1(i1)

SELECT * FROM t1 WHERE t1.i1 IN ( SELECT i1 FROM t2 );

The execution cost of each join can be calculated thus: (access cost + disk I/O cost).

  • Related hint: USE_NL

Sort-based Joins#

The sort-based join stores the inner table in a sorted order and performs a range scan using the join conditions. This method is likely to be used when an index does not exist and join conditions using inequality operators are contained in the query.

SELECT * FROM t1, t2 WHERE t1.i1 > t2.i1;

The following are sort-based joins:

  • One-pass sort-based join

  • Two-pass sort-based join

  • Inverse sort-based join

The one-pass sort-based joining method can be used when the amount of data in the inner table is small enough to be managed within the temporary table. This method can always be used when a memory table is used as the inner table.

The two-pass sort-based joining method is used when the amount of data in the inner table is too large to be managed within the temporary table. This method is used to reduce the amount of disk I/O.

With this method, both the outer table and the inner table are sorted and then saved in the temporary table. Because the records are evaluated against the join conditions in the order in which the data of the outer table is sorted, this method increases the likelihood that the same disk page will be accessed to retrieve the corresponding record for the inner table, thereby reducing disk I/O cost.

The inverse sort-based join is only used with semi-joins or anti-joins. The use of the inverse sort-based joining method can be advantageous when the outer table is relatively larger than the inner table.

However, the Altibase optimizer usually selects the inverse hash-based join since it has the merit of eliminating the need for further sorting as it returns join results in a sorted order. A hint must be used as shown below to forcibly use the inverse sort-based join.

SELECT * FROM t1 WHERE t1.i1 IN ( SELECT /*+ SORT_SJ */ i1 FROM t2 );

The execution cost of each join can be calculated as: (access cost + disk I/O cost).

  • Related hints: USE_SORT, USE_ONE_PASS_SORT, USE_TWO_PASS_SORT

Hash-based Joins#

The hash-based join stores the inner table in a hash structure and performs a range scan using the join condition. This method can only be used with join conditions that use equality operators, and is likely to be chosen in cases where there are no indexes.

SELECT * FROM t1, t2 WHERE t1.i1 = t2.i1;

The following are hash-based joining methods:

  • One-pass hash-based join

  • Two-pass hash-based join

  • Inverse hash-based join

The one-pass hash-based joining method can be used when the amount of data in the inner table is small enough to be managed within the temporary table. This method is used to reduce the amount of disk I/O.

Two-pass hash-based join method is used when the amount of data in the inner table is large and cannot be managed within the range of temporary space. Both the outer table and the repeating table are partitioned using the same hash function and stored in multiple tables in temporary space. Then, each temporary table is checked for join conditions to increase the probability of accessing the same disk page of the inner table.

The inverse hash-based join is only used with semi-joins, anti-joins or left outer joins and is likely to be used if the inner table is relatively larger than the outer table.

The execution cost of each join can be calculated as: (access cost + disk I/O cost).

  • Related hints: USE_HASH, USE_ONE_PASS_HASH, USE_TWO_PASS_HASH

Merge Joins#

The merge joining method can be very efficient when the data in both tables are sorted in order. With this method, there is no concept of an inner or outer table; both tables are read in a sequential order and records that satisfy the join condition are searched for. Both of the tables must be sorted in order by the join key and is likely to be used if each table is ordered by the column.

Index on t1(i1), Index on t2(a1)

SELECT * FROM t1, t2 WHERE t1.i1 = t2.a1;

The execution cost of merge joins can be calculated as: (access cost + disk I/O cost).

  • Related hint: USE_MERGE


Creating Physical Execution Plans#

Lastly, the optimizer creates a physical execution plan tree. A physical execution plan tree is composed of execution nodes which is the unit with which the executor executes queries. The executor processes the query by following the execution nodes of the execution plan tree.

Execution nodes are classified as below according to their number of child nodes and whether or not they store intermediate results:

  • Unary Non-materialization Node

    Has one or zero child nodes, does not store intermediate results and manages only one record.

  • Unary Materialization Node

    Has one or zero child noes, and stores intermediate results

  • Binary Non-materialization Node

    Has two child nodes and does not store intermediate results.

  • Binary Materialization Node

    Has two child nodes and stores intermediate results.

  • Multiple Non-materialization Node

    Has two or more child nodes and does not store intermediate results.

According to this classification, the following types of physical operators exist in Altibase. For further information about each execution node, please refer to Chapter 4. The Explain Plan.

[Table 3-8] Types of Execution Nodes

Classification

Node Name

Function

Unary
Non-Materialization
Node

SCAN

Retrieving data from tables using various access path methods

FILTER

Filtering out data that can't be filtered out by the access path method

PROJECT

Processing projection

GROUPING

Processing grouping

AGGREGATION

Performing aggregate operations

VIEW

Organizing records into views

VIEW-SCAN

Retrieving data from view

COUNT

Processing specifically for COUNT(*)

PARALLEL-QUEUE

Processing parallel queries

Unary
Materialization
Node

SORT

Sorting records

HASH

Hashing records

GROUP-AGGREGATION

Grouping using hashing and performing aggregate functions

DISTINCT

Discarding redundant records using hashing

MATERIALIZATION

Managing views that are stored in temporary tables

STORE

Storing records

LIMIT-SORT

Sorting for limit clauses

CONNECT BY

Processing hierarchical queries

Binary
Non-Materialization
Node

JOIN

Processing joins

MERGE-JOIN

Processing merge joins

LEFT-OUTER-JOIN

Processing LEFT OUTER joins

FULL-OUTER-JOIN

Processing FULL OUTER joins

ANTI-OUTER-JOIN

Processing ANTI OUTER joins

CONCATENATION

Combining results returned by child nodes

BAG-UNION

Processing BAG UNIONs

Binary
Materialization
Node

SET-INTERSECT

Performing SET INTERSECT operations

SET-DIFFERENCE

Performing SET DIFFERENCE operations

Multiple
Non-Materialization
Node

PARTITION-COORDINATIOR

Managing scans of partitions of partitioned tables

PARALLEL-SCAN-COORDINATOR

Executing the child PARALLEL-QUEUE nodes in parallel and combining the results

Features of Materialization Nodes#

A materialization node is a node that stores intermediate results in order to perform its operations. Intermediate results are stored in temporary tables and these can be classified as either memory temporary tables or disk temporary tables, depending on the storage medium type.

Memory temporary tables are stored in memory space directly allocated from the system kernel and are released immediately after the query has finished executing. Disk temporary tables are stored in user-defined temporary tablespaces and the data therein is managed with a memory buffer. This memory buffer is also released after the query has finished executing.

Whether to use a memory temporary table or a disk temporary table is determined as follows: a memory temporary table is used if all of the nodes below a materialization node use only memory tables. If one or more disk tables are used, a disk temporary table is used. This decision can be overruled by using a hint.

Temporary storage spaces are classified as shown in the table1 below, according to creation nodes and storage spaces.

[Table 3‑9]

         Node(Right)
\
Storage Medium(Bottom)

HASH

SORT

Disk

Disk hash table

Disk sorted table

Memory

Memory hash table

Memory sorted table

Memory hash tables1 can store and scan records in the bucket or partitioning method; the bucket method stores records into buckets in list format and scans records by buckets, whereas the partitioning method stores records as a list and scans records by partitions. For more detailed information, please refer to the General Reference > Chapter 2. Altibase Properties > Performance Properties.

In the following examples, it can be seen in the execution plans that sorting is handled differently depending on whether a memory or disk temporary table is used in the query

The execution plan for the query below contains two nodes that store intermediate results to eliminate duplicates and sort records in order. If memory temporary tables are used, information about disk pages is not provided.

[Figure 3-5]

Two materialization nodes that store intermediate results in disk tables to eliminate duplicates and sort records are shown in the following example. If a temporary disk table is used, information about disk pages is provided. That is, whether the temporary tables are stored in memory or on disk can be checked by observing whether DISK_PAGE_COUNT is provided in information of the execution node.

[Figure 3-6]


The following are Altibase properties that affect the optimizer's actions. For more detailed information about each property, please refer to Performance-related properties in General Reference > Chapter 2. Altibase Properties > Performance Properties.

  • OPTIMIZER_FEATURE_ENABLE

  • OPTIMIZER_MODE

  • OPTIMIZER_UNNEST_AGGREGATE_SUBQUERY

  • OPTIMIZER_UNNEST_COMPLEX_SUBQUERY

  • OPTIMIZER_UNNEST_SUBQUERY


  1. Disk hash, disk storage, and memory sorted storage tables are not supported.