Calculating histogram of microarray measurement data (Part 1)

I have been implementing a microarray data management application using Oracle APEX, which is a database centric rapid application builder. One task was to calculate a histogram of measurement data and show it in a chart. The histogram shows the distribution of values and can reveal problems concerning the data. So, the user selects one or more samples of interest and the application shows the histogram chart. The problem with this is performance. One microarray sample contains approximately 20000 measurements of gene expression levels and the user may select several samples. The total amount of rows hiling the measurements can be millions.

The first step is to create the query that calculates the data for the graph. Oracle has an analytic function called dense_rank, which computes the rank of a row in an ordered group of rows. In this case, the order comes from rounded measurement value. The rank is used as a bin, the rows are grouped by the bin and row count calculated for each bin. The result is the histogram.To limit the number of rows returned, the query combines  buckets 150 and above together. APEX will truncate the result if there are more rows than the chart is configured for.

The query is executed against all measurements (about 12 million) to get the upper limit for the execution time:

SQL> ;
1 select
2 least(bucket, 150) bucket_trunc,
3 median(val) label,
4 count(*)
5 from (
6 select dense_rank() over(order by round(sample_norm_value,1)) bucket, round(sample_norm_value,1) val
7 from t_sample_data
8 ) group by least(bucket, 150)
9* order by least(bucket, 150)

SQL> /

Elapsed: 00:01:16.57

Execution Plan
----------------------------------------------------------
Plan hash value: 1430084171

--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12M| 298M| | 49529 (8)| 00:09:55 |
| 1 | SORT GROUP BY | | 12M| 298M| | 49529 (8)| 00:09:55 |
| 2 | VIEW | | 12M| 298M| | 47887 (5)| 00:09:35 |
| 3 | WINDOW SORT | | 12M| 45M| 277M| 47887 (5)| 00:09:35 |
| 4 | INDEX FAST FULL SCAN| PK_SAMPLE_DATA | 12M| 45M| | 12036 (2)| 00:02:25 |
--------------------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
362 recursive calls
37 db block gets
43764 consistent gets
167113 physical reads
0 redo size
3655 bytes sent via SQL*Net to client
483 bytes received via SQL*Net from client
11 SQL*Net roundtrips to/from client
0 sorts (memory)
2 sorts (disk)
150 rows processed

The execution time is over one minute. The first step in the query plan is index fast full scan, which is expected, since the table is index organized and Oracle has to process all rows. Full scan is the most efficient way of processing a whole table. The next step is window sort, caused by the analytic function. This step takes most of the execution time because sorting is CPU intensive and leaks into disk due to lack of memory to hold the temporary data. Most of the performance improvement is achieved by reducing the amount of rows to sort. Adding an index does not affect that. Materialized view, on the other hand, can hold pre-calculated results and dramatically reduce the amount of rows to process. The query calculates the number of rows having the same measurement value. That can be materialized:

SQL> create materialized view mv_data_points
as
select [extra columns], round(sample_norm_value,1) sample_norm_value, count(*) data_points
from t_sample_data
group by sample_accession, round(sample_norm_value,1);

Materialized view created.

Elapsed: 00:00:08.77
SQL> exec dbms_stats.gather_table_stats(user, 'MV_DATA_POINTS');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.72

Gathering table stats is necessary to allow query optimizer to generate better query plans. When the data in the underlying table changes, the materialized view becomes stale and has to be refreshed. In this case, however, the cost of refreshing is insignificant: the table is updated rarely, but the query using the materialized view is executed very often.

The application is built on Oracle XE, it is free but has limitations. For example, it does not support query rewriting with materialized views and the original query needs to be rewritten manually. That is not too hard: changing the table name and count(*) into sum(data_points) is sufficient:

SQL> select 
  least(bucket, 150) bucket_trunc,
  median(val) label,
  sum(data_points)
 from (
select dense_rank() over(order by sample_norm_value) bucket, sample_norm_value val, data_points
from mv_data_points
) group by least(bucket, 150)
order by least(bucket, 150);
150 rows selected.

Elapsed: 00:00:00.14

Execution Plan
----------------------------------------------------------
Plan hash value: 995147871

--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 42499 | 1618K| | 212 (6)| 00:00:03 |
| 1 | SORT GROUP BY | | 42499 | 1618K| | 212 (6)| 00:00:03 |
| 2 | VIEW | | 42499 | 1618K| | 207 (3)| 00:00:03 |
| 3 | WINDOW SORT | | 42499 | 332K| 1352K| 207 (3)| 00:00:03 |
| 4 | MAT_VIEW ACCESS FULL| MV_DATA_POINTS | 42499 | 332K| | 47 (3)| 00:00:01 |
--------------------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
154 consistent gets
0 physical reads
0 redo size
3663 bytes sent via SQL*Net to client
483 bytes received via SQL*Net from client
11 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
150 rows processed

That gives us the sub-second performance that was required. Wall-clock time is not reliable performance metric, because it is affected by other activity in the machine and database. Also, the state of the buffer cache affects a lot. "Consistent gets" is better metric, since it does not change from execution-to-execution. It tells how many blocks Oracle visited when executing the query. That's a simplified definition, reality is more complex. Anyway, "consistent gets" improved almost 300 times. I'm happy with that.

  • Euformatics Oy, Tekniikantie 21, 02150 Espoo, Finland
© Euformatics Oy 2012