September 2010

  • Calculating histogram of microarray measurement data (Part 1)

    I have been implementing a microarray data management application using Oracle APEX, which is a database centric rapid application builder. One task was to calculate a histogram of measurement data and show it in a chart. The histogram shows the distribution of values and can reveal problems concerning the data. So, the user selects one or more samples of interest and the application shows the histogram chart. The problem with this is performance. One microarray sample contains approximately 20000 measurements of gene expression levels and the user may select several samples. The total amount of rows hiling the measurements can be millions.

    The first step is to create the query that calculates the data for the graph. Oracle has an analytic function called dense_rank, which computes the rank of a row in an ordered group of rows. In this case, the order comes from rounded measurement value. The rank is used as a bin, the rows are grouped by the bin and row count calculated for each bin. The result is the histogram.To limit the number of rows returned, the query combines  buckets 150 and above together. APEX will truncate the result if there are more rows than the chart is configured for.

    The query is executed against all measurements (about 12 million) to get the upper limit for the execution time:

  • Euformatics Oy, Tekniikantie 21, 02150 Espoo, Finland
© Euformatics Oy 2012