A lot of time has passed since I posted the first part of this blog series, which is mostly due to customer projects. Lately, I have also been busy implementing our NGS cancer analysis pipeline and software for representing the results. The product is really exciting, and we will post about it in the near future.
In the first post of this series I implemented an efficient query that calculates a histogram of microarray measurement data. In this post I will show my favourite method of implementing visualization in web environment. In the early days of web development we used to produce lots of code writing HTML tags to a stream to the browser. Today, several frameworks exist to make the life of the developer a lot easier. My favourite tool to build a data-centric web application is Oracle APEX. It is a declarative rapid web application development tool and it executes completely inside the database. Instead of writing the explicit steps how to create a user interface, APEX provides a point-and-click application builder (which is also implemented using APEX) where the developer creates pages using built-in components and plugins. APEX provides configurable components (reports, charts, buttons, input fields, etc) and wizards to help creating pages or regions based on existing database objects, such as tables.
Submitted by Jussi Volanen on Fri, 12/07/2012 - 16:05
I have been implementing a microarray data management application using Oracle APEX, which is a database centric rapid application builder. One task was to calculate a histogram of measurement data and show it in a chart. The histogram shows the distribution of values and can reveal problems concerning the data. So, the user selects one or more samples of interest and the application shows the histogram chart. The problem with this is performance. One microarray sample contains approximately 20000 measurements of gene expression levels and the user may select several samples. The total amount of rows hiling the measurements can be millions.
The first step is to create the query that calculates the data for the graph. Oracle has an analytic function called dense_rank, which computes the rank of a row in an ordered group of rows. In this case, the order comes from rounded measurement value. The rank is used as a bin, the rows are grouped by the bin and row count calculated for each bin. The result is the histogram.To limit the number of rows returned, the query combines buckets 150 and above together. APEX will truncate the result if there are more rows than the chart is configured for.
The query is executed against all measurements (about 12 million) to get the upper limit for the execution time:
Submitted by Jussi Volanen on Wed, 09/08/2010 - 17:01
In a recently published article by Lee et al. the genome of a lung cancer tumor was sequenced and compared to normal tissue. The number of single nucleotide variants (SNV) in the tumor was over 50,000. It is insane amount! The patient had smoked 25 cigarettes a day for 15 years, which makes about one mutation for every 2.7 cigarettes! The study also shows that the rate of mutations is lower for expressed genes compared to non-expressed genes. So, something is reducing the amount of mutations in those areas. It could indicate the presence of a repairing mechanism or maybe the mutations affecting actively expressed genes make the cells non-viable. I recommend reading the excellent blog entry commenting the article.
All this reminds me of a comparison between Linux call graph and the transcriptional regulatory network of E. coli bacterium. DNA can be considered as a source code of life or an operating system of life. Also, genes can be considered as functions in software code. Genes (and functions in Linux) can be divided into three categories:
Submitted by Jussi Volanen on Fri, 06/04/2010 - 20:12
I was reading a blog entry by Jan Aerts where he uses MongoDB to calculate statistics of SNPs from the 1000genomes project in different populations: CEU (European descent), YRI (African) and JPTCHB (Asian). The question was, how many SNPs are in common between those populations. Jan Aerts used Ruby and MongoDB to answer that question. He reported execution time of calculating the statistics to be about 55 minutes in his laptop. That is unreasonable for this smallish data. After all, there are only about 30 million SNP entries, or 800 MB of data. I will demonstrate here how the same thing can be done in about 3 minutes including data loading.
So, I decided to try the same exercise using Oracle XE, the free (and limited) edition of Oracle database. The most important limitation here is memory as Oracle XE can use only total of 1 GB of memory. I gave most of it to PGA, which is program global area and used for sorting and hash operations. Since I am the only user of this database, I changed the memory management of work areas to manual mode and allocated more memory for hash and sort operations:
Submitted by Jussi Volanen on Sun, 04/11/2010 - 14:23