SNP statistics using Oracle XE
I was reading a blog entry by Jan Aerts where he uses MongoDB to calculate statistics of SNPs from the 1000genomes project in different populations: CEU (European descent), YRI (African) and JPTCHB (Asian). The question was, how many SNPs are in common between those populations. Jan Aerts used Ruby and MongoDB to answer that question. He reported execution time of calculating the statistics to be about 55 minutes in his laptop. That is unreasonable for this smallish data. After all, there are only about 30 million SNP entries, or 800 MB of data. I will demonstrate here how the same thing can be done in about 3 minutes including data loading.
So, I decided to try the same exercise using Oracle XE, the free (and limited) edition of Oracle database. The most important limitation here is memory as Oracle XE can use only total of 1 GB of memory. I gave most of it to PGA, which is program global area and used for sorting and hash operations. Since I am the only user of this database, I changed the memory management of work areas to manual mode and allocated more memory for hash and sort operations:
ALTER SESSION SET WORKAREA_SIZE_POLICY=MANUAL;
ALTER SESSION SET HASH_AREA_SIZE = 400000000;
ALTER SESSION SET SORT_AREA_SIZE = 100000000;
I decided not to import the data into database at all. Instead, I created something Oracle calls "external tables". Those tables let you query data in a flat file as if they were regular tables. It takes no time to create an external table, but the penalty is extra CPU consumption during reads because Oracle has to parse the file. Also, no indexes are supported. I created one table for each sites file as follows:
SQL> create table CEU_sites (
2 chr NUMBER(2),
3 pos INTEGER,
4 ref_base varchar2(1),
5 variant_base varchar2(1),
6 maf float
7 )
8 organization external (
9 type oracle_loader
10 default directory temp
11 access parameters (
12 records delimited by 0x'0a'
13 fields terminated by 0x'09'
14 )
15 location ('CEU.sites')
16 )
17 reject limit 0;
Table created.
Elapsed: 00:00:00.01
Lines 8-17 define where the the external file is located and how the data is formatted. I downloaded the files to c:\temp directory and created corresponding Oracle DIRECTORY object. The file name is defined in the line 15.
The only time consuming part is the query that calculates the statistics. The code assumes unique positions (chromosome number + position in that chromosome), which is not entirely true, the input data contains one duplicate. The query is the most efficient method of calculating the number of common SNPs (until proven otherwise). The query reads the input files once and performs hash based aggregation. The query consists of two nested group bys. The inner group by joins the three files using the chromosome and position and produces attribute p, which contains information about populations having the SNP. The outer finds distinct values of p, counts number of rows and that's it.
SQL> set autotrace on
SQL> select case p
2 when 1 then 'CEU alone'
3 when 2 then 'YRI alone'
4 when 3 then 'CEU/YRI'
5 when 4 then 'JPTCHB alone'
6 when 5 then 'CEU/JPTCHB'
7 when 6 then 'YRI/JPTCHB'
8 when 7 then 'all three'
9 end "populations",
10 count(*) "count",
11 round(100 * ratio_to_report(count(*)) over (),1) "distribution"
12 from (
13 select sum(population) p from (
14 select /*+ cardinality(CEU_SITES 9633115)*/
15 1 population, chr, pos from CEU_SITES
16 union all
17 select /*+ cardinality(YRI_SITES 13759844)*/
18 2, chr, pos from YRI_SITES
19 union all
20 select /*+ cardinality(JPTCHB_SITES 10970708)*/
21 4, chr, pos from JPTCHB_SITES
22 ) group by chr, pos
23 )
24 group by p
25 order by "distribution" desc;
populations count distribution
------------ ---------- ------------
YRI alone 6901758 31,7
all three 4836814 22,2
JPTCHB alone 4101691 18,9
CEU alone 2954417 13,6
YRI/JPTCHB 1105797 5,1
CEU/JPTCHB 926406 4,3
CEU/YRI 915476 4,2
7 rows selected.
Elapsed: 00:03:00.07
Execution Plan
----------------------------------------------------------
Plan hash value: 2907451093
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 34M| 426M| 19669 (100)| 00:03:57 |
| 1 | SORT ORDER BY | | 34M| 426M| 19669 (100)| 00:03:57 |
| 2 | WINDOW BUFFER | | 34M| 426M| 19669 (100)| 00:03:57 |
| 3 | SORT GROUP BY | | 34M| 426M| 19669 (100)| 00:03:57 |
| 4 | VIEW | | 34M| 426M| 6767 (99)| 00:01:22 |
| 5 | HASH GROUP BY | | 34M| 950M| 6767 (99)| 00:01:22 |
| 6 | VIEW | | 34M| 950M| 317 (73)| 00:00:04 |
| 7 | UNION-ALL | | | | | |
| 8 | EXTERNAL TABLE ACCESS FULL| CEU_SITES | 9633K| 238M| 93 (69)| 00:00:02 |
| 9 | EXTERNAL TABLE ACCESS FULL| YRI_SITES | 13M| 341M| 121 (77)| 00:00:02 |
| 10 | EXTERNAL TABLE ACCESS FULL| JPTCHB_SITES | 10M| 272M| 102 (72)| 00:00:02 |
--------------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
965 recursive calls
0 db block gets
1350 consistent gets
98044 physical reads
0 redo size
730 bytes sent via SQL*Net to client
384 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
3 sorts (memory)
0 sorts (disk)
7 rows processed
SQL>