Genome research creating data that's too big for IT
- By Paul McCloskey
- Apr 18, 2013
The biomedical research community is generating data at rate that has simply overcome the ability of traditional IT tools to make the best scientific uses of it.
At least that is the gist of a letter this month to the cancer research community from George Komatsoulis, chief information officer of the National Cancer Institute, which is soliciting best practices on how to overcome the challenge.
Data generated by genome sequencing and the use of large-scale imaging technologies are "breaking the standard model by which researchers manage and analyze data," he wrote in a recent blog post.
In an effort to head off the trend, the NCI has asked all of its grantees this month for input on a set of pilot projects that would test the feasibility of setting up a "cancer knowledge cloud" that would equip researchers with the computational tools they need to meet the big data demands of big science.
By combining storage repositories and computing power in the cloud, researchers would overcome some of the limitations classic data management practices are putting on the NCI's biggest goals: building The Cancer Genome Atlas (TCGA) data set.
At the conclusion of the project in 2014, the TCGA project is expected to generate 2.5 petabytes, or 21 million gigabits, of data. Even with a 10 gigabit/second link, Komatsoulis estimated it would take 23 days (2 million seconds) to download the dataset. A faster solution, he pointed out, would be to ship a disk array via the U.S. Postal Service.
The cost of storage is another deal breaker. Komatsoulis estimates it would cost about $2 million a year to archive the TCGA data sets for each of the individual labs or research teams working with the database.
Instead, a cancer knowledge cloud, by co-locating data and computing power, would "allow researchers to bring their analytic tools to the data rather than trying to bring the data to their tools," he wrote. "Such clouds have the potential to increase the speed of discovery and democratize access to cancer genomics data, which is too often the province of organizations that can support the high cost of maintaining these enormous data sets."
The cancer knowledge cloud, however, is only an idea at this point. NCI has proposed a series of pilot projects to test the feasibility of the concept and identify potential approaches to take.
In a letter this month to all NCI research grantees, NCI Director Dr. Harold Varmus said the institute is interested in information on situations "where limitations in (IT) are either preventing you from carrying out ... high value research ... or slowing the pace of discovery considerably."
NCI also wants information from grantees on their experience with using "high performance computing environments," as well as any metrics that might be used to gauge the success of pilot cancer knowledge clouds.
Paul McCloskey is editor-in-chief of GCN. Follow him on Twitter: @Paul_GCN.