How can 1Gbp pair genome and fewer than 200 samples produce 10Tb of data? How do we work with such massive datasets? Genomics is benefitting from an accelerated increase in data. As we work with more samples and larger genomes, data increases linearly. Working with machine learning algorithms data can increase exponentially. We need to change how we think about processing data and performing analyses. At the analytical level, researchers should understand how to reduce problems into the smallest solvable problem set. By attacking small solvable problems, a large dataset becomes a series of computations which is easily parallelizable. Map/Reduce is a technique from data science used to address this specific problem. This technique benefits workflows, high-performance computing, and programming.

Other problems can arise from large datasets. Common bioinformatics software does not scale to large genomes. Throwing hardware at the problem is the most common solution, but there are alternatives such as memory mapping files. Finally, there are processor intrinsics called Single instruction, multiple data (SIMD). These allow running a single computation over multiple data points simultaneously. Experience in a systems programming language is not a pre-requisite for this. Both Python and R have tools to work with SIMD and GPU instruction sets.

In this lightning talk, I plan to share my story of working with large datasets, how I try to address problems, and my failings and successes. Working with datasets with a size unthinkable a decade ago requires a shift in thinking, both from the analysis level as well as the level of those writing the tools and libraries.

ABOUT THE AUTHOR

Joseph Guhlin, PhD in Plant and Microbial Sciences. Has been working with Unix (FreeBSD, originally) and Perl since age 12. Has expanded programming skills in Clojure, a lisp-dialect that runs on the JVM, and Rust, a systems-level programming language gaining traction as an alternative to C++. Interests include programming, genomics, big data sets, and machine learning applications.