Data pipelines and Prisms

McCulloch, Alan

doi:10.6084/m9.figshare.11929551.v1

Data_pipelines_and_prisms.pdf (477.7 kB)

Data pipelines and Prisms

presentation

posted on 2020-03-10, 03:51 authored by Alan McCulloch

We describe a data-prism data processing and analysis metaphor, contrast this with the data-pipeline metaphor and topology and describe several use cases.

The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end (longitudinal) processing integrity and performance is usually uppermost in the minds of analysts and software designers; secondly, mature implementations of well-understood formal pipeline-topology abstractions such as directed acyclic graphs are readily available, as are well-understood end-to-end-oriented quality-control processes and metrics.

However, collections of input files associated with some large-scale datasets have important side-to-side (latitudinal) structural features, processing and quality control metrics that are not so well represented by the longitudinal pipeline metaphor and topology. For example, while processing of a set of samples from a sequencing machine may conclude with perfect end-to-end integrity per data-file, unsupervised machine learning (for example clustering) applied latitudinally to a low-entropy precis of all of the input, intermediate or final datafiles may identify data features of interest such as outliers, relevant to quality control.

Another example of latitudinal processing and structure is in the use of multiple reference frames for sample annotation, rather than a single reference, so that a single stream of processing is refracted into multiple streams, with each stream searching a different reference database, and/or using alternate search parameters. Technical steps such as jobscheduling and intermediate and output file-disposition for such “short, wide” (as opposed to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For example pipeline-oriented scripting usually stores and indexes input, intermediate and output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a different file-system folder for that stage, which does not work well if each folder receives input from multiple threads of processing of the same data (for example, file-name collisions will result).

We describe some data-prism use cases, and a number of simple techniques we have found useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a high level API for tasks such as random sampling and processing large numbers of input files 94 and parameter sets, low-entropy data representation approaches to support a high level latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for easier refraction of single into multiple streams of processing (and to try to reduce the impedance mismatch between the shell command-line that many users know and love, and the cluster job-submission systems known and loved by systems administrators).

ABOUT THE AUTHOR

Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay campus, mainly supporting genetic and genomic databases and high-throughput computational pipelines.