posted on 2020-03-10, 03:51authored byAlan McCulloch
We describe a data-prism data processing and analysis metaphor, contrast this with the
data-pipeline metaphor and topology and describe several use cases.
The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end
(longitudinal) processing integrity and performance is usually uppermost in the minds of
analysts and software designers; secondly, mature implementations of well-understood
formal pipeline-topology abstractions such as directed acyclic graphs are readily available,
as are well-understood end-to-end-oriented quality-control processes and metrics.
However, collections of input files associated with some large-scale datasets have important
side-to-side (latitudinal) structural features, processing and quality control metrics that are
not so well represented by the longitudinal pipeline metaphor and topology. For example,
while processing of a set of samples from a sequencing machine may conclude with perfect
end-to-end integrity per data-file, unsupervised machine learning (for example clustering)
applied latitudinally to a low-entropy precis of all of the input, intermediate or final datafiles may identify data features of interest such as outliers, relevant to quality control.
Another example of latitudinal processing and structure is in the use of multiple reference
frames for sample annotation, rather than a single reference, so that a single stream of
processing is refracted into multiple streams, with each stream searching a different
reference database, and/or using alternate search parameters. Technical steps such as jobscheduling and intermediate and output file-disposition for such “short, wide” (as opposed
to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For
example pipeline-oriented scripting usually stores and indexes input, intermediate and
output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a
different file-system folder for that stage, which does not work well if each folder receives
input from multiple threads of processing of the same data (for example, file-name
collisions will result).
We describe some data-prism use cases, and a number of simple techniques we have found
useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a
high level API for tasks such as random sampling and processing large numbers of input files
94
and parameter sets, low-entropy data representation approaches to support a high level
latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for
easier refraction of single into multiple streams of processing (and to try to reduce the
impedance mismatch between the shell command-line that many users know and love, and
the cluster job-submission systems known and loved by systems administrators).
ABOUT THE AUTHOR
Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay
campus, mainly supporting genetic and genomic databases and high-throughput
computational pipelines.