10.6084/m9.figshare.11929551.v1
Alan McCulloch
Alan
McCulloch
Data pipelines and Prisms
eResearch NZ
2020
NeSI
eResearch
eResearch NZ 2020
2020-03-10 03:51:07
Presentation
https://eresearchnz.figshare.com/articles/presentation/Data_pipelines_and_Prisms/11929551
We describe a data-prism data processing and analysis metaphor, contrast this with the
data-pipeline metaphor and topology and describe several use cases.<div><br></div><div>The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end
(longitudinal) processing integrity and performance is usually uppermost in the minds of
analysts and software designers; secondly, mature implementations of well-understood
formal pipeline-topology abstractions such as directed acyclic graphs are readily available,
as are well-understood end-to-end-oriented quality-control processes and metrics. <br></div><div><br></div><div>However, collections of input files associated with some large-scale datasets have important
side-to-side (latitudinal) structural features, processing and quality control metrics that are
not so well represented by the longitudinal pipeline metaphor and topology. For example,
while processing of a set of samples from a sequencing machine may conclude with perfect
end-to-end integrity per data-file, unsupervised machine learning (for example clustering)
applied latitudinally to a low-entropy precis of all of the input, intermediate or final datafiles may identify data features of interest such as outliers, relevant to quality control. <br></div><div><br></div><div>Another example of latitudinal processing and structure is in the use of multiple reference
frames for sample annotation, rather than a single reference, so that a single stream of
processing is refracted into multiple streams, with each stream searching a different
reference database, and/or using alternate search parameters. Technical steps such as jobscheduling and intermediate and output file-disposition for such “short, wide” (as opposed
to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For
example pipeline-oriented scripting usually stores and indexes input, intermediate and
output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a
different file-system folder for that stage, which does not work well if each folder receives
input from multiple threads of processing of the same data (for example, file-name
collisions will result). </div><div><br></div><div>We describe some data-prism use cases, and a number of simple techniques we have found
useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a
high level API for tasks such as random sampling and processing large numbers of input files
94
and parameter sets, low-entropy data representation approaches to support a high level
latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for
easier refraction of single into multiple streams of processing (and to try to reduce the
impedance mismatch between the shell command-line that many users know and love, and
the cluster job-submission systems known and loved by systems administrators). <br></div><div><br></div><div><b><u>ABOUT THE AUTHOR</u></b></div><div>Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay
campus, mainly supporting genetic and genomic databases and high-throughput
computational pipelines. <br></div>