Data pipelines and Prisms

10.6084/m9.figshare.11929551.v1 Alan McCulloch Alan McCulloch Data pipelines and Prisms eResearch NZ 2020 NeSI eResearch eResearch NZ 2020 2020-03-10 03:51:07 Presentation https://eresearchnz.figshare.com/articles/presentation/Data_pipelines_and_Prisms/11929551 We describe a data-prism data processing and analysis metaphor, contrast this with the data-pipeline metaphor and topology and describe several use cases.<div><br></div><div>The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end (longitudinal) processing integrity and performance is usually uppermost in the minds of analysts and software designers; secondly, mature implementations of well-understood formal pipeline-topology abstractions such as directed acyclic graphs are readily available, as are well-understood end-to-end-oriented quality-control processes and metrics. <br></div><div><br></div><div>However, collections of input files associated with some large-scale datasets have important side-to-side (latitudinal) structural features, processing and quality control metrics that are not so well represented by the longitudinal pipeline metaphor and topology. For example, while processing of a set of samples from a sequencing machine may conclude with perfect end-to-end integrity per data-file, unsupervised machine learning (for example clustering) applied latitudinally to a low-entropy precis of all of the input, intermediate or final datafiles may identify data features of interest such as outliers, relevant to quality control. <br></div><div><br></div><div>Another example of latitudinal processing and structure is in the use of multiple reference frames for sample annotation, rather than a single reference, so that a single stream of processing is refracted into multiple streams, with each stream searching a different reference database, and/or using alternate search parameters. Technical steps such as jobscheduling and intermediate and output file-disposition for such “short, wide” (as opposed to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For example pipeline-oriented scripting usually stores and indexes input, intermediate and output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a different file-system folder for that stage, which does not work well if each folder receives input from multiple threads of processing of the same data (for example, file-name collisions will result). </div><div><br></div><div>We describe some data-prism use cases, and a number of simple techniques we have found useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a high level API for tasks such as random sampling and processing large numbers of input files 94 and parameter sets, low-entropy data representation approaches to support a high level latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for easier refraction of single into multiple streams of processing (and to try to reduce the impedance mismatch between the shell command-line that many users know and love, and the cluster job-submission systems known and loved by systems administrators). <br></div><div><br></div><div><b><u>ABOUT THE AUTHOR</u></b></div><div>Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay campus, mainly supporting genetic and genomic databases and high-throughput computational pipelines. <br></div>