Gladier - A programmable data capture, storage, and analysis architecture for experimental facilities.pdf.pdf (2.03 MB)Download file
Gladier - A programmable data capture, storage, and analysis architecture for experimental facilities
presentationposted on 2021-02-26, 00:08 authored by Kyle ChardKyle Chard, Ian FosterIan Foster
ABSTRACT / INTRODUCTION
The extraordinary volume and velocity of data produced by scientific instruments presents new challenges to efficiently organize, process, and share data without overburdening researchers. To address these needs we are developing Gladier (Globus Architecture for Data-Intensive Experimental Research), a data architecture that enables the rapid development of customized data capture, storage, and analysis solutions for experimental facilities. We have deployed a Gladier at Argonne’s Advanced Photon Source (APS) and Leadership Computing Facility (ALCF) to enable various solutions, including: delivery of data produced during tomographic experiments to remote collaborators; capture, analysis, and cataloging of data from Xray Photon Correlation Spectroscopy (XPCS) experiments; and feedback based on analysis of data from serial synchrotron crystallography (SSX) experiments to guide data acquisition.
The Gladier architecture leverages a data/computing substrate based on data and compute agents deployed across computer and storage systems at APS, ALCF, and elsewhere, all managed by cloudhosted Globus services. All components are supported by the Globus Auth identity and access management platform to enable single sign on and secure interactions between components. This substrate makes it easy for programmers to route data and compute requests to different storage systems and computers. Other services support the definition and management of flows that coordinate data transfer, analysis, cataloging, and other activities associated with experimental activities. Each service can be accessed via REST APIs, and/or from Python via a simple client library (which calls the REST APIs). Scientists can then develop experiment-specific data solutions by coding to these APIs or library—or reuse or adapt solutions developed by others. Importantly, both the overall architecture and specific solutions can easily be replicated at other institutions and extended to provide additional capabilities.
We describe three examples to illustrate how Gladier can be used to implement powerful data collection, analysis, and cataloging capabilities.
1. DMagic: Automated data delivery to experimentalists. The DMagic system uses a combination of Globus APIs and APS administrative APIs to 1) automatically create and configure shared storage space on the ALCF Petrel data service before an experiment begins; and 2) automatically copy over experimental data from the beamline to Petrel storage as they are produced during the experiment.
2. XPCS data collection, analysis, and cataloguing. This example uses Globus Automate to automatically collect data at an XPCS experiment, transfer the data to an HPC computer for processing, and then load processed data into a catalog, from where it can be searched and retrieved by authorized individuals 3. Rapid feedback for SSX experiments. This example guides SSX experiments by generating statistics and images of the sample being processed and providing them to the scientists in near real-time. These results can then be used to determine whether enough data have been collected for a sample, whether a second sample is needed to produce suitable statistics, or whether the sample is producing enough data to warrant continued processing.
In this talk we will present the Gladier architecture, highlight the major components used in the architecture, discuss three example data solutions deployed at APS and ALCF, and describe how the Gladier architecture can be replicated in other environments.
ABOUT THE AUTHOR
Kyle Chard is a Research Assistant Professor in the Department of Computer Science at the University of Chicago. He also holds a joint appointment at Argonne National Laboratory. His research focuses on a broad range of problems in data-intensive computing and research data management. He leads various projects related to distributed and parallel computing, scientific reproducibility, research automation, and costaware use of cloud infrastructure.
Ian Foster is the Director of Argonne’s Data Science and Learning Division, Argonne Senior Scientist and Distinguished Fellow, and the Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. Foster’s research contributions span highperformance computing, distributed systems, and data-driven discovery. He has published hundreds of scientific papers and eight books on these and other topics. Methods and software developed under his leadership underpin many large national and international cyberinfrastructures.