posted on 2019-05-15, 00:22authored byRyan Chard, Kyle ChardKyle Chard, Ian Foster
Exponential increases in data volumes and velocities are overwhelming finite human capabilities.
Continued progress in science and engineering demands that we automate a broad spectrum of
currently manual research data manipulation tasks, from transfer and sharing to acquisition,
publication, indexing, analysis, and inference. To address these needs we are developing a fullfeatured distributed research automation platform, Globus Automate. Globus Automate is designed
to increase productivity and research quality across many science disciplines by allowing scientists
to offload the management of a broad range of data acquisition, manipulation, and analysis tasks to
a cloud-hosted distributed research automation platform.
Globus Automate fills an important and previously un-met need in research cyberinfrastructure (CI).
It addresses the problem of securely and reliably automating, for many thousands of scientists,
sequences of data management tasks that may span locations, storage systems, administrative
domains, and timescales, and integrate both mechanical and human inputs. This is a different
problem to that of programming parallel workflows, as handled by HTCondor [1], Parsl [2], and
Pegasus [3], or building integrated data management systems, as handled by iRODS [4], Rucio [5],
and business rules engines [6]. IFTTT is more similar, being simple, SaaS, and extensible to new
events and actions, but it does not integrate with CI security or resources, or handle sequences of
actions.
Our Automate implementation leverages Amazon Web Services. It uses Step Functions for flow
automation, Simple Queue Service for event delivery, Reliable Database Service for persisting
state, and Lambda for managing action executions. These mature, cloud- hosted services simplify
delivery of a reliable, scalable service, allowing us to provide advanced capabilities to many
scientists at modest cost. The Automate service enables the detection of events at
Globus endpoints and other sources; the execution by the cloud-hosted Globus Automate engine of
user-supplied automation flows either manually or as a result of data events; and the invocation of
actions from those automation flows, including actions provided by Globus endpoints and services.
The service is easily extensible by anyone via the definition of new events and actions to meet the
needs of specific communities.
We have deployed Globus Automate to simplify and accelerate research for a diverse collection of
scientific domains, from Materials Science to Cosmology, and scientific instruments, including light
source synchrotrons and scanning electron microscopes. In this talk will present Globus Automate,
its architecture and prototype implementation. We will frame this talk in the context of a real-world
neurocartography use case in which Globus Automate is used to perform a complex image
reconstruction and analysis pipeline using data obtained from Argonne National Laboratory's
Advanced Light Source. This flow involves a complex, distributed, analysis and publication process
to perform data-driven analyses of large unsectioned brain volumes, generated at > 20GB per
minute, using HPC resources.
References
[1] Michael J Litzkow, Miron Livny, and Matt W Mutka. Condor–a hunter of idle workstations. In 8th
International Conference on Distributed Computing Systems, pages 104–111. IEEE, 1988.
[2] Yadu Babuji, Alison Brizius, Kyle Chard, Ian Foster, Daniel S. Katz, Michael Wilde, and Justin
Wozniak. Introducing Parsl: A Python Parallel Scripting Library, August 2017.
[3] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P.J. Maechling, R. Mayani, W. Chen,
R.F. da Silva, M. Livny, et al. Pegasus, a workflow management system for science automation.
Future Generation Computer Systems, 46:17–35, 2015.
[4] Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine
de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, Paul Tooby, and Bing
Zhu. iRODS Primer: Integrated rule-oriented data system. Synthesis Lectures on Information
Concepts, Retrieval, and Services, 2(1):1–143, 2010.
[5] Vincent Garonne, R Vigne, G Stewart, M Barisits, M Lassnig, C Serfon, L Goossens, A Nairz,
Atlas Collaboration, et al. Rucio–The next generation of large scale distributed system for ATLAS
data management. In Journal of Physics: Conference Series, volume 513, page 042021. IOP
Publishing, 2014.
[6] Ayman Meidan, Julián Alberto Garcıá -Garcıá
, MJ Escalona, and I Ramos. A survey on business
processes management suites. Computer Standards & Interfaces, 51:71–86, 2017.
ABOUT THE AUTHORS
Ryan Chard joined Argonne National Laboratory in 2016 where he was awarded a Maria Goeppert
Mayer Fellowship. His research focuses on the development of cyberinfrastructure to enable
scientific research. He is particularly interested in automation platforms and applying scientific
applications at scale on the cloud. He has a Ph.D. in Computer Science from Victoria University of
Wellington, New Zealand and a Masters of Science from the same university. His research
interests include high performance computing, scientific computing, cloud computing, cloud
economics, and network inference.
Kyle Chard is a Senior Researcher and Fellow in the Computation Institute at the University of
Chicago and Argonne National Laboratory. He received his Ph.D. in computer science from Victoria
University of Wellington. His research interests include distributed meta-scheduling, cloud
computing, economic resource allocation, social computing, and services computing.
Ian Foster is an Argonne Senior Scientist and Distinguished Fellow and the Arthur Holly Compton
Distinguished Service Professor of Computer Science. Ian received a BSc (Hons I) degree from the
University of Canterbury, New Zealand, and a PhD from Imperial College, United Kingdom, both in
computer science. His research deals with distributed, parallel, and data-intensive computing
technologies, and innovative applications of those technologies to scientific problems in such
domains as climate change and biomedicine. Methods and software developed under his
leadership underpin many large national and international cyberinfrastructures. Ian is a fellow of the
American Association for the Advancement of Science, the Association for Computing Machinery,
and the British Computer Society. His awards include the Global Information Infrastructure (GII)
Next Generation award, the British Computer Society's Lovelace Medal, R&D Magazine's Innovator
of the Year, and an honorary doctorate from the University of Canterbury, New Zealand. He was a
co-founder of Univa UD, Inc., a company established to deliver grid and cloud computing solutions.