A Survey of iRODS rules to enforce site policies and enable automated workflows
presentationposted on 26.02.2021, 00:07 by David Fellinger
ABSTRACT / INTRODUCTION
Data management and curation in research sites has become increasingly more complex in the last decade. A large portion of this complexity is caused by the ever-increasing number of instruments and sensors producing huge amounts of data. At the same time, researchers and data curators have joined to enforce FAIR (Findability, Accessibility, Interoperability, and Reuse) Principles of data products. This is a worldwide initiative which will increase the potential of collaboration through data sharing across diverse sites.
Managing large amounts of data requires rules-based automation to ensure that site policies are respected, and, at the same time, workflows must be enabled from data ingestion through the distribution of data products.
The evolution of iRODS technology
Initially, the basis for iRODS (The Integrated Rule-Oriented Data System) was a university project funded by a government agency. The concept was to build a searchable data base with entries that were linked to specific data represented in a plurality of POSIX compliant file systems. These file systems would appear as a single namespace so that a user did not have to be concerned with data locality but rather, just the descriptive metadata which was provided by a researcher when the file was generated. This original implementation of open source cataloging was successful to the extent that it became the basis of a commercial product that was sold to various government organizations. The evolution of iRODS as it exists today began with the founding of the iRODS Consortium about 7 years ago. Today iRODS can be described both as a technology platform and, through the consortium, a vibrant and diverse community with many mutual goals.
Currently, iRODS is built to support 8 essential capabilities. These include automated data ingestion, data integrity checking, storage tiering, and auditing enabling indexing, provenance tracking, and compliance checking. Finally, an interface to publication completes the feature set.
Use cases requiring policy enforcement were described at the most recent iRODS Consortium User Group Meeting .
First, let’s look at the Victoria Department of Agriculture in Australia. Their overall goal is nothing less than increasing farm efficiency in the entire state. Their policies include automated data gathering and migration to a processing site so that data products can be analyzed. The goal will be to federate iRODS locations located on individual farms. Sensor data will be automatically ingested on the farm then the aggregate data will be ingested to a central site also utilizing iRODS rules to enforce the collection policy . This managed use of eResearch and iRODS will directly affect the GDP of the state.
CyVerse, based in Tucson is a perfect example of a multi-national application of iRODS. The site hosts a diverse range of research data starting several years ago with plant genomics . The data is mirrored to an iRODS based site at TACC (Texas Advanced Computing Center) in Austin . Data is ingested from partner sites worldwide including Melbourne, Sydney, Brisbane, Canberra, Adelaide, Perth, and Hobart in Australia . CyVerse also offers compute services using the “data to compute” model.
The KTH Royal Institute of Technology is utilizing iRODS to transition data between GPFS file systems allowing file system upgrades while maintaining availability . The iRODS based checksumming was employed to assure that data integrity was maintained through the entire process. Storage at KTH is utilized to host data from the entire country of Sweden.
At Utrecht University data site policies stress data discovery to enable research. To that end, a custom interface called Yoda was written based upon iRODS. “Yoda deploys iRODS as its core component, customized with more than 10,000 lines of iRODS rules”. This site is successfully hosting a very large research data archive .
At KU Leuven iRODS is being utilized to allow researchers to have active data utilization enabling project work before publication. FAIR principles are stressed in their data policies with iRODS tools for compliance .
At Bristol Myers Squibb, iRODS is being utilize to manage AWS cloud-based data sets to enable worldwide project progress. The iRODS technology is used to manage data flows and to maintain a catalog of the available data in real time interfacing with AWS Lambda functions .
At the NIEHS (National Institute of Environmental Health Sciences) discoverability of diverse data sets with auditable data governance is stressed. Metadata templates have been written to guarantee standardization in the description of the hosted data .
There are many research sites both academic and commercial that use iRODS to enforce policies. It’s important to note, however, that a site policy does not have to fully evolve immediately. The iRODS open source technology with its plug-in framework allows policies to grow and evolve over time and effectively “future-proof” eResearch archives worldwide.
ABOUT THE AUTHOR
Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. He has over three decades of engineering experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. He attended Carnegie-Mellon University and holds patents in diverse areas of technology.
1. The agenda for the iRODS User Group Meeting 2020 is available from; https://irods.org/images/irods_ugm2020_agenda.pdf, accessed 11 November 2020
2. A presentation from the Victoria Department of Agriculture is available from; https://irods.org/uploads/2020/Murphy-AgVicSmartFarm_Data_Managementslides.pdf, accessed 11 November 2020
3. A presentation from CyVerse is available from; https://irods.org/uploads/2020/Roberts-CyVerse-Discovery_Environment-slides.pdf, accessed 11 November 2020
4. A presentation from TACC is available from; https://irods.org/uploads/2020/Jordan-TACCThe_Past_Present_and_Future_of_iRODS_at_TACC-slides.pdf, accessed 11 November 2020
5. A description of EMBL in Australia is available from; https://www.emblabr.org.au/wp-content/uploads/2016/05/EMBL-nodes-hubs-flyeronline-final.pdf, accessed 11 November 2020
6. A presentation from KTH is available from; https://irods.org/uploads/2020/KorhonenKTHMigration_Between_GPFS_Filesystems-slides.pdf, accessed 11 November 2020
7. A presentation from Utrecht University is available from; https://irods.org/uploads/2020/Westerhof-SmeeleUtrechtUniYoda_and_iRODS_Python_rule_engine_plugin-slides.pdf, accessed 11 November 2020
8. A presentation from KU Leuven is available from; https://irods.org/uploads/2020/Barcena-KULeuvenVSCiRODS_Data_Management_Platform-slides.pdf, accessed 11 November 2020
9. A presentation from Bristol Myers Squibb is available from; https://irods.org/uploads/2020/ShaikhBMSiRODS_for_scientific_applications_in_AWS_Cloud-slides.pdf, accessed 11 November 2020
10. A presentation from NIEHS is available from; https://irods.org/uploads/2020/Conway-NIEHSApplications_of_iRODS-slides.pdf, accessed 11 November 2020