posted on 2021-02-26, 00:07authored byDavid Fellinger
ABSTRACT / INTRODUCTION
Data management and curation in research sites has become increasingly more complex in the last decade.
A large portion of this complexity is caused by the ever-increasing number of instruments and sensors
producing huge amounts of data. At the same time, researchers and data curators have joined to enforce
FAIR (Findability, Accessibility, Interoperability, and Reuse) Principles of data products. This is a worldwide
initiative which will increase the potential of collaboration through data sharing across diverse sites.
Managing large amounts of data requires rules-based automation to ensure that site policies are respected,
and, at the same time, workflows must be enabled from data ingestion through the distribution of data
products.
The evolution of iRODS technology
Initially, the basis for iRODS (The Integrated Rule-Oriented Data System) was a university project funded by
a government agency. The concept was to build a searchable data base with entries that were linked to
specific data represented in a plurality of POSIX compliant file systems. These file systems would appear as a
single namespace so that a user did not have to be concerned with data locality but rather, just the
descriptive metadata which was provided by a researcher when the file was generated. This original
implementation of open source cataloging was successful to the extent that it became the basis of a
commercial product that was sold to various government organizations. The evolution of iRODS as it exists
today began with the founding of the iRODS Consortium about 7 years ago. Today iRODS can be described
both as a technology platform and, through the consortium, a vibrant and diverse community with many
mutual goals.
Currently, iRODS is built to support 8 essential capabilities. These include automated data ingestion, data
integrity checking, storage tiering, and auditing enabling indexing, provenance tracking, and compliance
checking. Finally, an interface to publication completes the feature set.
Use cases requiring policy enforcement were described at the most recent iRODS Consortium User Group
Meeting [1].
First, let’s look at the Victoria Department of Agriculture in Australia. Their overall goal is nothing less than
increasing farm efficiency in the entire state. Their policies include automated data gathering and migration
to a processing site so that data products can be analyzed. The goal will be to federate iRODS locations
located on individual farms. Sensor
data will be automatically ingested on the farm then the aggregate data will be ingested to a central site
also utilizing iRODS rules to enforce the collection policy [2]. This managed use of eResearch and iRODS will
directly affect the GDP of the state.
CyVerse, based in Tucson is a perfect example of a multi-national application of iRODS. The site hosts a
diverse range of research data starting several years ago with plant genomics [3]. The data is mirrored to an
iRODS based site at TACC (Texas Advanced Computing Center) in Austin [4]. Data is ingested from partner
sites worldwide including Melbourne, Sydney, Brisbane, Canberra, Adelaide, Perth, and Hobart in Australia
[5]. CyVerse also offers compute services using the “data to compute” model.
The KTH Royal Institute of Technology is utilizing iRODS to transition data between GPFS file systems
allowing file system upgrades while maintaining availability [6]. The iRODS based checksumming was
employed to assure that data integrity was maintained through the entire process. Storage at KTH is utilized
to host data from the entire country of Sweden.
At Utrecht University data site policies stress data discovery to enable research. To that end, a custom
interface called Yoda was written based upon iRODS. “Yoda deploys iRODS as its core component,
customized with more than 10,000 lines of iRODS rules”. This site is successfully hosting a very large
research data archive [7].
At KU Leuven iRODS is being utilized to allow researchers to have active data utilization enabling project
work before publication. FAIR principles are stressed in their data policies with iRODS tools for compliance
[8].
At Bristol Myers Squibb, iRODS is being utilize to manage AWS cloud-based data sets to enable worldwide
project progress. The iRODS technology is used to manage data flows and to maintain a catalog of the
available data in real time interfacing with AWS Lambda functions [9].
At the NIEHS (National Institute of Environmental Health Sciences) discoverability of diverse data sets with
auditable data governance is stressed. Metadata templates have been written to guarantee standardization
in the description of the hosted data [10].
There are many research sites both academic and commercial that use iRODS to enforce policies. It’s
important to note, however, that a site policy does not have to fully evolve immediately. The iRODS open
source technology with its plug-in framework allows policies to grow and evolve over time and effectively
“future-proof” eResearch archives worldwide.
ABOUT THE AUTHOR
Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. He has
over three decades of engineering experience including film systems, video processing devices, ASIC design
and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. He
attended Carnegie-Mellon University and holds patents in diverse areas of technology.
REFERENCES
1. The agenda for the iRODS User Group Meeting 2020 is available from;
https://irods.org/images/irods_ugm2020_agenda.pdf, accessed 11 November 2020
2. A presentation from
the Victoria Department of Agriculture is available from; https://irods.org/uploads/2020/Murphy-AgVicSmartFarm_Data_Managementslides.pdf, accessed 11 November 2020
3. A presentation from CyVerse is
available from; https://irods.org/uploads/2020/Roberts-CyVerse-Discovery_Environment-slides.pdf,
accessed 11 November 2020
4. A presentation from TACC is available from;
https://irods.org/uploads/2020/Jordan-TACCThe_Past_Present_and_Future_of_iRODS_at_TACC-slides.pdf,
accessed 11 November 2020
5. A description of EMBL in Australia is available from; https://www.emblabr.org.au/wp-content/uploads/2016/05/EMBL-nodes-hubs-flyeronline-final.pdf, accessed 11 November
2020
6. A presentation from KTH is available from; https://irods.org/uploads/2020/KorhonenKTHMigration_Between_GPFS_Filesystems-slides.pdf, accessed 11 November 2020
7. A presentation from
Utrecht University is available from; https://irods.org/uploads/2020/Westerhof-SmeeleUtrechtUniYoda_and_iRODS_Python_rule_engine_plugin-slides.pdf, accessed 11 November 2020
8. A
presentation from KU Leuven is available from; https://irods.org/uploads/2020/Barcena-KULeuvenVSCiRODS_Data_Management_Platform-slides.pdf, accessed 11 November 2020
9. A presentation from
Bristol Myers Squibb is available from; https://irods.org/uploads/2020/ShaikhBMSiRODS_for_scientific_applications_in_AWS_Cloud-slides.pdf, accessed 11 November 2020
10. A
presentation from NIEHS is available from; https://irods.org/uploads/2020/Conway-NIEHSApplications_of_iRODS-slides.pdf, accessed 11 November 2020