posted on 2020-03-10, 03:58authored byDavid Fellinger
The concept of countrywide and worldwide research collaboratives is relatively new. Several
decades ago it was common for a department head to have multiple vertical file cabinets
with paper folders housing the work of researchers and students in his or her department.
Access and subsequent citations of this work was generally based on the department heads’
knowledge of the works. As digital storage technologies became less expensive and
relatively ubiquitous the vertical files turned into disk storage systems reflecting the work of
each university department. The works were still filed and maintained by standard file
system references such as creation date, name, and access controls. In the many cases, card
catalogs or spreadsheets were used to further describe the titles. The introduction of
ethernet in the late 1970’s largely changed the manner in which research works were
conserved. The deployment of campus-wide data networks enabled universities to establish
and maintain central data repositories. Storage could become a service of the university
where individual colleges or departments no longer had to maintain their own archival
systems. The era of the digital research collaborative was born. In many cases, this
transition took years, and even today, some university departments retain internal storage.
Locating a specific work based upon anything other than title was a challenge and that
problem grew with the number of works that were archived.
The Advent of Storage Management Technology
A digital file system is really just a means for storing and maintaining data like a set of
shelves is a means to hold books. What is actually required is a way to relate descriptive
data to files indicating the contents of a file. This was largely understood for libraries
containing shelves of books starting thousands of years ago dating back to 2000 BC [1]. In
the United States, the Defense Advanced Research Projects Agency (DARPA) funded a
program called the Storage Resource Broker (SRB) in 1995 and 1996 and the first
middleware to identify works based on content and user defined metadata was written. In
2006 the DICE group, a group of research institutions in the US created the Integrated RuleOriented Data System (iRODS) expanding on the concepts of SRB and in 2013 the iRODS
Consortium was formed as a user supported community devoted to the long term
continuation of this open source middleware. This project, that was first launched 25 years ago, has spawned software that is being used to manage data archives worldwide. The
iRODS software can completely virtualize entire file system infrastructures so that storage
purchased from any vendor at any time can be made to appear as one effective file system.
Researchers no longer have to be concerned with the location of data but just the contents.
Data discovery is one of the primary features of iRODS. A researcher can specify search
terms retained in an index that allows other researchers to discover that research work. The
process of building an index does not necessarily require human intervention. Metadata can
be automatically extracted from files at rest or while being ingested to enable
discoverability. In fact, complete workflow automation can be realized with iRODS. Data can
be automatically ingested from numerous sensors and routed, based on content and
policies, to specific compute platforms for analysis. The subsequent data products can then
be distributed based on policy. Data products can be published according to policies
associated with the collections under management. All of this functionality can be audited in
real time to precisely track the operation of a data center
Bandwidth Availability Enables Global Collaboration
The deployment of 100Gbps ethernet wide area networks across many universities
launched a new era of research data communication. Initially all data operations were
relegated to one campus or entity simply due to the limitations of communication
technology. While it was possible to transfer files by way of File Transfer Protocol (FTP)
technologies it was not easily possible to create indices that spanned federated collections
allowing data to be discovered or easily accessed. The secure federation capabilities of
iRODS has changed the way that we think of data locality. One of the key focuses of iRODS
development has been to enable federated collaboration. When the administrators of two
iRODS sites share a set of keys, the two sites, with permissions, can appear as one. The
researcher or administrator can assign access controls for local and WAN access. A user in a
remote zone can easily discover data through access to user defined metadata. A file
transfer can then be enabled with the iRODS servers brokering a direct transfer to the
requesting client. A researcher can even share data with a non-iRODS users issuing a secure
ticket for a specific file or files.
A New Era of Data Sharing is Underway
Large scale iRODS deployments span the world and have enabled collaborations of multinational scientists and researchers. In the US the iPlant Collaborative was formed in 2008
with funding from the National Science Foundation. Data management was based on iRODS
from the start of the project and it initially served the plant science communities primarily in
the US. From its inception, iPlant quickly grew into a mature organization providing
powerful resources and offering scientific and technical support services to researchers nationally and internationally. In 2015, iPlant was rebranded to CyVerse to emphasize an
expanded mission to serve all life sciences [2]. Today CyVerse serves over 47,000 users with
5,690 participating academic institutions and 2,438 non-academic institutions. A major
feature of the collaborative is the Discovery Environment (DE) which allows researchers to
quickly find files of interest relating to their life science discipline. The primary site is in
Tucson Arizona with a mirror at Texas Advanced Computing in Austin Texas. Both data
management and workflow control is enabled by the use of iRODS.
In Europe the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of
over 50 universities and research institutions in the European Union. The infrastructure is
managed under iRODS and the data covers over 30 scientific disciplines from atmospheric
research to physics, hydro-meteorology, genomics, and ecology. As with CyVerse, a major
feature of EUDAT is data discovery across the entire geography of the EU. The goal is to
provide both data access and re-use for near term needs as well as data preservation to
build a long term archive [3].
In the Netherlands, SURF has built a data management framework based on iRODS.
Countrywide data from several universities is stored at their data site. Besides the service of
offering data storage and management, they also offer data processing and analysis as well
as compute services. All of the data at the site is moved to various platforms and tiers using
iRODS [4]. SURF is a member of the iRODS community as well as several universities in the
Netherlands.
In Sweden, The Swedish National Infrastructure for Computing (SNIC) is a national research
infrastructure that makes available large scale high performance computing resources,
storage capacity, and advanced user support, for Swedish researchers. This service is
managed under iRODS control [5]. This service uses the Swedish University Network
(SUNET) which links the infrastructure at the KTH Royal Institute of Technology to other
universities in Sweden with a 100Gbps link to facilitate data movement [6].
These are just a few of the iRODS deployments in both the academic and research sectors.
The use of iRODS and its discovery capabilities accelerates scientific research allowing
researchers to quickly find relevant materials while building on them. The power of iRODS to
manage data based on collection policies cannot be overstated as data sets grow and
automation becomes a requirement. Many worldwide universities, libraries, museums, and
companies have chosen iRODS as a technology that allows the “future proofing” of data
collections independent of the evolution of storage. These institutions have realized that
their data policy decisions can be maintained by iRODS at any scale regardless of the change
of data storage or networking technologies over time.
ABOUT THE AUTHOR
Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS
Consortium. In his role at the iRODS Consortium, Dave is working with users in research sites and high
performance computer centers to confirm that a broad range of use cases can be fully
addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a
member of the founding board.
References
1. The history of the card catalog is available from; https://www.vox.com/culture/2017/4/21/15357984/card-catalog-library-ofcongress-history ,accessed 2 November 2019
2. The history of CyVerse is available from; https://www.cyverse.org/about ,accessed 9 October 2019
3. Information regarding EUDAT is available from; https://www.eudat.eu/eudat-cdi ,accessed 8 October 2019
4. Information regarding SURF is available from; https://www.surf.nl/en/research-ict ,accessed 9 October 2019
5. Information regarding SNIC is available from; https://www.snic.se/ ,accessed 9 October 2019
6. Information regarding SUNET is available from; https://www.sunet.se/about-sunet/ ,accessed 9 October 2019