Building a Federated Research Collaborative
presentationposted on 10.03.2020, 03:58 by David Fellinger
The concept of countrywide and worldwide research collaboratives is relatively new. Several decades ago it was common for a department head to have multiple vertical file cabinets with paper folders housing the work of researchers and students in his or her department. Access and subsequent citations of this work was generally based on the department heads’ knowledge of the works. As digital storage technologies became less expensive and relatively ubiquitous the vertical files turned into disk storage systems reflecting the work of each university department. The works were still filed and maintained by standard file system references such as creation date, name, and access controls. In the many cases, card catalogs or spreadsheets were used to further describe the titles. The introduction of ethernet in the late 1970’s largely changed the manner in which research works were conserved. The deployment of campus-wide data networks enabled universities to establish and maintain central data repositories. Storage could become a service of the university where individual colleges or departments no longer had to maintain their own archival systems. The era of the digital research collaborative was born. In many cases, this transition took years, and even today, some university departments retain internal storage. Locating a specific work based upon anything other than title was a challenge and that problem grew with the number of works that were archived.
The Advent of Storage Management Technology
A digital file system is really just a means for storing and maintaining data like a set of shelves is a means to hold books. What is actually required is a way to relate descriptive data to files indicating the contents of a file. This was largely understood for libraries containing shelves of books starting thousands of years ago dating back to 2000 BC . In the United States, the Defense Advanced Research Projects Agency (DARPA) funded a program called the Storage Resource Broker (SRB) in 1995 and 1996 and the first middleware to identify works based on content and user defined metadata was written. In 2006 the DICE group, a group of research institutions in the US created the Integrated RuleOriented Data System (iRODS) expanding on the concepts of SRB and in 2013 the iRODS Consortium was formed as a user supported community devoted to the long term continuation of this open source middleware. This project, that was first launched 25 years ago, has spawned software that is being used to manage data archives worldwide. The iRODS software can completely virtualize entire file system infrastructures so that storage purchased from any vendor at any time can be made to appear as one effective file system. Researchers no longer have to be concerned with the location of data but just the contents. Data discovery is one of the primary features of iRODS. A researcher can specify search terms retained in an index that allows other researchers to discover that research work. The process of building an index does not necessarily require human intervention. Metadata can be automatically extracted from files at rest or while being ingested to enable discoverability. In fact, complete workflow automation can be realized with iRODS. Data can be automatically ingested from numerous sensors and routed, based on content and policies, to specific compute platforms for analysis. The subsequent data products can then be distributed based on policy. Data products can be published according to policies associated with the collections under management. All of this functionality can be audited in real time to precisely track the operation of a data center
Bandwidth Availability Enables Global Collaboration
The deployment of 100Gbps ethernet wide area networks across many universities launched a new era of research data communication. Initially all data operations were relegated to one campus or entity simply due to the limitations of communication technology. While it was possible to transfer files by way of File Transfer Protocol (FTP) technologies it was not easily possible to create indices that spanned federated collections allowing data to be discovered or easily accessed. The secure federation capabilities of iRODS has changed the way that we think of data locality. One of the key focuses of iRODS development has been to enable federated collaboration. When the administrators of two iRODS sites share a set of keys, the two sites, with permissions, can appear as one. The researcher or administrator can assign access controls for local and WAN access. A user in a remote zone can easily discover data through access to user defined metadata. A file transfer can then be enabled with the iRODS servers brokering a direct transfer to the requesting client. A researcher can even share data with a non-iRODS users issuing a secure ticket for a specific file or files.
A New Era of Data Sharing is Underway
Large scale iRODS deployments span the world and have enabled collaborations of multinational scientists and researchers. In the US the iPlant Collaborative was formed in 2008 with funding from the National Science Foundation. Data management was based on iRODS from the start of the project and it initially served the plant science communities primarily in the US. From its inception, iPlant quickly grew into a mature organization providing powerful resources and offering scientific and technical support services to researchers nationally and internationally. In 2015, iPlant was rebranded to CyVerse to emphasize an expanded mission to serve all life sciences . Today CyVerse serves over 47,000 users with 5,690 participating academic institutions and 2,438 non-academic institutions. A major feature of the collaborative is the Discovery Environment (DE) which allows researchers to quickly find files of interest relating to their life science discipline. The primary site is in Tucson Arizona with a mirror at Texas Advanced Computing in Austin Texas. Both data management and workflow control is enabled by the use of iRODS.
In Europe the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of over 50 universities and research institutions in the European Union. The infrastructure is managed under iRODS and the data covers over 30 scientific disciplines from atmospheric research to physics, hydro-meteorology, genomics, and ecology. As with CyVerse, a major feature of EUDAT is data discovery across the entire geography of the EU. The goal is to provide both data access and re-use for near term needs as well as data preservation to build a long term archive .
In the Netherlands, SURF has built a data management framework based on iRODS. Countrywide data from several universities is stored at their data site. Besides the service of offering data storage and management, they also offer data processing and analysis as well as compute services. All of the data at the site is moved to various platforms and tiers using iRODS . SURF is a member of the iRODS community as well as several universities in the Netherlands.
In Sweden, The Swedish National Infrastructure for Computing (SNIC) is a national research infrastructure that makes available large scale high performance computing resources, storage capacity, and advanced user support, for Swedish researchers. This service is managed under iRODS control . This service uses the Swedish University Network (SUNET) which links the infrastructure at the KTH Royal Institute of Technology to other universities in Sweden with a 100Gbps link to facilitate data movement .
These are just a few of the iRODS deployments in both the academic and research sectors. The use of iRODS and its discovery capabilities accelerates scientific research allowing researchers to quickly find relevant materials while building on them. The power of iRODS to manage data based on collection policies cannot be overstated as data sets grow and automation becomes a requirement. Many worldwide universities, libraries, museums, and companies have chosen iRODS as a technology that allows the “future proofing” of data collections independent of the evolution of storage. These institutions have realized that their data policy decisions can be maintained by iRODS at any scale regardless of the change of data storage or networking technologies over time.
ABOUT THE AUTHOR
Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a broad range of use cases can be fully addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a member of the founding board.
1. The history of the card catalog is available from; https://www.vox.com/culture/2017/4/21/15357984/card-catalog-library-ofcongress-history ,accessed 2 November 2019
2. The history of CyVerse is available from; https://www.cyverse.org/about ,accessed 9 October 2019
3. Information regarding EUDAT is available from; https://www.eudat.eu/eudat-cdi ,accessed 8 October 2019
4. Information regarding SURF is available from; https://www.surf.nl/en/research-ict ,accessed 9 October 2019
5. Information regarding SNIC is available from; https://www.snic.se/ ,accessed 9 October 2019
6. Information regarding SUNET is available from; https://www.sunet.se/about-sunet/ ,accessed 9 October 2019