Packaging Research data with DataCrate - a cry for help!

2019-05-15T04:30:24Z (GMT) by Peter Sefton Michael Lynch
Background:
DataCrate is specification for packaging research data with extensive human and machine readable metadata for either distribution (eg in a zip file) or hosting on the web. The specification is a final-draft form, and has been adopted at the University of Technology Sydney as a the core means of distribution for datasets in our repository, and has generated interest in Australasia and internationally. The aim is to provide “Who, what where” metadata that makes understanding and reusing data practical. DataCrate can express detailed information about which people, instruments and software were involved in capturing or creating data, where they did it and why, as well as how to cite a dataset. The spec is on github: https://github.com/UTS-eResearch/datacrate/blob/master/spec/1.0/data_crate_specification_v1.0.md

Related Work:
DataCrate builds on other standards, starting with BagIt for packaging files and URLs with checksums [1]. It is similar in intent to Frictionless Data packaging [2], but uses JSONLD and the schema.org vocabulary rather than a simple JSON structure – this ensures that metadata will interoperate with the semantic web (eg, DataCrates are compatible with Google’s Dataset search). DataCrate has a similar structure to Research Object Bundles [3], but a significantly simpler way of adding metadata. An innovation of DataCrate is that is has a rich HTML website that functions as a detailed README file down to the file level (and soon to the column header in tables), in an approach which has also been adopted by DataSpice [4].

Content:
This proposed session introduce the specification using extensive examples, showing how it can be used for many kinds disciplines including social history, microscopy, computational models, interview materials, environmental data about soil and atmosphere, and speleological mapping data. The session will also show how DataCrate can be used as interchange format, pulling and pushing data from multiple systems.

Seeking feedback, and developers:
We will be seeking feedback on the spec, as well as exhorting developers to consider adding DataCrate export to existing repositories and research apps, as away to increase the re-use potential of data and to reduce integration costs for data migration.

Acknowledgments:
DataCrate has had contributions from a number of people – a current list will be provided at the presentation.

[1] Kunze, John, Andy Boyko, Brian Vargas, Liz Madden, and Justin Littman. “The BagIt File Packaging Format (V0.97).” Accessed March 1, 2013. http://tools.ietf.org/html/draft-kunze-bagit06.

[2] “The Frictionless Data Field Guide.” Accessed September 26, 2018. https://frictionlessdata.io/specs/data-package/.

[3] “Research Object Bundle.” Accessed June 16, 2017. https://researchobject.github.io/specifications/bundle/.

[4] ::Hot_pepper: Create Lightweight Schema.Org Descriptions of Dataset: Ropenscilabs/Dataspice. R. 2018. Reprint, rOpenSci Labs, 2018. https://github.com/ropenscilabs/dataspice.

ABOUT THE AUTHOR(S)
Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before that he was in a similar role at the university of Western Sydney (UWS). Previously he ran the Software Research and development Laboratory at the Australian Digital Futures Institute at the University of Southern Queensland. Following a PhD in computational linguistics in the midnineties he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research.

Categories

License

CC BY 4.0