posted on 2019-05-15, 04:30authored byPeter Sefton, Michael Lynch
Background:
DataCrate is specification for packaging research data with extensive human and
machine readable metadata for either distribution (eg in a zip file) or hosting on the web. The
specification is a final-draft form, and has been adopted at the University of Technology Sydney
as a the core means of distribution for datasets in our repository, and has generated interest in
Australasia and internationally. The aim is to provide “Who, what where” metadata that makes
understanding and reusing data practical. DataCrate can express detailed information about
which people, instruments and software were involved in capturing or creating data, where they
did it and why, as well as how to cite a dataset. The spec is on github: https://github.com/UTS-eResearch/datacrate/blob/master/spec/1.0/data_crate_specification_v1.0.md
Related Work:
DataCrate builds on other standards, starting with BagIt for packaging files and
URLs with checksums [1]. It is similar in intent to Frictionless Data packaging [2], but uses JSONLD and the schema.org vocabulary rather than a simple JSON structure – this ensures that
metadata will interoperate with the semantic web (eg, DataCrates are compatible with Google’s
Dataset search). DataCrate has a similar structure to Research Object Bundles [3], but a
significantly simpler way of adding metadata. An innovation of DataCrate is that is has a rich
HTML website that functions as a detailed README file down to the file level (and soon to the
column header in tables), in an approach which has also been adopted by DataSpice [4].
Content:
This proposed session introduce the specification using extensive examples, showing
how it can be used for many kinds disciplines including social history, microscopy, computational
models, interview materials, environmental data about soil and atmosphere, and speleological
mapping data. The session will also show how DataCrate can be used as interchange format,
pulling and pushing data from multiple systems.
Seeking feedback, and developers:
We will be seeking feedback on the spec, as well as
exhorting developers to consider adding DataCrate export to existing repositories and research
apps, as away to increase the re-use potential of data and to reduce integration costs for data
migration.
Acknowledgments:
DataCrate has had contributions from a number of people – a current list will
be provided at the presentation.
[1] Kunze, John, Andy Boyko, Brian Vargas, Liz Madden, and Justin Littman. “The BagIt File
Packaging Format (V0.97).” Accessed March 1, 2013. http://tools.ietf.org/html/draft-kunze-bagit06.
[2] “The Frictionless Data Field Guide.” Accessed September 26, 2018.
https://frictionlessdata.io/specs/data-package/.
[3] “Research Object Bundle.” Accessed June 16, 2017.
https://researchobject.github.io/specifications/bundle/.
[4] ::Hot_pepper: Create Lightweight Schema.Org Descriptions of Dataset: Ropenscilabs/Dataspice.
R. 2018. Reprint, rOpenSci Labs, 2018. https://github.com/ropenscilabs/dataspice.
ABOUT THE AUTHOR(S)
Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before
that he was in a similar role at the university of Western Sydney (UWS). Previously he ran the
Software Research and development Laboratory at the Australian Digital Futures Institute at the
University of Southern Queensland. Following a PhD in computational linguistics in the midnineties he has gained extensive experience in the higher education sector in leading the
development of IT and business systems to support both learning and research.