DataCite is an international organisation, founded in 2009, which promotes the use of DOIs (Digital Object Identifiers) for published datasets, in order to establish easier access to research data, to increase acceptance of research data as legitimate contributions in the scholarly record, and to support data archiving to permit results to be verified and re-purposed for future study.
Its founding members were the British Library; the Technical Information Center of Denmark; TU Delft Library; the National Research Council’s Canada Institute for Scientific and Technical Information (NRC-CISTI); California Digital Library; Purdue University; and the German National Library of Science and Technology. Since its foundation, it has been joined by several other leading organisations from around the world, and it therefore provides a stable basis for the ongoing use of DOIs for data.
This recent availability of DOIs from DataCite for the identification of data entities has made all the difference to data repositories wishing to give unique global identifiers to their data holdings, since DOIs are widely recognised and respected throughout the academic world, because of their widespread prior use for identifying journal articles, made possible by CrossRef.
However, in their recent discussion paper Data Citation and Linking, published on 8th June 2011, Alex Ball and Monica Duke of UKOLN at the University of Bath ask:
“At what granularity should data be made citable? If single datasets are given identifiers, what about collections of datasets, or subsets of data?”
Individual data files and metadata documents will, of course, have their own unique internal identifiers within any data repository, but may not have externally resolvable identifiers such as DOIs. Practice varies.
This post is to explain how DOIs are employed in the Dryad Data Repository, that specializes in publishing data linked to peer-reviewed biological journal articles, since it is both elegant and addresses at least some of the issues raised by Alex and Monica.
The Dryad DOI usage policy is described at https://www.nescent.org/wg_dryad/DOI_Usage, and involves assigning unique DOIs to each version of every data package, and to each version of every data file, in a principled and easy-to-understand manner. In summary:
- Each data package is given a DataCite DOI, which can be versioned by adding “.2”, “.3”, etc. after the original DOI to create new DOIs for new versions of the same data package.
- Within each data package, each data file has a unique DOI defines by suffixing the data package DOI with “/1”, “/2”. etc., with versions indicated as for data packages.
Thus the third version of the second data file in the second version of a Dryad data package would have a DOI of the form doi:10.5061/dryad.1234.2/2.3.
One might argue that it would result in an awfully large number of DOIs if a single data package was made up of thousands of data files. True, but numbers themselves are limitless and free, and the cost of a DataCite DOI is small relative to the cost of data creation and preservation. The real problem at present is lack of identifiable, citable data entities within repositories – to have so many that the cost of DOIs becomes an issue should be regarded as an achievement, not a problem!
Dryad does not have a mechanism for assigning identifiers to a portion of a data file (“a subset of data”), and DOIs are probably not the correct identifiers for that purpose, since they are primarily designed for citation and resource discovery.
A more appropriate method for identifying portions of a data file, or of any other digital object or document, is to use the Annotation Ontology (AO) developed by Paolo Ciccarese of Harvard University, described at http://code.google.com/p/annotation-ontology/wiki/Homepage. AO can be used to identify and annotate portions of a wide variety of resources such as HTML, PDF, Word, Excel, XML documents, images, videos, databases, web services, experimental data and metadata files. Paolo is currently working with a group in Harvard that focuses on biodiversity, who are using OA to address databases and data, and he anticipates publishing version 2.0 of AO in September.