In the previous post, I outlined reasons why researchers don’t publish data, presented as evidence to the Royal Society’s Policy Study “Science as a Public Enterprise” Call for Evidence. Here, I summarize activities by members of my Image Bioinformatics Research Group (IBRG) at Oxford University to facilitate data publication and data citation, and thus to help catalyze a cultural shift to a situation in which data publication is as natural a part of research life as is undertaking experiments.
= = =
Data management services and data repositories
We are developing tools and services to assist researchers in their local data management, for their own personal benefit, while facilitating automated data submission to appropriate institutional or subject-specific data repositories, in ways that fit with their normal working practices and impose as little as possible in terms of cognitive overhead – what we term sheer curation. These include the two-stage data management services we are currently funded to develop by the University Modernization Fund through the JISC DataFlow Project, namely (a) DataStage, a private local data management file system, with automated backup, Web access, and security access control, for use by individual research groups, and (b) DataBank, a cloud-deployable data repository for use by universities, research institutes or large research consortia. These open source services will be made available for installation by third parties on the Eduserv academic cloud and elsewhere, as required by research groups, institutions and universities both in the UK and internationally. We seek early adopters!
Curation by addition
For automated data submissions from DataStage to DataBank, that will use the SWORDv2 repository submission protocol to standardize data package ingest, we are intentionally lowering the barriers in terms of metadata requirements for initial data submission, with the the possibility of enriching the metadata at a later date – what we call curation by addition – in order to kick-start the cultural sea change required for data deposition to become routine. We are trying to avoid the best – the requirement for perfect and complete metadata – becoming the enemy of the good – data publication by any means.
We are, through the JISC Dryad-UK Project, working to promote the Dryad Data Repository, a domain-specific repository for biological datasets linked to peer-reviewed journal articles, by bringing additional publishers and journals on board, and enabling Dryad metadata to be published as open linked data.
We are also promoting the adoption of SWORDv2 repository communication protocol for data package wrapping, to permit automated deposit to DataBank, Dryad or other SWORD-compliant repositories, and the exchange of metadata between them.
SPAR (Semantic Publishing and Referencing) Ontologies
To enable Dryad, DataBank and similar repository metadata to be published as open linked data, we are creating appropriate data description and data citation ontologies, including FaBiO and CiTO4Data, as part of our suite of SPAR Ontologies, and are using them to provide mappings from the DataCite XML Metadata Kernel to RDF.
We are working with DataCite to assign DOIs to Dryad and DataBank datasets, so that data publications become citable, gaining academic credit for the data depositor.
These data citations, when they exist, will fit naturally within the Open Citations Corpus, a collection of some 3.4 million bibliographic citations from within PubMed Central that we have recently established as open linked data, as part of the JISC Open Citations Project.
We have also worked to establish best practice for citing data publications from within the literature, and with one open access journal publisher to influence their Data Publishing Policies and Guidelines to Authors regarding data citation, as detailed in earlier posts on this blog.
Tools for metadata curation
The above tools and services are generic. Specifically in the biomedical area, we are developing MIIDI, a Minimal Information standard for reporting an Infectious Disease Investigation, to specify the metadata that should for completeness accompany such an investigation, and have recently developed MIIDI Forms, a web tool that facilitates the entry of such metadata, that involves interaction with appropriate web services to enable autocompleting of bibliographic information and specification of geo-coordinates for place names, and permits automated look-up of ontology terms from the NCBI BioPortal.
Open Research Reports
We are working to create Open Research Reports, open access structured digital abstracts in both human- and machine-readable form that describe datasets or journal articles that relate to infectious disease, based on MIIDI and to be published in an instant data journal format with DOIs to permit referencing and citation.
Tools for creating data management plans
We have recently started working with the Digital Curation Centre to help improve their DMPonline data management planning tool for creating the data management plans increasingly required to accompany grant applications, and useful for managing the flow of data from funded projects. If our current funding application is successful, this work will be carried forward in the OXFORD DMPonline Project, in which, in addition to adoption, adaption, customization and integration of the tool for use by University of Oxford researchers, we will develop the following generic improvements to the tool that will be fed back to the DCC as open source enhancements for general use across UK academia and internationally:
a) creation of DaMO, a simple data management ontology,
b) use of DaMO to create RDF metadata for data management plans,
c) SWORDv2-wrapping of data management plans for repository submission, and
d) creation of DMPBank, a DataBank instance specifically tailored for archiving and publishing data management plans.