In biology, the fields of macromolecular structural biology and sequence bioinformatics have, since the 1970s, had established international databases for the deposition of data, and journal policies mandating such deposition prior to acceptance for publication of manuscripts describing the data. Similar good practices have developed more recently in other disciplines, notably astronomy. But these are the exceptions, and over the majority of scientific fields data publication remains a minority activity. For the most part, this is because the technical barriers to publication of research datasets remain so high, and the academic rewards so low, that such publication is undertaken only by the few who regard it as a moral imperative. However, new policies are combining with new technological capabilities to bring significant change to this publication landscape.
An analogy can perhaps be made with geological plate tectonics. The new policies of funders and journal publishers towards the open publication of research datasets arising from publicly funded research can be likened to a tectonic plate slowly moving forward with inexorable force, that is colliding with the massive stationary continental plate of established scientific practice, in which data are traditionally regarded as belonging to the research group that generated them, in which data sharing occurs only between trusted colleagues on the basis of personal request, and in which the only publications that are truly valued are those of journal articles.
This traditional position is reinforced by the metrics employed in the assessment of research quality, which, while giving lip service to the value of data publication, regard articles in high impact factor journals as of paramount importance. And there is good reason why the scholarly research article is so highly regarded, since it is a rhetorical construct in which the authors attempt by the selective presentation of evidence to convince the readers that particular hypotheses are proven. As such, it can both demonstrate the competence and achievements of the authors, and be evaluated against other such publications by peer review. In contrast, the publication of a research dataset is primarily a presentation of facts, with no rhetorical content, that can only be validated on the basis of internal self-consistency, information about instrument calibrations, resolution and error estimates, and the possession of adequate descriptive metadata in appropriate formats – much more pedestrian stuff.
However, to return to our analogy, these tectonic plates are colliding, with the traditional plate of data retention facing ultimate subduction beneath the advancing plate of open data publication. At present there is friction between them, and along much of the sheer zone there seems to be little movement, leading to a build-up of pressure. Small projects that result in some local movement towards better data management can be likened to minor earthquakes having only local impact. But these are the precursors to an inevitable major re-positioning of the plates to relieve the mounting pressure for change. This will result in a major general realignment of attitudes along the whole plate boundary, resulting in a tsunami of open data publication. We are thus on the cusp between the traditional status quo and a new dramatically reshaped scientific publication landscape, in which open data publication will take its proper place as underpinning the publication of ideas and evidences supporting hypotheses.
Technological projects such as the JISC ADMIRAL Project and its successor the JISC UMF DataFlow Project serve to facilitate this transition by ‘lubricating’ the plate boundary and enabling movement. In particular, the two-tiered federated data management infrastructure they provide, with local services to meet the private data management needs of individual research groups (DataStage filestore instances) being linked to institutional repositories (DataBank repository instances) by automated procedures for the easy archiving and publication of selected datasets, makes the whole process easier, as illustrated by the following figures taken from the original ADMIRAL Project grant application.
Four phases mark the activities undertaken in traversing the conventional data lifecycle: formulation, experimentation, interpretation and publication. The publication outputs from one cycle provide the input to the next. However, only selected research data are conventionally published. The original research datasets are frequently abandoned on local hard drives or CD-ROMs, and neither datasets nor papers are submitted to institutional repositories.
Raw research data are first organized and annotated in a local research data filestore. From there they can be shared and used to support publications, and can be automatically archived to institutional repositories, from which they can optionally be published as Linked Open Data on the Semantic Web for public dissemination and reuse. This figure differs from the DCC Data Cycle model by emphasizing the importance of the local research data filestore.
As investment is made in the local organization and annotation of research datasets, the effort involved in data submission to an institutional repository reduces to the point where it becomes feasible on a routine basis.