Evidence submitted by David Shotton in response to the Royal Society’s Policy Study “Science as a Public Enterprise” Call for Evidence, addressing the following two topics raised by that call:
Getting Researcher buy-in. How do we get researchers to be more willing to share data? What is there to be learned from disciplines such as genomics which have norms which favour wide sharing of data?
Ensuring we generate useful metadata. For open data to be useful, it needs to be sufficiently well described. The researchers creating the dataset are in the best position to create the metadata; but as things stand, the incentives for them to do a thorough job of this are not always very strong. Do we need to change incentives?
= = =
“I guess I have been invited to contribute evidence to the evidence session on digital curation at the Royal Society on 5th August 2011 to present the view from the shop floor – or rather from the laboratory bench. I would like to mention three pressures that presently combine to prevent researchers from publishing their data.
Pressure one: Information volume
When I started research, you could, if you were very fortunate (as I was), solve a protein structure to low resolution within six months and to medium resolution within three or four years, and you could hope to know something about all the protein structures that had so far been determined. Today, you can collect the crystallographic structure factor data for a new protein in a few minutes at the Diamond Light Source, and can compute its 3D structure on your laptop during the train ride home. PDB currently contains the structures of about 74,000 macromolecules, and you are unlikely to know the structures of more than a handful of these.
Looking at the same problem from a different perspective, PubMed currently received a million articles per year. If you imagine there might be a thousand biomedical specialisms – if you slice the salami thinly enough -, as a specialist you can expect to have on average twenty new papers in your field each week – an impossible number to carve out time for, from your other activities, if you wish to read them properly.
Thus you will never catch up – there is just too much scientific information around now. You would like to know about it all, to keep abreast of your field, but the task is impossible. Researchers are thus under overwhelming pressure, and have to run just to stand still. They have no spare time to undertake data curation activities for which they receive little or no academic reward in terms of peer esteem, tenure or promotion.
Pressure two: Institutional pressures
The principal pressures researchers are under from their departments and institutions are (a) to win grants and (b) to publish in high impact journals, because these things influence departmental income both (a) directly through full economic costs from funding agencies, and (b) through high RAE/REF scores that in England determine funding from HEFCE. From the viewpoint of a Head of Department trying to establish or maintain his department’s reputation and financial health, nothing else matters. I have known these factors as the deciding ones in academic appointments. Nobler concepts of scientific excellence and of scientific altruism in the form of data publication become submerged beneath these pressures.
Pressure three: Cognitive overheads of data management
Appropriate ontologies and technical infrastructures for data preservation increasingly exist, but the concepts surrounding metadata creation, repository deposit and data accessibility are foreign to most biomedical researchers, leading to cognitive and skill barriers that prevent them from undertaking routine best-practice data management.
Put crudely, the large amount of effort involved in preparing data for publication, coupled with the negligible incentives and rewards, prevents researchers in most biomedical specialisms from doing so.
Having said that, research scientists are perfectly able to provide structured metadata when it is necessary to do so. With the switch to on-line journal article submission, publishers have devised lengthy web forms that require completion with details of co-authors and their affiliations, funding agencies, etc. before you are permitted to upload your manuscript – forms that for certain publishers can take the best part of an afternoon to complete for a new submission involving many authors, figures and supplementary files. Since researchers have no choice but to comply with the metadata requests, they do so, since this is the only way in which to achieve their desired goal of publication in the chosen journal.
That the fields of genomics and macromolecular structures are exceptions to the rule that data are not widely published is due to two factors:
- First, their datasets are relatively simple, homogeneous and well-defined – linear nucleotide or amino acid sequences, lists of structure factors, and lists of atomic coordinates – in comparison with the heterogeneity of data in fields such as ecology or animal behaviour, simplifying the tasks of data management and metadata creation.
- Second, and more important, is the fact that in the early 1970s journals such as Nature started to mandate database accession numbers as a precondition of publishing sequence or structure papers – this brought about an almost instantaneous change in attitudes among our research community!
For other disciplines, while I commend journals’ and research councils’ recent policies regarding data publication, I believe we will only achieve radical change when funders and publishers mandate data publication as a pre-condition of applying for a further grant or of article submission. Toothless research council data policies, however laudable, are of little use unless backed up by some policing. ‘Sticks’ are required to achieve desired policy aims, as well as the ‘carrots’ of better personal data management and data security obtained by employing easy-to-use tools and systems.”
= = =