Working Group Response to Royal Society Science as a Public Enterprise
Those of you following #openscience news over the last few weeks won’t have failed to notice that the Royal Society in the UK recently released their Science as a Public Enterprise Report in strong support of open science. The working group submitted our collaboratively drafted response during the consultation period, which you can read below or download with other responses.
What ethical and legal principles should govern access to research results and data? How can ethics and law assist in simultaneously protecting and promoting both public and private interests:
The presiding principle should be that all outputs of publicly funded research are released and made publicly accessible as soon as is practicable and reasonable. Our position would be that all data, code and algorithms supporting published scientific results should be released openly alongside that publication in accordance with the Panton Principles for Open Data in Science [1].
We acknowledge that there are reasons why outputs should not be released but these are restricted to a small set of issues including but not necessarily limited to personal privacy, personal endangerment, risks to the research itself, danger to the environment. All of these are much larger issues which deserve consideration.
[1] www.pantonprinciples.org
2 a) How should principles apply to publicly-funded research conducted in the public interest:
The principles and caveats discussed above should apply to all such research.
2 b) How should principles apply to privately-funded research involving data collected about or from individuals and/or organisations (e.g. clinical trials)?
If the full economic cost is covered by a private company then research is ‘private’ and there shouldn’t be an expectation of data release in the manner previously discussed. However, even private funding is often from donors that have expectations of research being done in the public interest e.g. charities supporting medical research. Therefore, the better distrinction may be between public interest research and commercial R&D (as addressed below).
There should be dicussion between donors and funders as to data release policies in these cases and where the funder is in agreement the data should be made available according to the principles above while prioritising privacy and anonymity of research subjects (see Q2d)
In the case of partial private funding there may be grey areas. This includes cases where research is privately funded but heavily subsidised by institutions such that companies are not bearing the full economic cost. These may need consideration case by case but in terms of any published results in the scientific literature, the default should be that the data to back up those claims is publicly available.
2 c) How should principles apply to research that is entirely privately-funded but with possible public implications?
If private funding leads to research claims around public policy areas e.g. health, environment or planning, then data to support that claim should be made available in a publicly accessible manner. This is imperative if research claims are intended to influence public policy.
2 d) How should principles apply to research or communication of data that involves the promotion of the public interest but which might have implications from the privacy interests of citizens?
The privacy of citizens should come before the need for open data, but methods to protect privacy and still release data in the public interest should be explored and considered where possible.
3. What activities are currently under way that could improve the sharing and communication of scientific information?
There are numerous barriers to effective sharing and communication of scientific information which are being adressed by ongoing projects and activities.
Getting the data in shareable form: One of the most urgent needs to improve sharing and communication of scientific information is to improve capture and collection of data. Technology is required alongside standards for formatting and sharing data and money must be invested in designing and building user friendly data capture devices. For example, the efficiency of data capture and collection could be improved by scientific workflow systems such a Taverna and VisTrails.
Encouraging sharing and communication: This requires work on the attitude and expectations of scientists, establishment of community norms around data sharing and a reward system which recognises the worth of dataset publication as well as journal articles.
Work on reward systems to ensure that scientists get more recognition and benefit for sharing data is ongoing. Systems such as microattribution and other forms of recognition following reuse of shared data are being developed by STARMETRICS (NIH) [1], REF (HEFCE) [2] and Altmetrics [3] among others.
This work will be essential for creating incentives and community norms encouraging sharing and release of data. Researchers reservations must be addressed, which may include fears of getting scooped, releasing ‘dirty’ and unedited data in which people may find mistakes, possible harm due to caveats and interpretation notes being detached from the dataset and the risk of misinterpretation. The interaction of incentives, culture, and individuals will make the difference in driving greater accessibility to scientific information.
Data Publication: An important step in improving sharing is enabling scientific datasets to be published as citeable and objects and publications such as BMC Research Notes [4] and F1000 data publications (among others) are allowing this to happen.
Some fields, such as crystallography, recognised the value of data papers far earlier than others. There are currently initiatives ongoing in other areas which have traditionally not followed this publication model e.g. meteorology [5].
Much scientific information does not make it into a paper or on its own would not merit a data publication but may still be of use to others. For instance, most scientists do not publish negative results but this may lead to unecessary duplication by other labs, which reduces the efficiency of research. Projects such as FigShare [6] enable such data to be made accessible and their use should be encouraged.
Standards: Each scientific discipline will differ in what research data it feels can or cannot be released and when and how that release should happen. Encouragement of disciplines to articulate norms around reasonable exceptions to release has started via projects such as Data Dryad [7].
A difficult issue is what rights researchers have to first use of any data they collect. It may be necessary to set an upper limit to time of publication, particularly in fields where long term data collection is the norm. However, the default should be that data is released as soon as practicable on a timescale deemed reasonable by the community.
On top of this, standards related to the formatting and content of datasets must be considered to increase their usefulness to multiple stakeholders including other researchers and the public. Initiatives such as Biosharing [8] in the life sciences aim to catalogue bioscience data reporting standards and policies.
Principles: Principles for the release of data should be clearly articulated e.g. the Panton Principles for Open Data in Science [9] provide recommendations for the release of fully open data. There is a difference between publicly accessible publication and open publication of data, this organisation would strongly promote the latter. As per the Panton Principles introduction:
Science is based on building on, reusing and openly criticising the published body of scientific knowledge. For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open. By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.
Data management and repositories: Numerous projects are ongoing in the field of improving scientific data repositories and research data management (Dryad [10], JISC UMF [11]), building sustainable infrastructures to draw data from different sources and make it available (DataONE [12]), looking at improving the quality and availability of scientific data more generally (CoDATA [13]).
[1] https://www.starmetrics.nih.gov/
[2] http://www.hefce.ac.uk/research/ref/
[3] http://altmetrics.org/manifesto/
[4] http://www.biomedcentral.com/bmcresnotes/
[5[ http://www.jisc.ac.uk/whatwedo/programmes/reppres/sue/ojims.aspx
[6] http://figshare.com/
[7] http://www.datadryad.org/jdap
[8] http://www.biosharing.org/
[9] http://pantonprinciples.org/
[10] http://datadryad.org/
[11] http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
[12] http://www.dataone.org/about
[13] http://www.codata.org/
5. What additional challenges are there in making data usable by scientists in the same field, scientists in other fields, ‘citizen scientists’ and the general public?
The form in which the data is published is a major factor in its reuseability e.g. file formats, use of standard ontologies. Many intitiatives are defining data standards (see Q3) which will improve the situation for other researchers.
The barriers to making data useable by citizens are higher. Data visualisation and provision of suitable narratvies to accompany datasets would be useful, although language barriers would also need to be addressed.
6 a) What might be the benefits of more widespread sharing of data for the productivity and efficiency of scientific research?
More widespread sharing of data increases the efficiency of research through:
- Reduction of duplication – Datasets are discoverable and can be reused e.g. data deposited in the NCBI Gene Expression Omnibus (GEO) database was reused in around 1150 papers from PubMed during 2007- 2010 [1].Sharing of negative results could reduce duplication significantly.
- Ease of replication – Release of full datasets as opposed to summaries in papers enable more effective replication and scope for discovering errors, partcularly if related code and algorithms are also released.
- Ease of critique and reanalysis – Thorough and rapid public critique of data would be made easier, possibly leading to more discussions such as the recent asrsenic life debate in the blogosphere.
- Crowdsourcing analysis – Sharing data with other scientists can be particularly beneficial when rapid analysis of data is required e.g. the recent E.coli outbreak in Europe where E.coli genome comparisons were crowd sourced in the public domain [2]. This demonstrates its inherent ability to increase efficiency compared to multiple closed labs performing the same analysis. Citizen science such as Galaxy Zoo or PlanetHunters also allows the crowdsourced analysis of scientific data by members of the public, which can only happen if data is publicly shared.
- Less time wasted searching for data: The discovery of data under the current system of publication can be a time consuming process. Once a suitable publication is found, the full data may need to be requested from the authors and that which is included in the publication may not be openly reuseable due to licensing. Finding out if data is reuseable takes time, which can reduce productivity and efficiencly. Tools such as the data status request service Is It Open Data? [2], which archives responses from data providers, may be useful in this regard but still takes time.
[1] http://researchremix.wordpress.com/2011/05/19/nature-letter/ [2] https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki [3] http://www.isitopendata.org/
6 b) What might be the benefits of more widespread sharing of data for new sorts of science?
Genomics as a science would have been impossible without data sharing via databases such as GenBank and the same applies to other scientific fields and their respective data sharing methods e.g. astronomy and crystallography.
It is difficult to predict what new fields may emerge from a world with greater availability of data. However, the release of open datasets will enable the utilisation of technology which already exists but is restricted by the availability of useable data e.g. semantic web technologies using linked data could discover obscure connections between datasets and research findings. Text mining and data mining tools have huge potential for new discoveries if allowed access to large sets of scientific information.
6 c) What might be the benefits of more widespread sharing of data for public policy?
Evidence based public policy will be more credible if the data supporting it is freely available (see also Q2d). This could lead to more informed debate from a wider range of stakeholders who all have access to the evidence.
6 d) What might be the benefits of more widespread sharing of data for other social benefits?
More diverse contributions to scientific research and debate would be possible, including increased public engagement. Members of the public with specific interests would have direct access to high quality scientific information e.g. patient groups who would like access to the latest research data on treatments.
6 e) What might be the benefits of more widespread sharing of data for innovation and economic growth?
The opportunities to increase the efficiency and productivity of reasearch discussed in Q6a could lead to an acceleration of innovation.
6 f) What might be the benefits of more widespread sharing of data for public trust in the processes of science?
The recent climate data controversy led to concern among the public about the management and availability of scientific data. To get to the stage where an FOI is required to retrieve the output of publicly funded research does not reflect well on the transparency and accessibility of science and thus, to some people, its credibility.
Widespread sharing of data, particularly in areas of great public interest, should reduce these concerns.
7. How should concerns about privacy, security and intellectual property be balanced against the proposed benefits of openness?
Privacy of research subjects should be prioritised as per Q2d. This is particularly pertinent to medical trials and informant-based social sciences and discussions on how best to approach data release are ongoing in these fields.
Security may be a valid reason for not publishing data but must be justified and will need to be examined on a case by case basis.
8. What should be expected and/or required of scientists (in companies, universities or elsewhere), research funders, regulators, scientific publishers, research institutions, international organisations and other bodies?
A common vision is required to enable the necessary technologial and cultural changes for widespread sharing of scientific information in the manner discussed above to be realised. There is already a topdown push from research funders e.g. a group of 17 health research funders have released a joint statement on data sharing [1] and publishers are beginning to offer more opportunities for publishing datasets (see Q3). More such organisations should be expected to join in the positive promotion of data sharing. This will enable a scaling up of efforts as more scientists are required to share data in order receive funding or a journal publication.
From scientists themselves there are various grass roots level projects. Many initiatives are listed in Q3. Additionally, the Open Data in Science Working Group at the Open Knowledge Foundation [2] is a community of scientists who aim to promote open data in science and build apps, tools and datasets to allow people to easily publish, find and reuse scientific data. The realisation that the extra work required to publish useful shared data is worth the effort may take some time. The recent winners of the 2011 BMC Open Data Award admitted that “credit..must go to persistent, anonymous referee.., who demanded—twice—that we also publish the background data” [3].
Scientists should not only be required to adhere to data sharing policies set down by organisations but should be expected to give thought to managing all of their data and how they might share it in a way that maximises its impact. This will become more likely as tools make sharing easier and incentives are provided in the form of recognition and rewards for data sharing.
[1] http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-health-andepidemiology/ WTDV030689.htm [2] http://science.okfn.org/ [3] http://blogs.openaccesscentral.com/blogs/bmcblog/entry/on_the_unbearable_lightness_of
Other comments:
In many cases the costs of sharing data are negligible, but in some cases there may be an argument that the cost of effective sharing is too high compared to the potential for reuse e.g. some raw datasets are out of date and not easily digitised. Others are impossibly large e.g. primary data from the LHC or image data from next-gen sequencing machines. In this case, it would be acceptable to share only what data is available and deemed useable e.g. summary data.
However, cost barriers are reducing over time and in the future it may be possible to publish datasets that are currently unfeasible, so effective data management is essential to preserved as much output of research as possible.



