Working Group Response to Royal Society Science as a Public Enterprise

July 10, 2012 in Panton Principles, Publications

Those of you following #openscience news over the last few weeks won’t have failed to notice that the Royal Society in the UK recently released their Science as a Public Enterprise Report in strong support of open science. The working group submitted our collaboratively drafted response during the consultation period, which you can read below or download with other responses.

What ethical and legal principles should govern access to research results and data? How can
ethics and law assist in simultaneously protecting and promoting both public and private
interests:

The presiding principle should be that all outputs of publicly funded research are released and made
publicly accessible as soon as is practicable and reasonable. Our position would be that all data, code and
algorithms supporting published scientific results should be released openly alongside that publication in
accordance with the Panton Principles for Open Data in Science [1].

We acknowledge that there are reasons why outputs should not be released but these are restricted to a
small set of issues including but not necessarily limited to personal privacy, personal endangerment, risks
to the research itself, danger to the environment. All of these are much larger issues which deserve
consideration.

[1] www.pantonprinciples.org

2 a) How should principles apply to publicly-funded research conducted in the public interest:

The principles and caveats discussed above should apply to all such research.

2 b) How should principles apply to privately-funded research involving data collected about or
from individuals and/or organisations (e.g. clinical trials)?

If the full economic cost is covered by a private company then research is ‘private’ and there shouldn’t be
an expectation of data release in the manner previously discussed. However, even private funding is often
from donors that have expectations of research being done in the public interest e.g. charities supporting
medical research. Therefore, the better distrinction may be between public interest research and
commercial R&D (as addressed below).

There should be dicussion between donors and funders as to data release policies in these cases and
where the funder is in agreement the data should be made available according to the principles above
while prioritising privacy and anonymity of research subjects (see Q2d)

In the case of partial private funding there may be grey areas. This includes cases where research is
privately funded but heavily subsidised by institutions such that companies are not bearing the full
economic cost. These may need consideration case by case but in terms of any published results in the
scientific literature, the default should be that the data to back up those claims is publicly available.

2 c) How should principles apply to research that is entirely privately-funded but with
possible public implications?

If private funding leads to research claims around public policy areas e.g. health, environment or
planning, then data to support that claim should be made available in a publicly accessible manner. This is
imperative if research claims are intended to influence public policy.

2 d) How should principles apply to research or communication of data that involves the
promotion of the public interest but which might have implications from the privacy interests
of citizens?

The privacy of citizens should come before the need for open data, but methods to protect privacy and
still release data in the public interest should be explored and considered where possible.

3. What activities are currently under way that could improve the sharing and communication
of scientific information?

There are numerous barriers to effective sharing and communication of scientific information which are
being adressed by ongoing projects and activities.

Getting the data in shareable form:
One of the most urgent needs to improve sharing and communication of scientific information is to
improve capture and collection of data. Technology is required alongside standards for formatting and
sharing data and money must be invested in designing and building user friendly data capture devices.
For example, the efficiency of data capture and collection could be improved by scientific workflow
systems such a Taverna and VisTrails.

Encouraging sharing and communication:
This requires work on the attitude and expectations of scientists, establishment of community norms
around data sharing and a reward system which recognises the worth of dataset publication as well as
journal articles.

Work on reward systems to ensure that scientists get more recognition and benefit for sharing data is
ongoing. Systems such as microattribution and other forms of recognition following reuse of shared data
are being developed by STARMETRICS (NIH) [1], REF (HEFCE) [2] and Altmetrics [3] among others.

This work will be essential for creating incentives and community norms encouraging sharing and release
of data. Researchers reservations must be addressed, which may include fears of getting scooped,
releasing ‘dirty’ and unedited data in which people may find mistakes, possible harm due to caveats and
interpretation notes being detached from the dataset and the risk of misinterpretation.
The interaction of incentives, culture, and individuals will make the difference in driving greater
accessibility to scientific information.

Data Publication:
An important step in improving sharing is enabling scientific datasets to be published as citeable and
objects and publications such as BMC Research Notes [4] and F1000 data publications (among others) are
allowing this to happen.

Some fields, such as crystallography, recognised the value of data papers far earlier than others. There are
currently initiatives ongoing in other areas which have traditionally not followed this publication model
e.g. meteorology [5].

Much scientific information does not make it into a paper or on its own would not merit a data
publication but may still be of use to others. For instance, most scientists do not publish negative results
but this may lead to unecessary duplication by other labs, which reduces the efficiency of research.
Projects such as FigShare [6] enable such data to be made accessible and their use should be encouraged.

Standards:
Each scientific discipline will differ in what research data it feels can or cannot be released and when and
how that release should happen. Encouragement of disciplines to articulate norms around reasonable
exceptions to release has started via projects such as Data Dryad [7].

A difficult issue is what rights researchers have to first use of any data they collect. It may be necessary to
set an upper limit to time of publication, particularly in fields where long term data collection is the norm.
However, the default should be that data is released as soon as practicable on a timescale deemed
reasonable by the community.

On top of this, standards related to the formatting and content of datasets must be considered to
increase their usefulness to multiple stakeholders including other researchers and the public. Initiatives
such as Biosharing [8] in the life sciences aim to catalogue bioscience data reporting standards and
policies.

Principles:
Principles for the release of data should be clearly articulated e.g. the Panton Principles for Open Data in
Science [9] provide recommendations for the release of fully open data. There is a difference between
publicly accessible publication and open publication of data, this organisation would strongly promote
the latter. As per the Panton Principles introduction:

Science is based on building on, reusing and openly criticising the published body of scientific knowledge.
For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is
crucial that science data be made open.

By open data in science we mean that it is freely available on the public internet permitting any user to
download, copy, analyse, re-process, pass them to software or use them for any other purpose without
financial, legal, or technical barriers other than those inseparable from gaining access to the internet
itself. To this end data related to published science should be explicitly placed in the public domain.

Data management and repositories:
Numerous projects are ongoing in the field of improving scientific data repositories and research data
management (Dryad [10], JISC UMF [11]), building sustainable infrastructures to draw data from different
sources and make it available (DataONE [12]), looking at improving the quality and availability of scientific
data more generally (CoDATA [13]).

[1] https://www.starmetrics.nih.gov/

[2] http://www.hefce.ac.uk/research/ref/

[3] http://altmetrics.org/manifesto/

[4] http://www.biomedcentral.com/bmcresnotes/

[5[ http://www.jisc.ac.uk/whatwedo/programmes/reppres/sue/ojims.aspx

[6] http://figshare.com/

[7] http://www.datadryad.org/jdap

[8] http://www.biosharing.org/

[9] https://pantonprinciples.org/

[10] http://datadryad.org/

[11] http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

[12] http://www.dataone.org/about

[13] http://www.codata.org/

5. What additional challenges are there in making data usable by scientists in the same field,
scientists in other fields, ‘citizen scientists’ and the general public?

The form in which the data is published is a major factor in its reuseability e.g. file formats, use of
standard ontologies. Many intitiatives are defining data standards (see Q3) which will improve the
situation for other researchers.

The barriers to making data useable by citizens are higher. Data visualisation and provision of suitable
narratvies to accompany datasets would be useful, although language barriers would also need to be
addressed.

6 a) What might be the benefits of more widespread sharing of data for the productivity and
efficiency of scientific research?

More widespread sharing of data increases the efficiency of research through:

Reduction of duplication – Datasets are discoverable and can be reused e.g. data deposited in the NCBI
Gene Expression Omnibus (GEO) database was reused in around 1150 papers from PubMed during 2007-
2010 [1].Sharing of negative results could reduce duplication significantly.
Ease of replication – Release of full datasets as opposed to summaries in papers enable more effective
replication and scope for discovering errors, partcularly if related code and algorithms are also released.
Ease of critique and reanalysis – Thorough and rapid public critique of data would be made easier,
possibly leading to more discussions such as the recent asrsenic life debate in the blogosphere.
Crowdsourcing analysis – Sharing data with other scientists can be particularly beneficial when rapid
analysis of data is required e.g. the recent E.coli outbreak in Europe where E.coli genome comparisons
were crowd sourced in the public domain [2]. This demonstrates its inherent ability to increase efficiency
compared to multiple closed labs performing the same analysis. Citizen science such as Galaxy Zoo or
PlanetHunters also allows the crowdsourced analysis of scientific data by members of the public, which
can only happen if data is publicly shared.
Less time wasted searching for data: The discovery of data under the current system of publication can be
a time consuming process. Once a suitable publication is found, the full data may need to be requested
from the authors and that which is included in the publication may not be openly reuseable due to
licensing. Finding out if data is reuseable takes time, which can reduce productivity and efficiencly. Tools
such as the data status request service Is It Open Data? [2], which archives responses from data providers,
may be useful in this regard but still takes time.

[1] http://researchremix.wordpress.com/2011/05/19/nature-letter/
[2] https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki
[3] http://www.isitopendata.org/

6 b) What might be the benefits of more widespread sharing of data for new sorts of science?

Genomics as a science would have been impossible without data sharing via databases such as GenBank
and the same applies to other scientific fields and their respective data sharing methods e.g. astronomy
and crystallography.

It is difficult to predict what new fields may emerge from a world with greater availability of data.
However, the release of open datasets will enable the utilisation of technology which already exists but is
restricted by the availability of useable data e.g. semantic web technologies using linked data could
discover obscure connections between datasets and research findings. Text mining and data mining tools
have huge potential for new discoveries if allowed access to large sets of scientific information.

6 c) What might be the benefits of more widespread sharing of data for public policy?

Evidence based public policy will be more credible if the data supporting it is freely available (see also
Q2d). This could lead to more informed debate from a wider range of stakeholders who all have access to
the evidence.

6 d) What might be the benefits of more widespread sharing of data for other social benefits?

More diverse contributions to scientific research and debate would be possible, including increased public
engagement. Members of the public with specific interests would have direct access to high quality
scientific information e.g. patient groups who would like access to the latest research data on treatments.

6 e) What might be the benefits of more widespread sharing of data for innovation and
economic growth?

The opportunities to increase the efficiency and productivity of reasearch discussed in Q6a could lead to
an acceleration of innovation.

6 f) What might be the benefits of more widespread sharing of data for public trust in the
processes of science?

The recent climate data controversy led to concern among the public about the management and
availability of scientific data. To get to the stage where an FOI is required to retrieve the output of publicly
funded research does not reflect well on the transparency and accessibility of science and thus, to some
people, its credibility.

Widespread sharing of data, particularly in areas of great public interest, should reduce these concerns.

7. How should concerns about privacy, security and intellectual property be balanced against
the proposed benefits of openness?

Privacy of research subjects should be prioritised as per Q2d. This is particularly pertinent to medical trials
and informant-based social sciences and discussions on how best to approach data release are ongoing in
these fields.

Security may be a valid reason for not publishing data but must be justified and will need to be examined
on a case by case basis.

8. What should be expected and/or required of scientists (in companies, universities or elsewhere), research funders, regulators, scientific publishers, research institutions, international organisations and other bodies?

A common vision is required to enable the necessary technologial and cultural changes for widespread
sharing of scientific information in the manner discussed above to be realised. There is already a topdown
push from research funders e.g. a group of 17 health research funders have released a joint
statement on data sharing [1] and publishers are beginning to offer more opportunities for publishing
datasets (see Q3). More such organisations should be expected to join in the positive promotion of data
sharing. This will enable a scaling up of efforts as more scientists are required to share data in order
receive funding or a journal publication.

From scientists themselves there are various grass roots level projects. Many initiatives are listed in Q3.
Additionally, the Open Data in Science Working Group at the Open Knowledge Foundation [2] is a
community of scientists who aim to promote open data in science and build apps, tools and datasets to
allow people to easily publish, find and reuse scientific data. The realisation that the extra work required
to publish useful shared data is worth the effort may take some time. The recent winners of the 2011
BMC Open Data Award admitted that “credit..must go to persistent, anonymous referee.., who
demanded—twice—that we also publish the background data” [3].

Scientists should not only be required to adhere to data sharing policies set down by organisations but
should be expected to give thought to managing all of their data and how they might share it in a way
that maximises its impact. This will become more likely as tools make sharing easier and incentives are
provided in the form of recognition and rewards for data sharing.

[1] http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-health-andepidemiology/
WTDV030689.htm
[2] http://science.okfn.org/
[3] http://blogs.openaccesscentral.com/blogs/bmcblog/entry/on_the_unbearable_lightness_of

Other comments:

In many cases the costs of sharing data are negligible, but in some cases there may be an argument that
the cost of effective sharing is too high compared to the potential for reuse e.g. some raw datasets are
out of date and not easily digitised. Others are impossibly large e.g. primary data from the LHC or image
data from next-gen sequencing machines. In this case, it would be acceptable to share only what data is
available and deemed useable e.g. summary data.

However, cost barriers are reducing over time and in the future it may be possible to publish datasets that
are currently unfeasible, so effective data management is essential to preserved as much output of
research as possible.

← Join the Open Science Hackday on 7 July

Open Science Hackday – with donuts, the Queen, and a whole lot of rain… →