Publications – OKF Open Science Working Group

Panton Fellow Update: Samuel Moore

Samuel Moore — Wed, 08 Jan 2014 10:21:21 +0000

My first few months as a Panton Fellow have flown by and so I wanted to provide a quick update on the work I’ve been doing. Whilst it’s not possible to discuss everything, I thought it would be good to list some of the larger projects I’ve been working on.

Early into the fellowship I made contact with two of the Open Economics Working Group coordinators, Velichka Dimitrova and Sander Van Der Waal, to discuss how best to encourage Open Data in economics. Whilst we thought that a data journal could be a good way of incentivising data sharing, we also thought it would be sensible to conduct a survey of economists and their data sharing habits to see if our assumptions were correct. This will give us some firm evidence of the best way to advocate for Open Data in economics. The results will be released when they are available.

Staying within the OKFN framework, I also helped kick-start the Open Humanities Group back into action in a meeting with the organisers and a post to the discussion list (posing the question: What does Open Humanities research data mean to you?). As a humanities researcher myself I am very keen to see the humanities embrace a more open approach to scholarship and it’s great to see a resurgence of activity here. So far this has resulted in a forthcoming Open Literature Sprint on January 25^th in London. This sprint will build upon some of the work already completed on the Open Literature and Textus projects for collaborating, analysing and sharing open access and public domain works of literature and philosophy. Whilst I cannot take any credit for organising the event, I will certainly be in attendance and I encourage all those interested in Open Humanities research/data to attend too. We are looking for coders, editors and textfinders for the event – absolutely no technical skills required! You can sign up to attend here.

However, the majority of my time has been spent working on a book: An Introduction to Open Research Data. This edited volume will feature chapters by Open Data experts in a range of academic disciplines, covering practical information on licensing, ethics, and information for data curators, alongside more theoretical issues surrounding the adoption of Open Data. As the book will be Open Access, each chapter will be able to standalone from the main volume so communities can host, distribute and remix the content that is relevant to them (the book will also be available in print). The table of contents is near enough finalised and the contributions are currently being written. I’m hoping the volume will be ready by August but watch this space! Do get in touch if you’ve any questions at all.

In addition, here is a round-up of the blogposts I’ve written so far:

On the Harvard Dataverse Network Project – an open-source tool for data sharing

What are the incentives for data sharing?

Panton Fellow Introduction: Samuel Moore

Working Group Response to Royal Society Science as a Public Enterprise

Jenny Molloy — Tue, 10 Jul 2012 13:56:29 +0000

Those of you following #openscience news over the last few weeks won’t have failed to notice that the Royal Society in the UK recently released their Science as a Public Enterprise Report in strong support of open science. The working group submitted our collaboratively drafted response during the consultation period, which you can read below or download with other responses.

What ethical and legal principles should govern access to research results and data? How can
ethics and law assist in simultaneously protecting and promoting both public and private
interests:

The presiding principle should be that all outputs of publicly funded research are released and made
publicly accessible as soon as is practicable and reasonable. Our position would be that all data, code and
algorithms supporting published scientific results should be released openly alongside that publication in
accordance with the Panton Principles for Open Data in Science [1].

We acknowledge that there are reasons why outputs should not be released but these are restricted to a
small set of issues including but not necessarily limited to personal privacy, personal endangerment, risks
to the research itself, danger to the environment. All of these are much larger issues which deserve
consideration.

[1] www.pantonprinciples.org

2 a) How should principles apply to publicly-funded research conducted in the public interest:

The principles and caveats discussed above should apply to all such research.

2 b) How should principles apply to privately-funded research involving data collected about or
from individuals and/or organisations (e.g. clinical trials)?

If the full economic cost is covered by a private company then research is ‘private’ and there shouldn’t be
an expectation of data release in the manner previously discussed. However, even private funding is often
from donors that have expectations of research being done in the public interest e.g. charities supporting
medical research. Therefore, the better distrinction may be between public interest research and
commercial R&D (as addressed below).

There should be dicussion between donors and funders as to data release policies in these cases and
where the funder is in agreement the data should be made available according to the principles above
while prioritising privacy and anonymity of research subjects (see Q2d)

In the case of partial private funding there may be grey areas. This includes cases where research is
privately funded but heavily subsidised by institutions such that companies are not bearing the full
economic cost. These may need consideration case by case but in terms of any published results in the
scientific literature, the default should be that the data to back up those claims is publicly available.

2 c) How should principles apply to research that is entirely privately-funded but with
possible public implications?

If private funding leads to research claims around public policy areas e.g. health, environment or
planning, then data to support that claim should be made available in a publicly accessible manner. This is
imperative if research claims are intended to influence public policy.

2 d) How should principles apply to research or communication of data that involves the
promotion of the public interest but which might have implications from the privacy interests
of citizens?

The privacy of citizens should come before the need for open data, but methods to protect privacy and
still release data in the public interest should be explored and considered where possible.

3. What activities are currently under way that could improve the sharing and communication
of scientific information?

There are numerous barriers to effective sharing and communication of scientific information which are
being adressed by ongoing projects and activities.

Getting the data in shareable form:
One of the most urgent needs to improve sharing and communication of scientific information is to
improve capture and collection of data. Technology is required alongside standards for formatting and
sharing data and money must be invested in designing and building user friendly data capture devices.
For example, the efficiency of data capture and collection could be improved by scientific workflow
systems such a Taverna and VisTrails.

Encouraging sharing and communication:
This requires work on the attitude and expectations of scientists, establishment of community norms
around data sharing and a reward system which recognises the worth of dataset publication as well as
journal articles.

Work on reward systems to ensure that scientists get more recognition and benefit for sharing data is
ongoing. Systems such as microattribution and other forms of recognition following reuse of shared data
are being developed by STARMETRICS (NIH) [1], REF (HEFCE) [2] and Altmetrics [3] among others.

This work will be essential for creating incentives and community norms encouraging sharing and release
of data. Researchers reservations must be addressed, which may include fears of getting scooped,
releasing ‘dirty’ and unedited data in which people may find mistakes, possible harm due to caveats and
interpretation notes being detached from the dataset and the risk of misinterpretation.
The interaction of incentives, culture, and individuals will make the difference in driving greater
accessibility to scientific information.

Data Publication:
An important step in improving sharing is enabling scientific datasets to be published as citeable and
objects and publications such as BMC Research Notes [4] and F1000 data publications (among others) are
allowing this to happen.

Some fields, such as crystallography, recognised the value of data papers far earlier than others. There are
currently initiatives ongoing in other areas which have traditionally not followed this publication model
e.g. meteorology [5].

Much scientific information does not make it into a paper or on its own would not merit a data
publication but may still be of use to others. For instance, most scientists do not publish negative results
but this may lead to unecessary duplication by other labs, which reduces the efficiency of research.
Projects such as FigShare [6] enable such data to be made accessible and their use should be encouraged.

Standards:
Each scientific discipline will differ in what research data it feels can or cannot be released and when and
how that release should happen. Encouragement of disciplines to articulate norms around reasonable
exceptions to release has started via projects such as Data Dryad [7].

A difficult issue is what rights researchers have to first use of any data they collect. It may be necessary to
set an upper limit to time of publication, particularly in fields where long term data collection is the norm.
However, the default should be that data is released as soon as practicable on a timescale deemed
reasonable by the community.

On top of this, standards related to the formatting and content of datasets must be considered to
increase their usefulness to multiple stakeholders including other researchers and the public. Initiatives
such as Biosharing [8] in the life sciences aim to catalogue bioscience data reporting standards and
policies.

Principles:
Principles for the release of data should be clearly articulated e.g. the Panton Principles for Open Data in
Science [9] provide recommendations for the release of fully open data. There is a difference between
publicly accessible publication and open publication of data, this organisation would strongly promote
the latter. As per the Panton Principles introduction:

Science is based on building on, reusing and openly criticising the published body of scientific knowledge.
For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is
crucial that science data be made open.

By open data in science we mean that it is freely available on the public internet permitting any user to
download, copy, analyse, re-process, pass them to software or use them for any other purpose without
financial, legal, or technical barriers other than those inseparable from gaining access to the internet
itself. To this end data related to published science should be explicitly placed in the public domain.

Data management and repositories:
Numerous projects are ongoing in the field of improving scientific data repositories and research data
management (Dryad [10], JISC UMF [11]), building sustainable infrastructures to draw data from different
sources and make it available (DataONE [12]), looking at improving the quality and availability of scientific
data more generally (CoDATA [13]).

[1] https://www.starmetrics.nih.gov/

[2] http://www.hefce.ac.uk/research/ref/

[3] http://altmetrics.org/manifesto/

[4] http://www.biomedcentral.com/bmcresnotes/

[5[ http://www.jisc.ac.uk/whatwedo/programmes/reppres/sue/ojims.aspx

[6] http://figshare.com/

[7] http://www.datadryad.org/jdap

[8] http://www.biosharing.org/

[9] https://pantonprinciples.org/

[10] http://datadryad.org/

[11] http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

[12] http://www.dataone.org/about

[13] http://www.codata.org/

5. What additional challenges are there in making data usable by scientists in the same field,
scientists in other fields, ‘citizen scientists’ and the general public?

The form in which the data is published is a major factor in its reuseability e.g. file formats, use of
standard ontologies. Many intitiatives are defining data standards (see Q3) which will improve the
situation for other researchers.

The barriers to making data useable by citizens are higher. Data visualisation and provision of suitable
narratvies to accompany datasets would be useful, although language barriers would also need to be
addressed.

6 a) What might be the benefits of more widespread sharing of data for the productivity and
efficiency of scientific research?

More widespread sharing of data increases the efficiency of research through:

Reduction of duplication – Datasets are discoverable and can be reused e.g. data deposited in the NCBI
Gene Expression Omnibus (GEO) database was reused in around 1150 papers from PubMed during 2007-
2010 [1].Sharing of negative results could reduce duplication significantly.
Ease of replication – Release of full datasets as opposed to summaries in papers enable more effective
replication and scope for discovering errors, partcularly if related code and algorithms are also released.
Ease of critique and reanalysis – Thorough and rapid public critique of data would be made easier,
possibly leading to more discussions such as the recent asrsenic life debate in the blogosphere.
Crowdsourcing analysis – Sharing data with other scientists can be particularly beneficial when rapid
analysis of data is required e.g. the recent E.coli outbreak in Europe where E.coli genome comparisons
were crowd sourced in the public domain [2]. This demonstrates its inherent ability to increase efficiency
compared to multiple closed labs performing the same analysis. Citizen science such as Galaxy Zoo or
PlanetHunters also allows the crowdsourced analysis of scientific data by members of the public, which
can only happen if data is publicly shared.
Less time wasted searching for data: The discovery of data under the current system of publication can be
a time consuming process. Once a suitable publication is found, the full data may need to be requested
from the authors and that which is included in the publication may not be openly reuseable due to
licensing. Finding out if data is reuseable takes time, which can reduce productivity and efficiencly. Tools
such as the data status request service Is It Open Data? [2], which archives responses from data providers,
may be useful in this regard but still takes time.

[1] http://researchremix.wordpress.com/2011/05/19/nature-letter/
[2] https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki
[3] http://www.isitopendata.org/

6 b) What might be the benefits of more widespread sharing of data for new sorts of science?

Genomics as a science would have been impossible without data sharing via databases such as GenBank
and the same applies to other scientific fields and their respective data sharing methods e.g. astronomy
and crystallography.

It is difficult to predict what new fields may emerge from a world with greater availability of data.
However, the release of open datasets will enable the utilisation of technology which already exists but is
restricted by the availability of useable data e.g. semantic web technologies using linked data could
discover obscure connections between datasets and research findings. Text mining and data mining tools
have huge potential for new discoveries if allowed access to large sets of scientific information.

6 c) What might be the benefits of more widespread sharing of data for public policy?

Evidence based public policy will be more credible if the data supporting it is freely available (see also
Q2d). This could lead to more informed debate from a wider range of stakeholders who all have access to
the evidence.

6 d) What might be the benefits of more widespread sharing of data for other social benefits?

More diverse contributions to scientific research and debate would be possible, including increased public
engagement. Members of the public with specific interests would have direct access to high quality
scientific information e.g. patient groups who would like access to the latest research data on treatments.

6 e) What might be the benefits of more widespread sharing of data for innovation and
economic growth?

The opportunities to increase the efficiency and productivity of reasearch discussed in Q6a could lead to
an acceleration of innovation.

6 f) What might be the benefits of more widespread sharing of data for public trust in the
processes of science?

The recent climate data controversy led to concern among the public about the management and
availability of scientific data. To get to the stage where an FOI is required to retrieve the output of publicly
funded research does not reflect well on the transparency and accessibility of science and thus, to some
people, its credibility.

Widespread sharing of data, particularly in areas of great public interest, should reduce these concerns.

7. How should concerns about privacy, security and intellectual property be balanced against
the proposed benefits of openness?

Privacy of research subjects should be prioritised as per Q2d. This is particularly pertinent to medical trials
and informant-based social sciences and discussions on how best to approach data release are ongoing in
these fields.

Security may be a valid reason for not publishing data but must be justified and will need to be examined
on a case by case basis.

8. What should be expected and/or required of scientists (in companies, universities or elsewhere), research funders, regulators, scientific publishers, research institutions, international organisations and other bodies?

A common vision is required to enable the necessary technologial and cultural changes for widespread
sharing of scientific information in the manner discussed above to be realised. There is already a topdown
push from research funders e.g. a group of 17 health research funders have released a joint
statement on data sharing [1] and publishers are beginning to offer more opportunities for publishing
datasets (see Q3). More such organisations should be expected to join in the positive promotion of data
sharing. This will enable a scaling up of efforts as more scientists are required to share data in order
receive funding or a journal publication.

From scientists themselves there are various grass roots level projects. Many initiatives are listed in Q3.
Additionally, the Open Data in Science Working Group at the Open Knowledge Foundation [2] is a
community of scientists who aim to promote open data in science and build apps, tools and datasets to
allow people to easily publish, find and reuse scientific data. The realisation that the extra work required
to publish useful shared data is worth the effort may take some time. The recent winners of the 2011
BMC Open Data Award admitted that “credit..must go to persistent, anonymous referee.., who
demanded—twice—that we also publish the background data” [3].

Scientists should not only be required to adhere to data sharing policies set down by organisations but
should be expected to give thought to managing all of their data and how they might share it in a way
that maximises its impact. This will become more likely as tools make sharing easier and incentives are
provided in the form of recognition and rewards for data sharing.

[1] http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-health-andepidemiology/
WTDV030689.htm
[2] http://science.okfn.org/
[3] http://blogs.openaccesscentral.com/blogs/bmcblog/entry/on_the_unbearable_lightness_of

Other comments:

In many cases the costs of sharing data are negligible, but in some cases there may be an argument that
the cost of effective sharing is too high compared to the potential for reuse e.g. some raw datasets are
out of date and not easily digitised. Others are impossibly large e.g. primary data from the LHC or image
data from next-gen sequencing machines. In this case, it would be acceptable to share only what data is
available and deemed useable e.g. summary data.

However, cost barriers are reducing over time and in the future it may be possible to publish datasets that
are currently unfeasible, so effective data management is essential to preserved as much output of
research as possible.

A Researcher’s Response to IPO Consultation on Text Mining Copyright Exception

Jenny Molloy — Wed, 21 Mar 2012 21:29:53 +0000

Personal experience and evidence from Professor Peter Murray-Rust.

I have been involved in developing and deploying text and other forms of data mining in chemistry and related sciences (e.g. biosciences and material sciences) for ten years. I have developed open source tools for chemistry (OSCAR [1], OPSIN [2], ChemicalTagger [3]), which have been developed with funding from EPSRC, JISC, DTI and Unilever PLC. These tools represent the de facto open source standard and are used throughout the world. In November 2011, I gave an invited plenary lecture on their use to LBM 2011 (Languages in Biology and Medicine) in Singapore [4].

These tools are capable of very high throughput and accuracy. Last week we extracted and analysed 500,000 chemical reactions from the US patent office service; approximately 100,000 reactions per processor per day. Our machine interpretation of chemical names (OPSIN) is over 99.5% accurate, better than any human. The extractions are complete, factual records of the experiment, to the extent that humans and machines could use them to repeat the work precisely or to identify errors made by the original authors.

It should be noted that many types of media other than text provide valuable scientific information, especially graphs and tables, images of scientific phenomena, and audio / video captures of scientific factual material. Many publishers and rights agencies would assert that graphs and machine-created images were subject to copyright while I would call them “facts”. I therefore often use the term “information mining” rather than “text mining”.

It is difficult to estimate the value of this work precisely, because we are currently restricted from deploying it on the current scientific literature by contractual restrictions imposed by all major publishers. However it is not fanciful to suggest that our software could be used in a “Chemical Google” indexing the scientific literature and therefore potentially worth low billions.
Some indications of value are:

My research cost £2 million in funding, and because of its widespread applicability, would be conservatively expected to be valued at several times that amount. The UK has a number of highly valued textmining companies such as Autonomy [5], Linguamatics [6], and Digital Science (Macmillan) [7]. Our work is highly valuable to them, as they both use our software [under Open licence] and recruit our staff when they finish. In this sense already, we have contributed to UK wealth generation.
The downstream value of high quality, high throughput chemical information extracted from the literature can be measured against conventional abstraction services, such as the Chemical Abstracts Service of the ACS [8] and Reaxys [9] from Elsevier, with a combined annual turnover of perhaps $500-1,000 million dollars. We believe our tools are capable of building the next and better generation of chemical abstraction services, and they would be direct competitors in this high value market. This supports our valuation of chemical textmining in the low billions.
The value of the tools themselves is difficult to estimate, but Chemical Informatics has for many years been a traditional SME activity in the UK and would have been expected to grow if textmining had been permitted. Companies such as Hampden Data services, ORAC, Oxford Molecular, Lhasa have values in the 10-100 millions.
I come from a UK pharmaceutical industrial background (15 years in Glaxo). I know from personal experience and discussions with other companies that it is not uncommon for drugs which fail to have post-mortems showing that the reason for failure could have been predicted from the original scientific literature, had it been analysed properly. Such failures can run to $100 million and the lack of ability to use the literature in an effective modern manner must contribute to serious loss of both effort and opportunity. My colleague Professor Steve Ley has estimated that because of poor literature analysis tools 20-25% of the work done in his synthetic chemistry lab is unnecessary duplication or could be predicted to fail. In a 20-year visionary EPSRC Grand Challenge (Dial-a-molecule) Prof Richard Whitby of Southampton is coordinating UK chemists, including industry, to design a system that can predict how to make any given molecule. The top priority is to be able to use the literature in an “artificially intelligent manner” where machines rather than humans can process it, impossible without widespread mining rights.
The science and technology of information mining itself is seriously held back by the current contractual restrictions. The acknowledged approach to building quality software is to agree on an open, immutable, ‘gold standard’ corpus of relevant literature, against which machine learning methods are trained. We have been forbidden by rights holders from distributing such corpora, and as a result our methods are seriously delayed (I estimate by at least three years) and are impoverished in their comprehensiveness and applicability. It is difficult to quantify the lost opportunities, but my expert judgement is that by linking scientific facts, such as those in the chemical literature, to major semantic resources such as Linked Open Data [10] and DBPedia [11] an enormous number of potential opportunities arise, both for better practice, and for the generation of new wealth generating tools.

Note: Most of my current work involves factual information, and I believe is therefore not subject to copyright. However, it is impossible to get clarification on this, and publishers have threatened to sue scientists for publishing factual information. I have always erred on the side of caution, and would greatly value clear guidelines from this process, indicating where I have an absolute right to extract without this continuing fear.

In response to Consultation Question 103

“What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?”

The difficulties I have faced are not even due to copyright problems as I understand it, but to additional contractual and technical barriers imposed by publishers to access their information for the purposes of extracting facts and redistributing them for the good of science and the wider community.

The barriers I have faced over the last five years appear common to all major publishers and include not only technical constraints (e.g. the denial of literature by publisher robot technology) but also difficulties in establishing copyright/contractual restrictions, which I do not wish to break. It is extremely difficult to get clear permissions to carry out any work in this field, and while a court might find that I had not been guilty of violating copyright/contract, I cannot rely on this. Therefore, I have taken the safest course of not deploying my world leading research.

Among the publishers with which I have had correspondence are Nature Publishing Group, American Chemical Society, Royal Society of Chemistry, Wiley, Elsevier, Springer. None have given me explicit permission to use their content for the unrestricted access of scientific facts by automated means and many have failed even to acknowledge my request for permission. I have for example challenged the assertion made by the Public Research Consortium that ‘publishers seem relatively liberal in granting permission’ for content mining. [12]

In conclusion, I stress that any need to request permissions drastically reduces the value of text mining. I have spent at least a year’s worth of my time attempting to get permissions as opposed to actually carrying out my research. At LBM 2011, I asked other participants, and they universally agreed that it was effectively impossible to get useful permissions for text mining. This is backed up by the evidence of Max Haussler to the US OSTP [13] and his comprehensive analysis of publisher impediments where it has taken some publishers over two years to agree any permissions, while many others have failed to respond within 30 days of being asked [14]. I do not believe therefore, that this problem can be solved by goodwill assertions from the publishers. Part of the Hargreaves initiated reform should be to assert the rights that everyone has in using the scientific factual literature for human benefit.

In response to Consultation Question 77

“Would an exception for text and data mining that is limited to non commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?”

Non-commercial clauses are completely prejudicial to effective use of text mining, because many of the providers and consumers will be commercial. For example, the UK SMEs could not use a corpus produced under these conditions, nor could they develop added downstream value.

I have had discussions with several publishers who have insisted on imposing NC restrictions on material. They are clearly aware of its role, and it is difficult to understand their motives in insisting on NC, other than to protect the publishers’ own interests by denying the widespread exploitation of the content. In two recent peer-reviewed papers, it has been convincingly shown that NC adds no benefits, is almost impossible to operate cleanly, and is highly restrictive of downstream use. [15, 16]

Alternative Solutions:
These contractual restrictions have been introduced unilaterally by publishers without effective challenge from the academic and wider community. The publishers have shown that they are not impartial custodians of the scientific literature. I believe this is unacceptable for the future and that a different process for regulation and enforcement is required. The questions I would wish to see addressed are:

Which parts of the scientific literature are so important that they should effectively be available to the public? One would consider, at least:
- facts (in their widest sense, i.e. including graphs, images, audio/visual)
- additional material such as design of experiments, caveats from the authors, discussions.
- metadata such as citations, annotations, bibliography
Who should decide this?
It must not be the publishers. Unfortunately many scientific societies also have a large publishing arm (e.g. Royal Soc Chem) and they cannot be seen as impartial.
I would suggest either the British Library, or a subgroup of the RCUK and other funding bodies
How show it be policed and conflicts resolved?
Where possible the regulator I propose should obtain agreement from all parties before potential violation. If not possible, then the onus should be on the publishers to challenge the miners, thought the regulator. Ultimately there is always final recourse to the law.

—————————————–

[1] http://www.jcheminf.com/content/3/1/41;

[2] http://pubs.acs.org/articlesonrequest/AOR-PcYgSy87ettZWfqyvHmN;

[3] http://www.jcheminf.com/content/3/1/17

[4] http://lbm2011.biopathway.org/

[5] http://www.autonomy.com/;

[6] http://www.linguamatics.com/;

[7] http://www.digital-science.com/

[8] http://www.cas.org/;

[9] https://www.reaxys.com/info/

[10] http://linkeddata.org/

[11] http://dbpedia.org/About

[12] Smit, Eefke and van der Graaf, Maurits, ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011. http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf.

[13] http://www.whitehouse.gov/sites/default/files/microsites/ostp/scholarly-pubs-%28%23226%29%20hauessler.pdf

[14] See also Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled
Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html

[15] Hagedorn, Mietchen, Morris, Agosti, Penev, Berendsohn & Hobern, ‘Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information’, ZooKeys 150 (2011) : Special issue: 127-149, ‘e-Infrastructures for data publishing in biodiversity science’;

[16] Carroll MW (2011) Why Full Open Access Matters. PLoS Biol 9(11): e1001210. doi:10.1371/journal.pbio.1001210

Working Group Response to IPO Consultation on Text Mining Copyright Exception

Jenny Molloy — Wed, 21 Mar 2012 21:21:11 +0000

The Intellectual Property Office of the UK government have been running a public consultation on changes to copyright law recommended in the Hargreaves Review of Intellectual Property and Growth in 2011.

The report stated that:

Researchers want to use every technological tool available, and they want to develop new ones. However, the law can block valuable new technologies, like text and data mining, simply because those technologies were not imagined when the law was formed.

It recommends:

Text mining is one current example of a new technology which copyright should not inhibit, but does. It appears that the current non-commercial research “Fair Dealing” exception in UK law will not cover use of these tools under the current interpretation of “Fair Dealing”. In any event text mining of databases is often excluded by the contract for accessing the database. The Government should
introduce a UK exception in the interim under the non-commercial research heading to allow use of analytics for non-commercial use…as well as promoting at EU level an exception to support text mining and data analytics for commercial use.

Along with Diane Cabell on behalf of iCommons (who has also blogged the response), the working group collaboratively drafted a response to the consultation which you can read below. We have several members of the working group with personal research interests in text and data mining. Peter Murray-Rust submitted his own response which you can read in the next blog post.

Response of the Open Knowledge Foundation’s Open Data in Science Working Group to BIS0312: Exception for copying of works for use by text and data analytics

The Open Data in Science Working Group at the Open Knowledge Foundation strongly supports the Government response [1] to the Hargreaves Review of Intellectual Property and Growth [2] and encourages it to follow through on its many excellent proposals. As scientists and scholars, we are both creators and users of intellectual property. Our creations, however, only have their full value when they are shared with other researchers. Our data becomes exponentially more useful when combined with the data of others. The intention of copyright law is to support public dissemination and enable the appropriate and effective recombination of work. Unfortunately, in the area of science, current copyright actually delays or blocks the effective re-use of research results in the current digital environment. We encourage adaptation of the law to benefit the progress of science and its attendant economic advantages. In our professional experience we have found that the ability to freely use research data benefits science as well as creators of scientific work products.

In particular, the working group strongly urges implementation of Recommendation 5 to allow specific exceptions in copyright law for data and text mining.

Information mining is the way that modern technology locates digital information. The sheer number of publications and data sets means that thorough and accurate searching can no longer be performed by the hand and eye of an individual researcher [3]. Mining is not only a deductive tool to analyze research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force UK scientists into blind alleys and silos where only limited knowledge is accessible. Science does not progress if it cannot incorporate the most recent findings and move forward from there. Today, digitized scientific information comes from hundreds of thousands of different sources in our globally connected scientific community [4], with many current data sets measured in terabytes [5]. In this environment it is no longer possible to simply read a scholarly summary in order to make scientifically significant use of such information.

Hence, one must be able to copy information, recombine it with other data and otherwise “re-use” it so as to produce truly helpful results. It would require extraordinarily time-consuming efforts to secure permission to mine each and every relevant article from hundreds, even thousands of sources. A recent report by the Joint Information Services Council (JISC) on the issues raised in this consultation demonstrate the value of time lost under the current system. By their estimates, a single researcher obtaining permission to mine PubMedCentral articles mentioning malaria could lose over 60% of their working year contacting the 1024 journals necessary to obtain access to the complete corpus of literature [6]. A blanket exception is the only way to ensure that UK scientists can truly stay abreast of scientific progress.

Therefore, we agree with the Government’s opinion that it is inappropriate for “certain activities of public benefit such as medical research obtained through text mining to be in effect subject to veto by the owners of copyrights in the reports of such research, where access to the reports was obtained lawfully.” Restricting such transformative use is not in the UK’s overall scientific, much less economic interests. It is also not in the interest of Europe as a whole. The Ghent Declaration states “European researchers, while remaining in touch with the whole world, could also benefit from tools that would allow for better collaboration within Europe. In particular, such goals can be assisted by computers using techniques ranging from data mining to the semantic web. However, machine-based inference techniques work well only if documents are freely accessible…” [7].

We applaud the recent draft policy statement from the UK Research Councils asserting their support for an open access mandate covering all research they fund, particularly their emphasis on a CC-BY or similar license which allows “unrestricted use of manual and automated text and data mining tools” [8]. While this will lead to positive developments going forward, access to the pre-existing literature is equally important.

Response to Specific Consultation Questions

77. Would an exception for text and data mining that is limited to non-commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?

Non-Commercial Limitations

Non-commercial limitation is not helpful and would reduce the benefits delivered to the economy. Researchers in both academia and industry are often reliant on the same information e.g. libraries of chemical structures. Therefore, to impede non-commercial access to mined text and data would result in duplicated time, effort and expense to obtain the information. Reducing dissemination to SMEs and other commercial organisations who could use it to generate useful and value added products would appear counterproductive. More details on the economic value of chemical information can be found in the submission to this consultation from Dr Peter Murray-Rust.

In addition to the potential loss of both scientific opportunity and economic returns, implementing a non-commercial clause is non-trivial. There is no clear local or global definition of “non-commercial use”, particularly when one recognizes that scientific data is merged across many jurisdictional boundaries. If the term is to be employed, it needs to be clearly defined, but this is difficult given that the full range of non-commercial opportunities cannot be foreseen. The difficulties of defining non-commercial were laid out in a detailed report by Creative Commons and is a topic of ongoing discussion as they review their licenses, which are among the most commonly used for open access scientific literature [9].

Even if clearly and strictly defined, non-commercial clauses could lead to data sharing problems within collaborations between academia and industry. Certainly, any prohibition against commercial use would hinder the widest possible sharing of data. It would probably also increase the cost to consumers of products based on such data and reduce the potential economic return by discouraging commercial organisations from creating value added products based on the results of text and data mining.

Disallowing downstream uses also complicates the publication and licensing of results from information mining. Having to lock or watermark the initial results of non-commercial mining in order to prevent its later commercial use would quickly prove cumbersome and problematic. Would each piece of mining information need to be labeled as restricted? Or only the fully compiled set of results from a mining operation? Would researchers have to track all subsequent uses and re-uses of mining simply to protect themselves from the threat of infringement litigation? By reducing interoperability and complicating the licensing situation, the labour and associated costs of information mining would significantly increase.

For these reasons, this group does not favor restricting the exception to non-commerical uses, but rather supports an exception for all mining purposes. At a minimum, any subscriber to closed access journals regardless of their commercial status should be able to mine the information for which they have paid subscription fees [10].

Evidence of costs and benefits

The recent JISC study, The Value and Benefits of Text Mining [11], includes many examples of the economic benefits and costs of text mining. Rather than repeat their excellent work, we wish to add the observations of one of our working group, Dr Peter Murray-Rust of the University of Cambridge, who has been working in text- and data-mining for 30 years. During that time the relevant technology has developed dramatically to the point where he can, for example, extract 10 million chemical reactions from the published text literature. However, after two years of negotiations with a major publisher, the Murray-Rust research group was told that it could mine the publishers’ corpus only if all of the results belonged to the publisher and were not published.

Similarly, Murray-Rust developed methods for crystallographic data-mining which have become accepted in the crystallographic community. Some such data is published openly alongside conventional articles so a dataset of 250,000 compounds has already been processed. The results of this analysis can be highly valuable for drug design but extension of the dataset is hampered by restrictive conditions of reuse. For example, other data are deposited with a non-profit organisation that makes individual data sets freely available to the scientific community, but under terms that may limit re-use of the mined data. In terms of text mining, factual data embedded in text is a grey area. For example, the factual statement “the melting point of X is 30 degrees” is usually hidden behind a paywall even if related data is available freely in supplementary files.

Further information is available in Peter Murray-Rust’s personal submission to this consultation.

Alternatives:

Although not an alternative, one step that might encourage data owners to permit information mining would be to provide a type of blanket immunity that would protect them from claims based on a user’s reliance on data that later proves to be inaccurate or fraudulent. A major cost and disincentive to data providers who allow public access to data may thereby be reduced whether those providers are individual researchers, their employers or journals.

However, to the extent that journals wish to protect copyright in information mining for the purpose of protecting their own opportunities to extract revenue from such use, the exemption will not be enough to offset the burden on researchers to secure permission from individual publishers. It would therefore function as an adjunct to copyright exceptions to address one of publishers’ concerns about a requirement to allow mining and reuse.

103. What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?

To permit contractual override of a mining exception is to risk voiding the exception. Text and data mining in academia are usually prohibited by contract at present, on top of any copyright. In some cases automatic systems will shut down access to institutions if they believe that the contract has been violated (including on occasions when it has not). A survey of licence agreements with institutions across 11 major publishers revealed that 7 out of 11 explicitly ban text and data mining or automated indexing by web crawlers [12]. A more comprehensive survey of licenses and contractual agreements performed by Max Haeussler at UC Santa Cruz reveals a similar pattern across a wide array of publishers [13]. Some of these organisations provide proprietary systems, often charged for, which enable some limited text-mining functionality.

Unfortunately, the majority of publishers when contacted directly with text mining requests have been “extremely unhelpful” [14] if not unresponsive [13], which would indicate that there is little reason to expect mining policies will change voluntarily, as they would effectively have to under a contractual override system where publishers could opt out of allowing mining via their licensing agreements.

——————–
[1] UK Government Response to Hargreaves Report, August 2011 at http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review

[2] Hargreaves Report, May 2011 at http://www.ipo.gov.uk/ipreview.htm

[3] The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012 at http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx, citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based Drug Discovery in a Large Pharmaceutical Company: a Case STudy,” Library, 2006, claiming that text mining tools evaluated 50,000 patents in 18 months, a task that would have taken 50 person years to manually

[4] See MEDLINE® Citation Counts by Year of Publication, at http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at http://www.nsf.gov/statistics/seind10/c5/c5h.htm asserting the annual volume of scientific journal articles published is on the order of 2.5%.

[5] Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004 at http://lcg.web.cern.ch/lcg/planning/phase2_resources/SizingandcostingoftheCERNT0center.pdf.

[6] Van Noorden, Richard; Trouble at the text mine, Nature, V. 48, Issue 7388, 08 March 2012. See also JISC, op.cit., at page 27 “Consequently, in this example, a researcher would need to contact 1,024 journals at a transaction cost (in terms of time spent) of £18,630; 62.1% of a working year.”

[7] Ghent Declaration, February 2011, OpenAIRE, at http://www.openaire.eu/en/component/content/article/223-seizing-the-opportunity-for-open-access-to-european-research-ghent-declaration-published

[8] RCUK Proposed Policy on Access to Research Outputs, RCUK, March 2012 at http://www.openscholarship.org/upload/docs/application/pdf/2012-03/rcuk_proposed_policy_on_access_to_research_outputs.pdf

[9] Creative Commons, Defining “Noncommercial” A Study of How the Online Population Understands “Noncommercial Use,”
September 2009 at http://wiki.creativecommons.org/Defining_Noncommercial

[10] Dr Murray-Rust, Peter; ”Information mining and Hargreaves: I set out the absolute rights for readers. Non-negotiable’ http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/

[11] JISC, op.cit., Chapters 4 and 5.

[12] ‘Journal Licensing Agreements’, April 2011, Spreadsheet of Results at https://docs.google.com/spreadsheet/ccc?key=0AtV3tIqIu0UZdGVMNTAtejhBUlFySGk4QWdrVHJNdkE&authkey=CKC-_LQP&hl=en_US#gid=0.

[13] Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html

[14] Murray-Rust, Peter; cited in ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011 at http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf. See also, Murray-Rust, Peter; ‘Wiley: Cambridge scientist require to text-mine content in Wiley journals: please switch off the lawyers and the robots’ at http://blogs.ch.cam.ac.uk/pmr/2012/03/07/wiley-cambridge-scientist-require-to-text-mine-content-in-wiley-journals-please-switch-off-the-lawyers-and-the-robots/

Open Data in Science in PLoS Biology

Jenny Molloy — Fri, 09 Dec 2011 12:34:39 +0000

We are very pleased to announce the publication of an article detailing the working group’s aims and achievements in PLoS Biology’s Community Pages.

‘The Open Knowledge Foundation: Open Data Means Better Science‘ has already had over 1800 article views and offers a fantastic opportunity to engage the biological community in the work we do and raise awareness of the importance of open data in science.

Published in the same edition was a Perspectives piece tackling an issue that the working group has taken a great interest in – the use of non-commercial clauses in licenses for open access articles. In ‘Why Full Open Access Matters‘, Professor Michael Carroll, a Creative Commons bord member and Director of the Program on Information Justice and Intellectual Property at American University states “We are living through a moment of fundamental opportunity [for text and data mining, innovative reuse of article material]. Let’s be clear. Only those publishers willing to fully seize this opportunity deserve to call their publications “open access.””.

A lively discussion has been taking place on the open-science and okfn-discuss mailing lists and a plan has been compiled to survey the policies of funders and journals as well as to generate a resource pack on the use of non-commercial clauses and the downstream effects of applying such licenses. Please join the mailing list and the conversation if this is an issue you feel strongly about.