A Researcher’s Response to IPO Consultation on Text Mining Copyright Exception

March 21, 2012 in Publications

Personal experience and evidence from Professor Peter Murray-Rust.

I have been involved in developing and deploying text and other forms of data mining in chemistry and related sciences (e.g. biosciences and material sciences) for ten years. I have developed open source tools for chemistry (OSCAR [1], OPSIN [2], ChemicalTagger [3]), which have been developed with funding from EPSRC, JISC, DTI and Unilever PLC. These tools represent the de facto open source standard and are used throughout the world. In November 2011, I gave an invited plenary lecture on their use to LBM 2011 (Languages in Biology and Medicine) in Singapore [4].

These tools are capable of very high throughput and accuracy. Last week we extracted and analysed 500,000 chemical reactions from the US patent office service; approximately 100,000 reactions per processor per day. Our machine interpretation of chemical names (OPSIN) is over 99.5% accurate, better than any human. The extractions are complete, factual records of the experiment, to the extent that humans and machines could use them to repeat the work precisely or to identify errors made by the original authors.

It should be noted that many types of media other than text provide valuable scientific information, especially graphs and tables, images of scientific phenomena, and audio / video captures of scientific factual material. Many publishers and rights agencies would assert that graphs and machine-created images were subject to copyright while I would call them “facts”. I therefore often use the term “information mining” rather than “text mining”.

It is difficult to estimate the value of this work precisely, because we are currently restricted from deploying it on the current scientific literature by contractual restrictions imposed by all major publishers. However it is not fanciful to suggest that our software could be used in a “Chemical Google” indexing the scientific literature and therefore potentially worth low billions.
Some indications of value are:

My research cost £2 million in funding, and because of its widespread applicability, would be conservatively expected to be valued at several times that amount. The UK has a number of highly valued textmining companies such as Autonomy [5], Linguamatics [6], and Digital Science (Macmillan) [7]. Our work is highly valuable to them, as they both use our software [under Open licence] and recruit our staff when they finish. In this sense already, we have contributed to UK wealth generation.
The downstream value of high quality, high throughput chemical information extracted from the literature can be measured against conventional abstraction services, such as the Chemical Abstracts Service of the ACS [8] and Reaxys [9] from Elsevier, with a combined annual turnover of perhaps $500-1,000 million dollars. We believe our tools are capable of building the next and better generation of chemical abstraction services, and they would be direct competitors in this high value market. This supports our valuation of chemical textmining in the low billions.
The value of the tools themselves is difficult to estimate, but Chemical Informatics has for many years been a traditional SME activity in the UK and would have been expected to grow if textmining had been permitted. Companies such as Hampden Data services, ORAC, Oxford Molecular, Lhasa have values in the 10-100 millions.
I come from a UK pharmaceutical industrial background (15 years in Glaxo). I know from personal experience and discussions with other companies that it is not uncommon for drugs which fail to have post-mortems showing that the reason for failure could have been predicted from the original scientific literature, had it been analysed properly. Such failures can run to $100 million and the lack of ability to use the literature in an effective modern manner must contribute to serious loss of both effort and opportunity. My colleague Professor Steve Ley has estimated that because of poor literature analysis tools 20-25% of the work done in his synthetic chemistry lab is unnecessary duplication or could be predicted to fail. In a 20-year visionary EPSRC Grand Challenge (Dial-a-molecule) Prof Richard Whitby of Southampton is coordinating UK chemists, including industry, to design a system that can predict how to make any given molecule. The top priority is to be able to use the literature in an “artificially intelligent manner” where machines rather than humans can process it, impossible without widespread mining rights.
The science and technology of information mining itself is seriously held back by the current contractual restrictions. The acknowledged approach to building quality software is to agree on an open, immutable, ‘gold standard’ corpus of relevant literature, against which machine learning methods are trained. We have been forbidden by rights holders from distributing such corpora, and as a result our methods are seriously delayed (I estimate by at least three years) and are impoverished in their comprehensiveness and applicability. It is difficult to quantify the lost opportunities, but my expert judgement is that by linking scientific facts, such as those in the chemical literature, to major semantic resources such as Linked Open Data [10] and DBPedia [11] an enormous number of potential opportunities arise, both for better practice, and for the generation of new wealth generating tools.

Note: Most of my current work involves factual information, and I believe is therefore not subject to copyright. However, it is impossible to get clarification on this, and publishers have threatened to sue scientists for publishing factual information. I have always erred on the side of caution, and would greatly value clear guidelines from this process, indicating where I have an absolute right to extract without this continuing fear.

In response to Consultation Question 103

“What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?”

The difficulties I have faced are not even due to copyright problems as I understand it, but to additional contractual and technical barriers imposed by publishers to access their information for the purposes of extracting facts and redistributing them for the good of science and the wider community.

The barriers I have faced over the last five years appear common to all major publishers and include not only technical constraints (e.g. the denial of literature by publisher robot technology) but also difficulties in establishing copyright/contractual restrictions, which I do not wish to break. It is extremely difficult to get clear permissions to carry out any work in this field, and while a court might find that I had not been guilty of violating copyright/contract, I cannot rely on this. Therefore, I have taken the safest course of not deploying my world leading research.

Among the publishers with which I have had correspondence are Nature Publishing Group, American Chemical Society, Royal Society of Chemistry, Wiley, Elsevier, Springer. None have given me explicit permission to use their content for the unrestricted access of scientific facts by automated means and many have failed even to acknowledge my request for permission. I have for example challenged the assertion made by the Public Research Consortium that ‘publishers seem relatively liberal in granting permission’ for content mining. [12]

In conclusion, I stress that any need to request permissions drastically reduces the value of text mining. I have spent at least a year’s worth of my time attempting to get permissions as opposed to actually carrying out my research. At LBM 2011, I asked other participants, and they universally agreed that it was effectively impossible to get useful permissions for text mining. This is backed up by the evidence of Max Haussler to the US OSTP [13] and his comprehensive analysis of publisher impediments where it has taken some publishers over two years to agree any permissions, while many others have failed to respond within 30 days of being asked [14]. I do not believe therefore, that this problem can be solved by goodwill assertions from the publishers. Part of the Hargreaves initiated reform should be to assert the rights that everyone has in using the scientific factual literature for human benefit.

In response to Consultation Question 77

“Would an exception for text and data mining that is limited to non commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?”

Non-commercial clauses are completely prejudicial to effective use of text mining, because many of the providers and consumers will be commercial. For example, the UK SMEs could not use a corpus produced under these conditions, nor could they develop added downstream value.

I have had discussions with several publishers who have insisted on imposing NC restrictions on material. They are clearly aware of its role, and it is difficult to understand their motives in insisting on NC, other than to protect the publishers’ own interests by denying the widespread exploitation of the content. In two recent peer-reviewed papers, it has been convincingly shown that NC adds no benefits, is almost impossible to operate cleanly, and is highly restrictive of downstream use. [15, 16]

Alternative Solutions:
These contractual restrictions have been introduced unilaterally by publishers without effective challenge from the academic and wider community. The publishers have shown that they are not impartial custodians of the scientific literature. I believe this is unacceptable for the future and that a different process for regulation and enforcement is required. The questions I would wish to see addressed are:

Which parts of the scientific literature are so important that they should effectively be available to the public? One would consider, at least:
- facts (in their widest sense, i.e. including graphs, images, audio/visual)
- additional material such as design of experiments, caveats from the authors, discussions.
- metadata such as citations, annotations, bibliography
Who should decide this?
It must not be the publishers. Unfortunately many scientific societies also have a large publishing arm (e.g. Royal Soc Chem) and they cannot be seen as impartial.
I would suggest either the British Library, or a subgroup of the RCUK and other funding bodies
How show it be policed and conflicts resolved?
Where possible the regulator I propose should obtain agreement from all parties before potential violation. If not possible, then the onus should be on the publishers to challenge the miners, thought the regulator. Ultimately there is always final recourse to the law.

—————————————–

[1] http://www.jcheminf.com/content/3/1/41;

[2] http://pubs.acs.org/articlesonrequest/AOR-PcYgSy87ettZWfqyvHmN;

[3] http://www.jcheminf.com/content/3/1/17

[4] http://lbm2011.biopathway.org/

[5] http://www.autonomy.com/;

[6] http://www.linguamatics.com/;

[7] http://www.digital-science.com/

[8] http://www.cas.org/;

[9] https://www.reaxys.com/info/

[10] http://linkeddata.org/

[11] http://dbpedia.org/About

[12] Smit, Eefke and van der Graaf, Maurits, ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011. http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf.

[13] http://www.whitehouse.gov/sites/default/files/microsites/ostp/scholarly-pubs-%28%23226%29%20hauessler.pdf

[14] See also Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled
Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html

[15] Hagedorn, Mietchen, Morris, Agosti, Penev, Berendsohn & Hobern, ‘Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information’, ZooKeys 150 (2011) : Special issue: 127-149, ‘e-Infrastructures for data publishing in biodiversity science’;

[16] Carroll MW (2011) Why Full Open Access Matters. PLoS Biol 9(11): e1001210. doi:10.1371/journal.pbio.1001210

← Working Group Response to IPO Consultation on Text Mining Copyright Exception

Introducing our Panton Fellows! →