The Intellectual Property Office of the UK government have been running a public consultation on changes to copyright law recommended in the Hargreaves Review of Intellectual Property and Growth in 2011.
The report stated that:
Researchers want to use every technological tool available, and they want to develop new ones. However, the law can block valuable new technologies, like text and data mining, simply because those technologies were not imagined when the law was formed.
Text mining is one current example of a new technology which copyright should not inhibit, but does. It appears that the current non-commercial research “Fair Dealing” exception in UK law will not cover use of these tools under the current interpretation of “Fair Dealing”. In any event text mining of databases is often excluded by the contract for accessing the database. The Government should
introduce a UK exception in the interim under the non-commercial research heading to allow use of analytics for non-commercial use…as well as promoting at EU level an exception to support text mining and data analytics for commercial use.
Along with Diane Cabell on behalf of iCommons (who has also blogged the response), the working group collaboratively drafted a response to the consultation which you can read below. We have several members of the working group with personal research interests in text and data mining. Peter Murray-Rust submitted his own response which you can read in the next blog post.
Response of the Open Knowledge Foundation’s Open Data in Science Working Group to BIS0312: Exception for copying of works for use by text and data analytics
The Open Data in Science Working Group at the Open Knowledge Foundation strongly supports the Government response  to the Hargreaves Review of Intellectual Property and Growth  and encourages it to follow through on its many excellent proposals. As scientists and scholars, we are both creators and users of intellectual property. Our creations, however, only have their full value when they are shared with other researchers. Our data becomes exponentially more useful when combined with the data of others. The intention of copyright law is to support public dissemination and enable the appropriate and effective recombination of work. Unfortunately, in the area of science, current copyright actually delays or blocks the effective re-use of research results in the current digital environment. We encourage adaptation of the law to benefit the progress of science and its attendant economic advantages. In our professional experience we have found that the ability to freely use research data benefits science as well as creators of scientific work products.
In particular, the working group strongly urges implementation of Recommendation 5 to allow specific exceptions in copyright law for data and text mining.
Information mining is the way that modern technology locates digital information. The sheer number of publications and data sets means that thorough and accurate searching can no longer be performed by the hand and eye of an individual researcher . Mining is not only a deductive tool to analyze research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force UK scientists into blind alleys and silos where only limited knowledge is accessible. Science does not progress if it cannot incorporate the most recent findings and move forward from there. Today, digitized scientific information comes from hundreds of thousands of different sources in our globally connected scientific community , with many current data sets measured in terabytes . In this environment it is no longer possible to simply read a scholarly summary in order to make scientifically significant use of such information.
Hence, one must be able to copy information, recombine it with other data and otherwise “re-use” it so as to produce truly helpful results. It would require extraordinarily time-consuming efforts to secure permission to mine each and every relevant article from hundreds, even thousands of sources. A recent report by the Joint Information Services Council (JISC) on the issues raised in this consultation demonstrate the value of time lost under the current system. By their estimates, a single researcher obtaining permission to mine PubMedCentral articles mentioning malaria could lose over 60% of their working year contacting the 1024 journals necessary to obtain access to the complete corpus of literature . A blanket exception is the only way to ensure that UK scientists can truly stay abreast of scientific progress.
Therefore, we agree with the Government’s opinion that it is inappropriate for “certain activities of public benefit such as medical research obtained through text mining to be in effect subject to veto by the owners of copyrights in the reports of such research, where access to the reports was obtained lawfully.” Restricting such transformative use is not in the UK’s overall scientific, much less economic interests. It is also not in the interest of Europe as a whole. The Ghent Declaration states “European researchers, while remaining in touch with the whole world, could also benefit from tools that would allow for better collaboration within Europe. In particular, such goals can be assisted by computers using techniques ranging from data mining to the semantic web. However, machine-based inference techniques work well only if documents are freely accessible…” .
We applaud the recent draft policy statement from the UK Research Councils asserting their support for an open access mandate covering all research they fund, particularly their emphasis on a CC-BY or similar license which allows “unrestricted use of manual and automated text and data mining tools” . While this will lead to positive developments going forward, access to the pre-existing literature is equally important.
Response to Specific Consultation Questions
77. Would an exception for text and data mining that is limited to non-commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?
Non-commercial limitation is not helpful and would reduce the benefits delivered to the economy. Researchers in both academia and industry are often reliant on the same information e.g. libraries of chemical structures. Therefore, to impede non-commercial access to mined text and data would result in duplicated time, effort and expense to obtain the information. Reducing dissemination to SMEs and other commercial organisations who could use it to generate useful and value added products would appear counterproductive. More details on the economic value of chemical information can be found in the submission to this consultation from Dr Peter Murray-Rust.
In addition to the potential loss of both scientific opportunity and economic returns, implementing a non-commercial clause is non-trivial. There is no clear local or global definition of “non-commercial use”, particularly when one recognizes that scientific data is merged across many jurisdictional boundaries. If the term is to be employed, it needs to be clearly defined, but this is difficult given that the full range of non-commercial opportunities cannot be foreseen. The difficulties of defining non-commercial were laid out in a detailed report by Creative Commons and is a topic of ongoing discussion as they review their licenses, which are among the most commonly used for open access scientific literature .
Even if clearly and strictly defined, non-commercial clauses could lead to data sharing problems within collaborations between academia and industry. Certainly, any prohibition against commercial use would hinder the widest possible sharing of data. It would probably also increase the cost to consumers of products based on such data and reduce the potential economic return by discouraging commercial organisations from creating value added products based on the results of text and data mining.
Disallowing downstream uses also complicates the publication and licensing of results from information mining. Having to lock or watermark the initial results of non-commercial mining in order to prevent its later commercial use would quickly prove cumbersome and problematic. Would each piece of mining information need to be labeled as restricted? Or only the fully compiled set of results from a mining operation? Would researchers have to track all subsequent uses and re-uses of mining simply to protect themselves from the threat of infringement litigation? By reducing interoperability and complicating the licensing situation, the labour and associated costs of information mining would significantly increase.
For these reasons, this group does not favor restricting the exception to non-commerical uses, but rather supports an exception for all mining purposes. At a minimum, any subscriber to closed access journals regardless of their commercial status should be able to mine the information for which they have paid subscription fees .
Evidence of costs and benefits
The recent JISC study, The Value and Benefits of Text Mining , includes many examples of the economic benefits and costs of text mining. Rather than repeat their excellent work, we wish to add the observations of one of our working group, Dr Peter Murray-Rust of the University of Cambridge, who has been working in text- and data-mining for 30 years. During that time the relevant technology has developed dramatically to the point where he can, for example, extract 10 million chemical reactions from the published text literature. However, after two years of negotiations with a major publisher, the Murray-Rust research group was told that it could mine the publishers’ corpus only if all of the results belonged to the publisher and were not published.
Similarly, Murray-Rust developed methods for crystallographic data-mining which have become accepted in the crystallographic community. Some such data is published openly alongside conventional articles so a dataset of 250,000 compounds has already been processed. The results of this analysis can be highly valuable for drug design but extension of the dataset is hampered by restrictive conditions of reuse. For example, other data are deposited with a non-profit organisation that makes individual data sets freely available to the scientific community, but under terms that may limit re-use of the mined data. In terms of text mining, factual data embedded in text is a grey area. For example, the factual statement “the melting point of X is 30 degrees” is usually hidden behind a paywall even if related data is available freely in supplementary files.
Further information is available in Peter Murray-Rust’s personal submission to this consultation.
Although not an alternative, one step that might encourage data owners to permit information mining would be to provide a type of blanket immunity that would protect them from claims based on a user’s reliance on data that later proves to be inaccurate or fraudulent. A major cost and disincentive to data providers who allow public access to data may thereby be reduced whether those providers are individual researchers, their employers or journals.
However, to the extent that journals wish to protect copyright in information mining for the purpose of protecting their own opportunities to extract revenue from such use, the exemption will not be enough to offset the burden on researchers to secure permission from individual publishers. It would therefore function as an adjunct to copyright exceptions to address one of publishers’ concerns about a requirement to allow mining and reuse.
103. What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?
To permit contractual override of a mining exception is to risk voiding the exception. Text and data mining in academia are usually prohibited by contract at present, on top of any copyright. In some cases automatic systems will shut down access to institutions if they believe that the contract has been violated (including on occasions when it has not). A survey of licence agreements with institutions across 11 major publishers revealed that 7 out of 11 explicitly ban text and data mining or automated indexing by web crawlers . A more comprehensive survey of licenses and contractual agreements performed by Max Haeussler at UC Santa Cruz reveals a similar pattern across a wide array of publishers . Some of these organisations provide proprietary systems, often charged for, which enable some limited text-mining functionality.
Unfortunately, the majority of publishers when contacted directly with text mining requests have been “extremely unhelpful”  if not unresponsive , which would indicate that there is little reason to expect mining policies will change voluntarily, as they would effectively have to under a contractual override system where publishers could opt out of allowing mining via their licensing agreements.
 UK Government Response to Hargreaves Report, August 2011 at http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review
 Hargreaves Report, May 2011 at http://www.ipo.gov.uk/ipreview.htm
 The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012 at http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx, citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based Drug Discovery in a Large Pharmaceutical Company: a Case STudy,” Library, 2006, claiming that text mining tools evaluated 50,000 patents in 18 months, a task that would have taken 50 person years to manually
 See MEDLINE® Citation Counts by Year of Publication, at http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at http://www.nsf.gov/statistics/seind10/c5/c5h.htm asserting the annual volume of scientific journal articles published is on the order of 2.5%.
 Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004 at http://lcg.web.cern.ch/lcg/planning/phase2_resources/SizingandcostingoftheCERNT0center.pdf.
 Van Noorden, Richard; Trouble at the text mine, Nature, V. 48, Issue 7388, 08 March 2012. See also JISC, op.cit., at page 27 “Consequently, in this example, a researcher would need to contact 1,024 journals at a transaction cost (in terms of time spent) of £18,630; 62.1% of a working year.”
 Ghent Declaration, February 2011, OpenAIRE, at http://www.openaire.eu/en/component/content/article/223-seizing-the-opportunity-for-open-access-to-european-research-ghent-declaration-published
 RCUK Proposed Policy on Access to Research Outputs, RCUK, March 2012 at http://www.openscholarship.org/upload/docs/application/pdf/2012-03/rcuk_proposed_policy_on_access_to_research_outputs.pdf
 Creative Commons, Defining “Noncommercial” A Study of How the Online Population Understands “Noncommercial Use,”
September 2009 at http://wiki.creativecommons.org/Defining_Noncommercial
 Dr Murray-Rust, Peter; ”Information mining and Hargreaves: I set out the absolute rights for readers. Non-negotiable’ http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/
 JISC, op.cit., Chapters 4 and 5.
 ‘Journal Licensing Agreements’, April 2011, Spreadsheet of Results at https://docs.google.com/spreadsheet/ccc?key=0AtV3tIqIu0UZdGVMNTAtejhBUlFySGk4QWdrVHJNdkE&authkey=CKC-_LQP&hl=en_US#gid=0.
 Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html
 Murray-Rust, Peter; cited in ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011 at http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf. See also, Murray-Rust, Peter; ‘Wiley: Cambridge scientist require to text-mine content in Wiley journals: please switch off the lawyers and the robots’ at http://blogs.ch.cam.ac.uk/pmr/2012/03/07/wiley-cambridge-scientist-require-to-text-mine-content-in-wiley-journals-please-switch-off-the-lawyers-and-the-robots/