Ross Mounce – OKF Open Science Working Group

What is content mining?

Ross Mounce — Thu, 27 Mar 2014 17:22:39 +0000

It’s simple really – you can break it down into its two constituent parts:

Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog: http://blogs.ch.cam.ac.uk/pmr/2014/02/27/101-uses-for-content-mining/

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

Some suggested tutorials & resources you might want to start with:

* Mining with Weka tools – http://sentimentmining.net/weka/
* Mining with R tools – http://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
* Open resources – you can download ALL of the English language Wikipedia (or any other language) here: http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

* See also Project Gutenberg for literary resources to mine: http://www.gutenberg.org/

Banishing Impact Factor: OKF signs DORA

Ross Mounce — Sun, 23 Jun 2013 16:59:09 +0000

The Open Knowledge Foundation has joined nearly 300 organizations in signing The San Francisco Declaration on Research Assessment (DORA).

The laudable aim of DORA is to get people to stop using the Journal Impact Factor for research evaluation exercises, as it is harmful to science, and statistically illiterate to do so.

The General Recommendation says:

Do not use journal-based metrics, such as Journal Impact Factors, as a surrogate measure of the quality of individual research articles, to assess an individual scientist’s contributions, or in hiring, promotion, or funding decisions.

There are also helpful, specific recommendations for funding agencies, institutions, publishers, organizations that supply metrics, and researchers.

Over 7,000 individuals have signed so far, and among the many other organizations to have signed there are a multitude of notable ones including: American Association for the Advancement of Science (AAAS), Wellcome Trust,
Proceedings of The National Academy Of Sciences (PNAS), EMBO, Cold Spring Harbor Laboratory Press, Genetics Society of America, Gordon and Betty Moore Foundation, Higher Education Funding Council for England (HEFCE), Howard Hughes Medical Institute, Society of Biology, Ubiquity Press, Austrian Science Fund (FWF), International Association for Plant Taxonomy, Gesellschaft für Biologische Systematik, Museu Nacional (Universidade Federal do Rio de Janeiro), Belgian Royal Society of Zoology and Belgian Journal of Zoology, Universidade de Brasilia, UCSD, Sao Paulo University…

We expect this campaign to be a success, and hope that the commercial journals and societies that have actively chosen not to sign DORA will nevertheless consider the merits of DORA.

Diverse stakeholders withdraw from Licences for Europe dialogue on text and data mining

Ross Mounce — Tue, 28 May 2013 14:02:07 +0000

The Open Knowledge Foundation, along with several other representatives from the research sector, has withdrawn from the Licences for Europe dialogue on text and data mining due to concerns about the scope, composition and transparency of the process.

A letter of withdrawal has been sent to the Commissioners involved in Licenses for Europe explaining the reason that these stakeholders can no longer participate in the dialogue and the wish to instigate a broader dialogue around creating the conditions to realise the full potential of text and data mining for innovation in Europe.

The following organisations have signed the letter:

Licences for Europe was announced in the Communication on Content in the Digital Single Market (18 December 2012) and is a joint initiative led by Commissioners Michel Barnier (Internal Market and Services), Neelie Kroes (Digital Agenda) and Androulla Vassiliou (Education, Culture, Multilingualism and Youth) to “deliver rapid progress in bringing content online through practical industry-led solutions”.

Licences for Europe aims to engage stakeholders in four areas:

Cross-border access and the portability of services;
User-generated content and licensing;
Audiovisual sector and cultural heritage;
Text and Data Mining (TDM).

While we are deeply committed to working with the Commission on the removal of legal and other environmental barriers to text and data mining, we believe that any meaningful engagement on the legal framework within which data-driven innovation exists must address the opportunities provided by limitations and exceptions. The current approach of the Commission instead places further licensing as the central pillar of the discussion.

The withdrawal follows much communication with the Commission on the issue, including a letter of concern sent on 26 February 2013 and signed by over 60 organisations. The Commission’s response to this letter is available here.

To find out more about the background of Licence for Europe and the issues surrounding text and data mining please take a look at our background document.

Our Statement on Public Access to Federally-Supported Research Data

Ross Mounce — Thu, 16 May 2013 16:13:34 +0000

Open Access to research publications often takes the limelight in national debates about access to research – but at the Open Knowledge Foundation we know there are also other pressing issues; like the need for Open Data. So we submitted a short written statement to the ongoing US Public Comment Meeting concerning Public Access to Federally Supported R&D Data. Our statement is below:

Each year, the Federal Government spends over $100 billion on research. This investment, in part is used to gather new data. But all too often the new data gathered isn’t made publicly available and thus can’t generate maximum return on investment through later re-use by other researchers, policy-makers, clinicians and everyday taxpaying citizens.

A shining example of the value and legacy of research data is the Human Genome Project.

This project and its associated public research data are estimated to have generated $796 billion in economic impact, created 310,000 jobs, and launched a scientific revolution. All from an investment of just $3.8 billion.

With the budget sequestration of 2013 and onwards it’s vitally important to get maximum value for money on research spending. By ensuring public access to most Federally funded research data it’ll help researchers do more with less. If researchers have greater access to data that’s already been gathered they can focus more acutely on accumulating just the new data they need, and nothing more. It’s not uncommon for Federally funded researchers to perform duplicate research and gather duplicate data. The competitive and often secretive nature of research means that duplicative research and data hoarding are probably rife, but hard to evidence. Enforcing a public data policy on researchers would thus help them to make the overall system more efficient. This tallies with the conclusions of the JISC report (2011) on data centres:

“The most widely-agreed benefit of data centres is research efficiency. Data centres make research quicker, easier and cheaper, and ensure that work is not repeated unnecessarily.”

Another more subtle benefit of making Federal-funded data more public is that it would increase the overall importance and profile of US research in the world. Recent research by Piwowar & Vision (2013) robustly demonstrates that research that releases public data gets cited more than research that does not publicly release its underlying data.

The as yet untapped value of research data:
I believe most research data has immense untapped re-use value. We’re only just beginning to realise the value of data mining techniques on ‘Big Data’ and small data alike. In the 21^st century, now more than ever, we have immensely powerful tools and techniques to make sense of the data deluge. The potential scientific and economic benefits of such text and data mining analyses are consistently rated very highly. The McKinsey Global Institute report on ‘Big Data’ (2011) estimated a $300 billion value on data mining US health care data alone.

I would finish by imploring you to read and implement the recommendations of the ‘Science as an Open Enterprise’ report from the Royal Society (2012):

* Scientists need to be more open among themselves and with the public and media
* Greater recognition needs to be given to the value of data gathering, analysis and communication
* Common standards for sharing information are required to make it widely usable
* Publishing data in a reusable form to support findings must be mandatory
* More experts in managing and supporting the use of digital data are required
* New software tools need to be developed to analyse the growing amount of data being gathered

Ross Mounce, Community Coordinator for Open Science, Open Knowledge Foundation

30 other written statements were also contributed to this session, including one from Creative Commons, and one from Victoria Stodden. These can all be found in the official 64 page PDF here

The Open Science Training Initiative

Ross Mounce — Mon, 13 May 2013 21:59:23 +0000

Posted on behalf of Sophie Kershaw, one of our Panton Fellows 2012/13, recapping her work training the next generation in the art of open science. Over to you Sophie:

Some of you may have been following the progress of my Panton Fellowship work over the past year, the main focus of which was establishing a graduate training scheme in open science, the Open Science Training Initiative (OSTI). There have been some exciting developments with the course in recent weeks and we’re really close to releasing the first full set of course materials for others to use in their own teaching to train young academics in open science and data-centric/digital research methodologies, so I thought I’d update you all on progress. If you’re interested in hearing about how the course works in practice, then scroll down for a download link to the post-pilot report!

What is OSTI?
The OSTI scheme is a teaching pattern and series of mini-lectures, designed to transform existing subject-specific graduate courses in the sciences to foster open working practices and scientific reproducibility. Its main features include:

dynamic teaching model of Rotation Based Learning,
hands-on application of licensing, data management and data analysis techniques, building students knowledge of, and confidence in using, these approaches;
daily lectures and exercises in key subjects including “Content, Code & Data Licensing”, “The Changing Face of Publication” and “Data Management Planning” accompany the main component of research time in the timetable, providing students with knowledge they can then consolidate through application to their research.

Open Science Training in Practice – download the report now!
After many months of hard work and analysis, the post-pilot report on OSTI was released last Saturday and is now available for download from the OSTI website, via http://www.opensciencetraining.com/content.php. The report draws on a broad range of perspectives from the student cohort, the auxiliary demonstrators and the course leader. A curated data set to accompany the report will be appearing on the OSTI site very soon and lecture movies from the pilot initiative have been appearing on the site over the past week. Keep checking back over the coming weeks as more content and downloads become available.

Where can I get course materials?
The official set of course materials will be appearing on our GitHub repository over the coming weeks – these are currently being tweaked based on the feedback we received and I can’t wait for others to fork the project and create other versions of the course as well.

Please feel free to get in touch with me if you’d like to hear more about OSTI, or have any comments, questions or suggestions. If we’re going to encourage uptake of Open working practices in the sciences, we need to start training our researchers in these approaches now. If you think there’s an opening at your institution for this kind of approach, then I would love to hear from you!

You can tweet Sophie at @StilettoFiend or email her at: sophie dot kershaw at okfn.org

Weekly Citizen Science Hangouts

Ross Mounce — Wed, 03 Apr 2013 21:08:49 +0000

Capitalizing on the success of our recent CrowdCrafting hack day, from this Thursday and onwards every week we’ll be having a public Google+ Hangout to discuss citizen science and related topics. Details of the first meeting are below:

Thursday 4th April, 5pm (BST) – Weekly Citizen Science Hangout on Google+ here

In the first meeting we shall talk with special guest Michal Kubacki about his Misomorf application that may eventually be developed into a citizen app to help scientists with the graph isomorphism problem. This problem was proposed at the recent hack day by mathematician and quantum computing expert Simone Severini of UCL, who will also join the Hangout. Your participation at these weekly meetings is both welcome and encouraged. If you can’t make the first one, then perhaps the next?

Further information from the recent Science Hack Day

Among the many hack day projects I didn’t get to write about in the last blog post were the Yellowhammers project. This citizen science project uses sound recordings of yellowhammer (bird) dialects and is a joint activity of the Department of Ecology, Charles University in Prague, and the Czech Society for Ornithology.

Volunteers have helped both sample bird song, and classify these hundreds of hours of audio recordings into different dialects. Whilst the initial project sample birds just in the Czech Republic, the team are now expanding to try and capture bird song recordings from the United Kingdom and New Zealand where this bird species also lives.

Here at the Open Knowledge Foundation, we hope that the data collection and analysis could be aided by both the use our out Crowdcrafting.org platform and the EpiCollect mobile software. Work is presently under-way to make this a reality.

So whether, a software-dev, amateur scientist, tinkerer or twitcher – perhaps we might see you tomorrow at the Citizen Science Hangout?

Crowdsourcing Success: Science Hack Day London

Ross Mounce — Sun, 17 Mar 2013 17:58:12 +0000

Yesterday was our Science Hack Day London event.

I think it’s safe to say it was a roaring success and we’ll likely be having more this year.

We had Daniel Lombraña González jet in from Spain to help us with PyBossa / Crowdcrafting projects and got a working demo of a sound classification app using SoundCloud by the end of the day

An excellent demo of the potential of Konekta geolocation and polygon drawing

Lots of brainstorming with the new WAXscience project from France (Aude Bernheim @AudeBer & Flora Vincent @vincentflora). WAXscience aims to do something positive about gender imbalance in science. We talked a lot about potential methods of ‘sampling’ gender balance at scientific meetings & conferences – twitter mining, web mining, and even citizen science self-reporting with mobile apps.

There was even wonderful free food & drink provided especially for this hack!

For more details about the event see the Storify of tweets in full here.

We will let you know as and when we plan the next hack day

Update from the ongoing EU Text & Data Mining dialogue

Ross Mounce — Tue, 12 Mar 2013 11:23:00 +0000

Last Friday was meeting number two of the Licences for Europe – a Stakeholder Dialogue, ‘Text & Data Mining’ Working Group 4.

The first meeting did not go well and this was widely reported. The potential of text & data mining (TDM) technology is enormous. The McKinsey Global Institute reckon that in Europe, government expenditure could be reduced by €100 billion a year if the legal barriers to TDM were relaxed.

Many content industry representatives seem keen to take a licence-based, negotiations and permissions-based approach to allowing (controlling?) TDM. We believe this is unnecessary – The Right To Read Is The Right To Mine, and that following a ‘yet more licencing’ pathway will ultimately stifle innovation. Instead we want the EU to follow the successful precedent set by the US, Japan, Israel, Taiwan and South Korea who enjoy a legal limitation and exception for such activities – the UK is also shortly to be added to this list too (come October this year when the Hargreaves Report recommendations become law).

Some who were at the first working group meeting regrettably could not afford to attend this meeting. It is expensive travelling to Brussels every month, and this structured dialogue is set to run for many months & meetings. But I was there to represent the point-of-view of academic research, and in doing so was only one of very few people in the room who actually had some experience with real TDM techniques. The presentation I gave is embedded below:

Content Mining from Ross Mounce

Slide 15 of my presentation elicited a strange response from one participant. I was (falsely) accused after my talk of mis-representing the Publishing Research Consortium’s own report on text & data mining.

The exact wording in the report on page 7 of the PDF is:
“When permission is requested, 35% of publisher respondents generally allows mining in all or the majority of cases” Smit & van der Graaf, 2011. Whilst on slide 15 of my presentation I wrote: “When permission is requested [by researchers], 35% of publisher respondents allow mining in the majority or all of cases” Do you see any significant difference in meaning? I certainly don’t.

Needless to say I kept my cool in rebutting the accusations and we clarified the matter amicably. The lamentably-low figure of 35% is truthful and speaks for itself sadly. Very few subscription-access journal publishers routinely allow TDM research on their copyrighted content and this is something that has to change in the future. Furthermore the permissions-based licencing approaches presented later in the meeting don’t accommodate the flexibility needed in research and may exclude unaffiliated scholars, retired scholars and even PhD students(!) – I have heard that only “employees” of higher education institutions would be allowed to perform TDM research on certain content. This is not a scenario we want for Europe.

At the next meeting we are hoping that the Commissioners will call for Prof. Ian Hargreaves himself or one of his team who helped him research and write his influential report on Intellectual Property and Growth to give a talk at the meeting and help us all to make informed decisions about the best ways in which to help & enable TDM techniques. The next meeting is in late April. I will report any further progress in this matter then.

What license is this ‘open access’ journal using?

Ross Mounce — Tue, 12 Mar 2013 10:05:08 +0000

ANNOUNCEMENT:

We have released a demo version of a new application on OKF’s new citizen science / microtasking platform Crowdcrafting.org.

It’s called “Is It an Open Access Journal?” and looks something like this:

The new app on Crowdcrafting.org

It uses PyBossa and Disqus for comments. The code for it is available on github here too (Open Source!)

The aim of this app is to help crowdsource data on what re-use license each of the ‘open access’ journals use, and who holds the copyright.

Background

Open Access, as defined by the Budapest Open Access Initiative, permits any and all users to

…read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

But sadly, not all journals that call themselves ‘open access’ actually make their articles available under terms compliant with this definition. Many journals don’t publish under any liberal re-use license at all. This survey will help identify those journals that need to do more to ensure that they actually publish their works under an open access compliant license.

Mini-Tutorial

screengrab of the data entry bits. It’s really as simple as that…

Users are given the name of a Journal and it’s publisher.
In step one, they google search the journal to go find out the answers to the two questions asked of them.

Once the user has found the correct website for the journal, hopefully this can be investigated to find the answers, so that (step two) the correct license that this journal publishes under can be selected from the drop-down box.

Step three. The user must select who owns the copyright of each of the articles the journal publishes. Is it the author(s), the journal, the society, or unknown?

Step four. Once the user is satisfied with their answers they can click this button to save their data and move onto the next journal.

We hope that this app will help to assess, and keep current the metadata that we have on Open Access journals.

Content Mining in Europe: Further Licensing is Not The Only Way

Ross Mounce — Thu, 28 Feb 2013 12:28:14 +0000

A significant number of groups who support knowledge policies for the public good, including ourselves, have signed and published a letter of concern arising from one of the working groups of the Licences for Europe – A Stakeholder Dialogue meetings in Brussels.

This particular working group was Working Group 4, which was set to discuss ways and means of enabling Text and Data Mining (TDM) for research. I was present as both a user of mining techniques in my academic research and official representative of the Open Knowledge Foundation, as participant in the discussions.

The letter expresses concerns that in this TDM meeting we were presented “not with a stakeholder dialogue, but a process with an already predetermined outcome –namely that additional licensing is the only solution to the problems being faced by those wishing to undertake TDM”

We believe that this dialogue should fairly include discussion of copyright limitations and exceptions for such TDM activity. The Vice-President of the European Commission responsible for the Digital Agenda Neelie Kroes (pictured above) made a speech shortly before the working group meeting which indicated this would be an option to consider on the table of discussion:

But keep your minds open: maybe in some cases licensing won’t be the solution

It was also in the notes published in advance of the working group meeting that discussion would explore:

the potential and possible limits of standard licensing models

(emphasis mine)

Yet when we started discussions, all our attempts to discuss copyright exemptions for TDM, as successfully practised in the US, Japan, Israel, Taiwan and South Korea, were quickly shut-down by the dialogue moderators. It was made crystal clear to us that any further attempts to discuss this as a solution to the problems of TDM access would not be entertained. Many of us left the meeting feeling extremely frustrated that we were prevented from discussing what we thought was a reasonable and optimal solution practised elsewhere, and were only allowed to discuss sub-optimal cumbersome options involving re-licencing of content or collective licencing.

Thus the letter of concern finishes with 3 simple requests:

All evidence, opinions and solutions to facilitate the widest adoption of TDM are given equal weighting, and no solution is ruled to be out of scope from the outset;
All the proceedings and discussions are documented and are made publicly available;
DG Research and Innovation becomes an equal partner in Working Group 4, alongside DGs Connect, Education and Culture, and MARKT – reflecting the importance of the needs of research and the strong overlap with Horizon 2020.

The greater than 50 participants & signatories of the letter include a Nobel Prize winner (Sir John Sulston), and top representatives of most European research funders, libraries and even smart tech companies with an interest in this area like Mendeley. We sincerely hope the European Commission takes action on this matter.