You are browsing the archive for Research.

What is content mining?

Ross Mounce - March 27, 2014 in Research

It’s simple really – you can break it down into it’s two constituent parts:

  1. Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

  1. Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

content mining can involve multiple types of content!

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog: http://blogs.ch.cam.ac.uk/pmr/2014/02/27/101-uses-for-content-mining/

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

Some suggested tutorials & resources you might want to start with:

10,000 #OpenScience Tweets

Open Science - March 20, 2014 in Media, Research, Tools

We have collected 10,000+ tweets using the #openscience hashtag on Twitter, and invite volunteers to help analyse the data. The twelve most-retweeted tweets are embedded below.

Happily, just over 4,600 accounts have participated in the Open Science community with its eponymous hashtag. The 10,000 tweets have accrued over ten weeks. Our own @openscience on Twitter has tweeted most, over 600 times at the hashtag, as well as having received the most retweets and @ mentions, over 8,000 in this span.

We have modified the vis which came with the data via the satisfying TAGS effort shared by Martin Hawksey. We added looks at the numbers of mentions and of mentions per tweets for top tweeters, and rankings of top tweets for the past week and past ten weeks to Martin’s default view.

Help wanted

More could be done; won’t you help? Leave a reply below or ping us @openscience on Twitter if you need edit access to the sheet itself but we would like to see data and analyses in other tools as well. Our work to this point is only to get something started.

Top #openscience tweets of the past ten weeks

 

The above list is not dynamic. The data collected and displayed here, however are dynamic and refresh themselves hourly.

Not all tweets which are about Open Science include the #openscience hashtag. In a perfectly semantic world, they would and when they can, they really should. It has helped to form a community among the 4,600+ accounts participating in these ten weeks and many others in recent years. A couple reasons the hashtag might not be used in a relevant tweet include the character limit on tweets and lack of awareness of hashtags or of the term Open Science.

We take our organising and leadership role seriously at @openscience on Twitter, an account shared by many in the community. We have a simple policy that all our tweets should be related to Open Science. Even at our account, not all our tweets include the #openscience hashtag, particularly as we discuss related concerns such as Citizen Science or Open Access. An example tweet from the time frame considered here, related to Open Science but not hashtagged as such is below. In this case, the limit on tweet length and the topic led to including #openaccess, not #openscience:

The most retweeted, Open Science related tweet of all time, so far as we know, did not use the #openscience hashtag but was lovely. From the Lord of Dance and Prince of Swimwear:

Building an archaeological project repository I: Open Science means Open Data

Michelle Brook - February 27, 2014 in Guest Post, Research

This is a guest post by Anthony Beck, Honorary fellow, and Dave Harrison, Research fellow, at the University of Leeds School of Computing.

In 2010 we authored a series of blog posts for the Open Knowledge Foundation subtitled ‘How open approaches can empower archaeologists’. These discussed the DART project, which is on the cusp of concluding.

The DART project collected large amounts of data, and as part of the project, we created a purpose-built data repository to catalogue this and make it available, using CKAN, the Open Knowledge Foundation’s open-source data catalogue and repository. Here we revisit the need for Open Science in the light of the DART project. In a subsequent post we’ll look at why, with so many repositories of different kinds, we felt that to do Open Science successfully we needed to roll our own.

Open data can change science

Open inquiry is at the heart of the scientific enterprise. Publication of scientific theories – and of the experimental and observational data on which they are based – permits others to identify errors, to support, reject or refine theories and to reuse data for further understanding and knowledge. Science’s powerful capacity for self-correction comes from this openness to scrutiny and challenge. (The Royal Society, Science as an open enterprise, 2012)

The Royal Society’s report Science as an open enterprise identifies how 21st century communication technologies are changing the ways in which scientists conduct, and society engages with, science. The report recognises that ‘open’ enquiry is pivotal for the success of science, both in research and in society. This goes beyond open access to publications (Open Access), to include access to data and other research outputs (Open Data), and the process by which data is turned into knowledge (Open Science).

The underlying rationale of Open Data is this: unfettered access to large amounts of ‘raw’ data enables patterns of re-use and knowledge creation that were previously impossible. The creation of a rich, openly accessible corpus of data introduces a range of data-mining and visualisation challenges, which require multi-disciplinary collaboration across domains (within and outside academia) if their potential is to be realised. An important step towards this is creating frameworks which allow data to be effectively accessed and re-used. The prize for succeeding is improved knowledge-led policy and practice that transforms communities, practitioners, science and society.

The need for such frameworks will be most acute in disciplines with large amounts of data, a range of approaches to analysing the data, and broad cross-disciplinary links – so it was inevitable that they would prove important for our project, Detection of Archaeological residues using Remote sensing Techniques (DART).

DART: data-driven archaeology

DART aimed is to develop analytical methods to differentiate archaeological sediments from non-archaeological strata, on the basis of remotely detected phenomena (e.g. resistivity, apparent dielectric permittivity, crop growth, thermal properties etc). The data collected by DART is of relevance to a broad range of different communities. Open Science was adopted with two aims:

  • to maximise the research impact by placing the project data and the processing algorithms into the public sphere;
  • to build a community of researchers and other end-users around the data so that collaboration, and by extension research value, can be enhanced.

‘Contrast dynamics’, the type of data provided by DART, is critical for policy makers and curatorial managers to assess both the state and the rate of change in heritage landscapes, and helps to address European Landscape Convention (ELC) commitments. Making the best use of the data, however, depends on openly accessible dynamic monitoring, along the lines of that developed for the Global Monitoring for Environment and Security (GMES) satellite constellations under development by the European Space Agency. What is required is an accessible framework which allows all this data to be integrated, processed and modelled in a timely manner.

It is critical that policy makers and curatorial managers are able to assess both the state and the rate of change in heritage landscapes. This need is wrapped up in national commitments to the European Landscape Convention (ELC). Making the best use of the data, however, depends on openly accessible dynamic monitoring, along similar lines to that proposed by the European Space Agency for the Global Monitoring for Environment and Security (GMES) satellite constellations. What is required is an accessible framework which allows all this data to be integrated, processed and modelled in a timely manner. The approaches developed in DART to improve the understanding and enhance the modelling of heritage contrast detection dynamics feeds directly into this long-term agenda.

Cross-disciplinary research and Open Science

Such approaches cannot be undertaken within a single domain of expertise. This vision can only be built by openly collaborating with other scientists and building on shared data, tools and techniques. Important developments will come from the GMES community, particularly from precision agriculture, soil science, and well documented data processing frameworks and services. At the same time, the information collected by projects like DART can be re-used easily by others. For example, DART data has been exploited by the Royal Agricultural University (RAU) for use in such applications as carbon sequestration in hedges, soil management, soil compaction and community mapping. Such openness also promotes collaboration: DART partners have been involved in a number of international grant proposals and have developed a longer term partnership with the RAU.

Open Science advocates opening access to data, and other scientific objects, at a much earlier stage in the research life-cycle than traditional approaches. Open Scientists argue that research synergy and serendipity occur through openly collaborating with other researchers (more eyes/minds looking at the problem). Of great importance is the fact that the scientific process itself is transparent and can be peer reviewed: as a result of exposing data and the processes by which these data are transformed into information, other researchers can replicate and validate the techniques. As a consequence, we believe that collaboration is enhanced and the boundaries between public, professional and amateur are blurred.

Challenges ahead for Open Science

Whilst DART has not achieved all its aims, it has made significant progress and has identified some barriers in achieving such open approaches. Key to this is the articulation of issues surrounding data-access (accreditation), licensing and ethics. Who gets access to data, when, and under what conditions, is a serious ethical issue for the heritage sector. These are obviously issues that need co-ordination through organisations like Research Councils UK with cross-cutting input from domain groups. The Arts and Humanities community produce data and outputs with pervasive social and ethical impact, and it is clearly important that they have a voice in these debates.

Content Mining: Scholarly Data Liberation Workshop

Jenny Molloy - December 14, 2013 in events, Oxford Open Science, Research, Tools

The November Oxford Open Science meeting brought over 20 researchers together for a ‘Content Mining: Scholarly Data Liberation Workshop’.

Iain Emsley and Peter Murray-Rust kicked off proceedings by presenting their work on mining Twitter and academic papers in chemistry and phylogenetics respectively.

Next we tried out web-based tools such as Tabula for extracting tables from PDF (we were fortunate enough to have Manuel Aristarán of Tabula joining us remotely via Skype) and ChemicalTagger for tagging and parsing experimental sections in chemistry articles.

OOS

We then got down to business with some hands-on extraction of species from HTML papers and mentions of books on Twitter using regular expressions. All code is open source so you are welcome and encouraged to play, fork and reuse!

Peter’s tutorial and code to extract species from papers can be found on bitbucket and the relevant software and command line tools have helpfully been bundled into a downloadable package. Iain has also documented his flask application for Twitter mining on github so have a go!

If this has whet your appetite for finding out more about content mining for your research and you’d like to ask for input or help or simply follow ongoing discussion then join our

open content mining mailing list


Some furry friends joined in the efforts - met Chuff the OKF Okapi and AMI the kangaroo

Some furry friends joined in the efforts – met Chuff the OKF Okapi and AMI the kangaroo

Open and transparent altmetrics for discovery

Peter Kraker - December 9, 2013 in Panton Fellowships, Research, Tools

6795008004_8046829553

by AG Cann

Altmetrics are a hot topic in scientific community right now. Classic citation-based indicators such as the impact factor are amended by alternative metrics generated from online platforms. Usage statistics (downloads, readership) are often employed, but links, likes and shares on the web and in social media are considered as well. The altmetrics promise, as laid out in the excellent manifesto, is that they assess impact quicker and on a broader scale.

The main focus of altmetrics at the moment is evaluation of scientific output. Examples are the article-level metrics in PLOS journals, and the Altmetric donut. ImpactStory has a slightly different focus, as it aims to evaluate the oeuvre of an author rather than an individual paper.

This is all good and well, but in my opinion, altmetrics have a huge potential for discovery that goes beyond rankings of top papers and researchers. A potential that is largely untapped so far.

How so? To answer this question, it is helpful to shed a little light on the history of citation indices.

Pathways through science

In 1955, Eugene Garfield created the Science Citation Index (SCI) which later went on to become the Web of Knowledge. His initial idea – next to measuring impact – was to record citations in a large index to create pathways through science. Thus one can link papers that are not linked by shared keywords. It makes a lot of sense: you can talk about the same thing using totally different terminology, especially when you are not in the same field. Furthermore, terminology has proven to be very fluent even in the same domain (Leydesdorff 1997). In 1973, Small and Marshakova realized – independently from each other – that co-citation is a measure of subject similarity and therefore can be used to map a scientific field.

Due to the fact that citations are considerably delayed, however, co-citation maps are often a look into the past and not a timely overview of a scientific field.

Altmetrics for discovery

In come altmetrics. Similarly to citations, they can create pathways through science. After all, a citation is nothing else but a link to another paper. With altmetrics, it is not so much which papers are often referenced together, but rather which papers are often accessed, read, or linked together. The main advantage of altmetrics, as with impact, is that they are much earlier available.

clickstream_map

Bollen et al. (2009): Clickstream Data Yields High-Resolution Maps of Science. PLOS One. DOI: 10.1371/journal.pone.0004803.

One of the efforts in this direction is the work of Bollen et al. (2009) on click-streams. Using the sequences of clicks to different journals, they create a map of science (see above).

In my PhD, I looked at the potential of readership statistics for knowledge domain visualizations. It turns out that co-readership is a good indicator for subject similarity. This allowed me to visualize the field of educational technology based on Mendeley readership data (see below). You can find the web visualization called Head Start here and the code here (username: anonymous, leave password blank).

headstart

http://labs.mendeley.com/headstart

Why we need open and transparent altmetrics

The evaluation of Head Start showed that the overview is indeed more timely than maps based on citations. It, however, also provided further evidence that altmetrics are prone to sample biases. In the visualization of educational technology, the computer science driven areas such as adaptive hypermedia are largely missing. Bollen and Van de Sompel (2008) reported the same problem when they compared rankings based on usage data to rankings based on the impact factor.

It is therefore important that altmetrics are transparent and reproducible, and that the underlying data is openly available. This is the only way to ensure that all possible biases can be understood.

As part of my Panton Fellowship, I will try to find datasets that satisfy these criteria. There are several examples of open bibliometric data, such as the Mendeley API, and figshare API that have adopted CC BY, but most of the usage data is not available publicly or cannot be redistributed. In my fellowship, I want to evaluate the goodness of fit of different open altmetrics data. Furthermore, I plan to create more knowledge domain visualizations such as the one above.

So if you know any good datasets please leave a comment below. Of course any other comments on the idea are much appreciated as well.

Open Scholar Foundation

Jenny Molloy - December 6, 2013 in Announcements, Guest Post, Reproducibility, Research, Tools

This is a guest post from Tobias Kuhn of the Open Scholar Foundation. Please comment below or contact him via the link above if you have any feedback on this initiative!

logo(2)

The goal of the Open Scholar Foundation is to improve the efficiency of scholarly communication by providing incentives for researchers to openly share their digital research artifacts, including manuscripts, data, protocols, source code, and lab notes.

The proposal of an “Open Scholar Foundation” was one of the winners of the 1K challenge of the Beyond the PDF conference. This was the task of the challenge:

What would you do with 1K that would significantly advance scholarly communication that does not involve building a new software tool?

The idea was to establish a committee that would certify researchers as “Open Scholars” according to given criteria. This was the original proposal:

I would set up a simple "Open Scholar Foundation" with a website, where researchers can submit proofs that they are "open scholars" by showing that they make their papers, data, metadata, protocols, source code, lab notes, etc. openly available. These requests are briefly reviewed, and if approved, the applicant officially becomes an "Open Scholar" and is entitled to show a banner "Certified Open Scholar 2013" on his/her website, presentation slides, etc. Additionally, there could be annual competitions to elect the "Open Scholar of the Year".

An alternative approach (perhaps more practical and promising) would be to provide a scorecard for researchers to calculate their “Open Scholar Score” on their own. There is an incomplete draft of such a scorecard in the github repo here.

In any case, his project should lead to an established and recognized foundation that motivates scholars to openly share their data and results. Being a certified Open Scholar should be something that increases one’s reputation and visibility, and should give a counterweight to possible benefits from keeping data and results secret. The criteria for Open Scholars should become more strict over time, as the number of “open-minded” scholars hopefully increases over the years. This should go on until, eventually, scholarly communication has fundamentally changed and does not require this special incentive anymore.

It is probably a good idea to use Mozilla Open Badges for these Open Scholar banners.

We are at the very beginning with this initiative. If you are interested in joining, get in touch with us! We are open to any kind of feedback and suggestions.

Open science & development goals: round up & the way forward

Jenny Molloy - September 23, 2013 in Collaborations, External Meetings, Meetings, Research

This is a post by the team at OpenUCT (post by Sarah Goodier, photos by Uvania Naidoo) and will soon be published on the OpenUCT blog.

Open science and development were the two key points that brought together a diverse group of over 20 scientists, methodological experts and researchers last week at the University of Cape Town. From the 12–13 September, these experts in their fields gathered for an IDRC OKFN-OpenUCT Open Science for Development workshop to scope possible research areas of open science for development. The focus was on research could be undertaken and to strengthen networks around this broad topic across Africa, Asia, Latin America and the Caribbean.

Day 1 involved discussions around opportunities and challenges for each of the regions represented as well as available resources that could be used and shared. By the end of the day, the group was starting to draw on these potential avenues for exploring open science for development to shape research questions.

Continuing from day 1’s discussions, day 2 focussed on framing these research questions around open science for development. These questions were discussed by breakout groups who selected the top four out of the multitude of those suggested. This selection was no easy task in such a mixed bag of broad, conceptual questions and focused practical questions – a clear indication that there are many potential interesting research questions.

day2_1

Four key questions emerged that were taken forward in further discussion:

  1. What value framework is a prerequisite for open science?
  2. How can open science support visibility and communication of science outside formal academic structures?
  3. How can open science create education?
  4. How can the economic and social value of open science be measured?

Projects that could help to answer these main questions were conceptualised and expanded upon. Some of the broad areas that the suggested projects could address included education, increased public involvement as well as the implication of open science on cost and building value. A regional focus for the suggested projects was thought to be best, largely due to financial and time limitations as well as co-ordination issues. The overarching IDRC-backed research programme will help to create and develop further synergies between any funded projects.

day2_2

As part of maintaining the momentum created during over the course of the workshop, staying connected and growing the network by bringing other people with diverse perspectives on board are key actions going forward. All of us walked away from this workshop with a greater appreciation for open science and an understanding that, although diverse, open science is united by many similar practises across regions.

We ended with more questions than answers at the end of the two days – just where you should be when you’re scoping possible research questions. What comes next is an OKFN working paper pulling together all the discussion threads, questions and resources raised over the two days, which will inform a call for research proposals for projects involving and investigating open science.

Watch this space as open science spreads across the map!

day2_3

Open science & development goals: shaping research questions

Jenny Molloy - September 13, 2013 in Collaborations, events, External Meetings, Guest Post, Meetings, Research

This is cross-posted from the OpenUCT blog.

What do we include in our definition of open science? And what is meant by development? Two key questions when you’re discussing open science for development, as we were yesterday on day one of the IDRC OKFN-OpenUCT Open Science for Development workshop.

Participants from Africa, Asia and Latin America and the Carribbean have gathered at the University of Cape Town in an attempt to map current open science activity in these regions, strengthen community linkages between actors and articulate a framework for a large-scale IDRC-funded research programme on open science. The scoping workshop aims to uncover research questions around how open approaches can contribute to development goals in different contexts in the global South. Contextualization of open approaches and the identification of their key similarities and differences is critical in helping us understand the needs and required frameworks of future research.

Several key themes, which generally provided more questions than answers, came up throughout a day packed of presentations, discussion and debate: strategic tensions, inequalities, global power dynamics, and the complexity of distilling common challenges (and opportunities) over large geographical areas. Some of the key strategic tensions identified include the balance between the “doing” of open science as opposed to researching it, as well as the tension between high quality research and capacity building at an implementation level. Both tensions are centred on inextricably linked components which are important in their own right. This brings up the question of where should the focus be? Where is it most relevant and important?

The issue of inequality and inclusivity also featured strongly in the discussions, particularly around citizen science – by involving people in the research process, you empower them before they are affected. But this begs the questions: How open should citizen science be? Who takes the initiative and sets goals? Who is allowed to participate and in what roles? With regard to knowledge, a small number of countries and corporate entities act as gatekeepers of the knowledge produced globally. How should this knowledge be made more accessible? Will open scientific approaches make dialogue and knowledge distribution more inclusive?

By the end of the first day’s discussion, the workshop had surfaced opportunities and challenges for each of the regions, but many questions still remain in terms of how to address the complex issues at hand and bring together the complex and disparate components of open scientific activity. Day two of the workshop will be focused on articulation of research problems, possible areas of activity and the structure of the envisioned research programme.

Join the discussion via Twitter via #OpenSciDev.

by SarahG (Pictures by Uvania Naidoo)

Open Science for Development

Jenny Molloy - July 20, 2013 in Announcements, Collaborations, events, External Meetings, Meetings, Research

OpenSciDev_logo

We are delighted to announce that OKF is collaborating with the OpenUCT Initiative at the University of Cape Town in an International Development Research Centre funded project to develop a southern led research agenda for open science for development.

We hope to use this as an opportunity not only to explore research into open science but also to really push community building efforts in the global south and identify a strong network of open science advocates and practitioners – maybe setting up some new local open science groups along the way!

You can read more in our project proposal.

A small group met in London last week to set the ground work for a larger workshop in Cape Town 11-13 September 2013 and the results of that meeting will be available online shortly.

We hope you are as excited about this opportunity as we are and in the spirit of the exercise we will be making both the process and outcomes as open as possible. Therefore, if you would like to apply to participate in the Cape Town meeting please send jenny.molloy@okfn.org a brief half page introduction to yourself including answers to the following questions:

Why is this project of interest? What expertise and experience do you bring? What would you like to see come out of this project?

Preference will be given to participants from developing countries in order to further the aims of the project and full funding will be provided.

There is a short deadline of 24 July 2013 so please spread this invitation through your networks, particularly contacts you might have in the global south. If selected, we will organise travel and flights as soon as possible.

OpenSciDev_Funders

What license is this ‘open access’ journal using?

Ross Mounce - March 12, 2013 in Research, Tools

ANNOUNCEMENT:

We have released a demo version of a new application on OKF’s new citizen science / microtasking platform Crowdcrafting.org.

It’s called “Is It an Open Access Journal?” and looks something like this:

The new app on Crowdcrafting.org

The new app on Crowdcrafting.org

It uses PyBossa and Disqus for comments. The code for it is available on github here too (Open Source!)

The aim of this app is to help crowdsource data on what re-use license each of the ‘open access’ journals use, and who holds the copyright.

Background

  Open Access, as defined by the Budapest Open Access Initiative, permits any and all users to

…read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

But sadly, not all journals that call themselves ‘open access’ actually make their articles available under terms compliant with this definition. Many journals don’t publish under any liberal re-use license at all. This survey will help identify those journals that need to do more to ensure that they actually publish their works under an open access compliant license.

Mini-Tutorial

screengrab of the data entry bits. It’s really as simple as that…

screengrab of the data entry bits. It’s really as simple as that…

Users are given the name of a Journal and it’s publisher. In step one, they google search the journal to go find out the answers to the two questions asked of them.

Once the user has found the correct website for the journal, hopefully this can be investigated to find the answers, so that (step two) the correct license that this journal publishes under can be selected from the drop-down box.

Step three. The user must select who owns the copyright of each of the articles the journal publishes. Is it the author(s), the journal, the society, or unknown?

Step four. Once the user is satisfied with their answers they can click this button to save their data and move onto the next journal.

 

We hope that this app will help to assess, and keep current the metadata that we have on Open Access journals.