Content Mining: Scholarly Data Liberation Workshop
The November Oxford Open Science meeting brought over 20 researchers together for a ‘Content Mining: Scholarly Data Liberation Workshop’.
Iain Emsley and Peter Murray-Rust kicked off proceedings by presenting their work on mining Twitter and academic papers in chemistry and phylogenetics respectively.
Next we tried out web-based tools such as Tabula for extracting tables from PDF (we were fortunate enough to have Manuel Aristarán of Tabula joining us remotely via Skype) and ChemicalTagger for tagging and parsing experimental sections in chemistry articles.
We then got down to business with some hands-on extraction of species from HTML papers and mentions of books on Twitter using regular expressions. All code is open source so you are welcome and encouraged to play, fork and reuse!
Peter’s tutorial and code to extract species from papers can be found on bitbucket and the relevant software and command line tools have helpfully been bundled into a downloadable package. Iain has also documented his flask application for Twitter mining on github so have a go!
If this has whet your appetite for finding out more about content mining for your research and you’d like to ask for input or help or simply follow ongoing discussion then join our
Leave a Reply