#OKFest Open Science and Culture Hackday – Project 3: Investigative Open Bibliography

September 18, 2012 in Hackday, Meetings, OKFest, Tools

The third hackday project aims to explore the links between corporations and researchers for the areas of organic food, but more widely research into pharmaceuticals, cosmetics, food and other consumer products. By extracting author and funding data from the full text open access literature, funding links become clearer allowing visualisation which can be manipulated into patterns based on study outcomes to identify areas with particularly high positive result publication biases which may be influenced by commercial interests.

How to Do It?


Luckily the OKFN has several tools at its disposal to make this happen:

Our Open Biblio project has developed tools to extract full text literature (PubCrawler) and convert files into bibliographic metadata formats such as BibJSON whereby data can be easily manipulated and facet browsed (BibServer).
Therefore, we are extracting a subset of the open access BioMedCentral articles.

The next step will be to extract author information and affiliation (illustrated using Ahmet et al., 2011), which brings up a new problem of reconciling affiliations where no canonical list of institutions exists. Much discussion ensued about the possibility of using pyBOSSA and merging several publicly available lists of institutions from authoritative sources as a starting point.

We will also require funding data and competing interests sections from the articles, which can be mined using natural language processing tools to extract corporate names (possibly utilising Open Corporates) and references to other funders e.g. research councils and charities.

Once extracted this information would be entered into a database for faceted searching by funder, institution and author along with standard bibliographic metadata e.g. list all articles funded by GlaxoSmithKline and published in UK institutions in 2011.

What Can We Do With It?

This would lay the ground work for linking to other data sets and further elucidating patterns in corporate funding of research. For example, abstracts and key words could be used to browse by topic and a further pyBOSSA app could be generated to type the articles by a positive or negative result, or even graded e.g. relative risk of a new clinical treatment. This could be used to look at positive publishing bias in different areas of science and for work through different funders – which appear to be the worst affected areas? Which funders very rarely publish negative results?

Many studies on publication bias already exist in clinical trials but they are often limited to a few hundred articles e.g. Friedman and Richter, 2004, whereas the automated methodology described above could examine large portions of the literature. There are also different types of publication biases e.g. citation biases for positive articles which could possibly be investigated on a larger scale using these techniques than is currently possible, particularly involving Open Citation tools.

Incorporating geographic information on institutions and keyword analysis could reveal hot spots of research and other useful information for funders and research managers.

Hard at work hacking with the OKF Okapi

What Did We Do?

We completed liberating BMC articles by crawling the site and using parsers from BibTex to BibJSON. By the end of the week we expect to have 140k articles uploaded to BibServer including author, affiliations, abstracts and more.

Leave a Reply

Your email address will not be published. Required fields are marked *