SciDataCon 2014 was the first ever International Conference on Data Sharing and Integration for Global Sustainability jointly organised by CODATA and World Data Systems, two organisations that form part of the International Council for Science. The meeting was held 2-5 November in New Delhi and I had the pleasure of staying in the peaceful, green campus of IIT-Delhi within walking distance of SciDataCon2014 at the adjacent and equally pleasant Jawaharlal Nehru University (JNU).
It was a jam-packed week but I’ve tried to pick out some of my personal highlights. Puneet Kishor has also blogged on the meeting and there was an active Twitter feed for #SciDataCon2014 .
Photo by Puneet Kishor published under CC0 Public Domain Dedication
Text and Data Mining Workshop
On Sunday 2 November I ran a workshop with Puneet Kishor of Creative Commons as a joint venture with Open Knowledge and ContentMine. Armed with highlighters, post-it notes and USB-stick virtual machines, we led a small but dedicated and enthusiastic group of immunologists, bioinformaticians, plant genomics researchers and seabed resource experts through the basics of content mining.
Photo by Puneet Kishor published under CC0 Public Domain Dedication
We covered what it means, when it is legal to content mine and more broadly some of the policy and legal frameworks which impact access to and rights of reuse for the scientific literature. We hand annotated entity types in two papers about lion evolution and Aspergillus fungi. This aimed to get people thinking about patterns and how to program entity recognition – what instructions does a computer require to recognise what our brain categorises easily? Swapping over the papers showed 80-90% inter-participant agreement in entity mark-up suggesting a reasonable precision and recall rate for our content mining humans!
Photo by Puneet Kishor published under CC0 Public Domain Dedication
Everybody managed to scrape multiple Open Access publications and extract species names, we also discussed potential collaborations and had a virtual visitor from afar, Peter Murray-Rust. The overwhelming feeling in the room was once of dismay at the restrictions on reuse of academic content but optimism about the potential uses of content mining – we hope that an excellent collaboration opportunity around phytochemistry will come to fruition!
Several sessions over the conference highlighted how far we still have to go in terms of data sharing and particularly the challenge of gaining political will required for data sharing for global sustainability. Waltraut Ritter, a member of the active Open Knowledge local group Open Data Hong Kong, presented a paper co-authored with Scott Edmunds and others, making the case to policy makers that open data can support science and innovation. There is no guidance from the Hong Kong University Grants Committee on dissemination of research data resulting from its 7.5 billion HKD annual funding pool. Data sharing was explicitly flagged as low priority in 2011 and on enquiry in 2014 Open Data HK were informed that this assessment had not changed. Arguments to appeal to policy makers are clearly required in these situations and Waltraut expanded on a few during the talk.
Exploring the complexities of sharing data for public health research, Sanna Meherally reported on a qualitative study examining the ethical and practical background to potential research data sharing, involving five sites in Asia and Africa, and focusing on stakeholder perspectives. A key takeaway message was the importance of considering cultural barriers to implementation of funder data policies. Chief concerns raised in interviews were confidentiality, the potential for data collection efforts to be underplayed and the need to give something back to research participants. That the latter point was raised by so many researchers interviewed is encouraging given the title of the next day’s session ‘Data from the people for the people’, which was another focus of SciDataCon.
Data from the People for the People – encouraging Re-use and Collaboration
This double session focused on citizen science projects around topics related to sustainability, including biodiversity and climate change. Norbert Schmidt introduced projects in the Netherlands to monitor air quality while Raman Kumar from the Nature Conservation Foundation introduced a range of bird and plant ecology citizen science projects in India such as eBird, MigrantWatch and SeasonWatch. You can find the full session list here.
Cumulative hours of birding as of Sep 214 through the eBird India citizen science initiative
Most questions raised surrounded the validation of data quality from citizen scientists, which has been addressed at length by several projects. Later presentations and discussions moved to some very pressing matter in participatory science – how to build and retain and community of contributors and how to manage outputs in a way that is accessible to and benefits contributors, a similar point to that raised by Sanna Meherally. Retention of volunteers is a particular issue in longitudinal studies in ecology, as data is required for the same locality over multiple years so repeat volunteering is essential.
Tyng-Ruey Chuang tackled some of these issues in his talk on ‘Arrangements for Data Sharing and Reuse in Citizen Science Projects‘. He asked projects to compare themselves to Wikipedia in terms of openness, participation and tools. For instance, does your project retain or strip metadata from contributed images? Tyng-Ruey also emphasised informed participation – clearly state if citizen contributions are prima facie uncopyrightable or ask agreement for open licensing. This chimed with earlier points by Ryosuke Shibasaki about the need for citizen ownership of contributed data and agency to make informed decisions about its use.
The talk ended with a call to action, as the Open Definition was practically quoted and Tyng-Ruey called for raw data, now! He’s in good company at Open Knowledge!
The sessions above are only a small subset of the conversations happening across the whole programme and papers are available online for all sessions. There were many demands for more open data, from Theo Bloom using her keynote to call for the abolition of data release embargos to Chaitanya Baruo revealing that Indian geology students are using US data because India does not make its own data available for open academic research. However, there were also excellent case studies of the reuse of data and its value. It would have been interesting to see some more cross-cutting sessions including all of the data collection and sharing cycle, but that will need to wait for 2016! This is a thoroughly recommended conference for data scientists and managers as well as domain experts and has notable participation from the global South, which is excellent and enriches the perspectives discussed.
Finally, I can only apologise for not being able to report on the Strategies Towards Open Science Panel – I was giving a talk at IIT which clashed with the session, but I’ve no doubt some excellent points were raised which will soon be shared!
Bonus slide decks
I couldn’t attend these sessions, but they’re worth a look! First up Susanna Sansone and Brian Hole on data journals:
What do Benjamin Franklin, Johann Wolfgang von Goethe, and Francis Bacon have in common?
All were amateur scientists. Franklin invented the lightning rod, Goethe discovered the incisive bone and was moderately successful as an art theorist and Bacon can be considered as nothing less than the father of empiricism, or can he? Either way, the three shared a passion for discovering things in their spare time. None of them earned their pennies as professional scientists, if that profession even existed back then.
Discovery is a matter of thirst for adventure
Citizen science is in fact old hat. It existed long before disciplines existed and could be described as the rightful predecessor of all empirical science. It laid the foundations for what we know today as the scientific method: the rule-governed and verifiable analysis of the world around us. Still, amateurs in science have often become marginalized over the past 150 years, as scientific disciplines have emerged and being a scientist has become a real thing to do (read more here).
Citizen science’s second spring
Today, citizen science is experiencing a second spring and it is no surprise that the internet has had a hand in it. In recent years, hundreds of citizen science projects have popped up, and they’re encouraging people to spend their time tagging, categorizing and counting in the name of science (see here and here). Some unfold proteins in an online game (Foldit), while others describe galaxies from satellite images (GalaxyZoo and here) or count wild boars in Berlin and deliver the numbers to an online platform (Wild boars in the city). Citizen science has moved online. And there are thousands of people in thousand different places that do many of funny things that can alter the face of science. The Internet is where they meet.
Berlin Wall; East Side Gallery
The logic of Internet-based citizen science: Large scale, low involvement
Citizen science today works differently to the citizen science of Goethe’s or Franklin’s time. The decentralised and voluntary character of today’s citizen science projects questions the way research has been done for a long time. It opens up science for a multitude of voluntary knowledge workers that work (more or less) collaboratively. In some respect, the new kind of citizen science is drawing on open innovation strategies developed in the private sector. In their recent Research Policy article, Franzoni and Sauermann refer to this type of amateur science as crowd science. The the term is extremely effective at capturing the underlying mechanics of most citizen science projects, which involve low-threshold-large-scale-participation. Today, participation of volunteers in science is scalable.
The advantages of citizen science
When it comes to data collection, social participation and science communication, citizen science is promising.
For scientists, it is an excellent way to collect data. If you visit one of the citizen science directories (for example here and here) and scroll through the projects, you will see that most of them involve some kind of documenting. These citizen scientists count rhinoceros beetles, wild boars, salamanders, neophytes, mountains and trees. There is nothing that cannot be quantified, and a life solely devoted to counting the number of rhinoceros beetles in North America would indeed be mundane for an individual scientist, not to speak of the travel expenses. Citizen scientists are great data sensors.
For citizen scientists it is a way of partaking in the process of discovery and learning about fields that interest them. For example, in a German project from the Naturschutzbund (German Society for the Conservation of Nature), sports divers are asked to count macrophytes in Northern German lakes. The data the divers collect help monitoring the ‘state of health’ of their freshwater lakes. In follow-up sessions, the divers are informed about the results. The case illustrates how citizen science works. Volunteers help scientists and in return receive first-hand information about the results. In this regard, citizen science can be an excellent communication and education tool.
Citizen science brings insight from without into the academic ivory tower and allows researchers and interested non-researchers to engage in a productive dialogue. This is a much-needed opportunity: for some time now, scholars and policy makers have been saying how challenging it is to open up science and involve citizens. Still, what makes the new kind of internet-enabled citizen science science, is rather the context volunteers work in than the tasks they perform.
The honey bee problem of citizen science
The old citizen scientists, like Franklin, Goethe or Bacon asked questions, investigated them and eventually discovered something, like Goethe did with his incisive bone. In most citizen science projects today, however, amateurs perform rather mundane tasks like documenting things (see above), outsourcing computing power (e.g. SETI@home) or playing games (e.g. Foldit). You can go to the Scientific American’s citizen science webpage and search for the word ‘help’ and you will find that out of 15 featured projects, 13 are teasered help scientists do something. The division of roles between citizens and real scientists is evident. Citizen scientists perform honey bee tasks. The analytic capacity remains with real researchers. Citizen science today is often a twofold euphemism.
That is not to say that collecting, documenting and counting is not a crucial part of research. In many ways the limited task complexity even resembles the day-to-day business of in-person research teams. Citizen scientists, on the other hand, can work when they want to and on what they want to. That being said, citizen science is still a win win in terms of data collection and citizen involvement.
An alternative way to think of citizen science: Small scale, high involvement
A second way of doing citizen science is not to think of volunteers as thousands of little helpers but as knowledge workers on a par with professional researchers. This small-scale type of citizen science is sometimes swept under the mat even though it is equally promising.
Timothy Gower’s Polymath Project is a good case for the small-scale-high-involvement type of citizen science. In 2009, Gowers challenged the readers of his blog to find a new combinatorial proof of the density version of the Hales-Jewett theorem. One has to know, that Gowers is a field medallist in math and apparently his readers share the same passion. After seven weeks, he announced that the problem had been solved with the help of 40 volunteers, a number far too small to count as massively collaborative.
Nevertheless, Gower’s approach was successful. And it designated an form of citizen science in which a few volunteers commit themselves for a longer period to solve a problem. This form of citizen science is fascinating regarding its capacity to harvest tacit expert knowledge that does not reside in a scientific profession. The participation is smaller in scale but higher in quality. It resembles Benkler’s commons-based peer production or the collective invention concept from open innovation.
The core challenges for this kind of citizen science is to motivate and enable expert volunteers to make a long-term commitment to a scientific problem.
Both strategies, the large scale low involvement participation as well as the small scale high involvement participation have the capacity to alter science. The second however would be a form of citizen science that lives up to its name. Or did you never want to discover your own incisive bone?
Reproducibility is fundamental to the advancement of science. Unless experiments and findings in the literature can be reproduced by others in the field, the improvement of scientific theory is hindered. Scholarly publications disseminate scientific findings, and the process of peer review ensures that methods and findings are scrutinized prior to publication. Yet, recent reports indicate that many published findings cannot be reproduced. Across domains, from organic chemistry ((Trevor Laird, “Editorial Reproducibility of Results” Organic
Process Research and Development) to drug discovery (Asher Mullard, “Reliability
of New Drug Target Claims Called Into Question”
Nature Reviews Drug Development) to psychology (Meyer and Chabris, “Why Psychologists’ Food Fight Matters” Slate), scientists are discovering difficulties in replicating
Various groups have tried to uncover why results are unreliable or what characteristics make studies less reproducible (see John Ioannidis’s “Why Most Published Research Findings Are False,” PLoS, for example). Still others look for ways to incentivize practices that promote accuracy in scientific publishing (see Nosek, Spies, and Motyl, “Scientific Utopia II: Restructuring Incentives and Practices to Promote Truth Over Publishability” Perspectives on Psychological Science). In all of these, the underlying theme is the need for transparency surrounding the research process – in order to learn more about what makes research reproducible, we must know more about how the research was conducted
and how the analyses were performed. Data, code, and materials sharing can shed light on research design and analysis decisions that lead to reproducibility. Enabling and incentivizing these practices is the goal of The Open Science Framework, a free, open source web application built by the Center for Open Science.
The right tools for the job
Open Science Framework (OSF) helps researchers manage their research workflow and enables data and materials sharing both with collaborators and with the public. The philosophy behind the OSF is to meet researchers where they are, while providing an easy means for opening up their research if it’s desired or the time is right. Any project hosted on the OSF is private to collaborators by default, but making the materials open to the public is accomplished with a simple click of a button.
Here, the project page for the Reproducibility Project: Cancer Biology demonstrates the many features of the Open Science Framework (OSF). Managing contributors, uploading files, keeping track of progress and providing context on a wiki, and accessing view and download statistics are all available through the project page.
Features of the OSF facilitate transparency and good scientific practice
with minimal burden on the researcher. The OSF logs all actions by contributors and maintains full version control. Every time a new version of a file is uploaded to the OSF, the previous versions are
maintained so that a user can always go back to an old revision. The OSF performs logging and maintains version control without the researcher ever having to think about it – no added steps to the workflow, no extra record-keeping to deal with.
The OSF integrates with other services (e.g., GitHub, Dataverse, and Dropbox)
so that researchers continue to use the tools that are practical, helpful, and a part of the workflow, but gain value from the other features the OSF offers. An added benefit is in seeing materials from
a variety of services next to each other – code on GitHub and files on Dropbox or AmazonS3 appear next to each other on the OSF – streamlining research and analysis processes and improving workflows.
Each project, file, and user on the OSF has a persistent URL, making content citable. The project in this screenshot can be found at https://osf.io/tvyxz.
Other features of the OSF incentivize researchers to open up their data and materials. Each project, file, and user is given a globally unique identifier – making all materials citable and ensuring
researchers get credit for their work. Once materials are publicly available, the authors can access statistics detailing the number of views and downloads of their materials, as well as geographic
information about viewers. Additionally, the OSF applies the idea of “forks,” commonly used in open source software development, to scientific research. A user can create a fork of another project, to
indicate that the new work builds on the forked project or was inspired by the forked project. A fork serves as a functional citation; as the network of forks grows, the interconnectedness of a body of research becomes apparent.
Openness and transparency about the scientific process informs the development of best practices for reproducible research. The OSF seeks both to enable that transparency, by taking care of “behind
the scenes” logging and versioning without added burden on the researcher – and to improve overall efficiency for researchers and their daily workflows. By providing tools for researchers to
easily adopt more open practices, the Center for Open Science and the OSF seek to improve openness, transparency, and – ultimately – reproducibility in scientific research.
Winner of the #LavauxContest photo competition at #t4d2014 from @GabrielaTejadaG
Denisa Kera and Sachiko Hirosue pulled together a fabulous session at Tech4Dev 2014 #t4d2014 at the SwissTech Convention centre in sunny Lausanne. The conference was organised by CODEV and the UNESCO Chair in Sustainable Technologies and focused on ‘What is essential?’ in technology for development. Many answers to this question discussed throughout the three days converged around collaboration with communities, with many sessions highlighting examples of co-design and co-creation across a range of technologies for development including water, energy, healthcare and ICTs for education. This recognition of and commitment to participation and collaboration in reseach for development relates strongly to work completed by the open science working group and the OpenUCT initiative funded by IDRC (documented here, working paper available online).
The session ‘The Openness Paradigm: How Synergies Between Open Access, Open Data, Open Science, Open Source Hardware, Open Drug Discovery Approaches Support Development?’ covered a range of topics reflecting the breadth of practices that constitute open science but the two key areas of interest were open hardware for science and open data.
First speaker up was Professor Irfan Prijambada from Gadjah Mada University in Yogyakarta, Indonesia, who described the necessity of access to lab equipment for his microbiology research focused on agricultural practices and fermentation. Fermentation is important for alcoholic drinks but also the fermentation of cassava and rice to produce traditional Indonesian foods such as tapei. Further aspects of research in the Laboratory of Agricultural Microbiology centre around soil and water microbiology, including biodegradation and bioremediation in volcanic soils. As any microbiologist knows, the ability to observe small lifeforms and a sterile environment in which to culture and work with them are the two most essential research requirements in the lab. Prof Prijambada described the resultant difficulties of performing research effectively when dealing with obsolete and inadequate research equipment, relying on out of date microscopes with no digital image collection and plating microorganisms on agar in open spaces next to a bunsen burner with no access to a clean hood or laminar flow hood, both standard pieces of equipment for maintaining a near-sterile environment and ensuring samples are not contaminated. To add to these difficulties, applying for funding for equipment procurement at the university can mean a 12 month wait for processing and delivery even if the application is approved. There was a clear need for cheap, rapid and local supply of essential kit.
miCAM v3.2 on display at Tech4Dev. Photo by Jenny Molloy, all rights waived under CC0.
Step in Hackteria.org. In 2009 after a workshop in Yogyakarta run by Marc Dusseiller, an active maker and advocate of DIY biology and open source hardware, Prof Prijambada and his lab set about taking a DIY approach to lab hardware by creating their own clean hood and laminar flow hood, initially using a glassfibre filter but now employing a series of HEPA filters. The equipment was constructed in only 2 months for less than 10% of the cost of a commercial equivalent (1.2m IDR vs 15m IDR). Microscopes were constructed from webcams in less that one month costing 750k IDR instead of 7m IDR and were entirely adequate for research needs. Not only adequate, but aquisition of digital images allowed an automated colony counter to be developed. The importance and utility of these microscopes was explained by their developer Nur Akbar Arofatullah, a researcher at Gadjah Mada University, who founded the Lifepatch initiative and along with other hardware projects, has improved the DIY microscopes to the stage where a company is now offering a commercial version of the latest MiCAM v3.2 for those who don’t find DIY appealing. However, hands-on construction remains a key part of the educational aims of open hardware and Lifepatch are using the microscopes and their construction for a range of workshops pitched at different educational levels. Kindergarten students compare the width of their hair in Cyber Hair Wars, elementary students learn about plant and muscle cells, high school students construct their own microscopes while their teachers are taught how to run workshops themselves. University students are enthused with the DIY spirit and encouraged to apply these principles in their own education and research.
One area where Gadjah Mada University excels is community relations. Setting an example for all publicly funded research establishments, staff and students are expected and obliged to work with the community to achieve promotion within the university and there exists a dedicated Office of Research and Community Development. Within this ethos, DIY microscopes have been used to bridge knowledge between the university and community through workshops on sanitation and hygiene which make use of the microscopes and microbiology techniques to analyse water, take handswabs and analyse data on E. coli contamination. Lifepatch have run the Jogja River Project for several years, taking an integrative approach to water quality and river monitoring including participatory mapping and data collection on vegetation and animals all the way through to active clean up operations. Innovation in DIY hardware is rapid at Lifepatch and Gadjah Mada, with other projects including a vortex, rotator for incubating bacterial samples and a pipette stand. As per the example of MiCAM, the DIY approach is still compatible with commercialisation as people can buy pre-built hardware, thus offering the possibility of generating jobs and income but there are many questions around models for these activities which were of interest to the audience but could easily fill an entire session and were not covered in any depth (see reports here and here for an introduction).
In a related talk during a later session at a beautiful UNESCO World Heritage Site, the Lavaux vineyards on the banks of Lake Geneva, Dara Dotz presented on 3D printing open hardware during another session which touched on the creation of jobs and hyper local digital manufacturing capacity in Port au Prince, Haiti. Dara travelled to Haiti for three weeks and ended up staying for a year working for an NGO. She observed problems in water treatment plants and in hospitals which were caused by a lack of supplies and particularly spare parts due to a broken supply chain including long shipping and customs quarantine times, a culture of bribes and poor transport links for distribution. After a friend attended the delivery of 5 babies in one evening and had no option for tying off umbilical chords but using her own gloves, Dara realised that her background interest and contacts in 3D printing could be used to solve some of the issues of obtaining plastic parts and consumables. Having brought a Maker-bot 3D printer into Haiti, Dara trained a group of Haitians with basic education to use the printers and 3D design software and several potential uses were identified, with critical application being umbilical chord clips, splitters for oxygen tubing to allow multiple patients to receive oxygen from the same cyclinder and IV bag hooks to reduce the use of large IV stands which blocked space in already overcrowded wards.
There were many considerations and design challenges to be addressed such as ensuring that designs addressed community needs and were designed with, by and for local people. In addition to empowering people to produce there own solutions to address real-time problems, the manufacturing method has the benefit of being on-demand, helping to ensure cleanliness of equipment, provides jobs and is also cheaper than importation of commercial equipment. Umbilical clips can be manufactured for $0.36 compared to $2.69 imported cost, representing a significant saving over time in this resource poor setting. Dara is now applying the same ideas to disaster zone supplies through the NGO fieldready.org and plans extensions to the Haiti project including importing CNC machines to allow manufacture of metal parts, creating a repository of designs for field supplies and increasing the use of recycled plastic waste for non-clinical devices and prototyping.
The Open Source Hardware approach advocated throughout this session is supported by the Open Source Hardware Association (OSHWA), a non-profit aiming to raise awareness of OSHW and to spur innovation by hobbyists, commercial and academic users. Gabriella Levine is President of the OSHWA Board and an artist with an interest in snake biomimicry. She introduced two projects designed for sensing water quality and clearing oil waste – Protei and Sneel, a snake biomimetic robot designed iteratively by Gabriella and documented online.
These modular sailing and swimming robots allow sensors for oil, plastic waste,temperature, radioactivity and more to be attached and move through the water autonomously or via remote control, taking readings as they go. These concepts have been used in a range of water quality workshops and Gabriella runs hackdays exploring ideas around the design and deployment of water quality monitoring sensors and other hardware, including a water hackathon at Tech4Dev the following day!
With a variety of DIY and OSHW approaches and designs being prototyped and promoted in areas as important as sensors and even medical devices, a major question becomes how to ensure that quality is consistent and devices work accurately and safely. The current systems of quality assurance regulations in various countries are often either complex, expensive, time-consuming and a massive barrier to market entry – or non-existent. Kate Ettinger is working to develop a system for collecting information on quality and accuracy of OSHW projects in an open and transparent way using an open source hardware/software data collection system and an open data approach to making information available. This framework could apply to many projects but Kate used the examples of neonatal incubators and prosthetic limbs, with data being collected to accelerate responsive design and ensure ‘integrity by design’ throughout the development and deployment of open source medical devices.
From open data for open hardware to open data as a research tool, Nanjira Sambuli from iHub in Nairobi described the use of crowd sourced data during the Kenyan elections in 2013 and contrasted data collected from Twitter and other social networks via passive crowdsourcing with active sourcing organised by Ushahidi. Conclusions presented were that machine learning algorithms are necessary to make collection of large datasets from high volume social networks viable and that there were surprising patterns and voices gathered through passive listening rather than active calls for information. Nanjira presented a framework developed by iHub for election data crowdsourcing emphasising the three V’s – viability, validity and verification.
Integrity and curation of scientific data was also highlighted in the final talk by Scott Edmunds of GigaScience , during which he described some excellent case studies of the power of openness. One example was increasing the rapidity of disease research during the E.coli outbreak in Europe in 2011, where BGI rapidly sequenced and released the genome as open data. The image of the chromosome map was later chosen as the front cover of a major report from The Royal Society in the UK on openness in science. Another example looked at the great scope for crowd sourcing the collection and analysis of open datasets. Research on Ash Die Back, an invasive tree disease, demonstrated several flavours of citizen science from publicly contributed geo-tagged photos of infected trees for OpenAshDB to gamification of genome data analysis via the Facebook game Fraxinus. It is also clear that citizens are very keen to support local research that is important for them and wish data to be made public to enrich their scientific and cultural heritage. The Puerto Rican “People’s Parrot” genome project took an endangered and much loved national symbol and sequenced its genome to learn more about its uniqueness and evolutionary history. This effort was funded by fashion shows, art projects, concerts, a branded beer and public donations. Scott focused on these successes but also discussed the challenges in increasing open data release, including ensuring researchers get appropriate credit and are incentivised to make their data available.
The BGI sequenced chromosome of the German E.coli outbreak strain.
A common theme running through the presentations was that openness can be effective at accelerating innovation and enabling research in resource-poor settings. In addition, the scope for education and democratisation of the scientific process through involvement of local communities in scientific research and technological innovation has variously led to employment, empowerment and increased opportunities. The challenge now is to establish under what contexts this remains true and work to advocate and support open approaches where they can offer benefits for scientists and citizens in the global South. I hope members of this working group and the rest of the global open science community will be able to contribute to this mission!
It’s simple really – you can break it down into its two constituent parts:
In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.
In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.
It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.
For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.
The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.
Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?
In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!
Some suggested tutorials & resources you might want to start with:
Happily, just over 4,600 accounts have participated in the Open Science community with its eponymous hashtag, in this span. The 10,000 tweets have accrued over ten weeks. Our own @openscience on Twitter has tweeted most, over 600 times at the hashtag, as well as having received the most retweets and @ mentions, over 8,000 in these 10,000.
We have modified the vis which came with the data via the satisfying TAGS effort shared by Martin Hawksey. We added looks at the numbers of mentions and of mentions per tweets for top tweeters, and rankings of top tweets for the past ten weeks to Martin’s default views. We will continue collecting tweets, but do note that in another month or so, we will reach Google Docs limits e.g. on numbers of cells. We will use additional sheets, so links to all data will have changed, just how depending on when you are reading this post. Ask us @openscience on Twitter.
More could be done; won’t you help? Leave a reply below or ping us @openscience on Twitter if you need edit access to the sheet itself but we would like to see data and analyses in other tools as well. Our work to this point is only to get something started.
Not all tweets which are about Open Science include the #openscience hashtag. In a perfectly semantic world, they would and when they can, they really should. It has helped to form a community among the 4,600+ accounts participating in these ten weeks and many others in recent years. A couple reasons the hashtag might not be used in a relevant tweet include the character limit on tweets and lack of awareness of hashtags or of the term Open Science.
We take our organising and leadership role seriously at @openscience on Twitter, an account shared by many in the community. We have a simple policy that all our tweets should be related to Open Science. Even at our account, not all our tweets include the #openscience hashtag, particularly as we discuss related concerns such as Citizen Science or Open Access. An example tweet from the time frame considered here, related to Open Science but not hashtagged as such is below. In this case, the limit on tweet length and the topic led to including #openaccess, not #openscience:
This is a guest post by Anthony Beck, Honorary
fellow, and Dave Harrison, Research fellow, at the University of Leeds School of
In 2010 we authored a series of blog posts for the Open Knowledge Foundation subtitled ‘How open approaches can empower archaeologists’. These discussed the DART project, which is on the cusp of concluding.
The DART project collected
large amounts of data, and as part of the project, we created a
repository to catalogue this and make it available, using CKAN, the Open Knowledge Foundation’s
open-source data catalogue and repository. Here we revisit the need
for Open Science in the light of the DART project. In a subsequent
post we’ll look at why, with so many repositories of different kinds,
we felt that to do Open Science successfully we needed to roll our
Open data can change science
Open inquiry is at the heart of the scientific enterprise. Publication
of scientific theories – and of the experimental and observational data
on which they are based – permits others to identify errors, to support,
reject or refine theories and to reuse data for further understanding
and knowledge. Science’s powerful capacity for self-correction comes from
this openness to scrutiny and challenge. (The Royal Society,
Science as an open enterprise, 2012)
The Royal Society’s report Science as an open enterprise
identifies how 21st century communication technologies are changing
the ways in which scientists conduct, and society engages with,
science. The report recognises that ‘open’ enquiry is pivotal for the
success of science, both in research and in society. This goes beyond
open access to publications (Open Access), to include access
to data and other research outputs (Open Data), and the
process by which data is turned into knowledge (Open
The underlying rationale of Open Data is this: unfettered access to
large amounts of ‘raw’ data enables patterns of re-use and knowledge
creation that were previously impossible. The creation of a rich,
openly accessible corpus of data introduces a range of data-mining and
visualisation challenges, which require multi-disciplinary
collaboration across domains (within and outside academia) if their
potential is to be realised. An important step towards this is
creating frameworks which allow data to be effectively accessed and
re-used. The prize for succeeding is improved knowledge-led policy
and practice that transforms communities, practitioners, science and
The need for such frameworks will be most acute in disciplines with
large amounts of data, a range of approaches to analysing the data,
and broad cross-disciplinary links – so it was inevitable that they
would prove important for our project, Detection of Archaeological
residues using Remote sensing Techniques (DART).
DART: data-driven archaeology
DART aimed is to develop analytical methods to differentiate
archaeological sediments from non-archaeological strata, on the basis
of remotely detected phenomena (e.g. resistivity, apparent dielectric
permittivity, crop growth, thermal properties etc). The data collected
by DART is of relevance to a broad range of different communities.
Open Science was adopted with two aims:
to maximise the research impact by placing the project data and
the processing algorithms into the public sphere;
to build a community of researchers and other end-users around the
data so that collaboration, and by extension research value, can be
‘Contrast dynamics’, the type of data provided by DART, is critical
for policy makers and curatorial managers to assess both the state and
the rate of change in heritage landscapes, and helps to address
European Landscape Convention (ELC) commitments. Making the best use
of the data, however, depends on openly accessible dynamic monitoring,
along the lines of that developed for the Global Monitoring for
Environment and Security (GMES) satellite constellations under
development by the European Space Agency. What is required is an
accessible framework which allows all this data to be integrated,
processed and modelled in a timely manner.
It is critical that policy makers and curatorial managers are able
to assess both the state and the rate of change in heritage
landscapes. This need is wrapped up in national commitments to the
European Landscape Convention (ELC). Making the best use of the data,
however, depends on openly accessible dynamic monitoring, along
similar lines to that proposed by the European Space Agency for the
Global Monitoring for Environment and Security (GMES) satellite
constellations. What is required is an accessible framework which
allows all this data to be integrated, processed and modelled in a
timely manner. The approaches developed in DART to improve the
understanding and enhance the modelling of heritage contrast detection
dynamics feeds directly into this long-term agenda.
Cross-disciplinary research and Open Science
Such approaches cannot be undertaken within a single domain of
expertise. This vision can only be built by openly collaborating with
other scientists and building on shared data, tools and techniques.
Important developments will come from the GMES community, particularly
from precision agriculture, soil science, and well documented data
processing frameworks and services. At the same time, the information
collected by projects like DART can be re-used easily by others. For
example, DART data has been exploited by the Royal Agricultural
University (RAU) for use in such applications as carbon sequestration
in hedges, soil management, soil compaction and community mapping.
Such openness also promotes collaboration: DART partners have been
involved in a number of international grant proposals and have
developed a longer term partnership with the RAU.
Open Science advocates opening access to data, and other scientific
objects, at a much earlier stage in the research life-cycle than
traditional approaches. Open Scientists argue that research synergy
and serendipity occur through openly collaborating with other
researchers (more eyes/minds looking at the problem). Of great
importance is the fact that the scientific process itself is
transparent and can be peer reviewed: as a result of exposing data and
the processes by which these data are transformed into information,
other researchers can replicate and validate the techniques. As a
consequence, we believe that collaboration is enhanced and the
boundaries between public, professional and amateur are blurred.
Challenges ahead for Open Science
Whilst DART has not achieved all its aims, it has made significant
progress and has identified some barriers in achieving such open
approaches. Key to this is the articulation of issues surrounding
data-access (accreditation), licensing and ethics. Who gets access to data, when, and under what conditions, is a serious ethical issue for the heritage sector. These are obviously issues that need co-ordination through organisations like Research Councils UK with
cross-cutting input from domain groups. The Arts and Humanities
community produce data and outputs with pervasive social and ethical
impact, and it is clearly important that they have a voice in these
Next we tried out web-based tools such as Tabula for extracting tables from PDF (we were fortunate enough to have Manuel Aristarán of Tabula joining us remotely via Skype) and ChemicalTagger for tagging and parsing experimental sections in chemistry articles.
We then got down to business with some hands-on extraction of species from HTML papers and mentions of books on Twitter using regular expressions. All code is open source so you are welcome and encouraged to play, fork and reuse!
Peter’s tutorial and code to extract species from papers can be found on bitbucket and the relevant software and command line tools have helpfully been bundled into a downloadable package. Iain has also documented his flask application for Twitter mining on github so have a go!
If this has whet your appetite for finding out more about content mining for your research and you’d like to ask for input or help or simply follow ongoing discussion then join our
Altmetrics are a hot topic in scientific community right now. Classic citation-based indicators such as the impact factor are amended by alternative metrics generated from online platforms. Usage statistics (downloads, readership) are often employed, but links, likes and shares on the web and in social media are considered as well. The altmetrics promise, as laid out in the excellent manifesto, is that they assess impact quicker and on a broader scale.
This is all good and well, but in my opinion, altmetrics have a huge potential for discovery that goes beyond rankings of top papers and researchers. A potential that is largely untapped so far.
How so? To answer this question, it is helpful to shed a little light on the history of citation indices.
Pathways through science
In 1955, Eugene Garfield created the Science Citation Index (SCI) which later went on to become the Web of Knowledge. His initial idea – next to measuring impact – was to record citations in a large index to create pathways through science. Thus one can link papers that are not linked by shared keywords. It makes a lot of sense: you can talk about the same thing using totally different terminology, especially when you are not in the same field. Furthermore, terminology has proven to be very fluent even in the same domain (Leydesdorff 1997). In 1973, Small and Marshakova realized – independently from each other – that co-citation is a measure of subject similarity and therefore can be used to map a scientific field.
Due to the fact that citations are considerably delayed, however, co-citation maps are often a look into the past and not a timely overview of a scientific field.
Altmetrics for discovery
In come altmetrics. Similarly to citations, they can create pathways through science. After all, a citation is nothing else but a link to another paper. With altmetrics, it is not so much which papers are often referenced together, but rather which papers are often accessed, read, or linked together. The main advantage of altmetrics, as with impact, is that they are much earlier available.
Bollen et al. (2009): Clickstream Data Yields High-Resolution Maps of Science. PLOS One. DOI: 10.1371/journal.pone.0004803.
One of the efforts in this direction is the work of Bollen et al. (2009) on click-streams. Using the sequences of clicks to different journals, they create a map of science (see above).
In my PhD, I looked at the potential of readership statistics for knowledge domain visualizations. It turns out that co-readership is a good indicator for subject similarity. This allowed me to visualize the field of educational technology based on Mendeley readership data (see below). You can find the web visualization called Head Start here and the code here (username: anonymous, leave password blank).
Why we need open and transparent altmetrics
The evaluation of Head Start showed that the overview is indeed more timely than maps based on citations. It, however, also provided further evidence that altmetrics are prone to sample biases. In the visualization of educational technology, the computer science driven areas such as adaptive hypermedia are largely missing. Bollen and Van de Sompel (2008) reported the same problem when they compared rankings based on usage data to rankings based on the impact factor.
It is therefore important that altmetrics are transparent and reproducible, and that the underlying data is openly available. This is the only way to ensure that all possible biases can be understood.
As part of my Panton Fellowship, I will try to find datasets that satisfy these criteria. There are several examples of open bibliometric data, such as the Mendeley API, and figshare API that have adopted CC BY, but most of the usage data is not available publicly or cannot be redistributed. In my fellowship, I want to evaluate the goodness of fit of different open altmetrics data. Furthermore, I plan to create more knowledge domain visualizations such as the one above.
So if you know any good datasets please leave a comment below. Of course any other comments on the idea are much appreciated as well.
The goal of the Open Scholar Foundation is to improve the efficiency of scholarly communication by providing incentives for researchers to openly share their digital research artifacts, including manuscripts, data, protocols, source code, and lab notes.
The proposal of an “Open Scholar Foundation” was one of the winners of the 1K challenge of the Beyond the PDF conference. This was the task of the challenge:
What would you do with 1K that would significantly advance scholarly communication that does not involve building a new software tool?
The idea was to establish a committee that would certify researchers as “Open Scholars” according to given criteria. This was the original proposal:
I would set up a simple "Open Scholar Foundation" with a website, where researchers can submit proofs that they are "open scholars" by showing that they make their papers, data, metadata, protocols, source code, lab notes, etc. openly available. These requests are briefly reviewed, and if approved, the applicant officially becomes an "Open Scholar" and is entitled to show a banner "Certified Open Scholar 2013" on his/her website, presentation slides, etc. Additionally, there could be annual competitions to elect the "Open Scholar of the Year".
An alternative approach (perhaps more practical and promising) would be to provide a scorecard for researchers to calculate their “Open Scholar Score” on their own. There is an incomplete draft of such a scorecard in the github repo here.
In any case, his project should lead to an established and recognized foundation that motivates scholars to openly share their data and results. Being a certified Open Scholar should be something that increases one’s reputation and visibility, and should give a counterweight to possible benefits from keeping data and results secret. The criteria for Open Scholars should become more strict over time, as the number of “open-minded” scholars hopefully increases over the years. This should go on until, eventually, scholarly communication has fundamentally changed and does not require this special incentive anymore.