You are browsing the archive for Maria Neicu.

Scientific Workflow Use – Towards Open Methods, by Richard Littauer

- September 27, 2011 in Guest Post, Tools

This summer I did an internship with DataONE, or the Data Observation Network for Earth, which is a US NSF-funded initiative aimed at “ensuring the preservation and access to multi-scale, multi-discipline, and multi-national science data.”
It is a virtual organization dedicated to providing open, persistent, robust, and secure access to biodiversity and environmental data. I was one of eight interns working on various projects – my project involved studying scientific workflows. Specifically, I was tasked with understanding how they could be categorised and how they are being used by the scientific community. I recorded my internship work on my Open Notebook, and presented (slides) on this work in progress at the Open Knowledge Foundation Conference in Berlin.

What are Workflows?

I had a Linguistics background before undergoing this project, and understandably I spent a long while going through the relevant literature trying to understand what exactly scientific workflows are. For those of you who don’t know (like I didn’t before hand), scientific workflows are not necessarily the same as business workflows or personal pipelines, although they are similar. “Scientific workflows are widely recognised as a ‘useful paradigm to describe, manage, and share complex scientific analyses’,” says the Taverna website.

As you can already tell by the terminology, scientific workflows have become a separate entity, tied up in computational science. They range from simple shell scripts to heavy grid programs that enable distributed, parellel processing for large data sets. On the whole, the most useful workflows that are being used repeatedly by different labs and scientists are built within other programs designed for that purpose, which also means that they are portable, easily archived and accessed, and somewhat transparent as to their immediate function.

An example of a workflow workbench would be Taverna; others include Kepler, VisTrails, Pegasus, Knime, RapidMiner, and so on. These are programs which have been designed to make designing workflows easier, by using smaller components that can be added together, than just copying and pasting, or infitinitely fiddling with, long scripts of raw code. Here is an example workflow that gets a comic off of the popular xkcd site:

The workflows developed in these programs can be shared – and since 2007, there has been a single repository that has garned the most impact, partly due to the fact that the organisers also worked with or were the Kepler and Taverna developers, and partly because of APIs in those programs that access the site, and partly due to the lack of a single repository elsewhere (although there are others, such as Yahoo pipes, although that is not science based.) I am talking about, which now boasts over 2000 uploaded workflows. These workflows have been uploaded entirely by the community, with no direct benefit to the uploader except to showcase work that might be helpful. As such, myExperiment is a wonderful textbook of open science at its best.

What did I do?

In my internship, I did several things РI mined google scholar, other websites concerning workflows, and various lists for pretty much all of the information I could get on workflows. All of these papers are now available in the Mendeley group I set up. The majority of these papers are by computational scientists Рgenerally bioinformatics, but also software developers Рabout workflows, with only a small subset being papers that have used workflows in their research and cite where they have uploaded them. As this is the case, and as the workflow systems I was working with had a very steep learning curve (somewhat unfortunately), it became apparent early on that a new approach was needed besides research.

So, I was tasked with getting all of the information I could off of myExperiment. They have an RDF backend that I found inscrutable, and that didn’t display the information very well once it was downloaded. To bypass this, I ended up screenscraping all of the information I could from the workflow pages themselves on myExperiment. This resulted in a fairly large amount of data on how scientists have been using myExperiment – which workflows were downloaded more, what the components were for those workflows, how they were tagged, what they did, and so on. This was the result of my internship, although the research is ongoing.

What did you discover?

Well, there were a lot of results. I ran hundreds of lines of R code (mostly because I was new to it and kept repeating myself.) I found that there are actually only a few people uploading tons of workflows onto myExperiment, which means that the main users of workflows may well be the developers. Most workflows aren’t very complex, at the end of the day, either – they do simple tasks. This has changed over time, but it’s hard to say whether that’s because of a change in the efficiency of the workbenches, or whether it’s because people are using simpler ones and tying them together (one of the important features of workflows is the ability to embed them inside each other). Most workflows are so-called ‘shims’, which means that they, effectively, change data from one format to another.

There were other results, of course, but I don’t want to give too much away as we’re trying to publish the results that we do have. Importantly, our resulting paper focuses around what we suggest for workflows based on our findings. Mainly, this settles down on a few core needs:

  • community awareness – scientists dealing with large amounts of data or processing times need to be made aware that there are programs out there that can speed up their science, help them with feedback on their results, and make their efforts reproducible, which is one of the most important features of data-intensive science.
  • standards – not just for workflow tags and names, but for workflow archiving and repository practices, for workflow usage, and most importantly for workflow referencing and citing in journals. If a paper mentions they use a workflow, it will go a long way towards helping reproducibility for other researchers, as well as understanding.
  • education – there is a steep learning curve towards using workflows at the moment, and hopefully this could be passed by teaching students earlier about their use in helping with one’s own research, as well as to the wider community. What helps me do my work will help another successfully proof my work, which leads to better science, on the whole. It may also help others later do better work that will build off of mine. This is all good.

What’s next?

Well, I’m hoping to screen-scrape the groups, packs, tags, and user information off of myExperiment. I’m hoping to mine the research I have more and spot articles that reference workflows they use. I’m presenting my research in a couple of weeks at the DataONE All Hands Meeting in Albuquerque, NM, and I hope to get some feedback there to help with future studies. I am also hoping to run more R code on the information I have to figure out if there are clusters that show significantly that there are different sorts of workflows that are downloaded more often – say, bioinformatics ones as opposed to astronomical workflows (which do exist.) I hope that all of this will eventually end up being used to make the future of workflows brighter.

I am also hoping to setup a repository for workflows in Linguistics and social sciences, my own disciplines, as there is nothing of the sort at the moment, and very few people working on them. I am, to that end, currently trying to find a server to buy, and trying to set up an open access journal that would demand that the authors publish their code and datasets and not just their results. Hopefully, this will be a new and significant step in experimental social sciences towards better work. Also hopefully, some people will read this post and want to help out – if so, please let me know. I’m glad to say that this work has been a major learning experience, and that shows no sign of stopping.

Richard Littauer

Reports on Open Science @OKCon 2011

- August 19, 2011 in OKCon, Uncategorized

Two Reports on the Open Science panels:

1. Open Quake- Welcoming OpenQuake and OpenGEM as new members in Open Science group. 

The Open Quake project summarise discussions at the OKCon Open Science Panel and the issues they face in open data: volunteer computing, licensing and user interfaces. Volunteer-based projects in scientific research can be improved by using a platform like BOINC which allows open-source computing solution. As for interfaces, the actual challenge comes down to making open data usable.

2. Citizen Cyberscience

Francois Grey- on the distinction between open science and citizen cyberscience. Playing with the liminal space between professional and amateurish science, openness should primarily enhance the possibilities for praxis. As Grey states, “I’m not interested in openness as an end in itself, but rather as a means to an end”: in this case, widening the circle of experts. Empowering the „have-nots” is not enough, as access must come along with real opportunities of participation. „In short, Open Science is about making sure there are no locks on the doors to science. Citizen cyberscience is about making sure as many people as possible walk through those doors.”

Original post can be found here.

Other Reports from OKCon

Interested in hearing more from Open Knowledge Conference 2011? Below you can find listed reviews, comments on speakers and presentations, but also ideas to be further developed.

·         On his blogNikolay Georgiev (Open Source Ecology) brings together the presentations related to Open Hardware, from principles of freedom to FabLabs and RepRap machines. Follow his slideshare, and consider his argument for having different levels of openness for Hardware.

·         Part of the LOD2 team (who presented this year in Berlin the Open Government Data Stakeholder SurveyMartin Kaltenboeck writes more on Andreas Blumauer’s presentation on open data for enterprises. Here.

·         DataMinerUk presents Nicola Hughes’s stand for open data in journalism. Find out why infographics and other interactive tools are only a superficial effort towards data journalism, reading her extracted points from the speakers Simon Rogers, Stefan Candea, Caelainn Barr, Liliana Bounegru and Mirko Lorenz.

·      Here,   James Harriman-Smith maps OKCon2011 around the Open Shakespeare’s annotation system, asking whether subjective opinion can be processed as (open) data as well, in the ecosystem of openness

·        In this post from http://www.lanetscouade.comSamuel Goëta summarizes top 5 speakers, starting with Richard Stallman’s intriguing talk on fundamental liberties vs. Open Source. (Article in French).

·         For DataOne research, Richard Littauer relates his experience of the OKCon2011, taking the pulse of legal matters. Find out why we will soon need a database of Open Knowledge-relevant lawsuits:

·         An extensive, critical blogpost from Michael Gurstein. Who is the end-user for whom we fight to open-up data?

·           Stefan Merten offers full details on some compelling presentations, and forecasts a soon-to-come big boom for open hardware. On and here.

·          Rolf from Open for Change links OKCon presentations on governmental data with the beta version of the Open for Change Manifesto, as a way to better create autonomy, control and empowerment:

·         For details on backstage meetings at OKCon2011, and on how OKF design its organizational DNA, find out more in Peter Murray-Rust article here (also blogged by Glyn Moody).



Ross Mounce on “Open Palaeontology” @OKCon Berlin 2011

- August 11, 2011 in Collaborations, OKCon, Panton Principles

The following is a guest blogpost by Ross Mounce, currently a PhD writing on “The Importance of Fossils in Phylogeny” at the University of Bath, in UK. As his approach includes application of informatics techniques to palaeontological data, Ross’s research interests are also oriented towards Openness in Data, Access and Science. Ross attended the Open Knowledge Conference in Berlin, 2011, where he gave a talk on Open Palaeontology.

Ross Mounce:

“A few weeks ago, I gave a talk at the Open Knowledge Conference 2011, on
‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and
build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research (access to research papers being perhaps the most well-known and well-publicised), the issue that most aggravates me, is not the access to these papers, but the actual papers themselves – especially in the digital context of the 21st century. They are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide) scope, I need to look at a lot of papers, in a lot of
different journals
, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; to place them on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to
understand there’s even a Wikipedia page here describing the
most commonly used data format (nexus). There exist multiple databases
set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this
standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (also available below this post) the data, as originally analysed/used, comes in a much richer, more usable, digital, standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief that this practice is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication?
I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result
of this was a fairly basic Drupal-based website that we launched
and disseminated via mailing lists, Twitter, as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here.
A huge thanks must go to everyone who helped out with the campaign, it has generated truly International support, as can be demonstrated on the map below:

(View Open Letter Signatures in a
larger map)

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once the editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society
journals (Paleobiology
and Journal of
) have recently adopted mandatory data submission to
the Dryad repository, and the Journal of Vertebrate
has also improved their editorial
with respect to certain types of data, but these are just a few of many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Below you can find the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. If the conference is on next year – I’ll be there for sure!”
Ross Mounce

The invited talk, given on Friday 1st July 2011 at the Open Knowledge Conference (Berlin) by Ross Mounce: Open Palaeontology on Prezi

BioMed Central Research Award for Open Data

- June 22, 2011 in Uncategorized

In May 2011, the winners of the 5th edition of BioMed Central Research Awards were announced in London. Part of the Microsoft Research initiative, the awards are offered for the following categories of scientific publications:  Best Case Report of the Year, Editor of the Year, Open Access Institution of the year, the Open Data Award as well as the Biology and Medicine Awards  for research.


The BioMed Central’s Open Data Award encourages data sharing and re-use, challenging the traditional ways of doing research and publishing within the scientific community. As stated on the official website, opening scientific data inevitably brings along debates on cultural acceptance and community access. However, these outstanding researchers have succeeded in adding value by adopting the Open Data philosophy; revolutionizing the way science is done.

The 2011 winners of the Open Data Award are Veli Vikberg, David R. Smith, and Jean-Luc Boevé, for the article How common is ecological speciation in plant-feeding insects? A ‘Higher’ Nematinae perspective’, in the field of ecological phylogenetics. By archiving the original set of data used for their research into an online appendix, the winning team from University of Eastern Finland became front-runners of Open Data. Starting with improving transparency of datasets, they took a first step towards creating the an open ‘meta-analysis’ – with the potential for different scientists of various expertise to access and compare different archived material.

More from  Veli Vikberg, David R. Smith, and Jean-Luc Boevé on the relevance of Open Data for the researcher as an individual and  for the scientific community as a whole, can be found under the title “On the unbearable lightness of mandatory data sharing”.

The panel of judges included Alex Wade (Microsoft Research) and OKF members of the Working Group on Open Data in Science, namely Rufus Pollock (OKF), Peter Murray-Rust (Murray-Rust Group, Cambridge), John Wilbanks (Creative Commons), Cameron Neylon (Open Access advocate), and Iain Hrynaszkiewicz (Journal Publisher, BioMed Central).

Sources: Official Biomed Central and the BioMed Open Access Central blog