On the Harvard Dataverse Network Project (and why it’s awesome)
I am a huge fan of grass-roots approaches to scholarly openness. Successful community-led initiatives tend to speak directly to that community’s need and can grow by attracting interest from members on the fringes (just look at the success of the arXiv, for example). But these kinds of projects tend to be smaller scale and can be difficult to sustain, especially without any institutional backing or technical support.
This is why the Harvard Dataverse Network is so great: it facilitates research data sharing through a sustainable, scalable, open-source platform maintained by the Institute for Quantitative Social Sciences at Harvard. This means it is sustainable through institutional backing, but also empowers individual communities to manage their own research data.
In essence, a Dataverse is simply a data repository, but one that is both free to use and fully customisable according to a community’s need. In the project’s own words:
‘A Dataverse is a container for research data studies, customized and managed by its owner. A study is a container for a research data set. It includes cataloging information, data files and complementary files.’
(http://thedata.harvard.edu/dvn/)
There are a number of ways in which the Dataverse Network can be used to enable Open Data.
Journals
A Dataverse can be a great way of incentivising data deposition among journal authors, especially when coupled with journal policies of mandating Open Data for all published articles. Here, a journal’s editor or editorial team would maintain the Dataverse itself, including its look and feel, which would instil confidence in authors that the data is in trusted hands. In fact, for journals housed on Open Journal Systems, there will soon be a plugin launched that directly links the article submission form with the journal’s Dataverse. And so, from an author’s perspective, the deposition of data will be as seamless as submitting a supporting information file. This presentation [pdf] goes into the plugin in more detail (and provides more info on the Dataverse project itself).
(Sub-)Disciplines
There are some disciplines that simply do not have their own subject-specific repository and so a Dataverse would be great for formalising and incentivising Open Data here. In many communities, datasets are uploaded to general repositories (Figshare, for example) that may not be tailored to their needs. Although this isn’t a problem – it’s great that general repositories exist – a discipline-maintained repository would automatically confer a level of reputation sufficient to encourage others to use it. What’s more, different communities have different preservation/metadata needs that general repositories might not be able to offer, and so the Dataverse could be tailored exactly to that community’s need.
Individuals
Interestingly, individuals can have their own Dataverses for housing all their shared research data. This could be a great way of allowing researchers to showcase their openly available datasets (and perhaps research articles too) in self-contained collections. The Dataverse could be linked to directly from a CV or institutional homepage, offering a kind of advertisment for how open a scholar one is. Furthermore, users can search across all Dataverses for specific keywords, subject areas, and so on, so there is no danger of being siloed off from the broader community.
So the Dataverse Network is a fantastic project for placing the future of Open Data in the hands of researchers and it would be great to see it adopted by scholarly communities throughout the world.
Researchers who submit to and repurpose content from data repositories like Dataverse and Dryad, as opposed to ICPSR and other data archives, need to remember that there is no curatorial review in terms of checking documentation, code, and data for things like disclosure avoidance, missing or inadequate data, proprietary file formats, or documentation required for transparency compliance. That means it falls in the hands of researchers to take on the responsibility for preparing data for sharing, with the data, documentation and code properly vetted prior to submission. This is not a small commitment. For self archiving, it is important to get a data curator on board and, as much as possible, follow the guidelines from archives like ICPSR to produce high quality content that can stand the test of time and be independently usable for as long as possible.
I am a big user of Dataverse as a tool for providing public access to faculty-generated data. I was struck by the comment that Dataverse “empowers individual communities to manage their own research data”. I am a data archivist and I have found that researchers are not managing their data for the long term. Dataverse does not (yet) provide the kind of curatiorial processes that are necessary to maintain usability. I would be hesitant to suggest that it is a tool for preservation because there is no way to check metadata, verify that the data and documentation match, ensure the file formats are usable over time. etc.. Most researchers do not have the resources to do this work; it is important to work with curators or archivists and not rely solely on Dataverse. Otherwise it does offer some options for collaboration and for short term sharing of data.
Thanks, Libbie and Ann. It’s interesting that you both (correctly) pick up on the same point.
It was certainly naive of me to imply that the Dataverses are a long-term solution for managing research data, though they do have value for at least getting researchers to approach these issues for themselves, as in many fields data simply isn’t being shared. As you know, proper data curation can take time to administer, so perhaps the Dataverses can be seen as an intermediate solution here.
Thanks again.
I recently deposited data with the Dataverse Network and I made sure to deposit all of my data files in subsettable format, exactly for the reason Libbie mentions above: preservation. So I’m a bit confused.
Libbie, are you saying the files in this format are not in preservable format? I was also informed all my files would be accessible with any changes in file format…Is this also incorrect?
Libbie-how should one check metadata? Do you have an example? I tried to use as many field as possible that Dataverse provided with their DDI codebook so that all questions relevant to my data set were answered.
I personally enjoyed the easy of going in, having my own space, and being able to share my data so easily.