Skip to main content

Making data sharing count

Consider a typical fMRI study: 
  • Twenty participants scanned for an hour = 10000 USD.
  • Research Assistant to run participants = 20000 USD.
  • Postdoc to invent the study and write it up = 40000 USD.
70000 USD later science is richer by an eight page paper, peer reviewed and published in an academic journal. The authors might look at the data again some time later, maybe join it with some other of their dataset to improve power. Maybe. Or maybe they will not have time. We may never learn if there was anything more in the data (all 360 million datapoints of it) than what those eight pages described.

Most scientists agree that sharing data makes sense and leads to better, more reproducible, transparent, and objective science. Funding agencies (the guys who turn your taxes into academic papers) understand how expensive data collection is and want to squeeze as much as possible out of existing data. But the perspective of an individual scientist is different. Sharing data does not come for free. You need to clean the data and describe it properly so other could make good use of it. You also risk that someone will try and fail to replicate your findings - unearthing a mistake in your analysis. All that for what? So someone else could take YOUR data find something interesting that you have missed and publish it? Leaving you with no credit for the data collection, nothing to put on your CV when you are going to face the tenure track committee?

Luckily not all scientists think this way, but plenty do. Even though there are many visionaries and idealists in science (luckily!) in many situations it is a dog eat dog, you publish or you perish dynamic. I don't believe this is fundamentally wrong - competition is driving development. Besides entities distributing money in science have to somehow make their decisions. Therefore we should not fight this, but try to tap into the existing system of academic credit.

Together with Daniel and Mike we have recently written a paper describing an attempt to increase the motivation of an individual researcher to share data. Instead of just putting your data on a website and not getting anything in return one would write a short paper describing in details how the data was acquired and how it is organized. Such data paper is publication like every other paper. It has a DOI, can be cited, and has to be peer reviewed before being accepted. This simple idea solves multiple problems:
  • Through citation data producers get appropriate credit. Interesting data sets will lead to highly cited papers.
  • Peer review process assures that the quality of the data and metadata leans to trouble free reuse.
  • A separate publication allows more space for detailed description of acquisition methods in contrast to just a few paragraphs of a typical cognitive neuroscience paper.
  • All people involved in the data collection (including research and lab assistants) can co-author the paper without concerns of "dilution of credit".
By no means this is a new idea: it has been implemented in other fields (see our paper for more details). It just needs to gain momentum (and this is the main reason for this shameless plug ;). There are already several neuroimaging journals that will accept data papers: GigaScience (they will also host your data), Neuroinformatics and Frontiers in Brain Imaging Methods. There is really not much to loose. With little effort you can get a publication, promote and share your data. So what are you waiting for? Publish a data paper to increase the impact of your research and receive credit for your data sharing efforts!

Comments

  1. This is an interesting idea, but the paper doesn't seem to address the fundamental question: what makes you think that data papers will make data sharing count? Reading this post and skimming the paper, I didn't see anything about this: is there any evidence that data papers boost tenure prospects? Salaries? Chance of still being in a field at a later followup? Publication of additional papers?

    If data papers are being used in other fields, this data should exist; or another angle would be to look at software packages since at least among R people it's not uncommon to see a published paper justifying and explaining a package which is then cited by subsequent users.

    ReplyDelete
  2. True, we did not include any data on impact of publishing data papers on researcher careers. I will have a look at this, but I'm afraid it would be a very difficult comparison. Many factors contribute to academic success so it would be hard to make a fair comparison between authors that published data papers and those that don't. Additionally some factors may correlate with tendency to publish data papers in a non causal way.

    It is also a question how to measure academic success. Normally it would be based on the number and popularity of published papers, but it is not clear we can use it in this context. Clearly being able to publish data papers can increase the number of publications you have. The question is if those publication will have any impact, or in other words how will the be perceived by grant reviewers and tenure committees. Quantifying success without using publications can turn out to be quite tricky.

    The software example you have mentioned also fills me with hope - some of the most cited papers in neuroimaging are describing methods, which would not succeed without a good software package.

    ReplyDelete
  3. > I will have a look at this, but I'm afraid it would be a very difficult comparison. Many factors contribute to academic success so it would be hard to make a fair comparison between authors that published data papers and those that don't.

    Which also means that anyone who publishes a data paper will be taking as much a gamble as anyone who just shares data, and since papers are harder to write than a short webpage describing informally the data and linking files...

    > The software example you have mentioned also fills me with hope - some of the most cited papers in neuroimaging are describing methods, which would not succeed without a good software package.

    If you can't show any benefit to the author from those most-cited papers, then a fortiori, that undermines any case for data papers.

    Also, the existence of software papers could easily not show that data papers have a chance: software is not data. Software has a much better history or story about how providing software can help you: other people can contribute bug fixes, keep it up to date and still compiling & running, optimize it, etc. If you plan to reuse the software in the future, then it can easily be a good idea to clean it up a bit and publish it; and even if it isn't a good idea strictly from the cost-benefit view, there's a widespread programming culture of sharing code under liberal licenses.

    Most of these reasons do not exist for data: the most valuable 'bug fixes' are pointing out serious errors or inconsistencies in the data of the sort that would discredit papers and hence careers, there's not really an equivalent of compiling/running (a text file will always be readable), data can't really be optimized short of just deleting parts (which is bad from an archival point of view) or compressed (which is trivial), and there obviously is no such culture in science encouraging data release as the default.

    ReplyDelete

Post a Comment