Skip to main content


Liberating data - an interview with John Ioannidis

Couple of weeks ago I had the pleasure to sit down with Prof. John Ioannidis to talk about the role of data in science. Prof. Ioannidis is well know in his pursuit to uncover and fix issues with the modern scientific process. This interview is the first in a series - stay tuned for more! Source: Chris Gorgolewski (CG): You have a long journey in science behind you, and I was wondering how you thought about the role of data in science when you were entering grad school? John Ioannidis (JI) : I think that my exposure to data has changed over time, because the types of datasets and their content, clarity, their limitations, their documentation, their availability, their transparency, and their prowess to error has changed over time. I think by now I have worked in different fields that have different doses of these challenges. The types of data that I was working with when I was a medical student or just graduating are ve

The glass box design philosophy

There is an interesting paradox in context of developing data analysis software. On one side, there are clear benefits of designing tools that are easy to use, robust and require as little manual intervention or user expertise as possible. Such design philosophy allows more users to take advantage of the tools and apply them automatically to large heterogeneous datasets. On the other side, blindly applying tools that are not fully understood or do not provide useful information on whether the input data meets their assumptions can raise serious concerns. Developers take not only great pride in the quality of their software but also feel responsible for how the software is being used. Unexperienced users can misuse a “black box” tool and obtain misleading results. Whether we like it or not, such situations can lead to bad reputation misattributed to the tool itself. Ease of use seems to be at odds with avoiding misuse. Extending your user base to less experienced users can lead to m

2017: Research Summary

Even though the passing of the year is more or less an arbitrary date it's a good opportunity to give a status update on various activities and projects I have been involved in this year. Here's a brief summary of 2017. Brain Imaging Data Structure (BIDS) Since the BIDS Specification version 1.0.0 and the accompanying Nature Scientific Data paper were published last year we have been focusing on three things: stability, sustainable growth, and the software ecosystem. Stability meant that we had to be very careful not to break backward compatibility even though many great ideas for BIDS 2.0 have been submitted . It also means that we focused on reaching out to new communities: I gave BIDS tutorials in London, Oxford, Birmingham, Glasgow and Chapel Hill this year and Dora Hermes published the BIDS Starter Kit (a super handy resource with tutorials and code snippets).  Sustainable growth translated into adopting a system of BIDS Extension Proposals (BEPs) and providing

To pin or not to pin dependencies: reproducible vs. reusable software

We recently had a very interesting conversation in our lab about how to describe software dependencies (libraries one needs to install) for a software project in the context of research. One camp was proposing explicitly listing which version of a dependency is required (a scheme also referred to "pinning") and the other camp was more in favor of either not specifying version at all or specifying the minimal required version. Luckily both camps agreed on the importance of specifying dependencies, but what's the big deal about pinning vs not pinning? Advantages of pinning dependencies When you pin a dependency (for example by saying "numpy==1.1.3" ) you explicitly point to a version of a library that a) you know works with your script b) was used to generate the result you present in your paper. This is very useful for people trying to replicate your results using your code as well as yourself attempting to revisit a project that was put aside for a while

Forever free: building non-profit online services in a sustainable way

In the past decade, we have seen a big switch from client run software to online services. Some services such as scientific data repositories were a natural fit for a centralized online implementation (but one day we can see a distributed versions of those). Others, such as Science-as-a-Service platforms, were more convenient and scalable versions of the client/desktop based software. One thing is certain - online platforms, available 24/7 via a web browser have proven to be very convenient in a range of tasks such as communication, sharing data, and data processing. Non-profit sector (such as projects funded by scientific grants) has also entered this domain. There are countless examples where modern web technologies based on centralized services can benefit scientists and general public even if the service they provide is not part of a commercial operation. This is especially true due to increased trend to share data and materials in science. Those outputs need to be stored and

Sharing academic credit in an open source project

We live in truly wonderful times to develop software. Thanks to the growth of the Open Source movement and emergence of platforms such as GitHub, coding became something more than just an engineering task. Social interactions, bonds between developers, and guiding new contributors are sometimes as important as sheer technical acumen. A strong and healthy developer community revolving around a software tool or library is very important. It makes the tool more robust (tested in many more environments), sustainable (the progress does not depend on a single person), and feature rich (more developers == more features). Even though there exist some excellent guides on how to build a welcoming and thriving community they miss out on one aspect that is specific to software development performed by academics - academic credit. For those not familiar with how things run in academia a quick refresher: the main currency of science is papers (manuscripts) and the number of times they are refe

How to meet the new data sharing requirements of NIMH

National Institute of Mental Health (NIMH) have recently mandated uploading data collected from all of clinical trials sponsored by them to the NIMH Data Archive (NDA) . Similar policies are not in place for many of their grant calls. This initiative differed from the previous attempts of NIH to make more data shared. In contrast to "data management plans" that have to be included in all NIH grants that historically remained unimplemented without any consequences to the grantees this new policy has teeth. Folks at NDA have access to all ongoing grants and are motivated to go after the researchers that are late with their data submission. Since there is nothing more scary than an angry grant officer it's worth taking this new policy seriously! In this brief guide I'll describe how to prepare your neuroimaging data for the NDA submission with minimal fuss. Minimal required data NDA requires each study to collect and share some small subset of values for all su