Skip to main content

Posts

The glass box design philosophy

There is an interesting paradox in context of developing data analysis software. On one side, there are clear benefits of designing tools that are easy to use, robust and require as little manual intervention or user expertise as possible. Such design philosophy allows more users to take advantage of the tools and apply them automatically to large heterogeneous datasets. On the other side, blindly applying tools that are not fully understood or do not provide useful information on whether the input data meets their assumptions can raise serious concerns. Developers take not only great pride in the quality of their software but also feel responsible for how the software is being used. Unexperienced users can misuse a “black box” tool and obtain misleading results. Whether we like it or not, such situations can lead to bad reputation misattributed to the tool itself.



Ease of use seems to be at odds with avoiding misuse. Extending your user base to less experienced users can lead to mist…

2017: Research Summary

Even though the passing of the year is more or less an arbitrary date it's a good opportunity to give a status update on various activities and projects I have been involved in this year. Here's a brief summary of 2017.
Brain Imaging Data Structure (BIDS) Since the BIDS Specification version 1.0.0 and the accompanying Nature Scientific Data paper were published last year we have been focusing on three things: stability, sustainable growth, and the software ecosystem.
Stability meant that we had to be very careful not to break backward compatibility even though many great ideas for BIDS 2.0 have been submitted. It also means that we focused on reaching out to new communities: I gave BIDS tutorials in London, Oxford, Birmingham, Glasgow and Chapel Hill this year and Dora Hermes published the BIDS Starter Kit (a super handy resource with tutorials and code snippets). 
Sustainable growth translated into adopting a system of BIDS Extension Proposals (BEPs) and providing a BIDS Cont…

To pin or not to pin dependencies: reproducible vs. reusable software

We recently had a very interesting conversation in our lab about how to describe software dependencies (libraries one needs to install) for a software project in the context of research. One camp was proposing explicitly listing which version of a dependency is required (a scheme also referred to "pinning") and the other camp was more in favor of either not specifying version at all or specifying the minimal required version. Luckily both camps agreed on the importance of specifying dependencies, but what's the big deal about pinning vs not pinning?

Advantages of pinning dependencies When you pin a dependency (for example by saying "numpy==1.1.3") you explicitly point to a version of a library that a) you know works with your script b) was used to generate the result you present in your paper. This is very useful for people trying to replicate your results using your code as well as yourself attempting to revisit a project that was put aside for a while (for e…

Forever free: building non-profit online services in a sustainable way

In the past decade, we have seen a big switch from client run software to online services. Some services such as scientific data repositories were a natural fit for a centralized online implementation (but one day we can see a distributed versions of those). Others, such as Science-as-a-Service platforms, were more convenient and scalable versions of the client/desktop based software. One thing is certain - online platforms, available 24/7 via a web browser have proven to be very convenient in a range of tasks such as communication, sharing data, and data processing. Non-profit sector (such as projects funded by scientific grants) has also entered this domain. There are countless examples where modern web technologies based on centralized services can benefit scientists and general public even if the service they provide is not part of a commercial operation. This is especially true due to increased trend to share data and materials in science. Those outputs need to be stored and made …

Sharing academic credit in an open source project

We live in truly wonderful times to develop software. Thanks to the growth of the Open Source movement and emergence of platforms such as GitHub, coding became something more than just an engineering task. Social interactions, bonds between developers, and guiding new contributors are sometimes as important as sheer technical acumen. A strong and healthy developer community revolving around a software tool or library is very important. It makes the tool more robust (tested in many more environments), sustainable (the progress does not depend on a single person), and feature rich (more developers == more features). Even though there exist some excellent guides on how to build a welcoming and thriving community they miss out on one aspect that is specific to software development performed by academics - academic credit. For those not familiar with how things run in academia a quick refresher: the main currency of science is papers (manuscripts) and the number of times they are referenced …

How to meet the new data sharing requirements of NIMH

National Institute of Mental Health (NIMH) have recently mandated uploading data collected from all of clinical trials sponsored by them to the NIMH Data Archive (NDA). Similar policies are not in place for many of their grant calls. This initiative differed from the previous attempts of NIH to make more data shared. In contrast to "data management plans" that have to be included in all NIH grants that historically remained unimplemented without any consequences to the grantees this new policy has teeth. Folks at NDA have access to all ongoing grants and are motivated to go after the researchers that are late with their data submission. Since there is nothing more scary than an angry grant officer it's worth taking this new policy seriously!

In this brief guide I'll describe how to prepare your neuroimaging data for the NDA submission with minimal fuss.


Minimal required data NDA requires each study to collect and share some small subset of values for all subjects and…

Highlights from the NeuroImage Data Sharing Issue

This week the first part of NeuroImage special issue on Data Sharing was published. It's a great achievement and I am glad to see that more focus is being put on sharing data in our field. However the issue is a mixed bag of papers that describe different types of resources. Some of my friends were confused by this heterogeneity, so I decided to highlight some of the resources presented in the issue.

The issue included papers about many data sharing platforms/databases (XNAT Central, LORIS, NIDB, LONI IDA, COINS, UMCD and NeuroVault) that are well known and covered by previous publications. Similarly some datasets (FBIRN and CBFBIRN) also have been previously covered in the literature. I understand that those have been included in the issue for completeness, but I will leave them out in this review.


Developmental and aging datasetsThe issue includes an impressive developmental dataset consisting of 9498 subjects with medical, psychiatric, neurocognitive, and genomic data (ages 8-2…