Thursday, June 16, 2016

How to meet the new data sharing requirements of NIMH

National Institute of Mental Health (NIMH) have recently mandated uploading data collected from all of clinical trials sponsored by them to the NIMH Data Archive (NDA). Similar policies are not in place for many of their grant calls. This initiative differed from the previous attempts of NIH to make more data shared. In contrast to "data management plans" that have to be included in all NIH grants that historically remained unimplemented without any consequences to the grantees this new policy has teeth. Folks at NDA have access to all ongoing grants and are motivated to go after the researchers that are late with their data submission. Since there is nothing more scary than an angry grant officer it's worth taking this new policy seriously!

In this brief guide I'll describe how to prepare your neuroimaging data for the NDA submission with minimal fuss.

Minimal required data

NDA requires each study to collect and share some small subset of values for all subjects and scans:
  1. Name, surname, date of birth, and place of birth. This data will not be share with the NDA, but is required to generate unique IDs for your subjects. Those IDs are used by NDA to link participants across studies. Therefore you should collect this data and keep it in a safe place linked to the IDs you use internally. You will use this data during the submission process to generate NDA compatible IDs.
  2. Age (in months at the time of scan) and sex (male or female). Those are minimal required demographic information you will need to collect.
  3. Repetition time, echo time, flip angle, scanner manufacturer, model, and field strength, date of acqusition for all of your neuroimaging files (T1s, fMRI, DWI etc.)
  4. bvec and bval files if you include diffusion files.
  5. Slice timing if your dataset includes fMRI data.

Data organization

For organizing your data after acquisition I recommend using the Brain Imaging Data Structure (BIDS). It's an intuitive file organization scheme that will make it easier to analyze your data later due to growing set of tools that use it (such as mriqc, FMRIPREP or AA). Use can use tools like dcm2niix (to convert from DICOM to NIFTI and extract metadata) and/or heudiconv (batch processing and sorting DICOMs from many scans). Data required by NDA can be included in your BIDS dataset in the following places:
  1. Age and sex can be included as columns in the participants.tsv file. If your data comes from a longitudinal study you can include age and sex in the _session.tsv files (one per subject) to specify the values for each session independently.
  2. If you use dcm2niix you should have almost all of the required metadata values and extra files (bval and bvec). Double check scanner model, field strength and flip angle.
  3. Finally date of acquisition can be inserted as the acq_time column in the _scans.tsv files.
When you are done organizing your data use the BIDS Validator to check if everything is ok.

When the data acquisition is completed and the validator passes all of the checks it is a good habit to make the folder with the dataset read only. This will prevent from any accidental deletion or modification of the data down the road.

    Submission to NDA

      Using BIDS makes the submission process very easy and requires very little manual data wrangling.
      1. Use the GUID Tool to generate GUIDs for each of your subject. This will require providing name, date of birth and place of birth and will result in a file mapping from the IDs you use internally to GUIDs. Make sure you keep this data safe – it includes personal information!
      2. Create NDA submission package using the BIDS2NDA tool. This command line tool take three arguments: your BIDS dataset, GUID mapping file and the folder where NDA submission package will be stored.
      3. Submit the data using NDA Validation and Submission tool. If the BIDS2NDA tool worked correctly NDA Validator should not return any errors and you should be able to submit your dataset without problems.
        I hope that this guide will convince you that submitting data to NDA (and thus fulfilling your grant requirements) can be a relatively straightforward process. One thing that is worth keeping in mind is that some of those steps require a little bit of planning (for example remembering to collect the place of birth of each of your participant). This guide also covers only the neuroimaging data - to submit other data types (such as questionnaires or clinical assesments) you will have to use data dictionaries provided by NDA.
          BIDS2NDA Tool is still under active development – please submit an Issue on GitHub if you find a bug.

          Thursday, November 19, 2015

          Highlights from the NeuroImage Data Sharing Issue

          This week the first part of NeuroImage special issue on Data Sharing was published. It's a great achievement and I am glad to see that more focus is being put on sharing data in our field. However the issue is a mixed bag of papers that describe different types of resources. Some of my friends were confused by this heterogeneity, so I decided to highlight some of the resources presented in the issue.

          The issue included papers about many data sharing platforms/databases (XNAT Central, LORIS, NIDB, LONI IDA, COINS, UMCD and NeuroVault) that are well known and covered by previous publications. Similarly some datasets (FBIRN and CBFBIRN) also have been previously covered in the literature. I understand that those have been included in the issue for completeness, but I will leave them out in this review.

          The original art used in the NeuroImage cover.

          Developmental and aging datasets

          • The issue includes an impressive developmental dataset consisting of 9498 subjects with medical, psychiatric, neurocognitive, and genomic data (ages 8-21). 1000 of those subjects include neuroimaging data (T1, PCASL, DWI, fMRI: resting, n-back and emotion ID). Data is available through DBGAp (you need to be a NIH approved PI to apply for access, application process can be lengthy and involve substantial amount of paperwork).
          • Another developmental dataset (PedsDTI) consisting of 274 subjects of age from 10 days (from date of birth) through 22 years includes high resolution DWI scans and reference T1 scans as well as precomputed derivatives and age matched atlases (DWI only). The imaging data is accompanied with a set of behavioral, hormonal and clinical measurements. The data is located on NDAR servers and you need to apply to gain access.
          • PING is yet another developmental database that includes data from 1493 children in ages 3 to 20 years old. Scanning protocol includes T1, T2, fMRI (rest). Behavioral measures include NIH Toolbox for Cognition and PhenX. Genotyping information is also includes (1000 SNPs). Due to IRB constraints only subset of this data is available (neither the paper nor the website says how much though). To gain access you will have to apply (only postdocs and higher can apply). All publications using this data require PING approval and co-authorship.
          • Age-lity projects includes 131 participants (ages 15-37). Imaging data includes T1, DWI, fMRI (resting state) and EEG (resting state). On the behavioral side there is only basic demographics information available. Data is easy to access through NITRC and requires only registration for the notification mailing list.

          Clinical datasets

          • Parkinson's Disease Biomarkers Program provides data from 460 controls and 878 diagnosed cases (mostly Parkinson's). Data were acquired across many sites without normalization of the protocols so different subsets will have different measurements. You will need to apply to gain access.
          • Northwestern University Schizophrenia Data and Software Tool provides access to 171 schizophrenia Subjects 170 Controls 44 non-schizophrenic siblings 66 control siblings. MRI data includes only T1 scans, but is accompanied by cognitive and genotypic measurements. You need to request access to gain access to this resource.
          • PLORAS is a dataset of 750 stroke patient accompanied with 450 healthy controls. The data includes T1 scans, fMRI (two simple language protocols). You will need to apply to gain access to this resource.

          Other datasets

          • OMEGA is a dataset of consisting of resting state MEG and T1 data collected from 97 participants. You will have to apply to gain access.
          • Open Science CBS Neuroimaging Repository is a dataset consisting of high resolution (7T) MP2RAGE (T1 maps) images from 28 healthy participants. The data is available publicly without the need for registration.
          • Cimbi is somehow heterogenous dataset of PET (mostly serotonin receptors) and T1 scans. The dataset consists of 402 healthy individuals and 206 patients with various coverage of different behavioral measures. You need to apply to gain access and you might have to put members of the Cimbi consortium as coauthors on your paper. 
          • BIL&GIN is a dataset consisting of 453 subjects (205 of which are left handed!) with T1, DWI, and fMRI (resting state) scans. Additionally 303 have 8 task fMRI scans (probing language, visuospatial, motor and arithmetic activities). You will need to apply to gain access to this resource and the authors will require co-authorship on your papers.

          Non-human datasets

          • The Cambridge MRI database for animal models of Huntington disease provides T1 and DWI data from mice and sheep models of Huntington. The data is publicly available without any restriction.

          Data aggregation

          • Global Alzheimer's Association Interactive Network facilitates finding and accessing multiple Alzheimer datasets.
          • SchizConnect joins together 4 different datasets with participants diagnosed with Schizophrenia.
          • ANIMA is a database of statistical maps from meta-analyses.
          • is a repository of intracranial EEG datasets. It is not clear from the paper what data is in the database and you cannot browse it without an account (I had problems registering a new account).
          Summing up - it's nice to see that there is more data sharing going on in our field. I hope that NeuroImage will keep publishing more data papers in the future without the need for a special issue. Together with Mike Milham and Daniel Margulies we have written extensively about this form of data dissemination - have a look at our paper for more information (including guidelines for reviewers).

          The thing that struck me the most when reviewing the contents of this special issue was how restrictive the access to most of the datasets is. Most of them require you to apply to gain access. The official explanation for this procedure is that the repositories make sure that you can be trusted with data obtained from human subjects (even though all of it is anonymized before sharing). In practice no one checks if you have appropriate facilities to keep the data safe (such as for example encrypted storage servers). On the other hand the access request approval system can be potentially abused by denying access to competing researchers and forcing beneficiaries to share co-authorships.

          Many projects have been promoting unrestricted public access to data (Open Science CBS and Cambridge datasets from this review, OpenfMRI, Study Forrest, NeuroVault etc.) - this means no "requests for approval". There were no privacy disasters or lawsuits reported in the context of the fully open datasets mentioned above, which proves that unrestricted sharing can be done. At the same time removing the need for requesting access to data lowers the usage barriers and makes the whole process more transparent.

          Monday, September 28, 2015

          The unsung heroes of neuroinformatics

          There are many fascinating and exciting developments in human cognitive and clinical neurosciences. We are constantly drawn to novel and groundbreaking discoveries. There is nothing wrong with this - I would even say that's part of the human nature. This kind of research is not, however, what I want to talk about today. This post is dedicated to people building tools that play a crucial role as a backbone of research - helping novel discoveries happen. They go beyond providing a proof of concept, publishing a paper and pointing to undocumented piece of code that works only in their labs. They provide maintenance, respond to user needs, and constantly update their tools fixing bugs and adding features. Here I will highlight two tools which in my personal (and very biased) opinion play an important role in supporting human neuroscience, and could do with some more appreciation.

          Early years of Captain Neuroimaging


          Anyone dealing with MRI data in Python must know about this library. Nibabel allows you to read and write a variety of different file formats used in neuroimaging (most importantly NIFTI). It hides the obscurity of those standards and provides easy to use objects and methods that let you efficiently access, modify and visualize neuroimaging data. It seems like nothing, but not having to deal with finding the right header format each time you want to read a file can be easily overlooked. I use nibabel all the time and I am very grateful for its existence!
          Its a really good example of something that even though is not "novel" or sexy but is absolutely crucial and enables many researchers to get closer to understanding how the human brain works. Despite the fact that nibabel plays an essential role in python neuroimaging ecosystem it does not get enough credit. Nibabel is an open source project lead by +Matthew Brett who is tirelessly keeping it up to date with frequent release cycle.


          Papaya is a relatively new project providing a modular, reusable javascript based NIFTI and DICOM viewer. Being able to read the data apply the right affine transformation and perform efficient interpolation is probably not the most fascinating work in the world, but it's incredibly important. Web based applications are the future and I am sure that Papaya will play a crucial role in bringing neuroimaging to the cloud. Papaya has already been used in projects such as NeuroVault, ANIMA, and NIFTI-drop. I wonder if those projects had to develop their own javascript viewer they would exist at all! Thanks to the work of of the Papaya team they can all reuse the same reliable and fast viewer.
          Papaya is also an open source project, but it is mainly developed by Biomedical Image Analysis Division of the Research Imaging Institute at University of Texas San Antonio lead by Jack Lancaster. Their (sadly) unnamed developers are doing a great job by constantly improving the viewer and providing new features upon user request.

          I love those two projects and I have written this post to tip my hat towards people spending their time making this software happen. It enabled me to do research over the years and build tools of my own. Behind my urge to compliment the unappreciated there is a bigger issue. Science is currently so obsessed with novelty and groundbreaking discovery there is no space for appreciating, crediting and most importantly funding those that provide essential support for this science to happen. If we want to have solid reproducible and robust findings we need to improve our tools and focus on maybe less fascinating, but nonetheless important work.

          Sunday, September 13, 2015

          Software workaround for corrupted RAM in OS X

          Recently my computer has been acting up. Software started crashing, compilations failing, etc. Many small errors that I could not replicate. I wasn't too concerned, because I'm a natural tinkerer - I play with software, install many different additions and one of the side effects can be an unstable operating system. Eventually my system stopped booting - the partition table was corrupted. I had to wipe it and reinstall (which was a massive pain in the ass). I also tried to run some hardware checks just in case (the computer is over three years old), but the "Apple Hardware Test" was hanging each time I run (bad sign huh?). I'v eventually run memtest86 overnight and discovered that part of my RAM is corrupted. My computer is a Mac Book Pro Retina with expired warranty.

          Normally I would buy new ram and install it myself, but the retina MBPs have RAM permanently soldered to the logic board. Instead of paying through the nose to get it fixed I researched software solutions. Linux users have a very handy kernel option that will tell the OS not to use a particular range of memory addresses - it's called memap. Situation on OS X is not so rosy. The only option available is to restrict memory up to the point where it's corrupted (but this way you lose everything after it). In my case I had around 60Mb range of corrupted memory in the 13th gigabyte. My only option was to restrict the system to use 12Gb. This is the procedure:

          1. Run memtest86 overnight to figure where your memory is corrupted.
          2. Estimate the lowest range of usable memory (in my case it was 12000Mb).
          3. Restrict the memory by setting a kernel flag: 
              sudo nvram boot-args="maxmem=12000"

          This did the trick and made my laptop usable again!

          Friday, December 12, 2014

          How to convert between voxel and mm coordinates using Python

          I'm often asked how to go from voxel and mm coordinates using Python. This can be easily achieved using nibabel package with only few lines of code. The following tutorial is based on +Matthew Brett answer on the nipy mailing list.

          Going from voxel to mm coordinates

          import os
          import nibabel as nib
          Load the NIFTI file defining the space you are interested in. For the purpose of this tutorial we will use a test dataset shipped with nibabel.
          data_fname = os.path.join(os.path.dirname(nib.__file__), 'tests', 'data', 'example4d.nii.gz')
          img = nib.load(data_fname)
          Get the affine matrix and convert the coordinates.
          aff = img.get_affine()
          real_pt = nib.affines.apply_affine(aff, [22, 34, 12])
          array([ 73.85510254,  27.1169095 ,  29.79324198])

          Going from mm to voxel coordinates

          Going the other direction is even easier.
          import numpy.linalg as npl
          nib.affines.apply_affine(npl.inv(aff), real_pt)
          array([ 22.,  34.,  12.])

          Monday, December 8, 2014

          How to embed interactive brain images on your website or blog

          We have recently added a new feature to NeuroVault - you can embed statistical maps in external websites and blogs. They look just like this one below:
          It's very easy to use. You just need to upload you statistical maps (unthresholded, NIFTI file format in MNI space) to NeuroVault and click on the "Embed" tab. Copy the HTML code snippet and paste it to your blog or website.
          This feature has been long awaited by some modern academic journals (like +F1000Research) as well as some neuroimaging bloggers (see +Micah Allen post about NeuroVault. It is still in beta so we would appreciate your feedback.

          Monday, October 27, 2014

          This is my brain: sharing the risk

          At a recent meeting at Leiden we talked about many issues related to data sharing. Previously I've been covering how to incentivise scientists to share data through data papers on this blog, but during that meeting we also discussed ethical issues. When we are collecting data about our participants (whether those are behavioural measures or MRI scans) we take responsibility for it. We make a pledge that we will make whatever we can to protect the identity of our subjects.

          This is easier if we do not share data. Because fewer people have access to the data the likelihood of someone finding a method to connects brain scans to a particular person are lower. In reality this could be done either through a security breach (someone hacking the university network and obtaining the list of participants and their anonymous IDs) or by combining multiple datasets about one person to obtain enough details to be able to identify a person (this however applies only to participants taking part in multiple studies).

          Even though we do everything we can to protect our participants we cannot give any guarantees. So in the unlikely event of revealing the identity of research study participants what is the risk they are exposed to? In the field of neuroimaging we are obtaining clinically relevant data. In other words we take images of the brain that can be used to help diagnose diseases.  In case of healthy controls all scans are are screened for abnormalities by a trained neuroradiologist and participants with any signs of a disease (stroke, tumour, vascular malformations etc.) are contacted and excluded from the study. Nonetheless new methods are being developed and maybe in the future someone will be able to find more about the health of our participants using the same data but with help of new techniques. Why is this important in context of privacy? Some countries have private health care system based on health insurance. The cost of insurance can be influenced by the health state of the person applying for it. Imagine that identities of participants from some publicly shared neuroimaging study have been leaked. A private health insurance company obtains that data and uses it to asses the risk of various brain diseases of a particular individual. If it is high they will increase that individuals monthly fee to reflect the risk of covering future treatment costs. They are an insurance company after all - it's like charging unexperienced drivers higher insurance rates because they are more likely to get into an accident.

          This scenario is very, very unlikely. We are really doing a lot to protect the identity of our participants. There would have to be a security breach, biomarkers of brain diseases would have to be much better then they are now, and it only influences people with private health insurance. Nonetheless this scenario is not impossible. By sharing data we are exposing our participants a tiny bit more than we would if we did not share the data. I do believe that the benefit of shared data is much bigger then the risk (the scenario I described is really, really unlikely), but this applies mostly to the big picture. Individual participants will not care if their health insurance will get more expensive (although I would also argue that getting a free MRI scan screened by a specialist is a benefit to an individual).
          My brain in all its glory
          I have been promoting data sharing for several years now and I believe that I owe to my participants being exposed as much or more as they are. Therefore I decided to make structural scans of my brain freely available. I have uploaded my brain scans to FigShare - you can download them here. This is much bigger exposure than in any of the publicly shared datasets. Mostly because my name is already linked to the data, but also because the scan include not only T1 but also more clinically relevant T2 T2* and FLAIR sequences. Those who say that I am young and healthy so I am not really revealing anything I encourage to look for white matter lesion in my scans. Those who say I am living in socialist Europe I would like to inform that I have moved to the US and am currently covered by private health insurance.

          The dataset of anatomical scans of my own brain has little scientific value (this is far from myConnectome - a project during which +Russ Poldrack scanned himself many times over a period of one year). It's more of an experiment. I'm curious if this can influence me in a negative way. Should I expect a call from my insurance company? Will I regret sharing this data? I'm pretty sure I will not, but time will tell.