Thursday, November 19, 2015

Highlights from the NeuroImage Data Sharing Issue

This week the first part of NeuroImage special issue on Data Sharing was published. It's a great achievement and I am glad to see that more focus is being put on sharing data in our field. However the issue is a mixed bag of papers that describe different types of resources. Some of my friends were confused by this heterogeneity, so I decided to highlight some of the resources presented in the issue.

The issue included papers about many data sharing platforms/databases (XNAT Central, LORIS, NIDB, LONI IDA, COINS, UMCD and NeuroVault) that are well known and covered by previous publications. Similarly some datasets (FBIRN and CBFBIRN) also have been previously covered in the literature. I understand that those have been included in the issue for completeness, but I will leave them out in this review.

The original art used in the NeuroImage cover.

Developmental and aging datasets

  • The issue includes an impressive developmental dataset consisting of 9498 subjects with medical, psychiatric, neurocognitive, and genomic data (ages 8-21). 1000 of those subjects include neuroimaging data (T1, PCASL, DWI, fMRI: resting, n-back and emotion ID). Data is available through DBGAp (you need to be a NIH approved PI to apply for access, application process can be lengthy and involve substantial amount of paperwork).
  • Another developmental dataset (PedsDTI) consisting of 274 subjects of age from 10 days (from date of birth) through 22 years includes high resolution DWI scans and reference T1 scans as well as precomputed derivatives and age matched atlases (DWI only). The imaging data is accompanied with a set of behavioral, hormonal and clinical measurements. The data is located on NDAR servers and you need to apply to gain access.
  • PING is yet another developmental database that includes data from 1493 children in ages 3 to 20 years old. Scanning protocol includes T1, T2, fMRI (rest). Behavioral measures include NIH Toolbox for Cognition and PhenX. Genotyping information is also includes (1000 SNPs). Due to IRB constraints only subset of this data is available (neither the paper nor the website says how much though). To gain access you will have to apply (only postdocs and higher can apply). All publications using this data require PING approval and co-authorship.
  • Age-lity projects includes 131 participants (ages 15-37). Imaging data includes T1, DWI, fMRI (resting state) and EEG (resting state). On the behavioral side there is only basic demographics information available. Data is easy to access through NITRC and requires only registration for the notification mailing list.

Clinical datasets

  • Parkinson's Disease Biomarkers Program provides data from 460 controls and 878 diagnosed cases (mostly Parkinson's). Data were acquired across many sites without normalization of the protocols so different subsets will have different measurements. You will need to apply to gain access.
  • Northwestern University Schizophrenia Data and Software Tool provides access to 171 schizophrenia Subjects 170 Controls 44 non-schizophrenic siblings 66 control siblings. MRI data includes only T1 scans, but is accompanied by cognitive and genotypic measurements. You need to request access to gain access to this resource.
  • PLORAS is a dataset of 750 stroke patient accompanied with 450 healthy controls. The data includes T1 scans, fMRI (two simple language protocols). You will need to apply to gain access to this resource.

Other datasets

  • OMEGA is a dataset of consisting of resting state MEG and T1 data collected from 97 participants. You will have to apply to gain access.
  • Open Science CBS Neuroimaging Repository is a dataset consisting of high resolution (7T) MP2RAGE (T1 maps) images from 28 healthy participants. The data is available publicly without the need for registration.
  • Cimbi is somehow heterogenous dataset of PET (mostly serotonin receptors) and T1 scans. The dataset consists of 402 healthy individuals and 206 patients with various coverage of different behavioral measures. You need to apply to gain access and you might have to put members of the Cimbi consortium as coauthors on your paper. 
  • BIL&GIN is a dataset consisting of 453 subjects (205 of which are left handed!) with T1, DWI, and fMRI (resting state) scans. Additionally 303 have 8 task fMRI scans (probing language, visuospatial, motor and arithmetic activities). You will need to apply to gain access to this resource and the authors will require co-authorship on your papers.

Non-human datasets

  • The Cambridge MRI database for animal models of Huntington disease provides T1 and DWI data from mice and sheep models of Huntington. The data is publicly available without any restriction.

Data aggregation

  • Global Alzheimer's Association Interactive Network facilitates finding and accessing multiple Alzheimer datasets.
  • SchizConnect joins together 4 different datasets with participants diagnosed with Schizophrenia.
  • ANIMA is a database of statistical maps from meta-analyses.
  • IEEG.org is a repository of intracranial EEG datasets. It is not clear from the paper what data is in the database and you cannot browse it without an account (I had problems registering a new account).
Summing up - it's nice to see that there is more data sharing going on in our field. I hope that NeuroImage will keep publishing more data papers in the future without the need for a special issue. Together with Mike Milham and Daniel Margulies we have written extensively about this form of data dissemination - have a look at our paper for more information (including guidelines for reviewers).

The thing that struck me the most when reviewing the contents of this special issue was how restrictive the access to most of the datasets is. Most of them require you to apply to gain access. The official explanation for this procedure is that the repositories make sure that you can be trusted with data obtained from human subjects (even though all of it is anonymized before sharing). In practice no one checks if you have appropriate facilities to keep the data safe (such as for example encrypted storage servers). On the other hand the access request approval system can be potentially abused by denying access to competing researchers and forcing beneficiaries to share co-authorships.

Many projects have been promoting unrestricted public access to data (Open Science CBS and Cambridge datasets from this review, OpenfMRI, Study Forrest, NeuroVault etc.) - this means no "requests for approval". There were no privacy disasters or lawsuits reported in the context of the fully open datasets mentioned above, which proves that unrestricted sharing can be done. At the same time removing the need for requesting access to data lowers the usage barriers and makes the whole process more transparent.