Skip to main content

Liberating data - an interview with John Ioannidis

Couple of weeks ago I had the pleasure to sit down with Prof. John Ioannidis to talk about the role of data in science. Prof. Ioannidis is well know in his pursuit to uncover and fix issues with the modern scientific process. This interview is the first in a series - stay tuned for more!
Source: https://twitter.com/StanfordDeptMed/status/721078974021906433

Chris Gorgolewski (CG): You have a long journey in science behind you, and I was wondering how you thought about the role of data in science when you were entering grad school?
John Ioannidis (JI): I think that my exposure to data has changed over time, because the types of datasets and their content, clarity, their limitations, their documentation, their availability, their transparency, and their prowess to error has changed over time. I think by now I have worked in different fields that have different doses of these challenges. The types of data that I was working with when I was a medical student or just graduating are very different compared to what we have now. Some of the challenges are actually pretty new in terms of the magnitude, complexity, reasons why we collect data, how we do that, what is their half-life – and more. There are obviously some issues that have been there for scientific investigation since the very beginning, but there are others that change pretty rapidly in our times. Most of those add extra layers of complexity about data.

"Many people at some point realized that we need to share extensively, we need to collaborate, we need to join forces, otherwise we are not going to do anything serious."

CG: You mentioned that you worked in many different fields. When we talk about data management, archival and especially sharing among researchers genomics is usually brought up as the template everyone should try to emulate. I was wondering what other fields can learn from genomics and why do you think things got so much better in that field in particular.
JI: I think there is a risk of overgeneralizing. Even within genomics there are fields that are more eager to share data more than others and there are many fields beyond genomics that are quite advanced in their practices of data sharing. In principle you need to have a willing community of scientists and you need to have a setup of people who feel that they are going to gain by sharing data. In genetics people have realized that unless they join forces and they share, they come up with these huge consortia it would be impossible for each team alone to do anything meaningful. The results were largely false positive which is what you would expect in underpowered environments even in accurate measurements, and even containing bias there was not much to discover. Little discovery, lots of noise. Many people at some point realized that we need to share extensively, we need to collaborate, we need to join forces, otherwise we are not going to do anything serious. In other fields this is not so obvious. Many fields are operating in a situation where they publish lots of irreproducible research, but, somehow, they are happy with it. Not having access to the data does not even allow showing that this is the case.

CG: But where is the difference – it seems that the problem of underpowered studies that you mentioned should be producing this selective pressure for collaboration across different fields. What makes genomics different?
JI: Maybe I am oversimplifying, but I think there was a large community of scientists, a critical mass, there were enough scientists who took that step of creating these larger databases, by joining forces. By doing that they had a clear competitive advantage, compared to others. They could publish things that everybody would quickly recognize that they are higher quality, more reliable. They could improve both in true positives and false positives. This may not be so obvious to many scientists in other fields, where maybe the pace of the investigation maybe is not as rapid. The critical mass maybe is not there. Creating these initial coalitions is not so easy. Therefore, there is less need to disrupt the status quo and people just continue doing what everybody recognizes as good enough, even though it is not good enough. Somehow these fields have taken for granted that a lower level of operation is sufficient, that there is no need to rock the boat. I think we will see more fields moving in the direction of sharing and trying to really get their work to a higher level.

CG: Speaking of other fields – you already mentioned that genomics is not the only place where things are getting better. What are the other fields that can be considered role models in data sharing?
JI: There are many pockets within biomedicine that are more willing to share data. Some fields traditionally had a lot of resistance to data sharing. For example, in clinical trials it was almost impossible to get hold of data, currently there are several thousands of trials by 12 companies that are available to be shared. There is also trial data from NIH funded work. There is also a lot of that from many journals that make it a prerequisite for publication to be willing to share data. We are seeing a changing landscape. Probably changing not as fast as I would wish, but, clearly, I think there is a transformation. There are also many other fields where data sharing has been the norm. Many areas in particle physics have traditionally worked with sharing all the data with a wide community of scientists. There is a lot of collaborative work in astrophysics around the telescope data. Lots of natural sciences are sharing observations about environment, climate, natural phenomena. These are widely available. Among social sciences, psychology has gone through a lot of transformations. There is a lot of movement in imaging studies. Economics have had a pretty long-standing tradition of data sharing which has strengthen over the years – at least for some types of economics. There is a lot of activity – it is not homogenous. You can have fields that can be very nearby in terms of design and mode of investigation, but one can be very open to sharing while another might be very closed to sharing. One example is observational epidemiology. If you take observational epidemiology of air pollution, data from the largest studies are available, they have been extensively reanalyzed. They have even been extensively reanalyzed by stakeholders who would not necessarily had a wish to reach the same conclusions and yet they did reach the same conclusion more or less. Conversely, data from nutritional epidemiology, studies done in the same department, at the same university, are very rarely shared. The norm is that just the team of the PI and his fellows and his researchers (or perhaps also some other sympathizers) would be the only people who can use the dataset and analyze it. Probably these people are a few yards away in their offices. Nevertheless, they have very different norms of operation and correspondingly different levels of credibility. I think the air pollution research is highly credible and nutritional epidemiology research is dismally incredible.

"We need to be strategic to see what types of datasets we need to liberate and invest for getting those cleaned and available to the community..."

CG: It’s definitely very hard! If I understand correctly, you are saying that influencing the scientific culture is very hard. You can have teams of researchers literally on the same floor talking to each other over coffee every day and still having a drastically different approach to data sharing. Putting aside reality for a moment what would be in your opinion the ideal landscape of biomedical data sharing and reuse – within a cost-constrained world? If you were to reset the system how would you set the ratio of data reuse to novel studies as well as data sharing policies?
JI: Basic question is – what is a novel study? If you get a random sample of the biomedical literature, and I bet this probably applies to many other scientific disciplines, the vast majority of studies claim that they have something novel to say. Almost all of them find statistically significant results, and almost all of them say that there is something new about what they convey. If you look more carefully there is really not that much novelty. Novelty is mostly a claim that the current system imposes to us to state, but I think that really novel and really innovative discoveries are very few. There is probably a large number of other studies that do have an element of novelty, but most research is not really novel. People have been trying to look at more or less similar issues, but instead of doing some proper replication they distort the original plan a little bit or more and they make a claim to novelty which is not really novelty and is not a replication either. So, we get close to nothing on either side which is not very good. Should all data be shared? Probably that is not feasible. I think there are limitations, ethical concerns – especially for data that have been already collected in the past. Trying to get all of them shared against their informed consent is not going to happen. Retroactive sharing may also take enormous resources. We need to be strategic to see what types of datasets we need to liberate and invest for getting those cleaned and available to the community and leave many others to their proprietors and possibly their demise, because data have a half-life and they become dysfunctional or even just disappear. For future studies I think that making data sharing the default option and then asking people to come up with an explanation why they don’t want to widely share the data, what are the constraints and how serious they are. I think that would be my preference. If we do that I think that we will switch from a situation where only a small minority of data is available to a situation where the large majority of datasets is widely available even though there still be exceptions that will be better justified. If we do that the resources required will probably not be that substantial, because in a fragmented environment where we ask investigators to kind of reinvent the wheel each time and come up with a new way to develop a repository or develop a new process for sharing their data this takes far more effort comparing to a situation where everything is streamlined. Everyone knows there is a platform. Everything can be harmonized or standardized to be compatible to the platform and the cost and effort will decrease manifold. In genetics this has been far more efficient rather than asking each team to come up with their own resources.

"Currently, the half-time of datasets is pretty short."

CG: You’ve mentioned the term of data half-life a few times. Could you elaborate a bit more what do you mean by that?
JI: I think it is more about data availability. Data disappears. One might even question whether we should maintain all the data that is collected or whether some data should be put to rest. Currently, the half-time of datasets is pretty short. Once an investigator is no longer actively working with them they would be dumping them on some laptop that gets replaced after few years, then gets lost or does not work any longer. That would be the end of it. Then you need a detective effort or an archeological excavation to determine where the data are. We have seen that repeatedly. Even people who wanted to share the data did not know where the data are. It is not that they did not want to disclose them or that they had caveats that they wanted to keep secret. Data disappeared.

CG: You’ve mentioned that we don’t need to and should not share all of the data, because it is a costly enterprise. How should we decide which datasets should be archived and maintained over the years and which shouldn’t?
JI: That’s a difficult question and I don’t have a good answer to that. I think one needs to ask knowledgeable people within the community that is utilizing these data to get a sense of what is the value of that information. I am sure some wrong decisions will be made unavoidably. I think the default option would be just to save it. It could be useful in the future for some reason that cannot be predicted yet. Probably there is lots of datasets that their mode of data collection or realization of the errors that are associated with them has become so clear that one can reach a conclusion that their value is minimal. If someone is collecting phrenology measurements of the size of bumps of the skull in the 19th century and associated them with personality profiles, would we want to continue saving these data? Maybe we might want to save them for a history of science perspective. Some of that would really be useful to document which stupidities we were following in the past. Currently we are probably also following some stupid leads. It would not be that all of that information would have to be stored, salvaged, curated, and maintained ad infinitum. There has to be a process of decision making. Having as a default saving rather than discarding, but if there is a problem just discontinuing to salvage some datasets. I think we might make a decision just to get rid of them.

"Many of the policies that have been proposed in the past are more like “wish lists”."

CG: You mentioned some potential policies and defaults that could be enacted. In the very beginning of our conversation, you also talked about incentives – how certain survival pressures in academia were incentivizing people against sharing. What we see right now is that there are some reasonably good policies, for example at the NIH, for data sharing, but they are not being followed that much. I was wondering if you had any ideas of how the system of incentives could be modified from the very top level to improve how much data is being shared.
JI: Incentives are functional when they have teeth in their application. When you know that what has been proclaimed will be done. If there are rewards or penalties related to whether someone is fulfilling or not fulfilling requirements of data sharing if these rewards were applied you are likely to see more adoption of these practices. We don’t exactly know what kinds of rewards and/or penalties we need. This is something that we need to study better. Many of the policies that have been proposed in the past are more like “wish lists”. People say that’s a good idea, but I have very limited time, very limited resources, I can move on or go back, why should I do that, I’ll just move on. If there is more than just a wish list I think that we’ll see more data sharing. We have seen this with clinical trials registrations. It wasn’t that all clinical trials were registered, many were, but not all. As we push more for that, we’ll get more information. We will still get registered trials that are not transparent enough on how exactly they were run unless we explicitly say that this is important to have. We are probably going to have clinical trials published without accompanying data unless we will explicitly say that we will not publish your paper unless you do this. It’s a trade off on how you want to push and how you really make something a prerequisite and how much oddity you have in the process of making it a prerequisite.

"Reuse does not mean that it will always be good."

CG: We talked a little bit about the cost of data sharing and I think we both agree it can be substantial. There are other aspects related to data sharing. Shared data can be used for transparency – evaluating if claims from the original study are true. Reanalysis of the data with the same question in mind. On the other side data can also be reused to ask new questions. We see more and more datasets that are multidimensional. They include many measures acquired because it was relatively cheap even though they might not relate to a particular primary question. Having all of this data open might change how we do science. It would mean that there would be more potential for more data reuse – research without acquiring new data – and fewer of the kind of research that requires acquiring new data. How do you think this ratio looks right now and where should it be?
JI: I am not sure that there is a right ratio of newly acquired data vs existing data. I think it is a shame not to use data that already exists if that information is sufficient to answer the questions that are deemed to be relevant. I cannot put a number on this – it depends on the type of the question, whether the field is eager to do it. Clearly though the need to obtain new data should not be the default option. Currently it is the default option. In most grant applications there is a sense that research means getting new data. Why? If we already have more than we need why do we need more? I think that increasingly we will find ourselves in a situation where we will have more than we need. Just because so many scientists already collected a lot of data. I think that the default will switch gradually from new data to use the existing data. If you cannot get the answer based on those then think about what new data you need very strategically instead of getting little bits and pieces right and left, multiple proposals, multiple applications. Think strategically – you might need only one dataset instead of fifty. Reuse does not mean that it will always be good. Some practices such as data dredging will be even more notorious, but, provided that it is recognized that it is data dredging, I don’t see much danger in that. If we are clear that that was just exploration – that’s fine. If it’s not clarified that it is an exploration we run a risk of having far more opportunities for undisclosed data dredging. The credibility of many of the secondary analyses may actually be low because their priors may be lower than the original ones. On average probably the original hypotheses have higher priors than the ones that follow. It could be that we will get a tail of very low credibility analyses of people just going through extensive datasets, trying to produce signals, but with very low credibility. All of that needs to be taken into account. If we know what is the universe of data that is available, how many people can access them, how they are accessing them, we can also get a better sense of the credibility environment. Currently we don’t really know that. Results appear out of nowhere. There is no preregistration, we don’t know which datasets exist, we don’t know if that was one out of five analyses or one out of five million. More transparency will give us a better mapping of the multiplicity and complexity of data analysis protocols in different fields. In that I think this is good rather than bad, even though I expect that the credibility of the average product that comes out of these analyses will not be necessarily very high. There is another possibility that asking for data to be routinely available may even have an opposite effect of getting too many results to be reanalyzed and found to be “correct”. A dataset can have many lives, starting from the very early pieces of information that are being collected to how these are composed and how these are cleaned, queried, transformed, normalized, standardized, cleaned, used and finally shortened to correspond to the analysis that is being presented in a paper. If what is shared is only that final shortened, cleaned version we may have a false sense that everything that has been published out of this dataset has been correct. Someone has already trimmed and distorted the dataset enough that it fits the original publication. This does not mean the original publication was correct. This only means that the same distortion just shaped the “raw data” that are available, which actually are a highly selected version of a much larger universe of that that could be considered in different process of the project.

CG: So one needs to be careful what is being shared?
JI: Raw data has twenty different generations that could be vastly different. It would be useful to know which generation of the data are we seeing. Is it the very final clean product? Or do we have the full continuity of all the forms it got transformed to.

"I can think (...) of many young scientists that feel overwhelmed trying to generate a new dataset that would be able to compete against existing datasets."

CG: Going back to the data reuse vs acquiring new data. I think there is also a human aspect of this issues that concern motivations of people entering science. In other words, it could be that a lot of junior scientists might be much more excited about control of a small new study where they acquire the data rather than reusing someone else’s data. Would you agree with that and if so how would you convince a junior scientist to pursue a larger dataset that was acquired by someone else.
JI: Is this true? I don’t know. Are there survey data that have proved this? I can think of the counter example of many young scientists that feel overwhelmed trying to generate a new dataset that would be able to compete against existing datasets. It feels that if there are already datasets with 500 units of information, it makes no sense having a single young scientist with a capacity to generate 2 units of information, being under pressure that you need to generate your own data, rather than work with a larger coalition that already has 500 units of information. I am not sure that this psychologically feels better. I do believe that there is a nice feeling about feeling that you have your own data. Obviously, there is a sense of control that you know that you have control of the steps of data collection. In many fields that I am aware of probably this is a false pressure. For example, epidemiologists very rarely collect their own data. They have research assistants that do that. At least based on my experience when I tried to see what research assistants are collecting I have almost had a stroke. In terms of what I thought they were collecting and what actually they were collecting. I think that many studies in the field are so unreliable just because the senior epidemiologist never really collected the data that they say are their own data. I think that they would be heavily surprised if they really took a closer look at what was being done at their lab. “Who is the data owner?” is questionable. A lot of data are not only collected by other human beings, now they may be collected by robots or by automated processes that someone has set up. The investigator is the first one probably to be hit by that wave of information, but does that mean that it’s her data? If an AI has collected it?

"To see that something clearly saves lives I need to wait 40 years to see that."

CG: This is a great segue to my next question. We are right now in the heart of Silicon Valley and in some of your editorials you argued about a larger involvement of industry in certain scientific endeavors because of better alignment of the structure of incentives. How do you think the data outputs of publicly funded research should influence the industry, especially the AI-driven sector.
JI: It’s a frontier that is largely unknown. I think that I have a biased perspective since my starting point is in biomedicine. In biomedicine, I cherish transparency and the ability to appraise evidence about new drugs, interventions, preventative measures, diagnostic tests, etc. I feel very uncomfortable unless I can see that it really works. There is another tradition that I think is very prevalent in Silicon Valley it comes more from the information technology type of work that you don’t need to publish. If something works it works, and you see that it does. You see that it performs better, lasts longer – there is no need to tell competitors how exactly you did it. It’s very obvious that you have done it. These two cultures are not very compatible because in medicine there is no way that I can get the same security. To see that something clearly saves lives I need to wait 40 years to see that. I’ll probably be dead by then. It’s a very different culture and I think we need to find the right recipe depending on what is that product that we want to get. If it’s AI that is related to health, I would like to see very transparently how it works and what it does. Be able to see that in transparent and reproducible research. If it’s something that is about the battery of my mobile phone, I can check on my own that it lasts two hours longer. It’s a very different situation. For anything related to health, I think that transparency, availability of the evidence to be scrutinized is key.

CG: Is the core issue here that fact that the outcomes are so delayed that we have to resolve to use proxies that are not very good?
JI: I think the proxies are horrible. It’s not that they are not very good – it’s that we don’t have any good proxies. Actually, the literature on proxies or surrogates shows that in order to even test whether you have a proxy you need to have information on the outcome which might be something like death or some major disease outcome and also have interventions that have markedly affected that outcome already. This way you can go back and say “well if I had used that proxy it would have told me the same thing, it would have picked the right people much in the same way.” We don’t have that. For new technologies that have completely different aims and completely different philosophy of how we should intervene, it makes things even more difficult. If I had just one of the same that I have developed proxies in that drug, for that family, I might argue that for something very similar I have that proxy available, and it worked. Here we are talking about a completely new, disruptive technologies and approaches. We have no proxy to inform us.

CG: In other words, the usual timescale of biomedical research is not compatible with the ultra high pace of Silicon Valley?
JI: We need the pace of Silicon Valley. This is where some of the great ideas will come from. The phase of testing and reaching a high level of certainty to be able to recommend something very widely is probably not commensurate with the first phase which is the exciting, weird idea high-risk idea being proposed. Some people may espouse some of these high-risk ideas very early on, and they may want to use them. There is plenty of incentives in that direction – for licensing and for approval of drugs. People are pushing them out as quickly as possible. There is a risk that these people who endorse them will do worse than the average person who doesn’t. There is even the risk that people who have more information in our times will do worse compared to people who have no information or little information. There is a critical inflection point probably, where if we drop below a certain level of credibility of that information, just giving more of that information to people will make things worse for them. We may even see a reverse inequality where people who are more wealthy and have more access to this type of new Silicon Valley technology maybe are to be pitied compared to people who are “disadvantaged” and will never get that. They would just be flooded with information about what to do and what not to do, how many tests to perform, how much data they need in their lives, etc. In the meantime, others might just be sleeping or having much better things to do.

CG: I guess the argument here is “ignorance is bliss” especially if the opposite is noise?
JI: Ignorance compared to wrong and useless information – yes. I can’t believe that ignorance is better than correct and useful information, but ignorance vs. wrong and useless information…

"We should make sure we maximize the chances of sharing, but still respect people’s privacy."

CG: Very true. Finally, I would like to switch gears to an different, recently debated, topic that has been in biomedicine for ages – patient data privacy. This can be sometimes at odds with data sharing goals – I was wondering if you had some thoughts on how data reuse can be done efficiently while respecting patient privacy.
JI: That means that we need informed consent for prospective research projects that take this into account. We need people to agree that their data will be shared. Currently, we see that private data are being shared without people being asked. We have all these scandals – next door. That’s really scary. We need to respect that, but in principle, if someone is participating in a research project we should be honest that we do that not so they should expect a benefit for themselves, but because they contribute to the community. This has been the standard all along for example for clinical trials. It is wrong to tell people that we have a new drug and we try to get you to the trial because then you will do great. If that’s the case than the trial should not have to run, because it means that we already know that this drug is really better. The fact that we run the trial means that that person on average has nothing to gain. They actually may have to lose because they may need to go five times to the clinic while they could’ve spent that time on the beach, they may need some extra involvement, extra testing, extra blood tests etc. The basic principle is that they do that because they want to contribute to knowledge and eventually hopefully help others. If we do this and we ask for their consent to use that information I see no problem. If they say that they don’t want to use that information in shared forms we can still explore many options for deidentification and making it usable in some variant. Not for all types of data, but for most types of data. I think there is no block here – it’s an issue of making sure we inform people of what we do. We should make sure we maximize the chances of sharing, but still respect people’s privacy.

CG: I could not agree more. Thank you very much for offering your time to do this!