Episode Transcript
Monica: Welcome to UVA Data Points. I'm your host, Monica Manny. Today we're tackling one of the most pressing challenges in our increasingly data driven world. The data deluge. The overwhelming flood of information that we generate and record every single day. With us are three experts from the University of Virginia School of Data Science. Phil Bourne, a professor of biomedical engineering and the founding dean of the School of Data Science, is joined by Terrence Johnson and Alex Gates, both assistant professors of data science. Together they have been exploring innovative methods to make sense of the vast oceans of data we're all swimming in. This episode unpacks the challenges of the data deluge, what it means for businesses, researchers and societies at large, and explores the strategies we can use to navigate it. How do we make sense of so much information? How do we ensure the ethical use of this data? And what opportunities does this overwhelming flood of data open up for the future?
Phil: Hi, everyone, my name is Phil Bourne. I'm the Stevenson founding Dean of the School of Data Science here at the University of Virginia and a professor in the Department of Biomedical Engineering at UVA as well.
Terence: I'm Terrence Johnson. I'm an assistant professor of data science. I'm an economist and I'm interested in markets and who gets what and why.
Alex: Hi, I'm Alex Gates. I'm an assistant professor at the School of Data Science and I study the science of science or how do we leverage really big data to actually understand the processes and mechanisms of doing science.
Terence: So we're here today to talk about data, like the problem of data, the future of data, the sustainability of data. It's a thing that's embedded in our devices. It's, you know, the apps we use, the web pages we go to. All of commerce is now driven by it. It's becoming a really big issue, especially in science. And that's where our team of researchers at the School of Data Science has been interested in kind of looking forward to and thinking about what's going to happen. So maybe a good place to start. Why is data sustainability a general problem across science and business and government?
Alex: The biggest problem is that we have so much of it. We're under what's now the situation of the data deluge. In essence, the number of publications, the number of insight has just been growing exponentially over the last hundred years or so. And it's to the point now where we have so much data that, that it's often not clear what we should be doing with it and what data is actually valuable versus what data is. Just the noise kind of produced from humans. And so we're trying to understand how is it that we can leverage a system designed to extract useful and valuable information from all of that data.
Phil: Fortunately, no one can actually see how old I am on this podcast, but I started my career when essentially there was no digital data. And what's changed? It's just such a dramatic change into the many zettabytes that we have today. But it's the notion that that data comes in all different forms and from all different devices, but also from all different disciplines. And in a way, the field of data science has arisen as a result of all of this, because there's the opportunity to do things across different types of data in ways that we could never have done before by virtue of how much we have and the different types that we have. So there's certainly lots of impediments as to the quality of data, how well it's represented, and so on. But really the promise is there. And I think what we see every day, and certainly in our school of data science, is the ability to work across that various types of data. The question going forward, which is really a large part of what we're going to talk about today, is the sustainability of that data ecosystem.
Terence: Can I tell you a data fairy tale real quick?
Phil: Please.
Terence: So you said zettabyte. So in 2024, there are 150 zettabytes of data.
So that's hard to think about. A zettabyte is a trillion gigabytes. Humans don't like numbers like that. So let's break that down a little. A MacBook. You've seen a MacBook? All my students have them. They have 512 gigabytes of storage. So a zettabyte is 1.8 billion MacBooks. This is like a comically large number. So let's put them on semi trailers. So a standard 53 foot semi trailer has 6.5 million cubic inches. That's 2.2 million semi trailers of MacBooks for 150 zettabytes. Okay, it's still too big. A standard container ship can hold 20,000 semi trailers. That's 110 container ships. So if you imagine like those huge ships, that's 110. That's what we have right now. Every month we get another nine container ships of data.
Phil: And just to give you a sort of reference point for how things have changed, in 1995, I went to work at the San Diego Supercomputer center to do computational biology, because I thought that was the Best place in the world that had the best data storage and the best compute environment. It was run by the National Science Foundation. It was a national resource. It was so called supercomputers. And recently I looked and said, well, how does that relate to what we have today?
The whole of the US Scientific community was using this particular resource, and yet it was 1,000th the power of a single is 16 iPhone. It's unbelievable what's happened in, you know, less than 30 years.
Terence: And when you talk about the computing power, there's an estimate from Cushman Wakefield In 2022, we're using 4.9 gigawatts of electricity globally for data storage and computing, and it went up 50% to 7.4 gigawatts in 2023. In Virginia itself, there's 483 data centers and about 3,000 in the US and about 10,000 globally. So if you think about the trajectory of this, from the starting point that you're talking about not even an iPhone to what we have now and what's coming in, the costs are going to be spectacular. So in Virginia, they think electricity consumption is going to double in the next 15 years, largely because of data centers opening up. In Loudoun county, the rate of growth is increasing.
It's not just the stock.
This is becoming a serious issue for the environment, for our economy, for science, for business.
Alex: Underlying all of this, I think what's really important to recognize is that not all data is created equal. And that while we've seen an amazing rise of techniques and methodologies, large language models being one of the preeminent of these, to be able to process all of this data and come to some structured understanding of what's happening inside of it. We often overlook the fact that those methods rely on the data being clean, accurate and reliable. And there's this fun adage that if you put garbage in, you're going to get garbage out. So we often see that if you just take the data as raw data points and throw it into any of the fanciest models that you could desire, it's not enough. So a lot of what our students are learning in our classrooms actually right now, is how to wrangle that data, clean it, process it, contain it, and create something that's actually usable to do machine learning processes on top of it that's extremely costly and extremely time consuming to do by hand.
A lot of the infrastructures that we have in place right now to deal with this data deluge problem and create clean data rely on the equivalent of these human processors and curators to clean and work with that data. And I think that's something that you've seen grow, but are seeing also the stresses of in your work. Right. Philosophy.
Phil: Yeah, no, that's definitely true. And I mean, it leads to the notion of this so called sustainability paradox. And it exists across the board that I particularly work in biological data, where it's certainly prevalent as well. So just to state what that really means is the amount of money to support research and including the data gathering and processing and curation, as you pointed out, Alex, is effectively relatively flat. And so, you know, it's changed, it's gone up over time, but at the moment, in the last several years, it's been flat. It may actually change in one direction or another, most likely down in the next few years, certainly in the us but clearly there's a cap on the amount of money, the amount of money that needs to be spent from that cap to actually deal with the data in the way you've just described is going up.
If it goes up, that means the money available for innovation, for research, innovation is actually going down. On the other hand, you need that data to actually be innovative. So it creates this paradox. And, you know, we, the three of us and others have been discussing how to address that paradox.
Terence: Can you give me some basic examples of this idea of curation or integration of data? Like, what does that mean? What does it look like?
Phil: So for about 14 years, I'll just give you an example of a data resource that it just changed the whole fabric of certainly the biomedical sciences. And that was the Protein Data bank that started in 1971. And it was really there just for some people, where data could be actually put so people could retrieve it. And some of that was actually on paper at the time because it was the early days of digitization.
But that grew over time. And it became clear that for that data to be the most effective in the way that Alex has just been describing, it needed to be curated. So that meant it needed to be collected because it was coming from labs all over the world. It needed to be collected and essentially unified so that whatever one lab was providing the representation of their data was the same as what another lab was providing, so it could be compared. And that wasn't the case initially because when people started to get interested in comparing these different data sets from different labs, they realized, wait a minute, we're not doing a very good job of this. It's higgledy piggledy, it's a mishmash. So then There was consistent rules were put in place, scientific societies got involved for defining so called metadata standards, which is how to describe the data about the data.
And it became much more unified and as a result of that became much more usable by a worldwide community. So it's important to say that the whole time this was going on, the public, through taxation and governments was paying for it, but it was free for absolutely everybody in the world to use. That actually creates an inequality of who's actually paying to produce the data versus who's actually using it. That's a whole other issue that actually ties into sustainability in some ways.
Terence: And this isn't just something that happens in biodata. Right? So if you think about the United States, there are all these police departments, municipalities that collect, for example, crime data. They do it in different ways. There's hospitals who measure patient outcomes differently. You need to harmonize all this information or there's going to be no way to put it together and see the big picture. So from that perspective, what's the future of science here? Why is this so important?
Phil: Well, all these things are best told by examples. So what comes out of all of this is situations that I think a story I've told a number of times because I think it's very manifest of what we're trying to do in data science. It's when you've got the ability to use this curated data and putting it together, because there is something in that data that allows a cross reference between quite disparate forms of data that people wouldn't have even thought about putting together. Give you an example that I find very compelling that you have, you mentioned hospital data. So you've got patient data. So there's lots of issues obviously about privacy, but you can have de identified data to study certain types of problems in research.
And to a certain degree that is still nowhere near where it should be. But there are notions of codes which frankly were originally developed for billing, but actually are quite effective in actually describing data, for example, different kinds of disease conditions. So that's curated data at some level. I was approached at one point by an ER doc, an emergency room doc, who said, I've noticed that when we have trauma patients who come in who have had car accidents, when they recover, the kind of accident they have and the kind of internal injuries they have are actually correlated to some degree. I got interested enough in this. I went to the Department of Motor Vehicles where I can get public crash data and I started to correlate that crash data with the patient record. This has not been done before by hardly anyone. Two very disparate forms of data. But there was a connection by virtue of information about the type, the car that was in the accident, its number plate, and that could be then ultimately tied to a person. And then you can look at this correlation and it turns out the correlation appears to be real. Why does this matter to anyone? It matters because people die in the scanner. Someone comes into the er, they put them in a full body scanner because they don't know what else to do with them and they look for internal injuries. And some people die in the scanner. If you knew by virtue of this correlation between the kind of accident and the kind of internal injury, you could look at that spot first on the patient, treat that first, and hopefully save lives. And that's now an ongoing activity whereby the first responders tell the emergency room what the kind of crash has been, rear ending, sideswipe head on, this kind of thing, and hopefully it saves lives. So this is data science, this is the value of data. You've got two very disparate types of data, transport data, health data. You put them together, you've created and done something that's for societal benefit. That's what we're all about.
Alex: One of the big problems is that every different community has different standards for how they should be representing their data. You know, just take for example, the license plate information that Phil was mentioning. In the United States we have one structure of license plates. In the eu, it's a completely different structure of license plates. So if you wanted to merge between EU data and data from the us you would have to actually understand how to make the translation between them to understand something about the cost that they were attached to. And that type of work, that type of translational alignment and curation of different types of data, it takes a lot of effort and energy. And so making those connections is really the problem that we're thinking about.
Terence: And that's where the economics is useful for kind of thinking about this question. There are different kinds of goods. One would be a private good. Consumption is a zero sum game, like imagine an apple, I can eat it, you can eat it, we can split it in half. There's a public good where consumption we'd say is non rival. So the classic example is national defense. We have a military, it protects the country, everybody benefits from that. So public goods have this problem where they're typically under provided. If you imagine roads, you benefit from paving a road, but you don't want to pave your road to benefit everybody else. In town or like the coffee maker at work, there's somebody who loves the coffee maker. They clean it, they make a pot of coffee, and then you have the coffee vultures who swoop in and get their cup. But they're never going to make the coffee.
Phil: Right.
Terence: Public goods are typically under provided when private individuals are responsible. That's why we have government intervention for things like libraries, things like roads, things like education. We think those have to be subsidized and we have to ensure that the right level is provided for society. So, you know, you have this private good, public good issue. I think it's helpful for thinking about data because it's a public good. There's so much of it. The more it's integrated, the more we benefit. But no single person really has the incentives to solve these problems. They're too big, they're too complex for anyone. You know, it's 100 and something zettabytes. I can't download that. I can't make, you know, some sort of categorization of all of the variables that are out there. So we need to think about coordination. We need to think about mechanisms by which government, academia, business work together to build this kind of intellectual infrastructure to enable us to use this data in the ways that you're describing. So what does that look like? What's that future of how data is curated, processed and made available?
Phil: Yeah, it's a very good question. And it doesn't have, at this point, at least in my mind, a clear answer. We have moves by government to actually rationalize data and also to make it a public good, as government should. So there has been various efforts in different administrations to do this. The problem is, of course, which leads to this issue of sustainability is it's in large part been an unfunded mandate. That's to say all academic researchers are increasingly required to make the data that they produce available to everybody. But the cost of doing that is not necessarily well embodied in the system. So that creates a mismatch right there, which we've yet to really solve, in part because until recently, that data is not perceived to have any value because it's actually free to everybody.
On the other hand, if you look at what's going on in the private sector, when I was the chief data officer of the National Institutes of Health, I had a meeting with a whole set of data officers from companies like Airbnb, and I asked the chief data officer of Airbnb, I said, how do you decide what data to throw away? He said, we've never thrown away one byte, let alone one zettabyte. We've never thrown away one byte. Our job is to figure out how to monetize that data. So you've got these different kind of business models and you're trying to bring them together. And at the present, that doesn't seem to be working in ways that I'm particularly aware of in any significant form. But I think it's something that's going to have to happen. And the work we've been doing together, which really brought us to this podcast, was efforts to at least begin to think about that problem.
Alex: And I think the problem is also better understood if you contextualize it within the incentives that exist currently in the scientific ecosystem. So right now, the primary currency of scientists is our ideas, the insights that we're able to draw into the world around us. And the way that we communicate those is primarily through publications and conference proceedings or talks. Right. So there's a lot of infrastructure built up around how do we share our ideas through these publications. And actually, we've taken that to great advantage to be able to understand how scientists communicate and exchange information. But what's lacking right now is a recognition that the data that we produce is actually valuable in and of itself. Too often it was treated as a byproduct of the scientific process, a necessity in order to create the idea or get the insight. But once you had it, it was the insight that was important. And recently we're beginning to recognize that that data itself has a lot of value. And so we need an infrastructure that's in place to recognize and value the data that we're producing along the way. And currently we don't have much of that infrastructure. There are a few loan efforts, like the biobank or the protein structure database, but those efforts were more collectives around people trying to share information in very limited ways without necessarily facilitating external exchanges. And so what we need now is a transformation from infrastructures that are in place to share the ideas and the publications to also be sharing the data and the models and the mathematics that go along and underlie where those insights are coming from.
Terence: Yeah. So let me ask you a question. Imagine I have some great data, and I give my talk. You come up to me after the talk, and you say, I would. Can I have your data? I'm probably going to say no, Right. I mean, we like to have a culture of openness. But the truth is that, like, if people have exerted a lot of effort to gather data, they want to kind of milk all of the value out of it before it becomes kind of public domain. How do you fix that? Like, how do you make it so that people can be the source of data but also receive credit for that value they're generating for other people?
Phil: Well, first of all, I would take a little issue with the notion that what you just described is global across all disciplines. I think there's just different cultures that exist.
You're an economics person, I'm a biologist. And it's true that even in biology, as soon as you drift into medicine, there's a completely different viewpoint about who owns the data and who's going to share the data. But everything around things like the human genome all began because of shared data. There was an agreement. The moment that data was collected, it was going to be shared. And over a period of that's now 20 or more years, that's really created a culture of data sharing. But it's definitely not prevalent across the whole enterprise, that's for sure. And the moment within the University of Virginia, we're. We're actually running a survey across the university to determine how people think about their data and how the openness of their data. And we're going to get some very different viewpoints.
So I fully. I think that's an important factor to add into this.
Terence: So that's a great place to kind of start to think about it. So why did the human genome, why was that so successful in changing norms in biodata? Why did the human genome project create this cultural change? What was it at that moment in time that allowed this big shift to institutions?
Phil: It was essentially the National Institutes of Health, the nih, saying, you don't share your data, you're not getting any money.
It seemed to work pretty well. I'm being a bit flipped there, but that was certainly a driver. But once the data was collected, because this project was going on in. At one point, it was going on in 23 different labs, effectively for 23 different chromosomes. And no one lab was collecting. If they didn't share it, they weren't going to solve. They weren't going to get to the end goal. There was an end goal there.
It was the biology moonshot. Let's create the generic view blueprint of a human being. And everybody wanted. So there was incentives because no lab was going to be able to do it on their own.
Alex: Part of what made that work was that the data was recognized as kind of the endpoint and the goal in and of itself. And the funding was contingent on the data being open and available. I think that A lot of our other systems and infrastructures don't necessarily value the data in the same way. And so part of the recognition is this philosophical movement from data being the byproduct of science to data being one of the end goals of science. And I think that is a very tricky transition to make. So again, thinking about scientists and the incentives that we have for sharing data, one of the big pieces that we question is, is there value in my data for myself still? And if there's value of ideas to still be had, why would I give them to you if not keep them for myself? That question, that problem is one that right now we don't necessarily have the right incentivization mechanisms to overcome on a grand scale in every discipline. And part of what we like to think about here in this group is how can we structure an environment in which a scientist recognizes that there might be some value in my data still for me, but it's still more valuable for my career or my prestige to share that data with everybody else.
Phil: I mean, we need new mechanisms for providing credit as data gets used and reused and reused again so that the original producers of that data still get some credit, even though it's several steps down a pipeline. The three of us are data scientists. We're in the business of the science of data. What struck me when we were beginning the school was that how could we possibly not share data as a group of researchers? Because effectively we only exist because we were using data that other people were willing to share. And so the idea that whatever we produced, the data we produced by aggregating various other forms of public data and open data, we didn't make that available. And that's without talking about the methods we use. And everything else would be just plain wrong. So I think we have, as a result of that, and it's a form of incentive, we have an open data policy within the school that all faculty sign on to and essentially agree to make all of their data available.
And that got some national attention. And I think as data science grows and grows across, across the world, it sort of creates its own little culture for doing this. And, you know, I'm very proud of the role we played in the beginning. And now, you know, it's become so it's certainly prevalent within data science initiatives.
Terence: Globally as we wrap up. What are some of the specific or practical interventions or ideas or institutions that you think could make a difference in addressing some of these problems in the future?
Alex: Yeah, so I like to view this problem through two different lenses. There's emergent solutions where you incentivize individual researchers to all act in a different ways and let the system coalesce into some emergent property that creates a better environment for everybody. And then there's institutionalized solutions where an omniscient presence or perhaps a government, which are very different things, come in and say, these are going to be the new incentivization structures and we want you all to follow these behaviors. So I'll talk a little bit about some of the work we've been doing, thinking about how to facilitate more emergent behaviors that align with a data for good type of an environment.
One of the things we've been thinking about is how do you recognize for the use and reuse of data in the scientific ecosystem? The primary way scientists right now give credit to each other is through the form of a recognition called a citation, in which one scientist writes a publication, and in that publication they have a reference and they say, I'm going to reference somebody else's publication that greatly influenced my idea, that allowed me to build upon and create some new idea.
One of the solutions that we have is to recognize data in the same way that we would recognize publications, by giving it a reference inside of the publication that you would write. This is typically done through granting data, something called a digital object identifier, doi. And if you give a data set a doi, it now is an entity that can be recognized in the scientific ecosystem in much the same way that the ecosystem facilitates the exchange of ideas through publications.
The problem is that that's not very widespread. And so we've also been thinking about how to go back through the historical literature to identify when different data sets played a prominent role in different scientific ideas, and to be able to give credit to those ideas after the fact. But this is very challenging because everybody has different names for data. They refer to it in ways, and there's not a standardized methodology for saying and accrediting that you've used a particular data set. So again, one of the things we've been thinking about is how do you give credit to that data to allow emergent rules to evolve to facilitate better data sharing and usages? Another thing we've been thinking about is to create more institutionalized incentives to facilitate that data sharing. And Phil has had some great insights on how to restructure some of the markets towards that end.
Phil: Yeah, so in restructuring markets, and Terry's really going to like me to say this because in my mind it all comes down to money. What I've come to is the Notion that there has to be. What Alex is describing is really a way of identifying data and then through its identification, then coming up with means to actually value it. And that's absolutely essential. But another part of that is to say, well, you know, is to assign value to data explicitly by virtue of something we've been talking about internally called data credits. And essentially you trade credits in the system and you do that across the whole enterprise. So it's not just what we do in academia. You're actually trading credits with the private sector as well. So you get engagement across the whole enterprise. And credits could be traded not just for data itself, but also services that relate to that data. And it has the potential to create an ecosystem that we believe at some level is more sustainable.
That's the good news. The bad news is that to do this will be a completely cultural change because many fields in academia feel that the data should always be free. If you start placing monetary value on it, it becomes in many people's minds, tainted.
And in my mind, we have to get over that and realize that the economics of the system have to be ultimately what drive the system to play.
Alex: Devil's advocate on that. Part of the pushback against creating these types of data credit structures is equitability in the access.
The value of data in small chunks is much less than what happens when you begin to aggregate a lot of these clean, good data sets and align them and create much larger data sets. But of course, in a data credit ecosystem, that would require more financial infrastructure in place and resource in order to be able to do those aggregations, making it only accessible to an elite few individuals that have those resources.
Part of, I think what makes these problems so fun is thinking about how do you facilitate and promote the behaviors that you want to aggregate the data and bring the resources together, but make it equitably accessible to everybody so that we all have a chance to leverage the power that's hidden within that data.
Phil: I'm not going to push back on you because as you're describing that in a way it's exactly how Google Meta and others became what they became in these enormous wealthy and powerful enterprises. We have to be careful that that doesn't happen in pockets in academia where you actually create great disparity between across the academic enterprise within the broader system. So I accept that fully. I think there is an opportunity by having credits. If you look at what happens now, in fact, there is a disparity anyway because most of the data resources are kept in the elite universities because they have more money to drive the system in the first place. In the credit model, as we've described it, there is at least the opportunity for others to play within that sandbox. So it'll be interesting to see how all this emerges over time. But it's both exciting and daunting if.
Terence: You think about how people create value in systems. The reality is that what we're talking about is someone sitting there with a data set and basically tagging it and standardizing it, changing the units and changing the names of variables. It's extremely tedious work, but it's incredibly valuable work. So even if I have the smallest lab in the world, I can still devote some of my time to that. And if that gets me more access to more data, I'm willing to do it. So I don't necessarily have to sacrifice cash. I can sacrifice time. And that's the core of the hardest choices that we make in life, is not necessarily about a savings or something like that, but how do we choose what to do with our time? It's the scarcest resources that we actually have. So if you have deep pockets, you know, you can jump ahead in line and you can pay money. But where does that money go? Well, we can use it to expand capacity of these smaller labs or more productive labs by rewarding them for their contributions. Right. So in thinking about how different parties create value and how to kind of clear the market in the sense that, okay, if, you know, this is the amount of money coming in, this is the amount of curation that's getting done, how can we make sure that, you know, everybody's benefiting and the costs and benefits are kind of fairly distributed across the system. I think that's kind of where we'd like to go with the idea of maybe a market is not the right word, but more of a system for global collaboration.
Phil: It's really interesting that you introduce time as an element in this whole model, which I think is really important. Having said that, I think we're out of time.
Alex: So thank you all so much. This has been so much fun to talk about the science of data at the School of Data Science at the University of Virginia.
Phil: Hear, hear.
Monica: Thanks for listening to this episode of Data Points. More information can be found online@datascience virginia.edu. and if you're enjoying UVA data points, be sure to give us a rating and review wherever you listen to podcasts. We'll be back soon with another conversation about the world of data science.