Exploring the Protein Universe via AI

Monica Manney: Welcome to UVA Data Points. I'm your host, Monica Manney. In today's conversation, we're joined by two experts in the field: Philip Bourne, the founding dean of the UVA School of Data Science, and Cam Mura, a biomolecular data scientist. Together they explore how data science is revolutionizing our understanding of protein structures, with a special focus on the exciting developments in protein folding and evolution. From new tools like Deep Urfold to the future of biomedical applications, Bourne and Mura provide a unique look into how cutting-edge technology is transforming the world of molecular biology. Cam: I'm Cam Mura, senior scientist working with Phil Bourne in the School of Data Science. And I am a, I guess I would call myself sort of a biomolecular data scientist now, or computational biologist or structural biologist. These are all sort of roughly overlapping descriptors of what my background was in chemistry and math as an undergraduate at Georgia Tech. And then I got into structural biology and biophysics because I was interested in really the things that make up life. So proteins and their structure in particular enthralled me. And so I followed that route and then eventually went more into computational biophysics to look at protein dynamics, and then came to UVA and have always maintained an interest in structural kind of computational biology that morphed into bioinformatics and, yeah, and so data science now presents a phenomenal new playground, really, for of approaches and methods and with unique power for structural biology. So we're not really sure what to call it, I guess molecular data science or biomolecular data science. Phil: So I'm Phil Bourne. I'm the founding dean of the School of Data Science here at the University of Virginia. But now today, I'm really talking about something completely different, as Monty Python would say. And that's really my lifelong work as well, in protein structure and the implications as it relates to biological function and disease and so forth. So, Cam, why don't you just give us a quick sort of rundown of proteins and then why structure is important and what that leads to. Cam: So imagine your body. Imagine the cells that comprise your body, and then about two-thirds, maybe upwards of two-thirds of that is water. But the rest of it, the dry mass, as it's called, little over half of that is actually proteinaceous, so it's proteins. The rest is a mishmash of DNA, RNA, membranes, things like that. But really the vast extent of your cellular life are these proteins. And the reason for that is that proteins are really the doers or actors or active agents inside a cell. Enzymes, you think about receptors. So, binding an insulin or hormone molecule. Molecular switches both in normal healthy states and also disease states. So for example, in cancer, when you have things like cytokines and other proteins that go awry, the signaling pathways go awry. So these really underlie or sort of malfunctions or dysregulation of proteins really underlies virtually every disease state. And on the flip side of that, conversely, our healthy states, and so really living systems as we know them, even the simplest little tiny viruses are comprised of proteins. Phil: So let me just give you a sort of perhaps slightly different chemical, more chemistry-based view of proteins just to set the stage for the discussion we're going to have. And partly because, you know, both of us have a background in chemistry. My PhD is in chemistry. It saves such a long time. Cam: Reformed chemists. Phil: Reformed chemists. Such a long time ago. But, you know, as we learned in high school biology, proteins are the building blocks of life, as you've just described. And I think if you look at that, what a building block looks like, you've essentially got 20 different amino acids, which are 20 small chemicals. That's what we need to worry about right now, that chain together to form that protein. So a typical protein might be 300 amino acids of these different — and I like to think about them as beads on a string. So, you've got 20 different colored beads on a string, and that string is 300 beads. So, how many possible arrangements are there? Well, there are 20 to 300, which is more than all the atoms in the universe. So it's a huge space. We're going to get to protein space in a minute. But what nature has done really is only occupy a small part of that space currently. And then, you know, what really is the important part of what we do is that the biological action of those proteins really comes from how they fold up into a three-dimensional shape. And you mentioned enzymes and how you have a sort of lock and key of how an enzyme works. And that lock and key are essentially three-dimensional shapes. So it's really the essence of that that really drives our thinking and the majority of the research we do in this area. So given that, say a little more from your point of view about this protein, fold space, the space of which proteins occupy in three dimensions. Cam: Yeah, that sets the stage for this. What Phil just described really sets the stage for this idea of fold space. And in particular what he said about, imagine a 300 amino acid or a 300-residue protein. Then you have 20 to the 300 possible sequences, discrete sequences. Those sequences are easy to understand, because they are really discrete entities. Like you can write them down as one-letter codes, right? But structure is a little bit more of a slippery beast to try to wrap your head around. Yes and no. You can imagine a structure like a house or condominium or skyscraper or whatever. These are structures, right? And they're very static. And the way they execute their function is partly, can be scaffolding, they can be sort of dynamic. If you imagine, well, a mobile home, that's a structure that's dynamic. There's a lot to the structure. It's harder to describe. We can't just write it down like the string of letters in an English word or a paragraph or something like this. And so really trying to understand structure is a game of abstracting away the details and trying to focus on sort of the shape of the thing. So, you can imagine a spiral staircase or a zigzag that might be called a beta sheet. The spiral sort of loopy staircase might be called alpha helix. These are sort of the elements of protein structure. Those are the two building blocks and how they combine in three-dimensional space. They're tethered together from one end of the protein to the other end of the protein. And to see, imagine it's just a floppy sort of piece of cooked spaghetti, overcooked spaghetti. Right. And that thing somehow then folds under the laws. Well, the somehow is actually thermodynamics and chemistry and sort of statistical mechanics and all of that that causes the protein to adopt a folded structure. Now we can try to categorize and classify that fold. Oh, this is fold A, this is fold B, this is fold C. On and on and on. And before you know it, we're into the thousands of folds. But how? Now I'm going to say something that maybe sounds like it's undercutting everything. How do we really define fold? How do we define quantitatively, rigorously, algorithmically, sort of in the spirit of data science? Right. How do we define a fold? Let's say we can do that. Then we can classify the folds. And how are they related to one another? Do we put them sort of near each other in some sort of proverbial space? For example, imagine a space of fruits. You have oranges and apples. And then within apples you might have Macintosh apples and Granny Smith. And in this space of fruits, those apples would be clustered near one another and the orange would be a little more distant, the grapes would maybe get a little more distant and so on, bananas. Somewhere in there, et cetera, you can try to come up with some type of what's called a metric or basically a distance measure to basically impart upon that collection of fruits some type of ordering or structuring in this high dimensional sort of abstract space. And so this is really what folds space. It's similar idea that insulin might be clustered together in fold space based on its shape closer to other proteins that have roughly similar shapes versus like a globin, like hemoglobin that carries oxygen in your blood, like that belongs to a globin sort of set of folds or a globin super family of folds. Phil: So what you've described by virtue of characterizing these folds essentially into different buckets, that is an act of reductionism. And that's part of what the scientific method has been for many, many years. Well, since the beginning. And it's how we wrap our human brains around a large collection of objects that we want to describe. What we've come up with, really, what you and others before you in the lab have come up with, is the notion that, well, the problem is that something doesn't just sit in one bucket. You could say, well, it sort of fits in this bucket, but then it sort of fits in this other bucket. And so the classification system, the reductionism, is not perfect, but we need it as humans. But what we've now encountered, we'll get to this in a second, is through new AI technologies. We don't need to do that reductionism. Essentially, AI algorithms and the networks and what they're trained on can actually look at literally billions of features at any one time and make sense of them in a way that no human can do. And when you start to do that, it has multiple connotations. One is, of course, we've now got to a point where we can accurately predict these structures from the one-dimensional sequence. And in fact, three scientists got the Nobel Prize for this just this past year for something called alphafold, which actually gives us the accurate three-dimensional structure of these proteins. So suddenly we had of the order of 250,000 experimental structures, whereas we had the order of 20 million or more sequences. Now, we have 20 million or more structures because we can do these actual predictions. So, we've begun to fill in more of the fold space as you've described it than we ever did before. So then the question is, what are we seeing? Based on the recent work that you've done with Eli Draizen and Stella Veretnik, what have we actually discovered as a result of all of this? Cam: It's an amazing era to be alive in, because what Phil just described has basically turned structural biology into a data science. That burgeoning explosion of the number of structures that are available. 30, 40 years ago it was technologies like PCR, which we've all heard about, the polymerase chain reaction, that turned sequences into genomics. And now we're basically have turned by virtue of AI-driven methods like AlphaFold, structure is now a commodity, three-dimensional structures. So now imagine that you have this high dimensional, whatever, thousands of dimensional space and it's sort of sparsely populated with sample points. Here's protein A, protein B, protein C, et cetera. There are going to be large gaps in there just necessarily conceptually, right? Because it's sparsely populated here in this high-dimensional space. There's a whole curse of dimensionality as you go up into these higher dimensional spaces. And so the problem grows geometrically with the dimensionality and the sparseness that you see, is that because the space is not truly, is it discontinuous versus continuous? So this is one of the issues with fold space, discrete versus continuous. For many years scientists thought it was quite discrete that there's this island or this tight cluster of protein folds and then another one over here and another one over there without really much of anything in between them. Or is that really a sampling limitation, right? That these sort of punctate islands in fold space and you don't see a lot of intermediaries. Is it really just an issue of sampling that you didn't have many examples, which now we do. We've literally gone two, three orders of magnitude greater in density of sampling by virtue of AlphaFold and its ilk. And what that tells us, and this is sort of reflected in the title of that paper that we published relatively recently: Deep Generative Models of Protein Structure Uncover Distant Relationships — and here's the key part — Across a Continuous Fold Space. That was part of the title, right? So in other words, fold space, this protein fold space is a lot more continuous, by which we mean, you can hop from protein A over on this part of fold space to protein B via a series of intermediate states that really do exist, those protein folds, right? There's a lot more of a hybrid nature. Another way of thinking about this is if you have the orange and then the Macintosh apple and you sort of meld them, there's sort of this, our apple or something, some hybrid intermediate that has features of both. And it could be something structural and sort of simple, or it could be something more subtle in terms of feature. When I say features of both, it doesn't necessarily need to be purely geometric. It can be what we call physical-chemical or biophysical. So for example, the charge, the balance of positive and negative electrostatic charges on the protein might be intermediate. There could be, in other words, other physical properties that dictate the function that are intermediate because of the way that we formulated the problem in this — if you sort of view this from the perspective of machine learning, there's always the issue of featurizing your input data set. And so how do you embed or how do you capture those features? Length, mass, color, opacity, blah blah blah. These are things we can think of for macroscopic objects as being the features. But for protein size things, what are the features? Well, an obvious one is the structure itself, which are just XYZ cartesian coordinates that defines the structure. But then on top of that we have the electrostatic charge, geometric properties like concavity or convexity of the surface, which influences things like the ability to bind a drug compound or another protein, say, in an interaction. So to make a long story short, you can imagine at least dozens of these features, even the conservation, the phylogenetic conservation of the amino acids, and you can inject all of those into this sort of featurization pipeline. And so what you get is sort of what we call an amalgamated representation of the protein. It's not just pure, strict, simple Cartesian geometry, but rather all these features in the bag too. And now we can start to compare proteins. Thanks to AI-based methods, deep generative modeling and so on, we can really start to compare proteins in that way more powerfully. Phil: I mean, it definitely is a whole new generation of capability through these tools. So what have we learned from that? Cam: You know, interestingly, proteins are distributed in a far more continuous manner across full space with elements of similarity, both structural is what you can sort of easily identify just by gazing. Phil: Is that where the Urfold comes in? Cam: Exactly. So I'm glad you asked that, because essentially, if you imagine now rewinding the clock by about three or four years before we implemented and really designed and implemented and turned loose this deep learning model, we came up with this theory of protein structure that we called the Urfold that Phil's mentioned. U-R-F-O-L-D. So we prefix it with UR in the sense that I believe it's from Latin, that it means sort of one level higher up in abstraction, in sort of like an ontological sense. And so you can basically think of this UR-fold as being a slightly more abstract level than fold. So let me back up for one second actually, if I can, this might be easier to ground it with an example of one of these classification systems that, that Phil mentioned. CATH is one. So cath, C-A-T-H, stands for class, architecture, topology, and H is homologous superfamily. We don't need to worry so much about that. Let's first worry about the C-A-T levels. The C is very easy. I mentioned earlier the alpha helices and the beta strands, which form beta sheets. So if the protein is all alpha or all beta, or a mix of alpha and beta, those are the three fundamental classes. The C level. Then you go to architecture level and then topology level. Let me skip architecture for a second. Topology is basically synonymous with what we're talking about as fold. It's the topology. So if you imagine you're a little tiny nanoscopic ant crawling along a protein structure, the direction that you're pointing in 3D space, you're walking along a beta sheet, then you make a turn and you go into an alpha helix. Just imagine that little map of vectors. So basically, if you could somehow classify those vectors, then you'd get basically a definition of a fold, like basically the secondary structures, the order of secondary structures. So you go beta sheet, then alpha helix, then maybe a turn, then another helix, then a beta sheath. And so that would be your protein fold. And there are many of these folds known, Rossmann fold, TIM barrel, et cetera, et cetera. So we talked about the C level in cath, the class, and then topology is the fold. And then we have in here this funny level of architecture. What is architecture? So that has been sort of operationally or empirically defined as the predominant orientational disposition of the main secondary structure elements. So those secondary structure elements are the helices and the strands. So if you imagine what this is saying, it's saying roughly the 3D spatial arrangement of those secondary structure elements, agnostic of the pattern of connectivity between them. Which gets you to the topology, right? So if you take the topology, just imagine snipping off all the loops that connect the secondary structure elements, those helices and strands, and just not caring about that and just saying, roughly, what is the order? Like, in a globin, you have helices that lie at a certain angle to one another, and you have like six of these, or eight, I should know, eight of these helices, right? And so they're at particular sort of angles in 3D space to one another, and they're packed fairly tightly as most protein cores are. Right? And so that's the architecture. So you talk about the globin architecture, right? Phil: So where does the Urfold fit in? Cam: Urfold now, finally, is this phenomenon that we discovered over time, really, of architectural similarity despite topological variability. So that's really key, because what was assumed to be the case was that if you had two different proteins with the same architecture, then they're going to come from similar topology classes. But actually that's not the case. Phil: So I think the important thing is that there's, what you're saying is that you've got this classification scheme that we've been using for a long, long time, and it's been enormously powerful because it basically defines the buckets of how we classified all these proteins. Now, what we're saying now is, as I said earlier, that things don't necessarily fit into one bucket, but what we're now describing is an element that fits into multiple buckets, which we've actually been able to identify as this urfold that sits between architecture and topology. It's a powerful element. And then that led — and it should be said that this was actually initially discovered in large part by work by Philippe Youkharibache that used to be in the lab. And it wasn't done with fancy AI technologies. It was done with eyeballs — Cam: Manually, visually, yeah. Phil: Because you could take these proteins now at this point, with COVID and everything else, people have seen capsid proteins from viruses and so forth. So effectively everyone's seen these kind of diagrams and they're just visual representations for humans to actually look at these large sets of Cartesian coordinates that make up the proteins. And from that we started to notice these segments that appeared to be repetitive across very different types of proteins. And from that we came up with this notion that this Urfold existed. And then really, what we've been doing with AI is to chase that belief across this fold space that you've been describing. Cam: Right. Systematically and reproducibly. And in a principled. At least, even if we're wrong about our assumptions, at least with the machine learning AI, these approaches, we can be clear about our assumptions in the model that we're constructing, which is better than eyeballs... Phil: Well, a totally interesting and different philosophical point that you can still publish something that potentially is wrong, but it actually ultimately leads to something that's right. So by, by being trodden down at the moment, no one's really trodden this down significantly. So, all right, so we've got this additional element we've discovered in principle, and we've now, you know, done this with, through AI, beginning to actually look at the space, improving it. What are the implications of this, from your point of view? What does this mean, I'm sure that's what the listener would be asking is, okay, this all sounds great. Assuming it is in fact continues to be supported as such, what does it mean ultimately to health and well-being of the people listening? Cam: That's a really good question, not one that I necessarily have a great answer to. Well, so fundamentally, I mean this is sort of like the evolutionary angle. I think of it as sort of you're looking backwards in time or looking forwards in time. Looking backwards in time, we have the 3.5 billion years of molecular evolution that's occurred and blah, blah, blah. And that's really interesting in terms of like textbook-type knowledge, right? Just basic knowledge. How did the proteins that we have today come to be? Looking forwards in time, you know, you think forwards to protein design and the universe of all protein sequences, normal versus aberrant, wild type versus mutant, ancestral versus extant, the ones that are around today, or even ones that could be around tomorrow by protein design, hypothetical proteins. This opens the door to all sorts of biomaterials, biomedicine, which is what Phil's asking. So one could imagine that these Urfolds, which again you can think of now as little three-dimensional constellations of structural elements that are not necessarily covalently tethered, otherwise we'd call them super secondary structure elements. And these little Urfolds could sort of form a library of design elements for basically doing more intelligent guided protein design rather than just sort of randomly turning loose a diffusion model or whatever that's basically able to — so, I shouldn't poo poo it because these models have been extremely effective. Just recently there was one published in Nature about a snake anti, you know, snake toxin, anti-venom that was highly effective and it was a small designer protein actually, sorry, it binds to the small three-fingered venom toxins. And that was purely the product of protein design de novo, totally from scratch. Phil: The key point you keep raising is protein design. So effectively as our understanding of the structures increases through Urfold and other mechanisms, it really opens the door to actually thinking about how we might design proteins that don't currently exist in a three-dimensional sense in nature. And what are the implications of that? For the most part, of course, we think about that in the context of the kind of things that you've mentioned, which is biofuels, biomaterials, designing proteins that ultimately could be involved in various types of healthcare applications. I mean, all of that is clearly on the horizon as we move into a completely different era of medicine that in many ways is driven by AI. So, all of that's really exciting. Of course there's, as there always is with the excitement, there's also the danger. And the danger is of course of nefarious production of various types of proteins, you know, and that gets into bioterrorism and all sorts of things. When I started working in this field, which I won't tell you how many years ago that was. Well, I will. It was around 1980, there were about 80 proteins experimentally determined. Now, as we've discussed already, there's 250 experimental ones and about 20 million computed ones. This amount of — this is "big data." This is data science. This is new knowledge. This is really opening doors to all sorts of development that are undoubtedly going to impact us. And I think we're very lucky to have the opportunity to work on these things. And we're in a school of data science that is really amenable to pushing the envelope on some of these things. So with that, I think we'll close up. Any last comment you'd like to make? Cam: I did have one realization in terms of that actually stepping away for just one second from protein design. What I realized is that, and we haven't explored this at all yet really, but we would like to, is a methodology... Phil: This is discovery through podcasting... Cam: ...yeah, is explainable AI, or XAI, right? Is this something that's very interesting because everything that we've told you about here rests on protein science, of which there is 50, 60, 70 years of really painstaking, laborious, but well, solid physical foundation type of science in biophysics and structural biology and so on. So, the implications of this work is that really one of the sort of potshots against AI methods or just turning loose random AI method is explainability, interpretability. Right. It's a black box. So you put in some data, you get out some result or prediction or classification or whatever, and you don't know why. Here we have the opportunity, and this is not biomedicine related, more fundamental, is the opportunity to explain why. Like, why is the deep generative model or the deep neural network or the blah, blah, blah. Why is it, in light of those 50, 60, 70 years of physical theories that have been applied to biological problems. Phil: What you're actually saying, which is a fundamental point, is we now through AI have the ability to predict the three-dimensional structure of a protein from its one-dimensional sequence. However, in nature, we still don't know how that occurs. Cam: Right. Physics, I mean, we know, roughly speaking. Phil: We know the beginning and we know the end. Thermodynamics, we don't know the middle. We don't know enough about how living systems work to understand exactly how that protein folds. And so the protein-folding problem is still not solved. The structure-prediction problem is solved. Cam: And so this does connect to... Phil: So, you better get back in the lab and get on with it! Cam: ...this does connect too to biomedicine in the sense that explainable AI, in a little small pilot study we did at the end of this published work, we found one region using again, this is very sort of crude and I hesitate to even maybe bring it up, but using XAI, an explainable AI approach called layer-wise relevance propagation, we're able to identify one particular loop of the protein that's important for this classification, this AI-based classification versus another. And it turns out that that loop — and this may just be coincidental, who knows, but it's worth investigating — that loop is actually an amyloidogenic loop in the protein. So it's where a lot of amyloid-related mutations are found in that loop, right? Phil: So just explain that for the audience a bit. Cam: So, so amyloid is a type of — it features a lot in like Parkinson's and Alzheimer's disease and a lot of neurodegenerative diseases where the proteins sort of glob together, you know, into these amorphous aggregates that result in cell death and so on. And so what we found just completely serendipitously in this deep Urfold study, by then we started turning the crank. You know, science is never done. And so any project that answered, they think you've answered a question, you've raised 10 new questions, right? And that's how it should be. That's perfect. Phil: It's a job for life. Cam: Exactly. Job security is the nature of science for these questions. Because there's always questions to be had. Right. And so the question we had then was, okay, if we just use this really simple out-of-the-box layer-wise, or LRP, layer-wise relevance propagation approach to analyze the deep urfold predictions. So really a first step in that direction of trying to tie this back to the, to demystify the black box as it were, right? To really tie this back to the protein structure and biophysics. What lit up for us in the results was this particular loop in the protein. I'm sad to say I cannot remember the exact identity of the protein, but it was one involved in amyloidogenic diseases, right? And that loop was known where there was a critical tryptophan to alanine or tryptophan to valine, I can't recall. Phil: Those are different types of amino acids. Cam: Those are different types of amino acids. And a mutation that is disease, a heritable disease-associated mutation then was in that loop. And so that's an example. Not by any means. That's extremely tentative and speculative right now, but that's the type. That's what we hope for. Is that explainable AI might be able so separate completely from protein design. Right? Phil: Yeah. I think the serendipitous nature of how one discovery potentially leads to another is certainly what drives us forward all the time. So with that, hopefully you'll get funding for that going forward and we'll be in good shape. So with that, I think we'll end. And thanks very much, Cam, and thanks all of the people for listening. Cam: Thank you, guys. Monica: Thanks for listening to this month's episode of UVA Data Points. More information can be found at datascience.virginia.edu. And if you're enjoying UVA Data Points, be sure to give us a rating and review wherever you listen to podcasts. We'll be back soon with another conversation about the world of data science.

Exploring the Protein Universe via AI

Show Notes

Episode Transcript

Other Episodes

Perspectives on the Meeting of Wikipedia & Artificial Intelligence

Advances in Sports Analytics

A Conversation with Virginia Eubanks