How machine learning boosts protein and genetic studies | Interview with Dr. Burkhard Rost

We met with computational biologist, Dr. Burkhard Rost from the Technical University of Munich, to talk about the intersection of machine learning and biology, protein structure and function prediction, and so much more. Enjoy!

How machine learning boosts protein and genetic studies | Interview with Dr. Burkhard Rost

Influential biologist Dr. Burkhard Rost offers insights into the synergy of computing, machine learning, and biology, especially regarding protein structure/function prediction. He also discusses protein variance and conservation in evolution, including analysis of sequence tracking through use of neural networks. Head of the computational biology & bioinformatics department for the Technical University of Munich, Dr. Rost talks with Dr. Jed Macosko, academic director of AcademicInfluence.com and professor of physics at Wake Forest University.

See Dr. Rost’s Academic Influence profile

See additional leaders in biology in our article
Top Influential Biologists Today

Interview with Biologist, Dr. Burkhard Rost

00:00 JM: Hi, I’m Dr. Jed Macosko at AcademicInfluence.com and Wake Forest University, and today, we have a special guest visiting us from Munich, Germany. It is Professor Burkhard Rost, and he is going to tell us a little bit about how he got started in biology when he was a teenager. So, Professor Rost, how did that happen?

00:20 BR: Actually it happened quite a little bit later. So, I studied physics, theoretical physics, and I was on a track to do something about physics, but then ultimately, I believe I got pulled… So I started with machine learning, or I started with neural networks, and that got me into this idea that maybe, with these devices, we could possibly understand a little bit about how the brain works. Digging deeper, I realized that the models from physics were, at that point, we’re talking late ’80s, a little bit too far away from Medicine or Biology, and I wanted to really be, at that point, wanted to get deeper into biology. I wanted to get deeper into something that in biology makes sense, and then it was one of those classic cases where I interviewed at the European Molecular Biological Laboratory, at EMBL, I came with a background in machine learning. Those people knew biology, I didn’t know biology. We didn’t know anything about each other and we completely miscommunicated, completely didn’t understand each other, and that was a perfect match…

[laughter]

01:22 BR: Ever since.

01:24 JM: So when you started working for the EMBL, did you work on biological projects right away, and which ones did you work on?

01:32 BR: I started right into what essentially made my name in the field, which is marriage of machine learning and evolutionary information, which was immediately into protein structure prediction. So proteins are these molecules and at that point that… So my first publication was a trend or best mark in predicting some aspects of protein structure, called secondary structure, and when I sort of published that, I still was at the brisk of knowing almost nothing about proteins. Now the way you can think about proteins, they are linear objects. Think about it as a pearl chain, but we have 20 different colors of pearls, so there are 20 different pearls, and the way they are strung together, that makes different proteins. So different colors, different number of pearls, some are 30 pearls long, and some are 30,000 pearls long, that’s sort of the variety. The proteins are essentially the entire machinery of life, there’s nothing in your body, nothing that functions that is not done by proteins.

02:35 JM: Well, except for the ribosome, of course.

02:38 BR: Except for the ribosome, fair enough, but yeah. [chuckle] Okay, yes, there’s RNA, and I believe I take your point, so ribosome itself essentially is made up of proteins, too. But, there are, in fact, also there’s long RNA in there, so it’s a little bit more complicated. RNA has more to say than we believe when I started in this field. So over the last 30 years, many things have changed, but it’s still true that proteins are the machinery of life.

03:07 JM: Of course.

03:09 BR: And, essentially…

03:10 JM: So, yeah, tell us a little bit about that first paper you wrote, and why did it make you so well-known?

03:17 BR: The only reason is because simply, secondary structure is one of those aspects of three-dimensional structure, so the shape of the protein determines how it functions. So, knowing the shape from the… We know many sequence, we don’t know many shapes because it’s very expensive to make these shapes. Therefore, predicting shape from sequence is a daunting challenge that people have been trying since actually the ’60s. And in particular, this aspect of secondary structure, when I came into the field, I started collecting all the publications. There were over 200 publications and people have sort of tried this topic because it was interesting for biologists and because, sort of any kind of tool that you could take from computational sciences or physics, you could apply it to that, and had been applied to that.

04:07 BR: So it’s relatively simple to sort of somebody who is a newcomer to enter that field, and this field had sort of hit its head against the ceiling in terms of performance, and the breakthrough step there was the use of evolutionary information, essentially benefiting from the fact that we evolved from other organisms and using that information. So that information had been used before, but it had never been combined with machine learning. That was sort of the real break through, this combination, and that made the big difference, that made a big jump that improved more than people had done over the last 20… 20 years before that point. And that completely started a new way of looking at it, in fact, using evolutionary information, not just sort of the background to it.

04:51 JM: Well, can you explain just a little bit about how it works? So you have all of these proteins that are in evolutionarily-related species, and they have similar sequences, and then machine learning is basically saying, “Run these through some tests, see what happens and then learn from that and run it through again.” So how does that work in this case?

05:15 BR: So, one way of saying that is, yes. You completely described it. Let’s get into a little bit more detail. So when we look at the point in which one species becomes another species, or which we sort of evolved into a new… Into new species, have evolved in the past, then what we see is that, essentially, at the bridging point from one to another, you have the same set of machinery, you have the same proteins, right? That means they essentially the same proteins, do the same thing in the other organism, but over time, things change and the constraints for that change is slight, maybe slightly different between these two organisms. That leads to the same protein doing the same thing in two different organisms, but nevertheless, the sequences drift a little bit away. That drift-away typically is neutral in the sense that the drift happens exactly where the change does not matter for structure, does not matter for function, where it’s neutral. And what change that really is, is happenstance, is something that is different between the two species.

06:17 BR: Now, when you look at two sequences from two different species, two different proteins, and you see there are some points where they are varied and some point where they are not varied, then you get an idea that the ones where they’re not varied are more important.

06:30 BR: Okay, that’s the first step. Now, this is a complex piece of information. And to understand how that is used, I cannot look at it and really say, always, this is the most important, but I can sort of say the tendency is that this is most important. I cannot sort of put that into something that immediately predicts structure, but in context of machine learning, I can simply put these vectors of change into the machine learning device. It turns out that the amount of information I had at that point 30 years ago was already so much that rather for the time being relatively large devices, meaning models with relativity, many free parameters at the time. So today the story is very different, we were talking about thousands of free parameters, today we talk about millions or the numbers have been increased, but at the time, you needed something with thousands of parameters, so you needed machine learning, you needed neural networks all the like in order to in fact benefit from that information, and that ultimately is the entire idea.

07:31 JM: Okay, so you have these vectors of change and you have about a thousand of them, and they show you where things have been conserved, of the amino acids that are conserved and where they’re different, so the ones that are more neutral, and what does machine learning and neural networks do with those?

07:50 BR: So essentially, the trick is to… For these machine learning devices, for this machine learning, is to figure out where are the changes relevant and where are the changes not relevant, and that simply is supervised training, so that you do it, you, at that point, we had about a 100 proteins, a protein on average is sort of 200 amino acids long, so we had 20,000 samples, and for every single one of these sample, the neural network could learn, is this important to say that that position is a helix? So they there are the three stages of secondary structure, helix strand other, unclassified, unnamed, so to speak, and then ultimately, the device categorized from the sequence, from the residue, or from the amino acid predicted what is the state of secondary structure of that amino acid. Given some environment, right, and given some conservation throughout the family of proteins that I know, although I don’t know how they look in 3D, I know these 10 proteins will look alike because they come from 10 different organisms, they do the same thing in the different organism, and by filtering those data I can do supervised learning, and it turned out that this way I could put into the machine, into the neural network, way more relevant information by training on 10 times more samples.

09:06 BR: So this is a statement that we couldn’t make at the time, but 10 years later, we could then look back and say, “Well, what if we had had at the time 10 times more data?” Which we didn’t, but we had it implicitly, because we had it in terms of evolutionary information, but what if we simply had had 10 times more proteins? We could still not have done as well and that’s…

09:25 JM: Really?

09:26 BR: Yes, and that is essentially because the evolutionary information is not only giving you a lot of data, it’s giving you a lot of specific data, saying, change there is implying that, you have the same… You have a particular amino acid that is called glycine, and in a particular environment, the glycine conserve leads to that outcome, and in another environment, it leads to that outcome, and that’s a level of detail of rules that experimentalists wouldn’t even know, and the machine learning device could gaze that. Yes, sorry?

09:55 JM: No, that’s great. So what you’re saying is that when you first created your neural network and machine learning and incorporated in the fact that these are all evolutionarily-related organisms, if you had gone back in time and made 10 times more sequences that you analyzed, but done it without the insight of evolution, you wouldn’t have gotten as good of results?

10:20 BR: Thank you for the translation.

10:21 JM: Okay, that’s interesting.

10:22 BR: Yes.

10:22 JM: And the insight is really just saying, these proteins are all related and they can only switch amino acids at positions that don’t matter as much to the secondary structure, is that what you were… That was the evolutionary information that you were bringing in?

10:39 BR: Essentially, in the simplest way of thinking about it, that is the answer, yes. It’s a little bit more complex in the sense that it’s not only more conserved or less conserved, but what is changed to what else and what position, and that position, that is very, very relevant for… Is that outside, so the protein forms a ball, so is that change happening outside, is it happening inside, and all of that the machine learning device somehow can learn, and in one case, the same residue with the same pattern of change, maybe outside or maybe inside and again, that depends on the environment and it is all learned from these vectors.

11:14 JM: Amazing. So now… Go ahead.

11:18 BR: Here’s a very important statement about the cleverness of doing… Of a scientist and doing science, I had no idea about that. I fell into it by happenstance. I somehow tried a lot of things, I knew this could be an interesting signal, but understanding how much, how important evolutionary information is, took me many years after that, so it was a happenstance discovery in some sense, this marriage, and it was in fact something that also, people immediately realized this is something that is happening because the performance went up so much, but what it actually means, how important this evolutionary information is was something that took a while for the field as such to really take up. And one definition of that is, how long did it take until almost every method used evolutionary information? And the answer to that question is almost a decade. And ultimately, that means it took almost a decade to really understand for them and for me, so for… It’s the same thing, I’m not the inventor who completely knew what he was doing, so what I’m telling you now is something that we, in many ways, realized in this clarity only in the aftermath of the event.

12:25 JM: Fascinating. Well, it seems like it all worked out really well, and that was a huge leap forward for structure prediction. What are you doing now?

12:33 BR: So I’m trying to undo that. [chuckle] So one of the problems is really that the bio-databases are growing so much, they by far outgrow computers. So when we want to create this evolutionary information, this gets increasingly difficult, so I’m running a server that is one of the first Internet servers in molecular biology, started in ’92, and is running ever since, so that lets people simply submit a query and then in return, they get these predictions. Then originally that was by email and people had more patience, so then they would wait for a day. Today they don’t wait anymore, so they want essentially this thing to be mostly stored, but the reality is that one single job takes longer on a single machine than it took then. And that ultimately is because the evolutionary information takes time to compute, to search through the databases, and that takes longer because the databases are getting larger.

13:27 BR: So that gets us to the point, can we get to the same gain of performance without really having to compute the evolutionary information? Can we put the evolutionary information into one model, and that is something we do transfer learning, so we do these deep learning models in this deep learning models where we try to learn from a latest thing to billion sequences, so this is something that is… A picture of this is 50 times Wikipedia, we throw out these huge, huge networks and we compute on very large machines for a few months, and we hope that that somehow learns to predict the language of life. And now, once we have this language of life, we hope that that will somehow contain evolutionary information. We are not there yet, so after two years, we at the point where we can do amazing things with that, where we can sort of almost begin to compete with evolutionary information, but we are not quite there yet. So we are trying to somehow put into these models something that would replace evolutionary information. After I put it into the mix, I wanna get rid of it.

14:32 JM: Because it’s too slow, because there are too many sequences, is that why you wanna get rid of it?

14:36 BR: Yes, ultimately that’s the reason. So it doesn’t… It’s difficult to scale off in the long run. It would be much easier if we could do it without that. The other way of looking at it, is if I can get to the same performance without evolutionary information, and then I put evolutionary information back in, I can always get higher. So I can even get better even… And this is another aspect. And in some sense, so I started working on secondary research and prediction for almost 30 years. I didn’t work on secondary research and prediction and now I’m back to it. In between, most of the work that we have been doing is more on the level of function, on the level of… So one question simply is mutability, so the two of us have essentially the same 20,000 proteins. We have some variations and the variations between you and I is that in each of those 20,000 proteins, you have a different amino acid than I.

15:31 BR: One, this is roughly the degree of change, which in fact is much, much, much more than we would have anticipated 15 or 20 years. When the human genome sequencing was done, the assumption was there were minor changes, the assumption would have been maybe between you and I, there is a few handful of these changes. Now we know it’s 20,000, is way, way more than we assumed. So the next question is, can we predict what the effect of these variants is? So are these variants essentially the ones that are evolutionarily neutral? Because both of us are happy. The simplest definition of happy and healthy is we can show up for this interview, we can smile, we are there, there’s a very simple definition for health. So we’re completely healthy and then whatever is between us that is different should not matter for health, should not matter for the organism to somehow work.

16:21 BR: And that we again, developed a method that is based on machine learning, is based on evolutionary information and predicts the effect of a variant. So what is the difference between us? Do they matter? And what we found there was that it seems that the variants we’ve seen in human population matter astonishingly much. So it is not at all neutral, and many of these variants in fact are health risk. Let’s put it like that, so they are not good for us. And ultimately why do we have them? Well, we survive as a species. So some variant may not be good for me, but under some different conditions that same variant may turn out to be something that is good for me, or turns out to be good for me. And essentially, this is another set of methods we have been working on, but they use the same principle, always the combination of machine learning with evolutionary information or machine learning to predict effects on protein structure and function.

17:23 JM: That must have been a very fun 30 years of research, but is it fun to be back on secondary structure prediction?

[chuckle]

17:33 BR: It is… I very much… I’m a scientist because I enjoy being wrong, and to see that so many things we believed 30 years ago are not true, that is a lot of fun. To see that there’s a new generation of people who completely don’t have any idea of what was published 30 years ago, and they have new ideas, and some of them are really great and they bring us further because they haven’t read the papers. This is a lot of fun for me to see.

18:04 JM: Yeah, it must be rewarding. Well, thank you so much for spending a little time with us talking about the things that have made you an influencer in the field of Biology, even though you came at it as an outsider and maybe precisely because you came from the outside perspective, you were able to change the field. So we really appreciate you spending some time with us and explaining it to us today.

18:25 BR: Very welcome, it was a pleasure.