Nicholas

Arc Institute's Patrick Hsu on Building an App Store for Biology with AI

Nicholas

Patrick Hsu, co-founder of Arc Institute, discusses the opportunities for AI in biology beyond just drug development, and how Evo 2, their new biology foundation model, is enabling a broad ecosystem of applications. Evo 2 was trained on a vast dataset of genomic data to learn evolutionary patterns that would have taken years to find; as a result, the model can be used for applications from identifying mutations that cause disease to designing new molecular and even genome scale biological systems. Hosted by Josephine Chen and Pat Grady, Sequoia Capital Mentioned in this episode: Sequence modeling and design from molecular to genome scale with Evo : Public pre-print of original Evo paper Genome modeling and design across all domains of life with Evo 2 : Public pre-print of Evo 2 paper ClinVar : NIH database of the genes that are known to cause disease, and mutations in those genes causally associated with disease state Sequence Read Archive : Massive NIH database of gene sequencing data Machines of Loving Grace : Daria Amodei essay that Patrick cites on how AI could transform the world for the better Arc Virtual Cell Atlas : Arc’s first step toward assembling, curating and generating large-scale cellular data from AI-driven biological discovery (among many other tools ) Protein Data Bank (PDB): a global archive of 3D structural information of biomolecules used by DeepMind to train AlphaFold OpenAI Deep Research : The one AI app Patrick uses daily

Published
Published Apr 15, 2025
Uploaded
Uploaded Jun 11, 2026
File type
Podcast
Queried
0

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:36

[00:00] One of the things that we, you know, that the field of computational biology is often asking is, you know, if you have a genetic mutation in your genome, if I sequence you, whether that's via, you know, 23andMe or, you know, or some other genetic test, right, you'll find mutations in your genome. [00:17] How do we actually interpret those and understand what the functional consequences are? [00:22] Sometimes you'll get a rare genetic disease. Those are causal genetic mutations that are known to cause a devastating disorder. That might be muscular dystrophy or cystic fibrosis or breast cancer. [00:34] right but you know most of the mutations that you have um there's sort of this uh you know we call them variants of unknown significance which is fancy kind of scientist speak yeah we know what the hell is going on right and uh you know it turns out the model has an opinion about those mutations and what the hell is going on with them and it turns out it's sort of state of the art in doing that [00:55] Bye. [01:12] Today we're joined by Patrick Hsu, a pioneer in genome editing, CRISPR technologies, and the emerging field of generative biology. He's the co-founder of the ARC Institute, where cutting-edge AI and biology converge to reimagined scientific discovery. [01:27] Patrick and his collaborators created EVO2. [01:29] a revolutionary biological foundation model that can interpret and generate genomic sequences across all domains of life.

1:37-3:11

[01:37] By training on the fundamental information layer of life, DNA itself, EVO can identify patterns from genetic code at scale and predict effects of both coding and non-coding mutations that can mean the difference between health and disease. [01:52] In this episode, we'll hear how Patrick's vision goes beyond creating better drugs to building a comprehensive understanding of biology at all scales. [02:00] Patrick, welcome to the show. Thank you for coming. Thanks for having me on. Excited to spend some time with you today. [02:08] I think maybe the most obvious thing to start with is... [02:11] You know, people have heard about CS and bio for the longest time. Now it's... [02:16] all about AI and bio. [02:18] Where are the results? Like what should we actually be expecting to see? And where are the drugs? Why are we not seeing the drugs yet? It takes time, right? Well, here's the thing. Even if we had perfect drug design molecules, you know, coming out of these pipelines and fancy models, it was still, you know, you can design a trillion molecules, right? Ten trillion, right? But you still have to actually test them, right? Initially in animals and then in people, right? [02:48] a real bottleneck and even if you pack top of funnel with all the things it just takes [02:52] years to actually go through the regulatory apparatus. And so I think there are a few intermediate checkpoints along the way in order to kind of realize this potential, but it might be worth taking a step back and just saying, you know, this is a bit of a soapbox of mine that ML for bio is not just drug design, right? This is actually ultimately, I think, a very important, but...

3:11-4:48

[03:11] narrow part of the potential of biology and not just as a field of STEM and in the way that affects human lives, right? [03:21] And do you mean basic, just like understand the human body or where else do you think the applications are beyond? [03:27] drugs that treat all of us. [03:28] Yeah, one of the things that motivates me academically is the idea that we actually have a unifying theory for biology, right? So unlike the physicists who have been kind of scrimping and, you know, kind of poking for one for a century, right? We have this in biology, and it's so obvious that... [03:47] we find it, you know, sort of an just an obvious force, right? This is, of course, evolution, right? And so it acts on biology across all of its different length scales from entire planets, right? Biology can, for example, terraform planets, right, all the way down to, you know, ecosystems and populations to individuals to our tissues, to individual cells to individual molecules, right? And so that's this unifying force that is [04:13] is actually very deep and rich and actually you can learn a lot from. How do we activate that unifying theory? How do we put it to work? [04:21] So we've been thinking about this in the lab and recently have been training a series of models that we call EVO, inspired by these forces of evolution that tries to connect... [04:31] biological sequences using this sort of modern sequence modeling paradigm directly to biological function with the idea that [04:40] evolution passes down its effects of natural selection throughout generations of life via DNA mutations.

4:48-6:20

[04:48] Last year, of course, multiple Nobel Prizes were awarded for AI and for AI in biology, in particular for science. [04:57] protein design [04:58] and for predicting the structure proteins to David Baker and to Demis Asabes and John Jumper, right? But if you read those citations, they both explicitly state for proteins, right? And, you know, we love proteins, right? These are, of course, some of the most important fundamental molecular machines, but... [05:18] Our realization, for me as a genome biologist, if [05:22] you will, right? The idea is that proteins are encoded in DNA, along with RNA and with regulatory DNA and all of the things that you need to make life, right? And so we asked, could we train a model [05:35] on genomes with a long context model so that it could reason over all the different bases and molecules that are embedded inside of genomes to learn about the molecular interactions and how they lead to biological function. Now, that was very interesting. [05:50] scientific or academic, right? But we can talk through specific examples of what we were able to actually do with this in a way that grandma can understand, you know, like predicting the effects of breast cancer causing mutations, right? It actually is best in class at doing this, right? Or being able to design new CRISPR gene editing systems, or, you know, I think, you know, in addition to its zero shot capabilities, I think people are building really an app store for

6:20-7:47

[06:20] for biology on top of all of these kind of foundational layers. One of the kind of funny things in modeling today is how everyone's model has to be more foundational than someone else's model. There's a bit of a pissing contest that's happening, right? But maybe our model is more foundational than the other models. Because your DNA versus protein. I mean, and then below that, there are these all-atom diffusion models. So maybe those are even more fundamental models. [06:47] I don't really know. I think what matters are the capabilities and doing something that actually feels useful. And I think, you know, those are some examples of what we thought was cool and useful from the model. Can you actually walk through a couple more of those use cases? Like, which are some of the most exciting ones? [07:03] And why is it possible today with Evo, but wasn't possible before? And the model's open source. [07:08] And so where has it been picked up? Like where are people running with it with some of those use cases that Josephine mentioned? [07:14] Yeah, so the model, I can just maybe talk about what the model is, right? So, you know, it's an autoregressive sort of multi-convolutional hybrid model, right? But you can think of it like, you know, just a really efficient long context model that's trained, at least in this version, autoregressively. [07:32] Right. And basically it does this next token or next base prediction. And it turns out just like in natural language or in vision or in robotics and embodied intelligence, this general machine learning paradigm is able to find higher order patterns.

8:02-9:32

[08:02] by predicting the next base or the next amino acid residue or the next gene. [08:08] and the model learns something about the molecular logic that gives rise to a cell. [08:13] And so one of the things that we, you know, that the field of computational biology is often asking is, you know, if you have a genetic mutation in your genome, if I sequenced you, whether that's via, you know, 23andMe or, you know, or some other genetic test, right, you'll find mutations in your genome. [08:32] How do we actually interpret those and understand what the functional consequences are? [08:37] Sometimes you'll get a rare genetic disease. Those are causal genetic mutations that are known to cause a devastating disorder. That might be muscular dystrophy or cystic fibrosis or breast cancer. [08:50] Right. But, you know, most of the mutations that you have, there's sort of this, you know, we call them variants of unknown significance, which is fancy kind of scientist. We don't really know what they do. We know what the hell is going on. Right. And, you know, it turns out the model has an opinion about those mutations and what the hell is going on with them. And it turns out it's sort of state of the art in doing that. [09:11] Interesting. Wait, what's an example of one of these mutations and what? [09:14] What did the model discover and how did you verify that? [09:17] what it discovered was accurate. - Yeah, yeah, so, you know, one example that we showcase in the paper is a gene called BRCA1, right? It's sort of a famous gene that's known to cause breast and ovarian cancer. And, you know, if you have the sort of specific causal mutations in BRCA1,

9:32-11:04

[09:32] many women elect to get double mastectomies, right? And this is obviously, you know, a serious and major life decision and medical decision for you and for your family, right? And [09:43] You know, the question is, you know, if you don't have one of the known to be benign mutations, so you're fine, and you just go ahead and get an annual mammogram and just check and monitor, right, there's this entire middle distribution of these VUS or variants of known significance, right? [10:13] which of those mutations in those genes can cause... [10:16] disease state or not, right? And we can basically use this as a ground truth database to assess the predictions of the model for, you know, new mutations that you introduce into the gene and whether or not those would be pathogenic. I mean, and so the [10:30] uh [10:31] When you develop a new type of model, you have to create a lot of the evals as well. And so that was actually something that we put tremendous effort into. And obviously, you guys see this horizontally across AI in many different domains. Folks will build things to the benchmarks. No one really, I think, likes building benchmarks. It's really gory. It requires a lot of taste. It takes a lot of time. You have to continually update them as the models get better. And we dealt with a very similar approach. [11:01] sort of challenge here, which was making evals that

11:05-12:35

[11:05] would be similar to the AGI or intelligence evals, where it actually feels meaningful when you're actually able to do it. Like Amy or Putnam problems or, you know, things like that. What would be the equivalent of... [11:18] demonstrating true biological understanding that a cell biologist would feel emotion if you were actually able to solve that right um you know what what would it look like to make all molecular biologists feel what the nlp people felt a few years ago right right that's sort of the core of you know what we're kind of noodling through yeah and i think you mentioned this briefly but you know [11:42] You guys are working on the DNA layer and you guys didn't actually do anything in the lab. You didn't have a lab in the loop. Like there was no RL. [11:49] Talk us through the decision to do that and then [11:51] kind of the decision for people who are doing protein models or affinity models, a lot of them have a lot more lab in the loop. Talk us through the differences between some of those models too. Yeah. So, so we started with DNA because we think it's the fundamental information layer of life, right? Yeah. The second is it's also just [12:09] pragmatically where we have the most data. Where does the data come from? It comes from the entire scientific community. [12:15] And so there are these kind of open source government funded and maintained databases, you know, known as the sequence read archive, right? Where basically, you know, when you publish a paper, you have to submit your data and all of the sequencing data that's been created over the last 25 plus years, right, goes into these databases, right?

12:35-14:10

[12:35] And that has all the genomes that the community has ever sequenced for bacteria, for bacteriophage, for viruses, for humans, monkeys, fish, flies, you know, the entire Noah's Ark or menagerie, whatever. We've got all those genomes. [13:05] different from each other. That's human genomic variation, but also the mutations that make us different from chimpanzees or from worms or from bacteria. And the model can just look across this trillions of tokens, large data set, [13:21] and learn those patterns. And that was sort of the, you know, [13:25] insight of this sort of EVO series of models that we've been training at ARC is to kind of ask if you could predict that next base, right, then that might be the difference between being healthy or having a sickle cell anemia mutation. [13:41] Or it could predict the next amino acid residue. And that could be the difference between having a catalytically active binding pocket [13:51] for a key enzyme, right, in your, you know, [13:55] body physiology or [13:57] That being a null mutation where that thing doesn't work anymore. Or it could also be the sort of the next gene, right? So these are different levels of abstraction, right, that completes some biosynthetic

14:10-15:44

[14:10] pathway, where it's a different gene that's been removed by some transposon or jumping gene mobile genetic element that's excised. It's some sort of viral interference, if you will. So there's just a cross- [14:24] Large databases, you find new patterns, and it turns out those seem to be biologically meaningful. [14:31] So if you're going from DNA and you know the function, in many ways, you mentioned even the DNA model can actually predict binding affinities. [14:39] Do you even need the protein structure models at all as an intermediate step? [14:44] Or can you just go straight from sequence to function? So structure is another way to have an abstraction of function, right? And so you have concepts of convergent evolution, for example, right? Where something has similar function, but they have different sequences and slightly different structures that act out the activity or, you know, function of that protein, right? And so this sort of, you know, [15:08] sequence, to structure, to function, token, or mapping of protein language modeling, I think is very beautiful. It takes advantage of... [15:19] the central dogma of molecular biology. The Alpha Fold series. Right, right. Well, it takes advantage of just our supervised textbook understanding of how biology works. The interesting oxymoron with these models for me is the way that we use them is just like you use ChatGPT. It's text in, text out. Evo is DNA in, DNA out. And it turns out we don't speak.

15:45-17:23

[15:45] DNA very well. Sort of imagine if you were using a model like ChatGPT in Russian but 1% of the words were in English. That's kind of what it feels like. That's the vibe of using Evo. Actually, you don't really know what's going on and so you have to build lots of [16:05] and annotators and... Interpretability. Yeah, like techniques to try to interpret and read what's happening. And so the way that we even use and prompt the models is really primitive, right? And so the way that we do fancy prompt engineering workflows to, you know, get more utility out of these models is something that we're just in the very early innings of exploring how that works with these biological language models. Because we speak... [16:31] DNA with an extremely heavy accent. Yeah. [16:35] Who do you think will do some of that work? Because the model is currently open source. Do you envision [16:40] a company formed around this? Will it be individuals who are at these pharma companies who [16:44] learn how to prompt this? How does that ecosystem evolve? [16:48] everybody, first of all. I think the tools become useful when they meaningfully lower the [16:57] energy barrier of adoption, right? It's like a, it's a catalysis type of activity where, you know, everyone uses BLAST for sequence alignment, right? Everyone uses AlphaFold to look at protein structure, right? You know, everyone uses CRISPR to do gene editing. Everyone does NGS to read DNA or RNA, right? So I think there will be a zoo of different models for not just modeling molecules

17:27-19:01

[17:27] step of the scientific method right and so i think you know you know if 2025 is the year of ai agents right i think you know there's lots of interest in agents for science and not just agents for interpreting molecules but also for doing the meta aspect of how scientists work and [17:45] and operate. And we're also very excited about that at Arc and recently released some of our first AI agent work. What is the most important role for Arc to play in all of this? [17:55] We started the Institute to be able to have this mothership that is able to attack long-term research capability breakthroughs and to be able to have the long-term thinking and the multidisciplinary expertise in order to actually execute on those goals. [18:15] biology. One of the interesting things is that [18:19] A lot of the biggest... [18:21] mechanistic or basic science breakthroughs, [18:24] do happen in a university context right i mean i would say that's interestingly kind of in contrast to what happens in ai yeah or cs in general yeah yeah why do you think that's the case [18:35] Um... [18:36] That might be a 10-hour podcast. That's the one-minute explanation. I have a lot to say about this. But I think, in short, it does happen in universities. [18:52] if [18:52] 30 years ago, the questions that basic science... [18:56] was interested in and industry was interested in were quite different.

19:01-20:37

[19:01] actually. And I would say today they seem to heavily overlap. [19:05] And there are some things that folks don't typically do in a university lab. People don't tend to study PK or tox or CMC or hardcore drug manufacturing type things. But... [19:18] folks are interested in molecular glues, induced proximity and degraders and new drug concepts, right? Folks are also interested in new machine learning models, you know, and new delivery mechanisms and, you know, inflammation and stress and, you know, all kinds of things like this. And so, there's much more heavy overlap, but I think the way that [19:40] The type of product that selects what you do upstream [19:44] is very different, they're structurally different. You know, end of the day, you have to optimize for grant funding and first-order, co-first author papers and, you know, that type of stuff in the academic setting, whereas, you know, [19:57] you do have to make a drug and have dozens to hundreds of people lying behind a molecule or a program. [20:05] to reach the Holy Land, right? And I think that does lead to differences in strategy. And I don't know, personally, by the way, I think people [20:15] really kind of overemphasize like what's academia and what's industry and the realities these are like heavily overlapping distributions today and how people operate and so I think the differences are a little overblown but you know [20:30] They do have fundamentally different incentives that drive different behaviors. And a lot of what we try to do in blending our

20:37-22:16

[20:37] academic side of the house with our technical staff side of the house, which is built much more like you'd see in industry. It has been, you know, kind of part of what we hope will make ARCA a model that others want to copy or replicate or propagate. Yeah. [20:54] Makes sense. [20:55] And you've had some... [20:56] fun people come through on the technical side of the house, including people like Greg Bachman. [21:01] Yeah, no, it was a joy. Yeah. So Greg joined us during his sabbatical from OpenAI. We're really the kind of, you know, the first vacation that he had ever taken since starting OpenAI. Of course, it's to work more. [21:31] all these other domains might just port over directly to understanding, you know, molecules. He was, you know, really kind of, you know, jazzed by this and also by the idea that his very specific capability and expertise would be kind of really meaningful and actually making this happen. And but, you know, in the you know, initially he was, you know, he was, you know, saying, you know, this is really the first vacation I've ever taken. I can't promise too much. [22:01] this. Um, you know, you know, I, I, I need to take Anna on vacation. Um, and then, and then Anna, um, gets up and goes to the bathroom, um, you know, in the middle of this very long meeting. And then he's like, all right, here's my email. Get me on the repo.

22:19-23:54

[22:19] Of course. Of course. Yeah. So, so yeah, just built different. Yeah. It was such a joy to, to learn from him. [22:26] What have you found works well in getting the different disciplines to work together as one unified team? Yeah, no, it's an interesting question. And we've thought about this deeply at ARC as a convening center, you know, not just between the three flagship research universities. [22:42] here in the Bay Area, in Stanford and Berkeley and UCSF, but also between basic science and the biotech industry, but also... [22:50] biology and the technology sector. For example, you know, our CTO, Dave Burke, just started a few months ago and has really kind of been leading the computational development. [23:02] modeling of virtual cells where we're trying to simulate human biology with these AI foundation models. And, you know, Dave used to, he actually has a PhD in biomedical engineering from, you know, many moons ago, but, you know, most recently ran engineering at Android and Pixel. [23:18] And so, you know, I think we have also built an entire operational side of the house from finance to legal to lab ops to facilities to university relations and academic affairs. And, you know, we run our own space, our own administration, our own ops and technology. [23:35] We try to do much of that like a tech company. And we've also, I think, recruited in, on the opposite side of the house, many people who... [23:45] maybe ordinarily wouldn't work in a [23:48] basic science or discovery setting, but are kind of motivated by the

23:54-25:26

[23:54] mission to be able to take fundamental breakthroughs and have a product sensibility where we can get these out into the real world. And we're not optimizing at ARK for success. [24:04] nature and science papers right if you give Berkeley or Stanford professors millions of dollars to do more science that's almost the default expectation and output [24:13] Right? And what we really care about are [24:16] things that could be tangible. [24:18] So what does success look like? Just more people using the products you create or what is success for Art Institute? [24:33] of sharing work with the community and peer reviewing it and all of that good stuff. But technical blogs, you know, code repos, protocols are just platforms that people can use, right? And I think we want to make technologies and platform capabilities that are... [24:50] broadly useful [24:51] to be able to create new mechanistic insights, but [24:54] Also, actually try to cure some diseases. [24:56] Right. And, you know, I think over time, if we can be an Edison shop that's inventing or finding lots of cool things. Right. [25:05] there hopefully will be real-world valley in those things, and there are lots of partners who specialize in this. Speaking of curing diseases, earlier you were talking about even if you had – [25:16] perfect drug design or infinitely accessible, infinitely intelligent drug design, there's still a long process that has to happen after that before you can make an impact on humans.

25:27-27:06

[25:27] Can you talk a bit about where you see opportunities in that value chain? And if you had a magic wand and you could just accelerate progress by 10 years, which of those bottlenecks might be alleviated by things you see coming down the pipeline? [25:40] Yeah, so happy to talk about this in the pharma context, which are large, decentralized, sprawling bureaucracies, much like – [25:49] universities or, you know, the, you know, Congress or the SF city government. Right. And, you know, I think, you know, some parts are, you know, [25:59] you know, [26:00] incredibly, you know, functional and then I think everyone would agree in these different [26:05] some parts are less efficient. And so I think the first thing that hopefully we can see is that we can have models that can improve efficiency in discrete individual steps. [26:16] So can we have a more efficient process for target ID? [26:20] right, in particular. [26:22] Can we have a more efficient process for data analysis? Can we have a more efficient process for information and literature review and summarization? For molecule design, absolutely. Making a better binder, and then figuring out the drug properties of those different molecules. That could be selectivity, that could be pharmacokinetics, half-life, expression, manufacturability, what have you. [26:52] And, you know, there will be... [26:54] models or model guided approaches for each of those steps. And then there's, you know, I mean, I think if you look at how AI is actually kind of being used today in pharma companies, right.

27:06-28:48

[27:06] A lot of them are actually just... [27:08] taking massive regulatory documents away [27:10] summarizing them and then using AI to help them write more of it. So there's this compression and decompression of information in a structured fashion that has actually been [27:24] leading to enterprise adoption in this setting, right? And... [27:29] I don't know. I think that... [27:31] I know that says something about the process of where people find things useful. So, you know, one example is if you talk to some pharma execs, right, not the drug discovery organizational leaders, but the, you know, the budget people. [27:44] who hold the enterprise purse strings, they'll say, "Well, you know, I don't spend actually that much money on drug discovery. I actually spend most of my money on drug development." [27:55] So tell me something about... [27:57] drug development, which is really where most of my dollars go. How can AI help me with that? [28:02] Right. And I think that [28:04] is actually... [28:05] like a very deep uh comment right because it says something about where um money is spent and where value can be found um [28:15] You know, the first thing that I realize is, you know, [28:18] our industry probability of success is like 10%, right? And so, you know, I think a lot of the things that people talk about or complain about or comment on in drug discovery and development is [28:29] falls out of that fundamental... [28:33] uh statistic yeah right like why does the fda heavily regulate why is it so focused on safety well if 90 of the time it doesn't work they're going to care a lot about safety right and i think the promise of ai is if we can go from 10 pos to

28:49-30:30

[28:49] 20? [28:50] or 30, or 50, right? As you move through those steps. Can we? Do you think we can? Over time. I think here's the thing that, [29:00] I find kind of interesting is... [29:03] we have [29:04] done a lot of biology with what is not that far from guess and check. Right. If you look at what happens in the wet lab, like the actual experiments that are happening. Right. You're just kind of in the arena. [29:19] trying something right and trying out random hypotheses yeah seeing what happens yeah and this is like the missing reasoning trace in the scientific literature is you don't know what didn't work everything is narrativized and written in a story of you know [29:37] inexorable logic and vision leading to scientific breakthrough, right? But everyone who makes the sausage knows that's not what actually happens most of the time, right? And, you know, if you actually work with the... [29:51] you know, research, [29:52] you know, kind of folks at the bench, right? You know, and this is actually the case across all technical industries is that high fidelity reasoning trace of the true process is kind of, [30:03] not written down anywhere and that would actually be very useful for these reasoning models and for closing the loop and doing you know you know multi-agent frameworks blah blah blah blah but that's kind of what we'll need but if you actually look at what happens it's guess and check [30:17] And so a model with even a modicum of predictive value would be transformative, when with even moderate predictive value, which, by the way, we don't have. I think biology is a very...

30:30-32:04

[30:30] pragmatic, salt of the earth, experimental discipline, right? You see this in the culture peer review, show me the data, you can't, you know, pontificate in your discussion section, because, you know, you haven't shown any of this stuff, right? And if you read old papers, [30:48] They were so clear and visionary and highfalutin in a way that I think papers today are – we have this culture that is very pragmatic. [31:04] you know, [31:05] I think A, obviously be useful for accelerating the efficiency of signs, it also changes the culture [31:10] which I think hopefully will change the culture, which I think will be really interesting. Why do you think it will change the culture of models? [31:17] Because you'll believe people's predictions or pontifications, depending on [31:22] It's just another evidence point, basically. Just like how model hallucinations could be... [31:28] you know. [31:29] predictions or they could be [31:31] nonsense and garbage. [31:33] And that depends on how much you trust the model. [31:35] Right. [31:36] Got it. [31:37] Why do you think it's the case? And we've collected a lot of data, too, to your point. There's obviously... [31:42] we basically see what works and we oftentimes don't see what doesn't, although hopefully lab notebooks are recording that in some way, shape or form. Maybe. [31:50] maybe hopefully. Um, [31:52] But somehow, [31:53] Still very, very regularly, even when things work in cells, things work in mice, [31:58] they fail in humans oftentimes. Why is that still the case? Is it we just don't understand biology deeply enough?

32:04-33:34

[32:04] Why is there still that drop-off, and that drop-off hasn't really changed over time? Well, these are imperfect models. [32:10] Right? And, um... [32:12] You know, we set up this set of filters... [32:16] in the drug discovery process where, you know, first show it works in cell lines, then it show it works in primary cells or in an organoid, then it show it works in a mouse, then it show it works in a monkey, then test it in people, right? [32:29] you know by the time you've gotten there like [32:31] five years and $100 million has gone by. I think that's very challenging. That's where I think [32:38] predictive models will really help. [32:40] Mate. [32:41] because [32:42] The reason why we do all of these steps in linear series is because we don't have predictive power, and so we... [32:50] have to do things in the arena and it just, you know, you have to do it in real life, right? And growing cells and growing animals, right? [32:59] takes... [33:00] months to years to actually do those experiments. And so the promise of having a predictive model isn't just predictive power, it's that you could actually simulate things. [33:09] in a multi-parallelized fashion, right? That's the whole... [33:13] idea behind parts of machines of loving grace, right? That I thought, you know, Dario really did get right. And is the idea that if [33:22] you had... [33:23] something that could [33:25] Be a trusted oracle. [33:27] Right. Yeah. That you could just run, you know, 10,000 agents. Right. At the same time. Right.

33:34-35:09

[33:34] Do we have enough data for that full closed loop to create a trusted Oracle? [33:39] I think we will see more examples of this coming out over time. Today, folks building AI agents for things are basically trying to close the gap between step and... [33:52] X and step X plus one, right? Or X plus four or whatever, right? And the businesses are trying to find the most... [34:00] commercially valuable set of, you know, steps that is a set size that's as small as possible and step number in order to, you know, make a company. And I think, [34:10] We will have agents or co-pilots at each step in the scientific method from hypothesis generation to experimentation to data analysis. And the ability to close the loop and write the paper or make the discovery and then decide what to do next, I think is great. [34:28] quite far away but I think something that's very efficient at [34:32] traversing the steps [34:34] I think... [34:35] will really take off and [34:38] So as a concrete example, right, one of the things that we recently released at ARC is our virtual cell atlas, right, which is the world's largest data set of single cells. [34:47] that we're using for training these cellular foundation models. [34:52] Uh... [34:53] The way that it happened was we created an agent that was essentially, it's like a crawler, kind of like, you know, kind of a search crawler, but it's able to crawl the kind of, you know, sequence read archive and then process all of the highly unstructured and messy metadata.

35:10-36:42

[35:10] and kind of reanalyze and systematically reprocess all single cell data. And this is something that is just running on a cloud bucket instance, just cranking away, right? [35:22] you know, in a tireless fashion, right? And it's the kind of stuff that a talented computational biologist wouldn't want to do because it's so grind set. But, [35:31] actually the scale at which we're able to reach is [35:35] community-wide and [35:37] That's the leverage and efficiency that our team, you know, just, you know, [35:42] really two lead researchers could achieve with one agent, [35:47] I think it was [35:48] was a huge mental unlock for me and so we want to be at the frontier of actually deploying these and making breakthroughs and i think the the meta aspect um that folks are going after right now will shake out over time but i care about using these to actually make breakthroughs as opposed to chart the end-to-end closed loop path yeah [36:09] You mentioned earlier the... [36:12] sort of the pragmatism that a lot of research papers have now versus... [36:16] grandiosity or something a bit more visionary back in the day. I wonder if some of that is related to the specificity [36:24] of the work that people are doing now, meaning it feels like we've gotten more and more specialized over time. Yeah. And I wonder if some of that is taking us away from breakthroughs, because in a lot of cases, you need the knowledge from different domains or different disciplines to achieve those breakthroughs. And I guess maybe the question is.

36:43-38:12

[36:43] One of the nice things about LLMs is that they can incorporate an enormous amount of information, and they're sort of inherently generalized even when you apply them to a specific domain. [36:54] how much of the efficacy that we might get out of some of these models can [36:59] is simply related to their ability to go across all these different specialties. Yeah. If you look at... [37:06] at least to me, the best scientist that I've had the pleasure to collaborate with or learn from. [37:12] Um... [37:13] they... [37:16] really do two things. They're able to come up with really creative ideas and they're able to execute on them. [37:23] The reason why they're able to come with really creative ideas is because they're able to make connections – [37:29] between things that other people wouldn't make. And in fact, if you got [37:33] a room of 10 really smart people together to [37:37] chat science, like that's, you know, a weekly lab meeting in any group, right? [37:42] if you actually analyze the anthropology of what happens, there's usually a small subset of people who are hearing all the things that are being discussed and then actually saying... [37:52] this is the connection, the conceptual bridge between these things. Right? And so, [37:58] There's... [37:59] That's sort of like an out-of-distribution generalization. It's like, this thing that I heard was really novel, and let's recognize that and then try to see what can generalize out of that observation. [38:11] Right.

38:13-39:45

[38:13] um, [38:14] That comes from... [38:16] people who tend to [38:18] either read a lot [38:20] or reason a lot, right? And so there's some aspect of pre-training, right? You need to just read a lot of papers, right? Read a lot of chemical biology papers, read a lot of molecular biology papers, read a lot of AI papers, read like physics papers, right? And do so across domains so that you can traverse those boundaries, right? I think... [38:39] There's this... [38:41] design problem of how do you build a multidisciplinary team, right? And the reality is there, well, for example, you want to work at the interface of bio and ML. There are way more ML people and way more bio people than truly bilingual ML and bio people, right? I have the extreme fortune of working with [38:59] some of those that arc. [39:00] Right? Um... [39:02] and they're just rare. [39:05] But those translators can actually help you [39:10] power the rest of the population. [39:15] What do you look for in people at ARK? [39:17] It depends on the role, right? I think, you know, depending on how you're trying to match specific project needs. But in a way, I could tell you. [39:29] all the ways that we try to intellectualize our recruiting process, but it actually comes down to very simple things, right? I mean, it's the same thing that I look for in a research technician or an executive, at least on the science side, right? It's really like...

39:45-41:37

[39:45] are you thinking about science outside of the lab? - Hmm. [39:49] right and [39:51] Have you done something... [39:53] end to end before. [39:55] And then the third is, do you have the grit? [40:00] to actually kind of walk the path and get it done. [40:04] What do you mean by the end-to-end part? [40:06] I think it's very easy to go from step one to two or three to five or, you know, 12 to 15. [40:14] But... [40:14] you know, [40:15] going from 1 through 15 winnows down the population significantly. And so, you know, I often say, you know, the last 20% of a project is actually 80% of the work. [40:28] reigns because [40:30] finishing something and then honing your killer instinct from finishing things multiple times [40:36] really matters. [40:37] What should we expect to be coming out of ARC Institute in the next... [40:41] Six months and then over the next few years. [40:43] Well, I'm tremendously excited about lots of things, and I think the... [40:48] thing that maybe many people don't know is the degree to which we have [40:53] really been trying to [40:55] build biology out at ARK. I think people have maybe heard about our gene editing work or our machine learning work, right? But [41:03] A lot of what we're actually trying to build is this general concept of applying high-throughput scalable data [41:11] technologies. [41:12] in the context of multi-systems interactions, right? Really working at the neuro and immune interface. And so, you know, we hired, you know, two incredible scientists out of Penn last year, who study the process of interoception, right? So proprioception is, you know, when you kind of close your eyes, where your limbs, right? And interoception is the idea of,

41:38-43:10

[41:38] you know, I feel the weather in my knee, or my tummy feels funny, right? You know, kind of midwives' tales type stuff that actually has really deep science. And of course, it's totally unknown. It's how does your body... [41:51] talk to your brain and vice versa. [41:53] And it turns out there's a deep mechanistic [41:56] basis for this. And as you know, I think when people think about programming biology, they think about in the drug paradigm of how do I get a binder that binds to this protein? Or how do I get an... [42:06] CRISPR to edit this gene. But if you think about what happens with hormones or with a Zempic, right, right, you're able to program [42:15] the way that you think and feel and behave in really powerful ways that controls not just satiety, but energy, mood, you know, you know, um, you know, muscle synthesis, you know, um, [42:28] all focus all kinds of things and i think how do we actually do [42:33] program physiology is something that I've been spending a lot of time thinking about in [42:39] in our lab. [42:40] What's one unexpected connection? [42:43] that you think people don't think about. [42:45] um... [42:47] One example is the exercise, right? And so Christoph, one of our PIs, had a beautiful paper where he showed that there's a specific species of gut bacteria that produce a certain type of molecule. [43:02] that connects via your enteric nervous system, which is the nervous system that lines your gut, that goes to your brain in order to release dopamine.

43:10-44:48

[43:10] And it is this functional circuit that creates... [43:15] runner's high or exercise reward. [43:17] And [43:18] And when you delete this bacteria... [43:20] You cut off this E and S to brain circuit, or you cut off the ability of the brain to release the dopamine. At each of these steps individually, you can block the runner's high. [43:32] Right. And so it really traces in an intact animal. Well, this is a mouse study, right? This [43:39] you know, full body circuit. [43:41] But that also goes in reverse. So when you have deep psychological stress, right, that can lead to signaling from the brain, [43:49] two astrocytes that innervate your gut. [43:52] that releases pro-inflammatory cytokines that leads to gut inflammation, [43:58] and then it can give you ulcers. [44:01] Right. So stress causes ulcers. We've actually kind of known this. But why and how can you treat it? Because there are folks who get recurring ulcers. [44:12] And there's a [44:13] brain to body axis by which this signals, right? And I think this happens... [44:19] all the time. You know, some of which is consciousness, [44:24] most of which is unconscious, right? And you can actually start to figure out the dials and knobs [44:30] And that's actually, in a way, like a new paradigm for drugs. Yeah. Or how to think about using drugs. It's not just this highly reductive pharmaceutical kind of binder to biomarker type of thing. But that's giving you more of the holistic...

44:49-46:20

[44:49] kind of almost eastern medicine flavor of just how do i feel healthier yeah right that i think is [44:56] in vogue in the longevity community today, right? Like what is healthspan? How do I improve it? How do I improve my diet and my nutrition? [45:04] So I just feel better, right? You know, those things are also... [45:08] can have deep scientific grounding and needs to have that. [45:14] how do you think people will be treated in the future? You will have a full panel. You'll know exactly what's going on in your body, and then you'll decide... [45:21] different inputs to influence the whole thing. Like, how do you... [45:24] How do you tie in functional medicine, things that go on longevity? [45:28] the drug industry as it is today. Like where does that interplay? [45:31] I mean, I think we'll want AI doctors. [45:34] right, that are able to integrate information multimodally, right? And so just like you have your CGM that monitors your glucose with high temporal resolution, you have your Oura Ring or your Whoop that talks about your, you know, various biomarkers, or you can go to Quest or, you know, Function Health or whatever and get blood tests, right, that, you know, measure what's going on with your liver function or your cholesterol or, you know, your testosterone or [46:04] right but [46:05] Right now... [46:07] all you really know is that these things are going up or down, and whether or not they're in standard or reference range. That doesn't tell you very much about what you're supposed to [46:16] do. [46:17] Right. [46:18] And I think, you know, one thing that...

46:20-47:57

[46:20] has been really missing in personalized genetics or consumer genetics is the ability to take [46:28] information content from your genome sequence, [46:31] Yeah, and meaningfully integrate it with your health biomarkers in a way that [46:36] gives you the genotype and the environment that can be more predictive of phenotype, right? That GX equals P equation you learn in high school biology, right? But, [46:45] none of that is actually kind of accessible to mom and dad or to even us, right. In the, in the, in the setting of how do I actually live my life and, [46:56] think we need to [46:58] go from [46:59] measuring people with, you know, higher content approaches to connecting that to, you know, genetic signatures and make more accurate predictions, right? So that's sort of like been the theme of our conversation today. What do you think that'll look like? Do you think that'll look like any of the existing longevity efforts or do you think there's some new beast entirely that's going to be created to serve that purpose? Yeah. I mean, so I think if you look at, you know, 23andMe was, you know, recently, you know, file chapter 11, right? And it's, [47:29] I think it's an amazing pioneer. [47:31] visionary pioneering effort in how to [47:36] take genetics and put it in the hands of millions of people. I think the [47:42] thing that I think [47:43] I would really love to see in the world is something that can take all of that information. [47:48] with all of your different, you know, kind of body measurements, and then actually, you know, connect that to diet and sleep and, you know,

47:57-49:31

[47:57] give you personalized recommendations about your health in a longitudinal way. We have very fragmented data sets that [48:05] for being able to do this today and I think [48:07] Being able to collect this data at scale across populations and over time with temporal resolution [48:14] will... [48:15] I don't want to be one of these unhinged, big data will solve everything people. But it will. But it will. [48:25] Yeah. I do wonder if there's more stuff on the cross-functional side to your point. Like if you know you have a gambling addiction. [48:31] I don't think anybody's thinking about maybe Monjaro or Zempic could help with that. [48:35] But it does help with... [48:36] Some of these things, right? Yeah. Yeah. [48:38] I do think there's something about the cross-functional nature. [48:41] is interoception. [48:43] That's fair. [48:44] There needs to be some... [48:46] some organization that actually makes it accessible. [48:48] Yeah, I don't think... [48:51] That obviously exists today. I think folks are building different hands on the elephant for this, but... [48:58] you know, [48:58] You guys should start this company. [49:00] Honestly, it's a pretty good idea. It is a good idea. Yeah. [49:03] All right, let's get a couple of predictions on a couple of different timescales. So we'll start with 2025 and then maybe 2030. [49:12] And then maybe 2050... [49:14] What is the most interesting thing we're going to see in the world of AI meets bio? [49:19] In 2025, by 2030, and by 2050. [49:23] My hope is by the... [49:24] end of the year. [49:26] I mean, and this is already happening, right? We can, you know, just we can design full IgG antibodies.

49:32-51:04

[49:32] right not single chain binders like nanobodies but just the real antibody medicines that you know we kind of know and love today that we can just design their cdr regions [49:43] they're going to bind really well. You can one-shot it, and you can kind of do point-and-click on that, you know, that surface of your enzyme. I can just bind it, right? [49:52] you know, one shot. I think the thing that will mature over the next couple of years is that we can actually design enzymes and, [49:57] de novo. I think that will be really interesting and also lots of efforts. And again, this is all in the world of proteins. And I think one of the things that most people who think about this stuff are very protein-coded. And so a lot of our work is to sort of zoom out from proteins and think about [50:16] sells, right? And so I think [50:19] building the [50:21] the PDB of virtual cells. It's something that we've been focusing a lot on at Arc. That will take... [50:28] Some... [50:29] years from today to mature. So PDB is a protein databank, right? And it's the sort of gold standard database of atomic resolution solved, experimentally solved protein structures that was used by, you know, DeepMind to train AlphaFold, right? And so it's the pre-training data that allows the model to reach some SOTA capability like [50:50] protein structure prediction at angstrom resolution, right? So, you know, what is that for, you know, virtual cells, which we think would help us design better drug targets, increase therapeutic probability of success, right? I think, you know,

51:05-52:39

[51:05] That's sort of my 2030. [51:07] sort of prediction is that we have [51:09] accurate and useful virtual cell models that make cell biologists feel emotion. [51:15] Right. [51:17] So the 2050 idea, and hopefully this happens much, much sooner than that, you know, there's lots of chat about scientific superintelligence or, you know, the end-to-end kind of recursion of the scientific method. [51:33] I'd like to see that [51:35] with [51:36] you know, lab in the loop with a fully automated wet lab that's vertically integrated. [51:42] Do you think it's possible by 2050 we can... [51:45] simulate with 99.9% accuracy. [51:49] the impact that a particular drug is going to have on a particular target. [51:54] validate that in a wet lab [51:57] in a fully automated way in a matter of hours. [52:00] not months. Like, what do you think the dream scenario is for going from zero to impact? Um, [52:07] you know, in the future of drug discovery, if you imagine 25 plus years of technological progress. [52:13] Yeah, I mean, I think we've laid out different aspects of the vision over the course of our conversation today. [52:19] where things are really slow, right? [52:23] like toxicity, long-term follow-ups, right? These are the, you know, it depends a lot on the disease, right? If you're doing some acute oncology thing, that's very different from some really chronic autoimmune thing, right? And so I think the only way that I can imagine you...

52:40-54:15

[52:40] speeding this up is if you have a model with strong predictive power. [52:44] and so a lot of [52:47] this hinges on our ability to make models that can actually do that. Yeah. And I think you'll unlock different step sizes of capability based on how good they are. [52:57] Is there any reason we wouldn't have that by 2050? [53:00] Yeah, basically if we make the wrong data. [53:03] I would say is like one obvious one. You know, you can model the mouse [53:09] in all of its glory to great [53:13] perfection, it still will not be the human. [53:16] that's one sort of I think [53:19] trivial example that... [53:23] is something that [53:26] We still do, though. [53:27] because that's just what's practical. And so I have this other soapbox about how we actually just need to be doing way more experiments in humans. [53:36] And what do you think it will take for us to be able to do that? [53:39] Is that just a regulatory thing? Is that... [53:41] I think there will be some aspect of creativity involved in addition to... [53:47] better... [53:48] regulatory innovation. So one example would be, you can take samples from brain-dead patients, [53:56] for example, and then now you can just get lungs that you can perfuse and keep alive for a week. [54:01] and then just do experiments in that lung. And there was a paper recently published about this. [54:08] Should we do lightning round? Yeah, let's do it. My first one, favorite new AI app that you've tried in the last three months?

54:15-55:46

[54:15] So this is maybe cheating, but I'm a DAU of OpenAI Deep Research. [54:21] Right. I just... I find it just by far the... [54:24] main AI app that I find [54:27] useful enough to use in my day-to-day work right and so there are lots of other fun ai apps or things that i pay attention to because i'm interested in ai yeah but the thing that actually changed [54:40] how I work is these deep research models. And they have, by the way, so much more room to improve and to run. And, uh, [54:50] Yeah, I think it was the first time I felt real emotion. [54:54] Thinking, oh, wow. Okay, someday maybe I will be automated. Yeah. Oh, wow. Yeah. I feel that every day. [55:02] Who would be on your Mount Rushmore of scientists? [55:06] Um... [55:07] This is maybe a bit smarmy. [55:10] But it's the folks I get to work with at ARC. [55:15] I realize this is incredibly smarmy, but it's really genuine. [55:21] I feel so lucky to go to lab every day and just be around. [55:26] you know, [55:27] just passionate, bright, kind, and [55:31] incredibly ambitious people yeah and it up it levels up my game [55:35] what do you think is going to be the killer application in, [55:38] that scientists will use by the end of this year. [55:42] Deep research. [55:43] Still that. All right. Nothing that you guys are going to create for Mark?

55:47-57:18

[55:47] Um... [55:48] Well, I think these virtual cell models will be incredibly useful. [55:51] I don't think we or anyone will have... [55:56] working models in the sense that [55:58] I think it'll take some time for them to mature to the point where they're... [56:04] actually fundamentally useful. There are currently research problems that will ripen over some time. [56:11] But we'd like to deliver those. [56:14] What's the most important thing you've learned at Ark Institute? So there's a Scottish proverb that I'm going to butcher when I paraphrase, but it's basically, you know, be happy when you're alive. [56:25] for you're a long time dead. [56:26] and you know i think that really hit for me when i when i read it um and you know it was one of those you know you're lying on the couch the phone is six inches from your face beaming lux into your eyeballs right before you're supposed to fall asleep and [56:42] I don't know, that just, it reminded me that despite all the complexity of, you know, [56:49] trying to work really hard to do useful things, you're supposed to have fun. [56:53] And... [56:54] I think that should... [56:56] I think we need more of that in life, not just in research labs, where I think it can be so easy to be super critical because that's the training and how you make progress is to hate on everything and everything has a problem and figure out why this can go wrong. [57:14] But... [57:15] Being optimistic and happy is not...

57:19-58:00

[57:19] a path to mediocrity and mistakes, but it's actually how you have [57:24] the emotional capacity for persistence over time to reach those long-term goals. [57:29] Love it. It feels like a good place to end it. I know. Well, hopefully this podcast made you happy too. Made us happy. I'm always happy to see you, Bob. Thank you again. Thank you for doing this. Thanks, guys. [57:59] you

Want to learn more?