Building the GitHub for RL Environments: Prime Intellect's Will Brown & Johannes Hagemann
Will Brown and Johannes Hagemann of Prime Intellect discuss the shift from static prompting to "environment-based" AI development, and their Environments Hub, a platform designed to democratize frontier-level training. The conversation highlights a major shift: AI progress is moving toward Recursive Language Models that manage their own context and agentic RL that scales through trial and error. Will and Johannes describe their vision for the future in which every company will become an AI research lab. By leveraging institutional knowledge as training data, businesses can build models with decades of experience that far outperform generic, off-the-shelf systems. Hosted by Sonya Huang, Sequoia Capital
- Published
- Published Feb 10, 2026
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] If data is the bottleneck, if having the real expertise is the bottleneck, like, would you rather have the smartest person in history work at your company or someone who's been there for 30 years? Sometimes you really want the person who's been there for 30 years. There's a lot of expertise that comes from really understanding a problem deeply and interact with it over a long time. And this is really... [00:17] what happens in training that is almost impossible to replicate in a short prompt. You really want the ability for institutional knowledge to compound over time, for best practices to compound over time. And this is how institutions and companies grow to be really powerful and successful is they stand on the shoulders of what they've done before rather than kind of resetting every day. And we want to have this be accessible to any company that wants to do this. And I think that's how we've thought about approaching it, especially as software becomes easier for people to manipulate [00:47] barrier to entry for coding becomes easier, we see the same happening for AI research. [00:51] *music* [01:07] Will and Johannes work at Prime Intellect, which is one of the coolest neolabs in AI right now. Your mission is to make frontier lab training accessible to everyone, which I think is a very noble mission. You have really, really strong taste and just developer feel and just understanding how to, you know, that intuition for what developers care about. And then what you all launched with the Reinforcement Learning Environments Hub was like really, really differentiated and people were very excited about.
[01:36] with you all today. Post-training, reinforcement learning, agent harnesses, your platform, the RL Hub, and then big picture questions on what's coming next in post-training and RL. Does that work? [01:48] Sounds great. Absolutely. [01:50] Maybe to get started, you are one of the leading research labs enabling customers to post-train their agents. Can you tell me about what is that? What does that mean? [01:57] Yeah, for sure. I'd have to take that one and give a bit of a higher level overview of what our platform does as well as our research at Prime Intellect. As you already mentioned at the beginning, we try to make frontier infrastructure available to any startup, enterprise and Neolab as well. And basically the infrastructure that is currently locked behind the walls of the big labs where nobody really has access to them. And we really start from the compute layer and the compute orchestration layer and go all the way up then to the entire infrastructure. [02:27] full post-training stack. So everything from like the training frameworks that are needed to do large-scale reinforcement learning to the environments with a bit of a more community approach of our environment hub. [02:39] to other pieces that I actually needed to do this, like sandboxes for secure code execution. [02:45] and evaluations as part of our environment hub as well. [02:48] and yeah to offer this like as an end-to-end product in a sense [02:53] And what's the intuition for why even why pursue that mission statement of making all that infrastructure available to everyone? [03:00] Yeah, there's a lot of reasons. I think one is... [03:03] And I think something that we are very passionate about is just like open science as a way that humanity moves forward, where like a lot of the big scientific discoveries historically have been things that we we talk about and as a world we can kind of build on top of. But kind of more practically speaking as well, there's a lot of value in model customization where we are like the winning applications are using AI for a specific thing, for some agent, for some workflow, where you also want to be able to deliver this at scale and with cost effective performance. And so really the way to kind of.
[03:32] really optimize these systems end to end is to be able to have access to the model weights directly, where you can then craft the model to be the best model for your problem rather than some model off the shelf where the crafting happens just in a prompt. And so it's really just allowing a deeper layer of customization than what you can do at the prompt level. And then is your vision for the future then that, you know, every company will be pre-training their own models? [03:55] post training their own models, fine tuning their, like, what do you think the future holds? [04:00] We definitely think that every company will be an AI company. And we think most AI companies will want to have an AI research lab. And research can look like many different things. It can look like pre-training, especially if you're in a domain where you maybe don't just want text in, text out, if you want something more bespoke. [04:30] very practical for people to do at scale for the right shape of problems that people want to solve. Awesome. And then can you say at a high level what your platform does? [04:38] Yeah, so we have a full stack research platform called Lab. And Lab is about giving everybody the ability to do the things that a frontier research lab can do internally, but for anyone in the world who wants to do this kind of research.
[05:08] But it's also more general beyond just reinforcement learning. And I think if you haven't heard of an environment before, it's essentially the same thing as the evals that get reported when people talk about new model releases. So Sweebench and Amy and Terminalbench, these are all examples of environments where there's a data set of tasks, there's a harness for the model to be in, and there's something called a rubric or reward function, which is responsible for grading the quality of the outputs. And so the same thing you'd use as an eval offline as your kind of your test set, you [05:38] And this is a way to improve model performance interactively. And so our platform is really enabling people to use environments in their workflows for post-training, for evaluation, for synthetic data, for reinforcement learning. And it's also very much focused as a community platform in the same way as things like GitHub are, where this stuff is new, it's complicated. [06:08] workflows as environments to allow other people to post train models in those environments to kind of have this way of kind of sharing the ability to improve performance across different tasks. [06:18] I think the general idea is to give more companies the actual ability and advantage that currently only the big labs have in a sense of like this product model optimization loop where they can optimize their models for their specific product in a sense. And we see it as a thing where... [06:36] um yeah that's the kind of reason why like a chat gpt was created by opmii or like a cloud code was created by anthropic they actually have the capabilities to optimize models for their specific scaffolds in a sense and um yeah we have their models work way better and their products and some more popular those kind of products also become in a sense like a cloud code becoming extremely popular right now um there is a big levels have yeah naturally less enough an incentive to actually make it work better for like other coding startups in a sense right
[07:06] And the idea there is to give them the tools to have their own model product optimization loop. And I think there are early adapters on that front. I think one great example, [07:18] always in this case that I always give is a scurr, for example, that, yeah, I realized that, in my opinion, quite early on, they built their own, like, Composer 1 model, where they did, like, large-scale post-training, in a sense, and, um, [07:30] yeah, really optimized a model where the environment was actually Cursor itself. So, yeah, gives them all the tools that like you have in Cursor in a sense as well and, yeah, optimize a model inside of Cursor. And yeah, we believe there's a lot more startups that, yeah, will go this direction to, yeah, on the one side, optimize their current products in a sense, but also, yeah, build completely new products that are really not possible right now without having this product model optimization loop. Awesome. [07:59] And can we say a word on, you said something interesting. [08:02] He said environments are just evals. [08:05] Can we dissect that statement? In my head, an environment is a state. It's a description of world state. [08:12] through which you observe what actions you take, [08:15] You observe how world state changes and therefore you update your word model. That to me is distinct from an eval, which is like, you should have gotten this answer on this set of questions. And so can you help me merge those two realities? Sure. Yeah. So I think there's... [08:28] a version of eval that's kind of where we were maybe a year or two ago, where a lot of evals were like question and answer. [08:34] And it's like this big bank of questions. And then maybe there's other notions of environment that people think about when they talk about like an old school RL with like Atari that is much more about like this kind of long running state interaction loop. And I think where we are now is it's both in the same thing where especially the the environments that you want to do large scale training on, they do have this complex state. They maybe are simulating a web app. It's a full fledged kind of coding platform where you have an agent doing these things. But
[08:59] In the original RL games, there's always a reward, there's a goal. And so the notion of there being a goal of this problem, where it's not just running through some system and a human is going to kind of vibe check it, there actually is something that can measure progress and performance. And that's kind of what I mean by it's uneval in that there is a system that... [09:17] It starts at some state. It... [09:19] interacts with the system and the environment, the harness, the agent, whatever you want to call it. But there is some goal and there's a way to measure whether it's doing well or not. [09:29] Hm. [09:29] Okay. [09:30] Got it. And then is there a difference between, you mentioned kind of Cursor as a great example of somebody doing really great frontier work. [09:38] and reinforcement learning. Is there a good way to think about [09:41] When should you be kind of constructing RL environments versus... [09:46] When should your actual application and your application states be the... [09:51] the environment so to speak. [09:53] Yeah, I think there's definitely reasons where you want both, especially like, let's say you're training a model to be a really good Rust coding model. Here, you might want it to be good for lots of different applications where you'd have different environments that are focused on some like domain task. Maybe you're a company that wants a model that's going to be really good at calling your tools or using your specific domain language that then you can provide as a service to people who are building around that.
[10:23] directly become the environment where I think the companies who might want to be doing this are the same who care about whether they're using Cloud or GBT5 or Gemini or the ability to choose models and have internal systems to evaluate whether a certain system prompt is good or whether changing out a model endpoint is good or whether using the mini version of a model is a better cost performance trade-off. The infrastructure to do that is that a lot of people have already been building at these kind of advanced agent companies is the same infrastructure you use to do [10:53] of this kind of convenient world we ended up in where the training paradigm that makes the most sense for improving model capabilities is the same sort of thing that a lot of people have been building up the muscle for just without doing the training piece. And so that's kind of where we see RL being a very. [11:08] useful tool for people to then have this as an option in their toolkit for system optimization. [11:13] Got it. [11:14] Okay. [11:14] Um, [11:16] I want to talk about agent harnesses as they buy RL. I feel like harnesses is like the theme of the moment, especially with you mentioned Claude Coe getting so much love. I think one of the things that they do, exceptional engineering around [11:26] is the harness. Harness and RL, are those things orthogonal, mutually exclusive? Like how do they relate? [11:32] They definitely relate. I think the way I think of a harness is like a piece of the environment. And so there's some for any like eval or environment task, there's some input. [11:43] bunch of stuff is going to happen then there's some output state which is then going to be graded and so this whole [11:49] intermediate piece, whether it's interacting with some simulator or interacting with some, like, another agent or physical world sim, this is the environment. And the harness is very much a piece of that, where it couples how the model interacts with
[12:04] any other pieces of the system. And I think depending on the application area, like I think for coding agents, we have a pretty clear like definition of what's the you could say the harness is the CLI coding agent and the terminal is the environment. But this isn't necessarily going to be universal across like all different types of agents. In some cases, it's a system prompt and some tools is the harness. In some cases, it's something that is going to be spawning sub agents and those sub agents also have their own harnesses. And so there's a lot of complexity. I think the way we've thought about it is like [12:33] harnesses are going to keep evolving. There's going to be this Cambrian explosion of ways people want to use models and we want to take a [12:39] a pretty general approach in defining what you could do with a harness. And so what we are really thinking about [12:44] and why we use the term environment as the abstraction is like, [12:47] Agent is too narrow. Harness is like kind of too narrow. And you can do all of these things within an environment. But [12:52] the environment [12:54] as an abstraction on the whole, allows this [12:57] any sort of system model interaction. [12:59] is in scope. And then, do you think then... [13:04] Do you think all companies should be... [13:07] post-training their models with environments. [13:09] Are there specific kind of, you know, where the bullseye, like you absolutely need to be using environments versus like you could be post training with a different method versus. [13:19] You should just be prompting your thing. [13:20] I mean, I think environments are tools you can use to do all of these things. And so I think that's when I talk about kind of environments beyond RL. I think part of the reason why it's a useful abstraction is because it doesn't tie you to RL. Let's say I want to have a small model and I want it to be distilled from a big model.
[13:37] the way you can do this is you take the big model, you plug it in your environment, you let it run a bunch of times. Now you have all this data that comes from the same interaction protocol. You can use the same grader at the end to filter for the best examples and then do SFT fine tuning on that. You could do prompt optimization with an environment. You could A/B test different models with an environment just as you would with an eval. And so we really, I think the idea is that every AI company should be optimizing their AI systems. [14:01] I think that's a less controversial take. Okay. And maybe there's another way to frame it. Like the E vowel is almost... [14:08] how your agent performs in the set of environments that you expect your customers to face. [14:13] - Yeah, I think it, um, [14:15] It depends on what you want to call, like, I mean, in kind of traditional machine learning terms, you don't want to overfit to the test set. And so in some ways, we kind of are already accepting that we're going to be like using the test set or the eval to measure to have that kind of filter back into the model. [14:29] And so it. [14:30] I think it's a little tricky to distinguish even like what is the eval, what is the environment. We kind of think of them as one and the same. [14:36] Um, and [14:38] we use the environment term very generically, where an eval is a type of environment that's used for measuring performance but not training on. [14:44] Um, [14:45] And that's kind of how we... [14:47] see a lot of people thinking about [14:50] when they're doing evals, what they really mean they're doing is they have some way of measuring current performance and they're iterating on it. [14:55] And in some ways, RL is this... [14:57] iteration applied at scale where you're automatizing the process of changing the model a little bit, changing the prompt a little bit, and having this be the way that you can kind of hill climb on some goal. Maybe another question. Do you see reinforcement learning as synonymous?
[15:11] functionally with post training. Meaning like if your post training model, are you doing reinforcement learning or are you doing other things? [15:17] there's definitely a lot of things. I think reinforcement learning is like the big thing now where it's like, [15:21] in many cases, practically speaking, [15:23] the if you're doing a large-scale model uh rl will be where you spend the most of your time and focus and compute um but it's not the only thing you want to do there's a lot of things involved in the pull process of going from some initial model to the system you want to deploy this can be prompt tuning this can be sft this can be online distillation uh there's a number of algorithms that all kind of are under this umbrella where rl is like the big one in the middle but there's a lot of stuff around it and uh [15:48] Really, I think exposing this toolkit to people and letting them have all these knobs they can play with. [15:53] is the way to unlock. [15:55] And are you finding that it's [15:56] the really smart AI researchers that actually [15:59] know how to make this stuff happen in practice? Is it, you know, towards your goal of democratizing AI development for all? [16:06] You know, does your average Fortune 500 know how to use the platform and get value out of it? [16:11] I think most companies have people who can. [16:13] I think [16:15] Any Fortune founder company will likely have a team of AI engineers who are capable at [16:21] following the latest tools, who are good at using cloud code, who have a lot of opinions about models and prompting. [16:29] those people certainly can do this. That's kind of the audience where we see as the target customer for this. Especially if you give them the right tools to actually do it. Like maybe some of those large companies don't really have anybody in there who can like debug your GPU cluster in a sense to actually kick off such a run. Or other components that are needed in there to like actually just make it easier in a sense to do large scale like agentic reinforcement learning with tool use, with code execution and pieces like this.
[16:57] to just abstract away the ZNTA infrastructure for them. [17:01] Yeah, got it. [17:02] Awesome. Actually on that note, do you guys have any favorite customer stories that you want to share? [17:07] um yeah one of our favorite customers that i would like to uh yeah point out in a sense is rci and neolab working on like frontier models um have been working with us on yeah the entire stack in a sense right um we've been talking a lot about reinforcement learning right um but yeah we've been yeah also doing a large uh large pre-training in the past um that's basically where our history is coming from in a sense as well right so uh yeah i've been training with them some of the largest [17:37] and Gaspel on the post-training side with them as well. Maybe, Will, do you want to share some more? Sure, yeah. So they're a very close collaborator of ours that we've, I think, we've all been friends for a long time, but also, like, we've been, I think they've been a way where we've had the, they've had a lot of things that they want infra for, and that's been a way that we've been able to kind of force us to build out a lot of the pieces, both from compute orchestration to post-training to pre-training to inference, [18:07] of be a frontier lab. [18:10] And I think they're very aligned with us in the kind of the openness mission. And I think but I think they're kind of focused. They are more targeted at enterprises and the end user where they are kind of going to work more directly with customers in terms of like end to end kind of delivery of a certain artifact where I think where we come in is we are.
[18:28] we are really focused on the developer experience at the infralayer and the ways that we can make, put these tools into more people's hands where the process of [18:37] going from idea to deployable model can become as seamless and kind of quick as possible and efficient as possible. Awesome. And then any other favorite customer stories, maybe on people that are using the environments product specifically? Yeah. So we are definitely really focused on like the research community. And like some of this is like a lot of grad students use it. A lot of like students and people who are getting early in like [19:03] their career learning how to do this stuff or using it. But also a lot of labs who are focused on a very specific domain where let's say you're starting a medical AI lab. We work closely with a number of groups in the medical AI space where they want to create more both a [19:17] benchmarks to understand how good our models at medical capabilities, both in terms of diagnosis or patient interactions or question answer about medical literature or agentic search over certain medical tasks. And so like SoFont, OpenMed being two that we work closely with, where the focus there is to they really care about like earning trust from the medical professionals. [19:47] headache for a lot of people. And so for them, like being able to kind of have this platform for creating evaluations, for showing them off, for being able to use them to then improve model capabilities that could then be deployed locally in a hospital or deployed locally for some end user where the ability to have this customization and have this kind of end-to-end trust and understanding of the data providence for tasks at hand or the ability to kind of customize models very directly is very key. Do you have any customers that are using you for more of like
[20:17] kind of Atari style, you know, learning from your environment type? Do you mean like, so when I think of Atari, I think of like non-LLMs. And so we definitely are focused on LLMs and foundation models that look like LLMs. Yeah. [20:30] There's definitely a lot of researchers who use our platform for these things that are more like there's some examples we have on the platform that are much more like games. So, I mean, I have one fun one that I use is like a demo a lot is the game Wordle from The New York Times, where this kind of ended up just being the like. [20:45] a great hello world environment for people because it's the infra is really simple but it's very expressive where you can kind of get a feel for it and you get this aha moment of seeing a model learn to think about the game as you give it rewards for doing better and you can you can do it with a really tiny model too like it's it's in this sweet spot of difficulty where you can do these runs on like a couple gpus in an hour and see a model actually learn how to get better at the game yeah and [21:10] yeah i would say like the more toyish game examples um yeah usually the ones that people are going for for actually learning how to build those reinforcement learning environments and yeah that's also what people are heavily using the environment hub for in a sense just because we have all this infrastructure built around it to uh yeah be able to actually test in um your environments and um yeah that's usually how they start out and then yeah go to actually building more complex environments uh later on another group i would love to give a shout out to in a [21:40] on the hub are people part of our reinforcement learning residency where we have like a group. We initially started out with like eight to 10 people, I think 14 to 16 are now in the group. You have people, grad students, as well as people working full time that are part time, like building reinforcement learning environments, as well as doing novel research on top of the environment hub.
[22:01] And, yeah, those folks have built amazing environments in all kinds of verticals in a sense, everything from verifiable software engineering and lean to medical physics environment to some cybersecurity environments. [22:17] And then, yeah, also we give them the tools, obviously, now to actually do the training in those as well. Yeah, awesome. Can you help demystify for me? I've heard that some of the foundation model companies are spending millions of dollars each on... [22:29] some of these environments. - Yeah. - And so like you mentioned cybersecurity at the end. [22:35] What goes into constructing a cybersecurity environment? [22:39] Yeah. So there's someone in our residency program who actually does a lot of the stuff. And so like there actually is a lot of like tooling in the cybersecurity world around like these capture the flag games where there's there's some system that has some hidden bug in it. And this is like a challenge that originally these are built for programmers or programmers will have like a little hackathon where they try to go find the bug in some system. But you can adopt these challenges to LMS as well, where it then is a full software environment where it's a terminal where the agent lives in the terminal and has access to tools for doing bash commands. Maybe it's using cloud code or some other [23:09] wrapper for an agent harness, but it's in a terminal full of files and can interact with these files. And then at some point, it marks that it's finished with the rollout. And then you can grade the state of the environment using other pieces of software, other code that executes. But we do like, we actually have a lot of people we work with who are in the data and environment space where we've found that there's a lot of...
[23:31] interest in using reinforcement learning as a way to evaluate data quality, where the ability to measure what happens when you train a model on a set of tasks, set of rubrics, set of environments, allows you to understand bugs in the environments. Because there are issues that come up in reinforcement learning where maybe if your environment has a backdoor, a model can exploit this and kind of game the system. And so I think there's a lot of interest in people using RL to, [24:01] environment that ends up in some frontier labs like [24:04] foundational training run for the next GPT or cloud model. There's a lot of vetting that goes into this and doing these like smaller or like medium scale runs in like, let's say one environment allows you to really poke and see where the problems might be. Okay, super interesting. And if we take the cybersecurity environment example one step further, by the way, I have no particular affection for cybersecurity, but it's something that like, I love having a specific example in my head. [24:27] Um, [24:27] I could see how you could construct a toy example, right? These Capture the Fly examples, I would imagine they're toy examples. I would imagine they look nothing like the actual kind of corporate network environment of a, you know, real company with all of this. [24:38] cybersecurity products. And so [24:40] I guess, how do you construct an actual... [24:42] Does what people can construct, does it actually scale up towards imitating and reflecting the complexities of the real world? Almost like the robotics Sims-to-real gap. Yeah. Is there a Sims-to-real gap? [24:52] in crafting these environments and it is a solution just to like [24:55] train on like real kind of corporate security environments. [24:59] Yeah. So I would say there's less of a barrier in terms of the actual complexity. Like the in principle, these can be as complex as you want. It's anything that could be on a computer. So just think of anything that's on a computer or network of computers as potentially an environment. What really kind of becomes the bottleneck in many ways is like cost of the simulator, where I think there's a lot of focus on identifying clever ways to kind of mock the right piece of the system where let's say you have an agent that you want to like, let's say the Internet.
[25:29] of like web search. In some cases, you actually want to do the real web search. In some cases, you want to find ways where you can design tasks where you actually don't need the full thing. It's like this kind of, I don't know, [25:40] Um, [25:42] think of like the map games where there's this world you want to explore where there might be this whole map and like some of it's like dark but as you walk around like the light shines then you now see this part of it and so if you know which pieces of this map might actually be explored on a given task then this kind of allows you to decide which pieces are important or not and so like [25:59] One example, like there's a benchmark called Tao2Bench that is very popular in the eval world where it's about customer service agents involving a database. And so the database is something where like, [26:10] In a real system, you might need the full database. But if you know that the agent is only going to ask certain types of questions for a task, you don't need a full database with millions of records or thousands or billions of different examples. You can have kind of a mock database that is a cheaper to run, maybe in memory piece of software that. [26:31] only has the things that are expected by the agent or expected in the the scope for a given task. And so identifying like the pieces of the system that where the complexity is kind of overkill, [26:42] is part of this task design process to enable efficiency. And does your platform help with that? Like, it almost seems like... [26:49] If creating environments is the bottleneck to training these systems, then being able to efficiently create these environments where you're lighting up the part of the map that has to be lit up is a core platform competency. Do you guys assist with that?
[27:04] Yeah, I mean, it's definitely a way that we think about the design of everything. And we kind of go down a lot of these different... [27:09] like rabbit holes as we have to focus on different tasks and working with different people. And so like coding agents maybe is one example where like there's a lot of complexity that comes up when working with like sandboxes and terminal states and ensuring that you have like good snapshotting and all of these things and protocols to interact with different agent harnesses. And so we've built a lot around that. But it's also the sort of thing that we kind of know that the space of complexity is going to grow arbitrarily over time as people start to [27:33] getting more deep into these different domains. And I think we've tried to design everything in a way where we keep a lot of doors open, where like we have room to build features that kind of like there's base layers of like generic environments. Then you can go, I want a coding agent environment. I want a coding agent environment with a sandbox that's global across the run, or I want one per rollout, or there's lots of these different branches you can go down. And so we kind of have anticipated like there will be a lot of these branches. There's some that we've [28:03] for when we need to. And we also think a lot about how do you make a good developer experience where, let's say, think about just documentation or skill files for agents. People are going to be using coding agents when they're building these. And so there's a lot of institutional knowledge that gets built up when you're doing this kind of research, both for a specific project or as a research team doing large-scale training runs. And being able to surface this information, some of it is directly in the product, some of it is in the way we design the library, some of it is in [28:33] agents. And this is the sort of thing where we've designed it to be
[28:37] something that can evolve over time as like the research literature evolves, as the best practices for different types of complex agents become more clear. I assume you guys read Sutton's... [28:49] Learning from, what is it, age of experience? Age of experience, yeah. Scale AI era data labeling. [28:54] Um... [28:56] Do you think that constructing environments is sort of the natural successor? [29:00] Do that. [29:01] Yeah, I mean, it very much seems like it kind of already is, where it does seem like a lot of the focus from the major labs has shifted to they're still using a lot of human data. And so human data doesn't seem to be going away. [29:12] because creating these environments kind of by definition is for things that models aren't good enough at yet. [29:16] which means the humans are better than the models in some capacity. And so identifying which pieces of information... [29:23] the human can most uniquely assist the model in improving its skill on, I think is really the key to target, which is how do you create this [29:31] information flow from the human who really has the expert knowledge on how to do some sort of thing, what does a job well done look like, and get it into the model. And it does seem to be that RL is the most effective way to do this right now, where having tasks with prompts to grade with an LM judge a rubric for what success looks like on a task is kind of the paradigm that's emerging for a lot of [29:55] a human, sometimes it's the reward model, but kind of the most direct, visceral version of it is, uh, there's a set of questions. [30:02] about yes, no, was this done in the LLM's answer? And that turns into the reward score. And so that requires a lot of human data.
[30:10] Awesome. [30:12] Why create a hub for environments? [30:16] Yeah, I think the hub idea started by seeing a lot of different like open source repos out there that had like overflowing implementations of all those environments in a sense. And yeah, in general, like we already, well, we already created a nice verifiers framework before we even started the environment hub, right? Where you had different examples for environments in there and it was very much a. [30:36] um yeah an approach to like standardize the whole process even more yeah um and beyond that um just beyond just like sharing those environments right and having like an open source platform for it it's also about having like a place where you can build a lot of um yeah infrastructure around them which you can't really do if you just upload them to like a github repo or something like this so um yeah having a proper evaluations uh like integrated with your environment so you can like [31:06] is one of the features people are heavily using the environment hub then for, right? You make it extremely easy to install one of those environments inside of a different trainer. So we obviously have our own large-scale trainer with Primer L, which we've been heavily optimizing for this. But yeah, we are very open there on the open source side to integrate with a bunch of different trainers because people have different needs on the trainer side as well. And... [31:34] That's how the whole idea of the environment up. [31:37] Yeah. And what is the community behavior been like? Do you see people forking, modifying these environments? Do you see them, you know,
[31:45] putting something 80-20 out into the public domain, and then, like, you know, we're going to keep our... [31:49] secrets for ourself and, you know, not share that back. Like, what's the community behavior? Yeah, I mean, there's definitely a lot of people we work with who want to keep their environments private, as you understand. But the value for them of it being a hub is that they can do ablations on ones that are kind of known to be, they can compare their private one versus some public one that might be on a similar type of task. Or for evals, there's a lot of value in having kind of, uh, [32:10] mix and uniform implementations of popular benchmarks in a way that if you're doing a training run you can plug in some known eval as a way to monitor the progress of your run and so maybe you're doing well in your environment you can see does your environment also generalize to other tasks and so having all these other tasks available is a really helpful way for people to be able to understand not just their own tasks but other things they might want the model to be good at as well yeah awesome what are the most popular environments on the hub today [32:36] Yeah, I think it tends to be the ones that are these kind of, one is the ones that we use as the examples in the documentation that naturally in software tends to be the most popular ones. So the Wordle stuff?
[33:06] documents or documents for a specific type of information and having this kind of template for if you just you just have to swap out the documents and now you have this environment ready to go where the the rest of it is already kind of set up. [33:19] That tends to be the sort of thing where we see a lot of value in people being able to bring other types of documents that they'd want to do a gender search over. [33:27] Yeah, got it. Okay, awesome. I want to maybe shift towards future research, you know, big blue sky questions. Maybe the first one... [33:38] Andre, I think, is one of your angels as well. [33:41] He kind of has that infamous quote of, you know, RL is, you know, it's, it's, [33:45] It's amazing, but it's... [33:47] quite inefficient and it's like sucking bits from a straw. [33:50] I guess, do you agree with that? And what do you think is going to happen in the research side to make RL more efficient? [33:57] Yeah, I mean, I think... [33:59] It's definitely true that RL is using a lot of compute to get a pretty kind of small signal in terms of pure information. But I think in some ways that's part of the value of it as well. And [34:10] a lot of the the labs have really focused on it is that [34:13] One of the bottlenecks that's hard to scale is human data, especially high quality human data. [34:18] And RO allows you to kind of trade off compute for data, in a sense, where you can get a lot of value out of a smaller amount of data by using more compute. [34:28] And so the supervision coming from this data is small, but you can get a lot out of it more so than you can via pre-training or supervised fine-tuning alone. [34:36] as well as it's useful in cases where you don't necessarily have golden examples. So like if you have a bigger model to distill from, that's great. But if you're already at the biggest model size that you have access to, then you kind of need to go into uncharted territory. And exploration is really the heart of RL. It's how do you explore and try out different things and
[34:56] Maybe there are ways to do exploration that are more efficient than RL that people will discover, but this is kind of currently the... [35:02] the frontier of using compute to explore and improve capabilities. And so that's what we got for now. [35:07] for sure and i think uh like i can't speak for andre obviously um but yeah i would be curious to hear his views like two months later after the latest like drugish podcast in a sense on um his views especially on the coding side for uh like uh large care reinforcement learning and how it actually helps there um if we look at something like like cloud code which definitely was already popular before but definitely popped up more uh over the last uh like one month i would say um [35:37] of how useful reinforcement learning can be in the coding domain. [35:42] But generally, we don't think that's going to be the end in a sense, right? We generally think we want to be always at the frontier of like what comes next in terms of paradigms. Yeah, there's definitely lots of low hanging fruit still on the like pushing agentic R capabilities even further. But yeah, we also like know the limitations in a sense, right? Like some of the pieces we've been working on where we definitely see limitations is on the context side. [36:12] limit right now of how much tokens we can fit into a context. And you have been thinking there about ways on how to actually improve that in a sense. [36:21] Awesome. [36:21] Um, [36:22] Switching gears a little bit. [36:24] open source, open weight models.
[36:27] Um... [36:28] What do you see the role open weight models play? [36:32] Does your infrastructure... [36:34] kind of work only on or work optimally on open weight models? Could you kind of help people? [36:40] Do post training around closed weight models. How does that all work? [36:44] Yeah, so... [36:45] in many ways, like, [36:46] the trainer itself is going to require having access to the weights. And so if anyone who has closed weight models wants to use it, we're happy to chat. But [36:54] More broadly, the idea of the environment is general across different types of optimization. And so we can do using the same infrastructure at the environment level for doing evals on closed models, for doing prompt tuning on closed models, for doing model selection, for evaluating agent harnesses. [37:24] model into another where you use the environment as this data engine. So there's a lot of different ways you can use the tools. [37:31] to optimize models. [37:33] And... [37:34] Do you need, need, need the weights? Like, for example, can you kind of, Laura? Yeah, so, I mean, the Laura I would consider still part of the fine-tuning process, and that's what we recommend a lot of people do for RL anyways. And so, like, you don't need this, that's not sure the full weights, but, like, you can't, [37:49] I can't upload my own LoRa for GBT5, but I could bring an environment to the platform potentially. [37:55] And so I think there's a lot of different ways you can
[37:58] do partial customization. And I would imagine that [38:01] in many cases, [38:02] The degree of like how many do you need a lower adapter do you need full fine tuning is going to depend on the training recipe as well as the goal of your optimization. What about can you do reinforcement learning on like the agent harness around a closed model? [38:17] Yeah, so I would definitely consider this in the domain of like, like there's a world of prompt optimization that some people have been exploring in the research world that... [38:25] is in some ways kind of an analog of RL, but in prompt space where you have. Yes, exactly. And so the GEPA algorithm got a lot of attention like late last year as kind of a. [38:38] I, [38:38] what seemed to be a better way of doing this. And so we have support for that as well with our environments where you can do this around different pieces of the harness where this might be, what's the prompt used for a certain tool? What's the agent's skill? What's the system prompt? There's a lot of these things that you can apply different types of optimization to. Awesome. While we're on the topic of DSPY, I think the DSPY authors also have this new thing that is the current thing. What is it? Recursive language models? What do you guys think? [39:06] yeah um yeah definitely very interested in that um yes i um yeah already said earlier in a sense we are very interested in like um yeah longer horizon agents and so on and like actually solving things for those um type of use cases and uh yeah i've been internally thinking for a long time about uh like how can we um yeah have models learn how to manage their own context um so um right now people are building a lot of scaffolds for context management and yeah we believe uh like
[39:36] is to have the model learn how to manage its own context. And, yeah, I've been, yeah, searching for different research in this kind of domain for a pretty long time. And, yeah, the recursive language model research direction is one of the, yeah, most promising ones in our opinion. We've been, yeah, uh... [39:52] since Alex Tsang, who's the original author of the RLM work, published it. We've been very interested in this kind of work. I've been exploring it as part of our research as well. And we had a blog post out a couple of weeks ago that [40:09] basically showed using this RLM harness where you pretty much give a language model access to a variable in a persistent Python REPL. So it can not have this whole context or the whole data as input in a sense, but it has it in this variable that it can then retrieve, it can transform it, and manage the context through that, and then also call other sub-LLMs where the recursive [40:39] actually like, um, [40:41] yeah, manage it. And yeah, that's the whole idea behind the recursive language model. We've been doing it, yeah. [40:48] some exploration there on this front to actually just give current frontier language model access to this specific RLM harness. So not necessarily training in this harness, but just giving it access to the specific [41:03] yeah um way of like dealing with its own context um and yeah it's been all be shown to like improve benchmarks on um yeah very long horizon reasoning quite a bit and yeah we are very excited as like a new frontier in a sense to actually train in this as well to let the model train uh well train the model to actually use this harness um and yeah that's what we're gonna work on over the next couple of months.
[41:26] Super exciting. What else in the research domain? You guys have great research taste. What do you think is on the horizon? [41:33] I'm really excited about synthetic data research, and it feels like [41:36] There's a lot of stuff that feels like we should be able to do it, but... [41:41] you haven't really seen it emerge in the open in terms of creative ways of doing kind of self-reflection. [41:48] and [41:49] I think people talk a lot about continual learning as this kind of idea that we're going to have to get better at models kind of learning things on their own. And I think the idea of. [41:57] I'm using [42:00] other tricks that we already know in conjunction in different ways, things like prompt optimization, things like distillation, [42:08] in conjunction with synthetic data, [42:11] It feels like there's a lot of... I don't want to kind of go too deep into different directions, but... [42:15] It seems like there's a lot of room for exploration around having models curate their own training data, maybe curate their own environments and understanding which versions of this are most effective for lifelong learning. [42:28] Love it. [42:28] Okay, we're going to close on an optimistic note. If everything goes right, what does the world look like and what is the role that prime intellect serves in that world? [42:38] Yeah, good question. How would you answer that on a high level overview in a sense? I would say we don't want to have a world where like all the like future value of AI and all kinds of verticals is just owned by the big labs. We have something where we like empower like entrepreneurs and enterprises and so on to actually not get steamrolled in a sense and like optimize their products. Yeah, better than they have the tools for doing so right now. And yeah, just enabling this and yeah, a lot more
[43:08] like Claude code moments, a lot more Cursor4x type moments that will be enabled through this. [43:14] Every company is a Neolab. [43:16] Got it. In some ways. [43:18] If if data is the bottleneck, if having the real expertise is the bottleneck, like would you rather have the smartest person in history work at your company or someone who's been there for 30 years? And in some ways, sometimes you really want the person who's been there for 30 years. There's a lot of expertise that comes from really understanding a problem deeply and interact with it over a long time. And this is really. [43:37] what happens in training that is almost impossible to replicate in a short prompt where you really want the ability for institutional knowledge to compound over time, for best practices to compound over time. And this is how institutions and companies grow to be really powerful and successful is they stand on the shoulders of what they've done before rather than kind of resetting every day. [44:07] people to manipulate as the barrier to entry for coding becomes easier, we see the same happening for AI research. [44:13] It's a really inspiring vision for the world. Thank you guys so much for joining today. You've really paved the way on environments and your Environment Hub. And thank you for taking the time to demystify what an environment is and share your vision for the future. Thanks. Thank you. [44:38] Thank you.
Want to learn more?