Artificial intelligence is entering a new epoch: systems are evolving from static reasoning engines into autonomous entities capable of open-ended self-improvement.
Earlier this year, Zhang et al. (2025) introduced the Darwin Gödel Machine, a meta-architecture for self-evolution guided by environmental feedback and meta-rewards. Its recursive design connects reflection, environment interaction, and reward modeling into a continuous evolutionary loop.

This theoretical milestone, once confined to academic speculation, now underpins a wave of commercial experimentation. Our core investment thesis is that the self-evolution stack will define the next platform era in AI.
At our recent hackathon, we convened 100+ builders and dozens of industry experts and researchers to test these ideas in practice. Across projects ranging from automated research assistants to adaptive coding agents, participants explored how feedback loops drive compounding intelligence.
Fang et al. (2025) offer an abstraction of self-evolving systems. They describe four interacting components (System Inputs, Agent System, Environment, and Optimizer) and five optimization axes that define where adaptation occurs:

Their insight reframes "self-evolution" not as a single algorithm but as a closed feedback system. Systems that can autonomously generate, evaluate, and integrate new experiences will compound faster than any static model. The "outer loop" of self-improvement (e.g., reflection, environment design, and self-rewarding) becomes the key source of differentiation.
In tandem, we will see accelerated growth in infrastructure layers that can consolidate around agent OS layers and provide efficient means to close this "outer loop." We will also see domains like biomedicine, software development, and enterprise automation–all of which are inherently feedback-rich–become early winners of the self-improvement paradigm shift. Agents that can leverage real-world outcomes as training signals (e.g., clinical accuracy, code correctness, business KPIs) will dominate their verticals.
The long-term frontier is meta-learning infrastructure: systems that make the process of evolution itself programmable. This framework anchors our belief that the self-evolution stack will define the next platform era in AI.
The foundation model is no longer a passive reasoning core; it is the substrate of evolution. Providers are rapidly vertically integrating their offerings to serve different evolutionary niches within the self-evolving-agent paradigm. The key question becomes which model will provide the most reliable, feature-rich, and economically viable backbone for agentic systems.
The market also see a pragmatic bifurcation between two types of model providers:
| Provider | Model | Key Traits |
|---|---|---|
| OpenAI | GPT | Mature tool/function calling & structured outputs. Best-in-class tool-calling and structured function-execution via JSON schemas enable reliable multi-step reasoning. Its Assistants API provides persistent memory and file-handling. |
| Gemini | Native multimodality (text–audio–vision; Multimodal Live API) and Workspace integration (Gmail/Docs/Drive/Chat), plus Agent/Agent Mode & Computer Use previews for automation. This makes Gemini well-suited to perception-action loops and enterprise workflows where agents must see, reason, and act across Workspace. | |
| Anthropic | Claude | Very long context (up to ~1M tokens in beta/enterprise channels) enabling deep code/document analysis and longer reflective chains—useful for agent planning/verification over long horizons. Strong safety/constitutional methods. |
| Cohere | Command | Enterprise-oriented stack with emphasis on retrieval (Embed/Rerank), privacy, and deployment flexibility; positioned for agent workflows that require verifiable, governed RAG and corporate data integration. |
| Provider | Model | Key Traits |
|---|---|---|
| Meta AI (FAIR) | LLaMA | The de facto open-weight backbone for agent projects, benefiting from a massive community and permissible licensing. Research from leaders like Yann LeCun champions self-supervised learning, fueling the open, autonomous AI paradigm |
| Mistral AI | Mistral-8x22b | Sparse MoE (39B active of 141B total) gives strong cost-efficiency and latency for self-hosted agents; popular for organizations wanting control over fine-tunes/infra. |
| Others | Qwen, Grok, ERNIE | Alternative architectures and training data mixes |
Today, evolution happens in the outer loop–by optimizing memory, tools, and policies–while the foundation model acts as a fixed reasoning engine. However, beyond the outer loop, there is a not-so-distant future in which "self-evolution" implies models rewriting their own weights.
The current paradigm of Reinforcement Learning from Human Feedback (RLHF) uses a static reward model trained on human preferences. The future is a shift toward Self-Rewarding Language Models, where the model itself acts as the judge, providing its own rewards to continuously fine-tune its policy. This is the commercial realization of research concepts like the Self-Challenging Agent (SCA) (Zhou et al., 2025).
Foundation models are becoming powerful enough to generate their own training data, effectively creating their own curriculum. Google's emphasis on synthetic data generation with Gemini 2.0 is a key industry signal in this direction, enabling models to bootstrap their capabilities in new domains where real-world data is scarce. This aligns with research like AgentGen (Hu et al., 2025), where agents construct their own simulation worlds to drive learning.
The orchestration framework is responsible for planning, state management, and coordinating the complex, multi-step workflows that define agentic behavior. Seminal open-source projects like AutoGPT (Yang et al., 2023) first proved the potential of autonomous agents, and influential research like Microsoft's AutoGen (Wu et al., 2024) pioneered multi-agent conversational patterns.
Modern frameworks provide infrastructure to enable the "outer loop" of self-evolution. The commercial landscape has rapidly matured toward robust, production-grade platforms.
The core investment thesis is that the company that provides the dominant "operating system" for agents (analogous to web frameworks like Ruby on Rails for Web 2.0) will capture immense value by becoming the default execution layer. Early chains proved agents could act; the winners will natively manage state, surface evaluations/telemetry, and plug into any model, tool, and environment.
There is strong market activity here, as evidenced by the proliferation of frameworks and LangChain's recent $1.25 billion valuation.
| Category | Key Players | Strategic Positioning |
|---|---|---|
| Open-Source Incumbents | LangGraph AG2 (OSS branch of Microsoft AutoGen) LlamaIndex CrewAI Mastra (Sam Bhagwat) |
First-Mover Advantage & Ecosystem: LangChain and LlamaIndex have massive developer adoption. LangGraph's stateful, graph-based architecture is becoming the standard for building production agents, while LlamaIndex is known for its advanced RAG techniques. CrewAI simplifies multi-agent systems. Mastra AI, with its $13M seed, works in the TypeScript/JavaScript developer community |
| First-Party Challengers | OpenAI AgentKit Google ADK Microsoft Agent Framework (formerly AutoGen) |
Native Integration & Simplicity: These frameworks offer seamless integration with their respective foundation models (GPT-5, Gemini). They are positioned as the simplest path to build agents on their platforms, often with visual builders that compete with Zapier. Key players aim to be the fastest path to production on their model stacks (tight function calling, hosted eval hooks, computer use). Expect continued consolidation around "one-vendor" developer flows for teams that prize speed over portability. |
| Enterprising and Low-Code | Flowise (acquired by Workday) LangFlow (acquired by DataStax) Dust Griptape |
Enterprise & low-code expand the surface area: Enterprise-focused platforms like Dust and Griptape differentiate on governability (RBAC, policy, observability). Visual builders like Flowise and LangFlow broaden access; they are useful for prototyping and internal automations, then handing off to code-first orchestration. |
Early agent frameworks executed linear, stateless chains. The next generation manages branching logic, persistence, and collaboration among specialized agents. LangGraph leads this shift with a graph-based execution model that treats every node as a resumable state machine, while Microsoft's AG2 (AutoGen OSS) and CrewAI extend the pattern to conversational and role-based teams. OpenAI's Agents SDK and Google Agent Mode are following suit, introducing native session state and coordination hooks inside their model runtimes.
Today, evaluation and fine-tuning happen in disconnected MLOps stacks; tomorrow's orchestrators will capture agent trajectories, score them, and trigger updates directly. W&B Weave and LangSmith already provide integrated tracing and multi-turn evaluation; OpenAI's Responses API and Google Agent Mode formalize step-level telemetry that can be analyzed or rewarded in real time.
Longer term, orchestration frameworks will evolve from static graphs to self-improving programs. Inspired by the Darwin Gödel Machine concept, agents will propose and test edits to their own workflows under audit constraints, creating a recursive optimization loop. LangGraph and AG2 provide natural foundations because their declarative graphs can be diffed, versioned, and rolled back, while evaluation platforms such as Weave or LangSmith supply the safety rails for testing orchestration changes.
Tools allow agents to interact with the outside world, execute tasks, and, most critically for self-evolution, generate the feedback necessary for improvement.
The core investment thesis is that value will accrue to platforms that create liquid interfaces to the digital and physical world. This means moving beyond static connectors to dynamic systems where agents can discover, combine, and even create tools, a frontier that was opened via seminal research like ToolLLM (Qin et al., 2023) and Voyager (Wang et al., 2023). The market is evolving from a simple "Zapier for AI" model to a more sophisticated, multi-layered ecosystem for action, safety, and feedback.
Notably, we also observe the beginnings of vertical integration between tools and orchestration solutions. A prime example is OpenAI's approach with AgentKit: by tightly coupling the orchestration layer (AgentKit) with a curated set of tools and their flagship model (GPT-5), OpenAI is creating a "lock-in layer."
| Category | Description | Key Players |
|---|---|---|
| API and Skills Layer | Provides the connective tissue between the agent's reasoning and external APIs | Composio: An "agent-native" tool library with a focus on creating a reliable "skill layer" for agents. Pipedream: Ships managed MCP servers that instantly expose ~3,000 APIs and ~10,000 tools to the agent Hugging Face Transformers Agents: A standardized open-source interface for calling thousands of models as tools. Stability AI: A specialized creative tool for on-demand image, audio, and video generation Gumloop, Make.com, Workato: Established and emerging iPaaS platforms providing thousands of connectors for enterprise automation MCP in IDEs: Anthropic's Model Context Protocol is becoming the standard way to plug tools/data into agents; IDEs are shipping MCP clients (e.g., JetBrains IntelliJ 2025.1), making "use the codebase/CI/ticketing tool" a first-class, in-editor capability. Amplitude also ships an MCP server to expose product analytics as an agent tool |
| Secure Execution Layer | Creates a safe "sandbox" for agents to execute code or perform actions without risking harm to the host system | Daytona: A high-speed, open-source sandbox for running AI-generated code. E2B: An "enterprise AI agent cloud" providing secure, tool-equipped cloud environments for agents. Runloop: A runtime environment for agents designed to manage continuous loops, checkpointing and rollback Modal: Serverless compute platform that supports persistent containers triggered by agents, with built-in autoscaling and native Python decorators for lightweight sandboxing Microsandbox: An open-source alternative focused on lightweight, secure code execution |
| Feedback and Reward Layer | Translates the outcome of an agent's actions into a measurable signal for learning and evolution | Amplitude MCP: Exposes product analytics as a reward signal, allowing agents to query their own performance. Livekit: Provides real-time voice/video infrastructure, enabling feedback for multimodal and embodied agents. Langfuse, Arize, W&B Weave,Monte Carlo: Agent observability platforms that trace agent behavior, providing the raw data for reward modeling |
Today, developers manually select and wire up tools. The next step is for agents to do this themselves. Frameworks will evolve to allow agents to dynamically discover relevant tools from a vast library based on the task at hand, and even compose multi-tool chains on the fly. This is the commercial application of research like ToolLLM (Qin et al., 2023), which studied how LLMs could learn to master over 16,000 real-world APIs.
The ultimate frontier is moving beyond using existing tools to creating new ones. Inspired by research like ToolGen (Wang et al., 2024), future agents will be able to read API documentation and write their own functions or "skills" to interact with new services they've never seen before.
The "Feedback & Reward Layer" will become more sophisticated. Platforms like Langfuse and Arize, which currently focus on observability (tracing agent behavior), will increasingly merge with reward modeling. Instead of just showing what an agent did, they will provide tools to define what success looks like for that action.
The memory layer is the stateful repository where an agent stores and retrieves knowledge, experiences, and skills. Foundational research like Reflexion (Shinn et al., 2023) established that an agent could improve its performance by storing and critiquing its own reasoning–turning memory into an active tool for self-improvement.
The research frontier is focused on moving beyond simple data stores to more cognitively-inspired systems. This includes implementing structured long-term memory via knowledge graphs and developing sophisticated short-term memory curation, such as using the Ebbinghaus forgetting curve to intelligently decide which memories to retain or let decay, as explored in frameworks like MemoryBank (Zhong et al., 2024). ReasoningBank (Google, 2025) worked towards "lifelong reasoning memory" and a persistent, queryable repository of experiences.
For a self-evolving agent, this layer is the experience repository that fuels the "outer loop" of learning. The dominant paradigm for agentic memory today is Retrieval-Augmented Generation (RAG). A common example of RAG implementation is pairing a vector database with an LLM to provide a long-term memory that extends beyond the token-limited context window.
The core investment thesis is that as agents become more autonomous and long-running, the value of their accumulated experience will grow exponentially. The frontier now extends to memory that writes itself, forgets selectively, and learns what to remember.
| Category | Description | Key Players |
|---|---|---|
| Vector Databases | The underlying database technology that stores information as high-dimensional vectors (embeddings) for fast, semantic retrieval. | Pinecone: The venture-backed market leader, known for its managed, high-performance vector search service. Weaviate: A strong open-source and EU-based competitor to Pinecone. ChromaDB, Milvus, Qdrant: Popular open-source alternatives that offer more control and are often self-hosted. |
| RAG and Knowledge Frameworks | The "operating system" for RAG, providing the tools to ingest data from diverse sources and build queryable knowledge indexes. | LlamaIndex: market leader in composable RAG systems LangChain RAG: integrated within orchestration stacks KnowledgeGPT, AG2 Memory components: emerging players |
| Data Ingestion and Processing | The "plumbing" that crawls, cleans, and structures raw data (from websites, documents, etc.) into an "LLM-ready" format for ingestion into a vector DB. | Firecrawl: Crawls websites and converts unstructured content into clean, structured data for RAG pipelines Reducto: Text distillation layer that compresses and summarizes raw documents into vector-optimized storage Unstructured.io, Airbyte: Other key players in the broader data integration and preprocessing space. |
| Agentic Memory Developer Tools | Lightweight, model-agnostic memory SDKs that let developers attach persistent context, episodic recall, and vector search to agents | Mem0: memory-as-a-service with automatic summarization and decay Supermemory: open-source memory graph for LLM agents Letta:TypeScript framework for stateful agents Zep: fast, embeddable vector memory for chatbots Hyperspell: semantic recall and personalized context memory |
| AI Browsers and Enterprise Memory Systems | Tools that unify search, documents, and conversational memory into knowledge graphs | Glean: enterprise knowledge graph and retrieval layer) Rewind: personal memory and timeline for local workflows Perplexity Pro Search: knowledge-grounded conversational memory Dex: local-first agent memory built into a browser-based context engine |
While early efforts focused on manual "prompt engineering," this is a rapidly commoditizing skill. The investment thesis is that value is shifting to programmatic context optimization, where the system itself learns and refines the most effective context for a given task.
This is the shift from "prompt engineering" to what Anthropic calls "context engineering"–treating context as a critical, finite resource to be algorithmically managed. The academic blueprint for this is the Agentic Context Engineering (ACE) framework from Stanford and SambaNova (2025), which treats context as an "evolving playbook" that the agent refines over time.
In our opinion, the prompt/context engineering space will collapse into the memory optimization space. Still, recent activities in the prompt optimization space include:
In the near future, we will see a convergence with orchestration via Agentic Context Engineering. The prompt will evolve from a static block of text into a dynamic, executable program.
This layer provides the "reward function" for the agent's learning loop. Foundational research has established the outer-loop optimization principle: models can continuously refine themselves using feedback generated from their own behavior, not just static datasets.
This principle was pioneered by concepts such as LLM-as-a-Judge (Zheng et al., 2023), and was further advanced by Self-Rewarding Language Models (Yuan et al., 2024), which proved that an agent could generate tasks, produce solutions, and then grade its own performance to create an automated feedback loop. Complementary work like TextGrad (Yuksekgonul et al., 2024) and AutoRule (Lyu et al., 2025) has focused on converting qualitative feedback into structured, differentiable reward signals.
As agents move into production, the commercial layer is coalescing around systems that operationalize this principle: capturing, structuring, and applying feedback at scale. The next generation of evaluation frameworks treat every interaction as an experiment, or an opportunity to quantify and reinforce improvement.
| Category | Description | Key Players |
|---|---|---|
| Experiment Tracking and Observability | The "Datadog for AI," providing tools to trace, log, and compare agent behaviors over time | Weights & Biases (W&B): The market leader in MLOps, now extending its platform with Weave to provide first-class support for tracing and evaluating agentic workflows Ruler (OpenPipe): lightweight evaluation and scoring framework enabling structured prompts, reference outputs, and metric computation directly in production pipelines Haize Labs: emerging leader in feedback data infrastructure and reward model training, building platforms for collecting, curating, and aligning AI and human feedback at scale |
| Safety and Guardrails | Ensures agent outputs are safe, reliable, and adhere to predefined rules before being executed | Guardrails AI: An open-source and commercial platform for enforcing structural and quality guarantees on LLM outputs Lakera, ProtectAI: emerging contenders in model integrity and red-teaming for production agents |
| Benchmarks and Environments | Provides standardized, realistic environments for testing and comparing agent performance on complex tasks. Still primarily academic | WebArena & OSWorld: Influential open-source benchmarks that simulate complex web browsing and operating system tasks, pushing the frontier of agent capabilities |
Future Directions:
The Learning Environment is distinct from a simple sandbox; it is a persistent, high-fidelity simulation or digital twin of a real-world process that provides the rich, continuous feedback necessary for RL and self-improvement.
However, this layer is the least commercially developed. The market is bifurcated: frontier labs such as OpenAI, Anthropic, and Google build proprietary internal environments to train and benchmark their foundation models, while independent startups face the structural challenge of competing with in-house infrastructure.
Still, a nuanced investment opportunity is emerging. Despite these challenges, we see validation by a handful of high-conviction, early-stage bets. In our belief, the next decade's winners are unlikely to win by building synthetic "universes" from scratch—but it may be that a company may find early environment-building success that ultimately paves the way to a larger generational win.
| Category | Description | Key Players |
|---|---|---|
| "Gym for Agents" Platform | Provide a turnkey, infrastructural layer for environment creation, simulation, and synthetic data generation. The goal is to become the default "gym" where developers train their agents. | Applied Compute: A pre-launch, ex-OpenAI team that raised a significant round to build RL-as-a-Service infrastructure. Mechanize: Building RL environments that simulate white-collar work to sell to leading AI labs. |
| Domain-Specific Digital Twin | Focus vertically on a single, high-value domain (e.g., drug discovery, financial trading) where a specialized, integrated agent-and-environment stack can prove clear ROI. | Biomedical Agents (e.g., MedAgentSim): The most promising vertical, where a simulated clinical environment can train diagnostic agents with clear success metrics. Robotics (e.g., Google's RT-X): Using high-fidelity simulators to train robots before deploying them in the physical world. |
| Productized Learning Loop | Abstract away the "environment" and instead sell a "continuous learning" product to enterprises. This bundles a lightweight simulation with observability and feedback tools, tying agent evolution directly to business KPIs. | This is the most commercially viable near-term thesis. It's less about building a complex world and more about instrumenting the existing one. Amplitude MCP turns product analytics into a direct reward signal for the agent's learning loop. |
Future Directions:
Verticals like biomedicine and software development are characterized by specialized task structures, unique data modalities, and operational constraints that demand tailored agent architectures. The investment thesis is to favor domains where agent behavior can be aligned with these procedural requirements and where success can be measured by clear, closed-loop feedback signals (e.g., clinical outcomes, tests pass/fail, conversion metrics).
The biomedical domain is a prime vertical for self-evolving agents because it requires agents to operate under uncertainty, engage in multi-turn diagnostic dialogues, and integrate external knowledge bases.
Ambient scribing is the beachhead; longitudinal reasoning with AI-powered diagnostic assistants is the destination. Research like MedAgentSim (Almansoori et al., 2025) provides the blueprint for training these agents in high-fidelity simulated environments, while commercial players are already attracting massive valuations. Abridge raised a large round in early 2025 to expand ambient AI across health systems, while Nabla has continued rolling out its "Copilot" in EHR workflows.
Key Players:
In drug discovery, agents are being used to navigate the vast chemical space to find novel therapeutic compounds. Research like CACTUS (McNaughton et al., 2024), equipping agents with cheminformatics tools, and STELLA (Jin et al., 2025), a self-evolving agent for biomedical research, are the academic precursors to the commercial work.
Key Players: Insilico Medicine (IPO announced), Atomwise, and the publicly-traded Recursion, which operate closed-loop RL environments that connect LLM reasoning with wet-lab validation.
Future Directions:
Software development is an ideal domain for self-evolving agents because its workflows are highly structured and it offers an unambiguous, binary feedback signal: code either works or it doesn't. Agent optimization in this domain is focused on aligning with established software engineering practices for code refinement and debugging.
Research is bifurcated in two main approaches:
Key Players: Sourcegraph Cody and Codeium are building "continuous improvement" assistants that learn from an entire codebase. Augment Code is explicitly focused on multi-agent "coder/reviewer/tester" orchestration.
Future Directions:
While a softer domain, customer support offers a high volume of interactions that can be used to train adaptive agents. The key is translating qualitative feedback (e.g., customer satisfaction scores, resolution times) into a quantitative reward signal.
The goal is the evolving chatbot. A self-evolving agent can learn from each difficult case, updating its knowledge base or adjusting its dialogue strategy based on whether the customer was satisfied.
Key Players: While many developments are internal, established players like Intercom and Zendesk are heavily investing in LLM-based support bots. The well-funded consumer-facing agent Inflection AI's Pi is a key player to watch, as its emotionally intelligent conversational technology could be repurposed for enterprise customer support.
Future Directions:
This is the largest and most disruptive opportunity: giving "brains" to static Robotic Process Automation (RPA) bots. The goal is to move from scripting repetitive tasks to defining high-level goals and letting the agent figure out how to achieve them.
Traditional RPA breaks when a UI element changes. An agent built on a vision-language model can adapt to these changes on the fly. The holy grail is goal-driven automation.
Key Players:
Future Directions:
Our hackathon surfaced an unusually candid snapshot of what real developers, researchers, and operators gravitate toward when building self-evolving agents. Each team considered how they would implement the "outer loop" to their unique ideas. We are also able to see the aggregate patterns provide field evidence for where the ecosystem is headed—and where infrastructure demand is strongest.
Many teams gravitated toward tools that minimized infrastructure overhead and maximized speed to "agent behavior." We saw rapid mainstreaming of:
These observations validate a central lesson: DX is no longer a nice-to-have—it is a gating factor for agent adoption. The winners in agent infrastructure will be those who reduce the friction between idea, functioning agent, and an evaluable, improvable loop.
Several conversations highlighted a truth that is underweighted in most VCs' mental models: self-evolving systems require a continual stream of diverse tasks, corrections, and examples—and communities generate that data.
This aligns with an overlooked structural reality: self-evolution thrives where user communities continuously create new trajectories, corrections, and environments.
Multiple builders expressed hesitation about using heavy, prescriptive frameworks:
This nuance strengthens your thesis that the "Agent OS" must remain flexible and low-imposition. The hackathon evidence supports a future where orchestration layers succeed only if they remain modular, interoperable, and minimally opinionated.
A theme raised consistently in conversations: sample efficiency is the central bottleneck for self-improvement today. As one participant put it: "We can make an agent self-improve, but not cheaply."
In other words: Outer loops require thousands of trajectories, but most teams (at our hackathon and elsewhere) don't have the budget for that. Better sample efficiency (via distillation, reward models, offline RL, or clever evaluation harnesses) is the real unlock.
In our view, another attribute of the infrastructural winner will be one who makes self-evolution cheap.
Self-evolving agents are not a thought experiment anymore; they are a new computing stack coming into focus. The research frontier (Darwin Gödel Machine, Reflexion, STELLA, ACE) is converging with production primitives (agent OSs, MCP, vector memory, evaluators, learning environments) and real pull from domains like biomedicine, software, and business process automation. Our belief is simple: the next platform-defining companies will be the ones that turn the outer loop of learning into a product surface area, not an internal research project.
To this end, the Self-Evolving Agents Hackathon at AGI House is a controlled growth experiment. Our partners saw that stack recommendations (Weave, Weave evals, dashboards) drove user activation, while builders were incentivized to try new frameworks and tools–with many converting to long-term users. Especially under the self-evolving paradigm, hackathons are the proving ground where infra companies test activation, retention, and community-building strategies.
Our view is that we are still early in the self-evolution era, but not too early: the foundations exist (frontier models, MCP, RAG, agent OSs, observability), and the first real use cases are emerging in production. The opportunity now is to build the companies that treat learning as the core product feature—and in doing so, become the default way that software adapts to the world.
We're grateful to our sponsors—Weights & Biases, CoreWeave, and Firecrawl—for supporting the Self-Evolving Agents Hackathon and making this research possible. We'd also like to thank Sam Stowers from Weights & Biases for co-hosting the event and providing technical guidance to participants. Special thanks to Diya Sreedhar, Coco Xia, and Preethika Pavirala for their invaluable help in organizing and executing the event.
Our judges brought deep expertise across the AI stack: Siqi He (Founding Product Lead at Pika AI), Sam Stowers (AI Product Engineer at Weights & Biases), Nick Camara (Co-founder and CTO at Firecrawl), Oliver Wang (ML PhD at MIT Schwarzman College of Computing), Bohdan Kovalevskyi (Senior Engineer at OpenPipe), Angky William (Staff Engineer at CoreWeave), Jitesh Luthra (Head of Operations at Composio), Eric Ciarla (Co-founder and CMO at Firecrawl). We also want to give special thanks to Mark Mercer (Chief of Staff at Firecrawl) for coordinating and organizing our partnership. The event also featured insightful talks from Mike Merrill (AI NLP Postdoc at Stanford), Jason Qi (Research Scientist at Snorkel AI), and Bohdan Kovalevskyi (Senior Engineer at OpenPipe).
Finally, we extend our deepest gratitude to the entire AGI House community for creating a space where ambitious ideas meet execution. This hackathon brought together over 100 builders, researchers, and founders who dedicated their weekend to pushing the boundaries of what self-evolving agents can do. The energy, creativity, and technical depth demonstrated by every team—from early prototypes to production-ready systems—validated our belief that the best way to understand the future of AI is to build it together. Events like this don't just test technologies; they forge the communities and collaborations that will define the next era of intelligent systems.

Research Lead, Primary Contributor

Operations & Research, Contributor

Operations Lead, Contributor

Founder and CEO, Contributor