A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution
Zhishang Xiang1, Chengyi Yang1, Zerui Chen1, Zhimin Wei , Yunbo Tang1, Zongpei Teng1, Zexi Peng1, Zongxia Li1, Chengsong Huang1, Yicheng He1, Chang Yang1, Xinrun Wang1, Xiao Huang1, Qinggang Zhang1, and Jinsong Su1
1Affiliation not available
February 27, 2026
Abstract
The rapid advancement of Large Language Models (LLMs) has empowered autonomous agents with advanced reasoning, planning, and tool-use capabilities. However, traditional agent systems rely heavily on human-guided training. Despite the reduced data volume in post-training stages, the increasing demand for high-quality supervision creates a critical scalability bottleneck. This dependency not only restricts scalability but also confines agents to the upper bound of human expertise, severely limiting their evolutionary potential. To overcome these limitations, research on autonomous agents is shifting towards the Self-Evolving Agents paradigm, in which agents autonomously coordinate their own improvement loops with minimal human supervision. To unify this rapidly expanding field, we propose a taxonomy that categorizes methods into three core components of an autonomous evolutionary cycle: (i) Model-Centric Self-Evolution, where agents enhance internal capabilities through inference scaling or parameter bootstrapping; (ii) Environment-Centric Self-Evolution, where agents achieve continual self-evolution by interacting with the environment to obtain external knowledge and experience-based feedback; and (iii) Model-Environment Co-Evolution, where agents and their environments jointly evolve through sustained interaction. We highlight Model-Environment Co-Evolution as a key emerging direction for Self-Evolving Agents, where environment definitions and their co-evolution with agents will become central challenges in future research. We provide a systematic technical foundation of Self-Evolving Agents, current implementations, and identify key technical challenges and promising research directions. All related resources, including research papers, open-source data, and projects, are collected for the community in https: //github.com/XMUDeepLIT/Awesome-Self-Evolving-Agents.
A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution
Zhishang Xiang†, Chengyi Yang†, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng Zexi Peng, Zongxia Li, Chengsong Huang, Yicheng He, Chang Yang, Xinrun Wang Xiao Huang, Qinggang Zhang‡, Jinsong
Abstract—The rapid advancement of Large Language Models (LLMs) has empowered autonomous agents with advanced reasoning, planning, and tool-use capabilities. However, traditional agent systems rely heavily on human-guided training. Despite the reduced data volume in post-training stages, the increasing demand for high-quality supervision creates a critical scalability bottleneck. This dependency not only restricts scalability but also confines agents to the upper bound of human expertise, severely limiting their evolutionary potential. To overcome these limitations, research on autonomous agents is shifting towards the Self-Evolving Agents paradigm, in which agents autonomously coordinate their own improvement loops with minimal human supervision. To unify this rapidly expanding field, we propose a taxonomy that categorizes methods into three core components of an autonomous evolutionary cycle: (i) Model-Centric Self-Evolution, where agents enhance internal capabilities through inference scaling or parameter bootstrapping; (ii) Environment-Centric Self-Evolution, where agents achieve continual self-evolution by interacting with the environment to obtain external knowledge and experience-based feedback; and (iii) Model-Environment Co-Evolution, where agents and their environments jointly evolve through sustained interaction. We highlight Model-Environment Co-Evolution as a key emerging direction for Self-Evolving Agents, where environment definitions and their co-evolution with agents will become central challenges in future research. We provide a systematic technical foundation of Self-Evolving Agents, current implementations, and identify key technical challenges and promising research directions. All related resources, including research papers, open-source data, and projects, are collected for the community in https: //github.com/XMUDeepLIT/Awesome-Self-Evolving-Agents.
Index Terms—Large Language Models, Agent, Self-Evolving
I. INTRODUCTION
Recent years have witnessed remarkable advancements in Large Language Models (LLMs) [1]–[8]. Driven by the scaling of massive datasets and model parameters, LLMs have demonstrated increasingly strong generalization and emergent reasoning capabilities across a wide range of tasks. As a result,
†Equal contribution.
‡Corresponding authors.
Zhishang Xiang, Chengyi Yang, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng, Zexi Peng and Jinsong Su are with Xiamen University, China (e-mail: {xiangzhishang, yangchengyi, chenzerui1, tangyunbo, tengzongpei, pengzexi}@stu.xmu.edu.cn, zhimin.wei@foxmail.com, jssu@xmu.edu.cn).
Qinggang Zhang, Chang Yang and Xiao Huang are with The Hong Kong Polytechnic University, Hong Kong SAR, China (e-mail: {qinggangg.zhang, chang.yang}@connect.polyu.hk, xiao.huang@polyu.edu.hk).
Zongxia Li is with University of Maryland (e-mail:zli12321@umd.edu); Chengsong Huang is with Washington University in St. Louis, USA (email:chengsong@wustl.edu); Yicheng He is with University of Illinois Urbana-Champaign, USA (e-mail:yh84@uiuc.edu).
Xinrun Wang is with the School of Computing and Information Systems, Singapore Management University, Singapore (e-mail: xrwang@smu.edu.sg).
they have evolved from passive text generators into autonomous agents capable of decomposing abstract goals, utilizing external tools, planning multi-step actions, and executing complex decision-making processes [9], [10]. These advances enable agents to operate in interactive environments, where they iteratively perceive, reason, and act to solve long-horizon tasks under real-world constraints [11]–[13].
Typically, the development of such agents largely follows a two-stage paradigm: Pre-Training to acquire broad world knowledge [2], [3], [5]–[8], [14], [15], followed by Post-Training to align models with specific agentic skills via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) [16]–[24]. However, the post-training faces a critical high-quality annotation data bottleneck [25], [26]. Specifically, SFT [27]–[30] follows an imitation learning paradigm that depends on high-quality human annotations, inherently limiting the agent’s performance to the quality of collected supervision. Similarly, while RL [31]–[35] aims to transcend this limit, it typically remains constrained by human-defined reward signals, which are often sparse, non-differentiable, and prone to reward hacking [36], [37]. Moreover, capabilities learned from fixed training datasets and predefined supervision signals struggle to generalize to out-of-distribution scenarios and evolving requirements in open-ended tasks over time. This limitation motivates a shift toward autonomous self-evolution. Accordingly, a critical research question emerges: how can we architect self-evolving agents that surpass the limits of static training data and human supervision?.
To break these constraints, Self-Evolving Agents has emerged as a promising paradigm. Unlike traditional systems relying on periodic, human-curated updates, a self-evolving agent operates through a closed-loop mechanism: it proactively explores problem spaces, generates its own training signals from reasoning or interaction trajectories, and iteratively refines its policy [38], [39]. From a high-level perspective, this paradigm is characterized by two essential properties: (i) Strong autonomy with minimal human supervision, enabling agents to generate learning signals without relying on external supervision. (ii) Actively exploration through interaction, where agents improve their policies either through internal model-driven optimization or through sustained interaction with the environment to discover diverse trajectories and feedback signals. This shift redefines the agent from a passive recipient of offline knowledge into an active participant in its own cognitive growth, supporting continual adaptation and lifelong learning in open-ended settings [40]–[42].

Fig. 1: The Development Trends of Self-Evolving Agents with Representative Works.
Critically, while early approaches relied solely on the model’s intrinsic capacity, we argue that the external environment is the key driver for breaking capability bounds. Similar to the rigor found in closed systems like AlphaGo [43], [44], continuous growth in open-ended worlds requires a welldesigned environment that provides deterministic feedback and finite action spaces. Rather than serving as a static backdrop, the environment should be viewed as an optimizable partner that evolves in complexity alongside the agent, providing the structured feedback necessary for truly autonomous intelligence. This paper presents a systematic analysis of Self-Evolving Agents, structuring the discussion from foundational concepts to advanced evolutionary paradigms as follows:
• Section II establishes the formal notation by formally defining agents and environments, outlining the core mechanisms of interaction and adaptation.
• Section III introduces Model-Centric Self-Evolution, analyzing how agents enhance intrinsic cognition through test-time scaling and parameter bootstrapping.
• Section IV introduces Environment-Centric Self-Evolution, illustrating how agents evolve by accumulating external knowledge, refining tools, and expanding their interaction with dynamic environments.
• Section V explores Model-Environment Co-Evolution, where agent and environmental evolution iteratively drive each other toward autonomous intelligence.
• Section VI presents applications of Self-Evolving Agents, with a particular focus on scenarios where agents evolve through interaction with their environments.
• Section VII discusses the development of Self-Evolving Agents, examining its current challenges and outlining promising future research directions.
II. PRELIMINARIES
To facilitate the subsequent descriptions and discussions of Self-Evolving Agents, we first formally define the agent, its core
components, and the environment in which it operates. Here, an agent refers to a decision-making entity equipped with Memory, Tool, and Interaction Interface Modules, while the environment denotes the external or shared system that provides observations, feedback, and interaction signals. These definitions establish the notations used in subsequent discussions on how agents achieve self-evolution through continual interaction with their environments.
A. AI Agents
We formally define a LLM-based agent as a 5-tuple:
Here, each component represents a key part of the agent’s cognitive and interaction system, defined as follows:
- Core LLM : It serves as the cognitive brain of the agent, which is typically based on an LLM parameterized by θ. It assumes the core functions of semantic understanding, logical reasoning, and instruction generation. The core LLM receives the processed contextual information and utilizes its internal parametric knowledge to generate natural language Chains-of-Thought (CoT) or structured action instructions.
- Memory : To address the context window limitations and the stateless nature of the core LLM, it is utilized to store the agent’s historical interaction sequences, domain-specific knowledge bases, and experience distilled from past interactions. Typically, comprises a short-term memory, used to maintain context coherence for the current task, and a long-term memory, used to persistently store cross-task experience.
- Tool : It is a collection of functional APIs or external services, denoted as . It specifically designed to extend the agent’s capability boundaries by enabling access to real-time information, precise calculations, and physical-world manipulation.



Fig. 2: The proposed unified taxonomy for Self-Evolving Agents categorizes evolutionary methodologies into three paradigms based on their interaction mechanisms and evolutionary targets. (i) Model-Centric Self-Evolution enhances intrinsic cognitive capabilities by scaling computation mechanisms during inference or parameter updates, categorized into Inference-Based Evolution and Training-Based Evolution. (ii) Environment-Centric Self-Evolution focuses on expanding extrinsic support systems through continuous interaction with the external environment, categorized into Static Knowledge Evolution, Dynamic Experience Evolution, Modular Architecture Evolution, and Agentic Topology Evolution. (iii) Model-Environment Co-Evolution treats the environment as an evolving entity rather than a static backdrop, enabling the agent and the environment to adapt and co-evolve jointly toward open-ended growth, categorized into Multi-Agent Policy Co-Evolution and Environment Training.
- Interaction Interface : It serves as the channel for information exchange between agents and the external environment, primarily performing three key functions:
• Perception: Responsible for transforming multi-modal raw signals returned by the environment (e.g., visual images, code execution logs, HTML structures) into text embeddings or natural language descriptions comprehensible to the core LLM. It acts as the window for the agent to acquire the environmental State.
• Action Execution: Responsible for converting abstract decisions generated by the Core LLM into concrete operational instructions acceptable to the environment (e.g., keyboard inputs, API calls). Through these actions, the agent directly changes the state of the environment.
• Evaluation & Feedback: Responsible for capturing feedback signals regarding action outcomes from the environment or human supervisors. These feedback signals can be Scalar Rewards, Textual Critiques, or execution status codes. It implements an action–feedback closed loop, providing optimization signals that support the agent’s self-reflection and subsequent evolution.
B. Environment
Distinct from the opaque “black-box” environments in traditional Reinforcement Learning (RL), the environment in self-evolving systems serves not only as the setting for
task execution but also as a source of feedback that drives continual improvement. We characterize it as an external system equipped with verification capabilities. Formally, we define the environment as: :
- State Space : This represents the objective reality in which the agent operates. It includes the task context, initial problem conditions, and external knowledge bases. provides the foundation for the agent’s observations. For self-evolving agents, the state is not limited to current task variables, but also includes external knowledge that can be accessed through interaction. When the agent performs retrieval or exploration actions, the environment returns relevant information, which then becomes a part of the agent’s working context.
- Verification & Feedback Mechanism : As an external verifier, this mechanism constitutes the core characteristic of the environment. When the agent executes an action through the interface, the verifier evaluates its validity, accuracy, or quality according to objective rules or predefined evaluation criteria, and returns a corresponding feedback signal. This feedback is objective and deterministic. Typical examples include error messages from a code interpreter, standard solutions to mathematical problems, or pass/fail signals from unit tests. Such signals precisely identify incorrect behaviors or outcomes, allowing the agent to localize errors reliably.
Beyond simple scalar rewards, the feedback is often semantically rich. It may take the form of textual feedback, structured execution results, or detailed error traces. These signals supply

Fig. 3: A Comprehensive Taxonomy of Self-Evolving Techniques for LLM Agents.
the concrete supervision required for the agent’s self-correction and subsequent parameter updates, enabling effective and finegrained learning during self-evolving.
C. Agent-Environment Interaction
We model the interaction process between the agent and the environment as a standard Markov Decision Process (MDP). This abstracted model intuitively describes how the Agent takes actions based on the current state and receives environmental feedback at successive time steps t. Formally, the interaction trajectory is represented as sequence:
-
State Perception & Decision: The agent receives the current environmental state , which encompasses the context, dialogue history, or problem description. Utilizing the core LLM , the agent generates an action based on the current state. This decision process follows the policy function .
-
Execution & Feedback: At each step, the agent executes an action in the environment . The environment then moves to a new state according to the transition probability and returns feedback. For self-evolving agents, this feedback plays two main roles: (i) State as Knowledge. In knowledge acquisition tasks such as open-ended question answering or retrieval, the new state mainly reflects newly obtained information. The updated state contains additional context that supplements the model’s internal knowledge and supports subsequent reasoning. (ii) Feedback as Evaluation. In optimization tasks such as code generation or logical reasoning, the environment provides explicit evaluation signals , such as error messages, test results, or scalar scores. These signals reveal weaknesses in the current policy and guide the agent to revise or improve its policy .
In more complex settings, the environment can also evolve by the agent’s actions. These changes persist and affect future states, gradually reshaping the conditions under which learning occurs. As a result, evolution emerges not only from policy updates within the agent, but also through the continual

Fig. 4: Comparison between Synthesis-Driven Offline Evolution and Exploration-Driven Online Evolution.
transformation of the environment.
III. MODEL-CENTRIC SELF-EVOLUTION
This paradigm refers to a paradigm in which an agent autonomously improves its capabilities by mining and internalizing its own knowledge, without relying on external human supervision. Central to this paradigm is the principle of computation to intelligence: by scaling computation either during inference-time search or through iterative parameter updates, the agent unlocks reasoning potential that is already encoded but remains underutilized in its pre-trained weights. We divide this paradigm into two streams based on whether the evolution occurs only during inference or leads to lasting capability improvement. Inference-Based Evolution focuses on Test-Time Scaling, where additional computational resources are used within a single inference process to improve reasoning performance. In contrast, Training-Based Evolution aims at long-term capability growth by enabling agents to generate training signals through offline synthesis or online interaction, supporting continuous parameter updates and sustained improvement.
A. Inference-Based Evolution
This paradigm focuses on enhancing accuracy within a single inference episode by leveraging test-time scaling [199]–[203]. Instead of updating model parameters, it improves performance through deeper internal deliberation, effectively trading additional computation for more reliable outcomes [204], [205]. Representative approaches typically fall into three categories: Parallel Sampling, Sequential Self-Correction, and Structured Reasoning. Despite their methodological differences, these approaches all aim to improve reasoning performance through enhanced test-time computation.
- Parallel Sampling: This paradigm leverages parallel computing to broaden solution coverage, utilizing multiple reasoning paths to mitigate the local optima often encountered in single-pass inference. Building on the Self-Consistency mechanism [45], [206], studies confirm that scaling the sampling budget enables smaller models to surpass larger counterparts through Best-of-N strategies [48], [207]. However, to ensure compute efficiency, recent research emphasizes optimizing the trade-off between model scale and search depth [199], and also highlights the importance of refining search granularity by shifting from token-level exploration to natural language planning in order to enhance diversity [53]. Furthermore, effective aggregation is critical; advanced approaches move beyond simple majority voting by employing pairwise ranking [47], measuring inconsistency for hallucination detection [46], or utilizing synthetic model crowds for comparative reasoning [208].
- Sequential Self-Correction: This paradigm leverages iterative computation to refine reasoning. Basic mechanisms alternate between generation, feedback, and revision [50], [51], [54]. To support long-horizon learning, Reflexion [49] introduces verbal reinforcement learning, utilizing episodic memory of failures to guide future attempts. Addressing internal knowledge gaps, CRITIC [52] incorporates external tools for fact-based verification. Recent research further structures reasoning for complex slow thinking, applying evolutionary algorithms to thoughts [56] or enabling non-linear backtracking [55]. Methods such as Planning Tokens [53], [209] enhance reasoning coherence by enforcing high-level planning constraints through latent variables, thereby guiding intermediate steps toward more globally consistent solutions.
- Structured Reasoning: This paradigm models reasoning trajectories as structured processes, enabling systematic exploration through explicit search and verification mechanisms. [210]–[213]. Tree-based approaches decompose reasoning, ranging from heuristic search in Tree of Thoughts [57] to
TABLE I: Representative Methods and Key Characteristics of Training-based Evolution Methods.
| Method | Domain | Core Mechanism | Feedback Source | Feedback Type | Learning Alg. | Initialization | Release Time |
| Synthesis-Driven Offline Self-Distillation | |||||||
| SELF-INSTRUCT [66] | General | Gen & Filter | Self-Model | Textual Critique | SFT | Seed Tasks | Dec-2022 |
| SELF-GUIDE [74] | General | Gen & Filter | Self-Model | Binary Reward | SFT | Seed Prompts | Aug-2024 |
| SEAL [76] | General | Self-Edit & SFT | Self-Model | Binary Reward | ReSTEM | SFT Model | Sep-2025 |
| SPIN [69] | General | Self-Play Mechanism | Self-Model | Binary Reward | DPO | SFT Model | Jan-2024 |
| SPPO [72] | Alignment | Self-Play & Nash | Self-Model | Win Rate | SPPO | SFT Model | May-2024 |
| STaR [67] | Reasoning | Self-Gen & Filter | Self-Model | Binary Reward | SFT | Seed Prompts | Mar-2022 |
| LMSI [68] | Reasoning | Self-Gen & Filter | Self-Model | Binary Reward | SFT | Seed Prompts | Oct-2022 |
| ReST-MCTS* [70] | Reasoning | MCTS + PRM | Self-Model | Scalar Reward | SFT & RL | Seed Prompts | Oct-2024 |
| SOAR [215] | Code | Sample & Refine | Environment | Binary Reward | SFT | SFT Model | Jul-2025 |
| SIRIUS [75] | Multi-Agent | Bootstrapping & Aug. | Outcome Reward | Binary Reward | SFT | Seed Dataset | Feb-2025 |
| RAGEN [73] | Agents | StarPO & StarPO-S | Environment | Trajectory Reward | RL | Instruct Model | May-2025 |
| SAMULE [216] | Agents | Multi-Level Reflect | Self-Model | Textual Critique | SFT | Seed Prompts | Feb-2025 |
| Exploration-Driven Online Self-Evolving | |||||||
| R-Zero [38] | Reasoning | Co-evolution & GRPO | Self-Model | Binary Reward | GRPO | Base LLM | Aug-2025 |
| Absolute Zero [77] | Reasoning | Self-Play | Code Executor | Binary Reward | TRR++ | Zero Triplet | May-2025 |
| LSP [90] | General | Challenger & Solver | Self-Model | Scalar Reward | GRPO | Pretrained Model | Sep-2025 |
| Socratic-Zero [96] | Math | Co-evolution | Fixed Teacher | Binary Reward | DPO & WSFT | Seed Prompts | Sep-2025 |
| Agent0 [95] | Agents | Co-evolution & Tool-use | Self-Model & Environment | Scalar Reward | GRPO & ADPO | Base LLM | Nov-2025 |
| SeRL [86] | Reasoning | Self-Gen & Filter | Self-Model | Binary Reward | Reinforce++ | Seed Prompts | May-2025 |
| Self-Challenging [81] | Agents | Code-as-Task | Environment | Binary Reward | RL | Seed Prompts | Jun-2025 |
| CURE [217] | Code | Self-Play & Co-evolve | Self-Model | Binary Reward | GRPO | Instruct Model | Jun-2025 |
| SPICE [91] | Reasoning | Self-Play & Grounding | Self-Model & Corpus | Binary & Variance | DrGRPO | Pretrained LLM | Oct-2025 |
| WebRL [40] | Agents | Self-Evolving & Curriculum | Self-Model | Binary Reward | RL | SFT Model | Nov-2024 |
| SPIRAL [80] | Trajectory | Multi-Agent | Outcome Reward | Neural Weights | RL | Strategic Games | Jun-2025 |
| LADDER [88] | Math | Recursive & Decomposition | Self-Model | Binary Reward | GRPO | Seed Prompts | Mar-2025 |
| R-FEW [92] | Reasoning | Self-Play & Grounding | Self-Model & Human | Scalar Reward | GRPO | Seed Prompts | Dec-2025 |
| SPC [84] | Reasoning | Adversarial Self-Play | Self-Model | Textual Critique | SFT & RL | SFT Model | May-2025 |
| TTCS [39] | Reasoning | Test Time Self-Play | Self-Model | Binary Reward | GRPO | Base LLM | Feb-2026 |
incorporating environmental feedback via MCTS in LATS [61] and learned value functions in TS-LLM [62]. To address merging converging thoughts, Graph of Thoughts [58] generalizes to arbitrary graphs, while Planner-Centric frameworks [214] employ global Directed Acyclic Graphs (DAGs) for complex tool dependencies. Finally, methods like Think-on-Graph [59], [63] and ROG [60] align reasoning with external KGs, treating the LLM as a graph traversal agent to ensure factual grounding.
B. Training-Based Evolution
Unlike transient Inference-Based Evolution, this paradigm achieves permanent capability internalization through parameter updates. In this paradigm, the model acts as proposer, solver, and evaluator to generate high-quality synthetic data for SFT or RL, effectively bypassing the need for human annotation. Through the iterative evolving cycles, the model continuously reviews its world knowledge and refines its distribution, directly strengthening its overall reasoning and planning ability [79], [93], [94], [218]. As illustrated in Figure 4, existing studies can be broadly categorized into Synthesis-Driven Offline Self-Evolving and Exploration-Driven Online Policy Self-Evolving [219], [220], corresponding to improvements derived from offline synthetic data generation and those emerging from online interaction-based exploration.
- Synthesis-Driven Offline Self-Evolving: This paradigm leverages the model’s generative capabilities to construct highquality synthetic datasets, explicitly consolidating implicit knowledge into parameters via Supervised Fine-Tuning (SFT). The core motivation is to address the scarcity of human data by enabling the model to act as its own teacher through a “Bootstrapping” mechanism [71], [85], [89], [221], [222].
Initial efforts focus on instruction following: SELF-INSTRUCT [66] pioneered the generation pipeline, while SELF-GUIDE [74], SEAL [76], and SELF [65] refine this via multi-stage loops, RL-based self-edits, or natural language feedback. Evolution then shifts to preference optimization via self-play, where SPIN [69] approximates target distributions and SPPO [72] models alignment as a zero-sum game seeking Nash equilibrium. For complex reasoning, STaR [67] and LMSI [68] utilize iterative self-consistency, while ReST-MCTS* [70] and SOAR [215] integrate search algorithms to generate step-level value labels or employ Hindsight Relabeling on failed trajectories. In multi-turn scenarios, SIRIUS [75] and RAGEN [73] address attribution errors and policy collapse by converting failures into successful trajectories or optimizing observation-action units, supported by reflective mechanisms for error extraction [216].
- Exploration-Driven Online Self-Evolving: Distinct from offline distillation based on static synthetic corpora, this paradigm transforms the model from a passive learner into an active explorer. It builds a dynamic online reinforcement learning loop, where the model discovers new strategies through real-time trial-and-error interaction with itself via self-play or with the external environment [39], [78], [83], [223]–[225]. By continually exploring beyond existing data distributions, the agent can progressively improve its policies, while addressing the stability challenges inherent in autonomous exploration.
To address cold starts issues, approaches like R-Zero [38] and Absolute Zero [77] split models into Challenger and Solver roles, while LSP [90] and Self-Questioning LM [82] introduce asymmetric gaming to drive evolution. Advanced frameworks such as SPIRAL [80], Socratic-Zero [96], and Agent0 [95] extend this to multi-turn zero-sum games or co-evolutionary

Fig. 5: Comparison between Static Knowledge Evolution and Dynamic Experience Evolution.
systems, with SeRL [86] bootstrapping training loops from minimal initial samples.
To prevent hallucination, evolution is grounded in interactions with external environments [226]–[229]. Self-Challenging [81] and CURE [217] leverage code execution, SPICE [91] and SPELL [97] utilize document corpora to introduce information asymmetry, and WebRL [40] applies online curricula to web tasks. To ensure training stability, LADDER [88] employs recursive decomposition, while other methods adopt few-shot anchoring [92] or explicit knowledge retrieval [87] to guide exploration. SPC [84] further refines process supervision through adversarial critic games.
IV. ENVIRONMENT-CENTRIC SELF-EVOLUTION
This paradigm investigates how agents achieve self-evolution through continuous interaction with the external world. In such settings, the environment provides not only static knowledge that expands the agent’s information horizon, but also dynamic feedback signals that can be summarized into reusable experience for behavioral refinement. We organize this paradigm into four complementary directions. Static Knowledge Evolution expands the agent’s knowledge base through interaction with external information sources. Dynamic Experience Evolution enhances behavior by accumulating and refining experience from environmental feedback. Modular Architecture Evolution improves long-horizon adaptation by optimizing the modules that mediate agent–environment interaction. Agentic Topology Evolution explores how multi-agent interaction structures evolve to shape collective behavior.
A. Static Knowledge Evolution
This paradigm emphasizes the agent’s capability to recognize the boundaries of its internal knowledge and proactively retrieve relevant information from the environment, thereby bridging the gap between static parametric knowledge and the evolving external world [230], [231]. In response to the temporal staleness of training data and the limited availability of domain-specific knowledge, the model shifts from a passive responder to an active knowledge seeker [232], [233]. Broadly, existing approaches can be divided into Agentic Retrieval-Augmented Generation, which centers on autonomous retrieval for task solving, and Reasoning-Driven Deep Research, which
extends retrieval into sustained evidence synthesis and longform knowledge construction.
-
Agentic Retrieval-Augmented Generation: Unlike traditional Retrieval-Augmented Generation (RAG) where models are passive information receivers, Agentic RAG transforms retrieval into an autonomous cognitive process. Early reflective retrieval approaches [98], [234] introduce self-assessment mechanisms to verify retrieval necessity and relevance. Progressing to reasoning integration, methods like Search-o1 [101] embed search directly into Chain-of-Thought for denoising, while RL-driven frameworks such as Search-R1 [99], [102], [235] cultivate an intrinsic instinct to proactively identify knowledge gaps and initiate queries. To realize this capability, WebCPM [103] and DSPy [100] reframe retrieval as programmable workflows by defining atomic human-like actions (e.g., clicking, quoting) and modular self-optimizing pipelines, respectively. This structural foundation facilitates DeepSearch, orchestrating parallel retrieval via DAGs [107], [236] or decoupling search tools for flexible long-chain reasoning [237], thereby enabling dynamic knowledge construction beyond single-step queries.
-
Reasoning-Driven Deep Research: This paradigm transcends single-query retrieval, focusing on autonomously browsing and synthesizing evidence into structured, research-grade reports [108], [109], [238]–[241]. Unlike Agentic RAG, it prioritizes the generation of coherent long-form content, shifting agents from search tools to reliable researchers [104]. The core challenge lies in training optimal search strategies via Reinforcement Learning (RL). Recent advancements optimize this process either by integrating structured priors such as Knowledge Graphs and MCTS [113], [114], [242], or by leveraging end-to-end minimalist paradigms and data synthesis pipelines to achieve breakthroughs without dense supervision [105], [111], [112]. Beyond training, architectures like WebWeaver [115] and WebThinker [106] introduce dynamic planning mechanisms to bridge the execution gap. In addition, specialized frameworks extend this paradigm to specific domains, such as SurveyX [110] for academic literature and FINSIGHT [243] for financial analysis.
B. Dynamic Experience Evolution
As illustrated in Figure 5, distinct from Static Knowledge Evolution, which primarily focuses on expanding external
TABLE II: Expanded Overview of Dynamic Experience Evolution.
| Method | Exp. Form | Source | Feedback Signal | Memory Form | Manage strategy | Task | Key Objective |
| Offline Experience Compilation | |||||||
| AgentRR [246] | Trajectory | Interaction | Success Rate | Graph | Retrieval & Ranking | Web | Reliability & Efficiency |
| AWM [116] | Workflows | Interaction | Binary | Textual | Induction & Integration | Web | Reusability |
| SkillWeaver [118] | Code | Interaction | Reward | Skill Library | API Selection | Web | Reusability |
| Agent KB [117] | Trajectory | Cross-Framework | Execution Gain | Knowledge Base | Reason-Retrieve-Refine | General & Code | Reusability |
| CoPS [119] | Trajectory | Cross-Task | Reward | Buffer | Pessimistic Retrieval | Embodies & Web | Generalization |
| Online Experience Adaptation | |||||||
| Dyn. Cheatsheet [122] | Strategies | Interaction | Self-Reflection | Textual | Retrieval & Synthesis | Math | Test-time Learning |
| Memento [123] | Trajectory | Interaction | Reward | Case Bank | Soft Q-Learning | Unified Frameworks | Continual Adaptation |
| GEPA [120] | Rule | Self-Exploration | Reflection | List | Pareto Selection | General | Sample Efficiency |
| ACE [121] | Bullet point | Environment | Execution Traces | Context Playbook | Grow-and-Refine | Agent & Finance | Robustness |
| Lifelong Experience Evolution | |||||||
| ReasoningBank [124] | Trajectory | Self-Exploration | Binary | Vector | Retrieval & Consolidation | Web & Code | Reusability |
| EVOLVER [133] | Principles | Interaction | Binary | Experience Base | Curation & RL | General | Experience Lifecycle |
| FLEX [130] | Rules | Self-Exploration | Semantic Feedback | Hierarchical Library | Selective Merge | Math & Science | Scalable Evolution |
| SAGE [135] | Code | Interaction | Reward | Skill Library | RL | Tool-use | Efficiency |
| ASI [126] | Code | Interaction | Execution Result | Skill Library | Induction & Verification | Web | Efficiency |
| AccelOpt [248] | Code | Self-Exploration | Execution Result | List | Threshold-based Curation | Coding | Efficiency |
| Early Experience [127] | Trajectory | Interaction | Future States | Parametric Weights | Fine-tuning | Multi-Domain | Scalability |
| SPIRAL [80] | Trajectory | Self-Play | Reward | Parametric Weights | RL | Games | Generalization |
| SWE-Exp [125] | Trajectory | Interaction | Success/Failure Signal | Experience Bank | Distillation & Retrieval | Coding | Reusability |
| ReMe [128] | Trajectory | Interaction | Utility Feedback | Procedural Memory | Dynamic Refinement | Multi-Domain | Procedural Memory |
| ArcMemo [131] | Trajectory | Self-Exploration | Task Success | Concept-level Memory | Retrieve &Compose & Adapt | General | Generalization |
| AgentEvolver [136] | Principles | Self-Exploration | Attribution Reward | Experience Memory | RL | Multi-Domain | Scalable Evolution |
| MemGen [129] | Latent Vector Interaction | Reward | Parametric Weights | Generative Fusion | Embodies | Generalization | |
| LatentEvolve [134] | Latent Vector Interaction | Self-Reward | Episodic Buffer | Retrieval & Distillation | Multi-Domain | Latent Evolution | |
knowledge resources, this paradigm centers on enhancing an agent’s decision-making capability through accumulated interaction experience [125], [142], [244]. Through continual interaction with the environment, agents generate rich trajectories encompassing both successful behaviors and failure feedback [245]. Rather than functioning as mere records, these trajectories provide informative signals for refining behavioral policies and improving future performance. Based on the lifecycle of experience utilization, existing studies can be broadly categorized into Offline Experience Compilation, Online Experience Adaptation, and Lifelong Experience Evolution.
- Offline Experience Compilation: The core objective of this paradigm is to construct a static experience repository from historical data to address the cold-start problem [128]. Early approaches focus on direct reuse, ranging from simple recordand-replay [246] to generating composable API code [118]. Notably, Agent Workflow Memory [116] advances this by mining frequent subroutines into universal workflows. To enhance generalizability, recent research shifts toward experience abstraction, facilitating cross-domain transfer [117] or organizing experiences into structured graphs with theoretical validity guarantees [119], [131], [247].
- Online Experience Adaptation: Distinct from static offline repositories, it constructs query-conditioned experience states that are aligned with the immediate problem context [249]. To improve inference efficiency, Dynamic Cheatsheet [122] proposes a lightweight test-time learning approach that extracts key rules and code snippets into a structured representation. Building upon episodic memory, Memento [123] leverages analogy-based reasoning to construct and refine an external repository of past successes and failures. Extending experience evolution to the instruction level, GEPA [120] iteratively refines system prompts based on historical feedback.
Along a broader direction, Agentic Context Engineering [121] treats the context environment as a dynamic artifact and exploits meta-learning signals to construct optimized initial context structures.
- Lifelong Experience Evolution: In this paradigm, the experience repository is persistently updated and refined over time, with accumulated experience either maintained in a nonparametric form or incorporated into the model parameters, thereby enabling continual learning beyond short-term inference [136], [250]–[254]. Early efforts focus on scaling symbolic trajectories, with ReasoningBank [124] utilizing MATTS to refine cases through memory-aware scaling. To handle complex variations, iterative feedback-driven refinement frameworks have emerged. These frameworks distill generalizable strategies via offline reflection [133], summarize semantic insights from group responses [132], or utilize gradient-free paradigms to structure open-world trials [130]. Bridging imitation and RL, Early Experience [127] further leverages exploratory interactions to generate natural language supervision without external rewards. As complexity rises, experience representation shifts from static text to executable programs. ASI [126] induces low-level operations into verifiable Python skills for crossplatform transfer. Skill Library [135] and AccelOpt [248] encapsulate discovered skills into dynamic libraries or construct optimization memories for code transformations, effectively mimicking human iterative design.
Recent frontier research moves beyond explicit symbols to emulate implicit learning. MemGen [129] utilizes a memory weaver to inject machine-native latent token sequences directly into inference. Inspired by the brain’s complementary learning systems, LatentEvolve [134] introduces a Day-Night mechanism, which enables the evolution process to perform both fast thinking and slow consolidation in latent space.
C. Modular Architecture Evolution
Beyond Static Knowledge Evolution and Dynamic Experience Evolution, which focus on the content of environmental interaction, this paradigm centers on optimizing the structural modules that mediate agent–environment interactions. Its objective is to refine key components such as the Memory Module, Tool Module, and Interaction Interface, thereby enhancing both interaction efficiency and adaptability [143], [148], [149], [255], [256]. Existing research spans Interaction Protocol Evolution in interaction protocols, Memory Architecture Evolution in adaptive memory architectures, and Tool-Augmented Evolution in tool integration mechanisms.
- Interaction Protocol Evolution: This paradigm addresses the challenge of managing limited context in long-horizon tasks. Instead of passively accumulating raw interaction history, it emphasizes distilling trajectories into concise summaries and salient cues, while regulating what information should be retained or exposed at different stages of reasoning.
A key direction involves cognitive abstraction protocols. Think-in-Memory [139] stores inductive thoughts to avoid contradictions, while Memory-of-Thought [141] filters highconfidence reasoning paths. Inspired by fuzzy trace theory, ReadAgent [140] compresses text into global “Gist Memory”. Further optimizations decouple knowledge synthesis from realtime execution or treat memory as an active Deep Research process for Just-In-Time reconstruction [144], [145].
Research also adopts dynamic management protocols akin to operating systems. MemGPT [138] pioneers virtual memory paging to simulate infinite context, while MemoryBank [137] incorporates the “Ebbinghaus Forgetting Curve”. To handle saturation and data fusion, recent frameworks introduce proactive folding mechanisms for incremental compression or construct hierarchical architectures to intelligently route queries between heterogeneous sources [146], [257].
-
Memory Architecture Evolution: This paradigm redefines memory as an adaptive system that actively organizes interaction experience, rather than a static database [152], [157], [258]. To transcend storage limitations, researchers endow memory with agency. A-MEM [147] treats memory as a dynamic network inspired by “Zettelkasten”, while Mem0 [150] mimics human cognition to intelligently decide on data retention. Further approaches leverage constructivist theories to enable real-time flexible assimilation of cognitive schemas [153]. Moreover, memory management is increasingly treated as a learnable decision process, where context pruning and hierarchical organization are optimized via reinforcement learning [151], [154]. MemEvolve [155], [156] represents a line of work that advances this direction toward meta-level reconstruction, enabling agents to autonomously adapt routing strategies and memory logic based on task feedback.
-
Tool-Augmented Evolution: This paradigm expands agent capabilities by progressing from basic tool invocation to autonomous creation and skill library management [259]. Initial approaches synergize reasoning with execution: ReAct [158] and WebGPT [159] interleave reasoning with search actions to enhance plan adaptability and credibility, while PAL [161] offloads complex logic to external Python interpreters. In
embodied settings, VOYAGER [160] advances this by accumulating an iterative library of executable code skills for lifelong learning.
Moving beyond predefined APIs, the paradigm extends to autonomous tool creation. LATM [162] generates tools for lightweight agents, while CREATOR [163] and CRAFT [164] separate abstract design from execution to build specialized toolsets. TOOLMAKER [165] applies this to scientific workflows with automated debugging. Alita [166] enables scalable evolution through a minimally predefined architecture, dynamically generating tools from a compact core through a standardized context-based interaction mechanism.
D. Agentic Topology Evolution
Traditional Multi-Agent Systems (MAS) are often built upon manually crafted workflows and predefined agent roles, resulting in rigid architectures that are difficult to adapt across tasks or environments [260]. In contrast, this paradigm emphasizes evolving the system structure itself as a learnable and optimizable component, enabling MAS to transition from offline structure search to runtime team reconfiguration and dynamic coordination. Existing work in this direction can be broadly categorized into Offline Architecture Search, which optimizes communication structures prior to deployment; Runtime Dynamic Adaptation, which adjusts team composition and connectivity during inference; and Structural State Evolution, which refines shared collective states to improve coordination efficiency.
-
Offline Architecture Search: This paradigm treats the design of MAS as an offline optimization problem. Early approaches focus on optimizing communication topologies within computational graphs. GPTSwarm [171] applies gradient-based updates to prompts and connectivity to maximize information flow, while MACNET [168] proposes a large-scale DAG-based multi-agent framework and identifies a collaboration scaling law, showing that optimized irregular topologies outperform regular structures as agent count increases. AutoFlow [170] leverages reinforcement feedback to fine-tune interaction workflows from abstract descriptions. To transcend fixed templates, recent research further shifts toward searching the infinite space of programmatic definitions. AFLOW [167] formulates workflow optimization as a Monte Carlo Tree Search (MCTS) problem to efficiently navigate combinatorial code spaces. From a meta-learning perspective, ADAS [169] employs a meta-agent to iteratively write and reflect on system code, discovering novel designs that outperform human baselines. MAS-GPT [172] frames design as an end-to-end generation task, synthesizing executable systems directly from high-level user intents.
-
Runtime Dynamic Adaptation: Unlike offline methods, this paradigm dynamically adjusts agent configurations and connectivity during inference to balance performance and cost [178], [180]. Initial approaches focus on role generation, where AutoAgents [173] assembles dynamic expert teams, and EVOAGENT [176] employs evolutionary operators like mutation to generate specialized skills. Beyond roles, recent work optimizes communication topologies: G-Designer [174] decodes structures via graph neural networks,

Fig. 6: Advantages of Model–Environment Co-Evolution over Model-Centric and Environment-Centric Self-Evolution.
while MaAS [175] samples architectures from a probabilistic supernet. To handle complex reasoning, hierarchical strategies emerge. ReMA [179] decouples high-level meta-thinking from execution, and MASS [177] alternates between local prompt refinement and global topological search for efficient optimization.
- Structural State Evolution: This paradigm optimizes collective memory architectures. Foundational mechanisms focus on maintaining information quality by autonomously pruning low-value entries to prevent context overflow [184], [261]. Moving toward structured representations, G-Memory [181] organizes collective experience into hierarchical graphs, enabling retrieval at different abstraction levels. With deeper collaboration, Collaborative Memory [182] manages information boundaries through evolving bipartite graphs to balance sharing and privacy. LatentMAS [183] further shifts interaction into the latent space via hierarchical KV caches, synchronizing cognitive states directly without the overhead of text decoding.
V. MODEL-ENVIRONMENT CO-EVOLUTION
As illustrated in Figure 6, Model-Centric Self-Evolution faces inherent limitations, including the lack of external verification, error accumulation in iterative reasoning, and the overestimation of high-probability yet high-variance trajectories. These issues indicate that internal computation alone cannot ensure reliable long-horizon improvement, motivating the introduction of the environment as a source of grounded feedback. However, existing Environment-Centric Self-Evolution often treats the environment as static, with limited scalability and fixed difficulty that may not keep pace with the agent’s growing capability. To address this mismatch, the environment must evolve alongside the agent, forming a co-evolutionary system that enables open-ended capability growth. Current research in
this direction broadly includes Multi-Agent Policy Co-Evolution and Environment Training.
A. Multi-Agent Policy Co-Evolution
Unlike Agentic Topology Evolution in Environment-Centric Self-Evolving, which focuses on the structural design of MAS, this paradigm emphasizes the continuous optimization of agent policies through explicit training [188], [262], [263]. This paradigm views the environment as the collective of interacting agents, where parameter updates via Multi-Agent Reinforcement Learning and alignment training drive the emergence of advanced social intelligence. Initial efforts target communication efficiency; OPTIMA [185] uses MCTS-guided optimization and multi-objective rewards to penalize verbosity. As tasks grow complex, research shifts to joint policy optimization. MAPoRL [186] employs validator-based feedback to promote long-term collaboration, while MARFT [187] addresses heterogeneity via a flexible markov game formalism. Finally, to reduce reliance on external supervision, CoMAS [189] replaces human feedback with peer evaluation, extracting intrinsic rewards from internal discussions to support decentralized self-improvement.
B. Environment Training
Traditional agent training is often constrained by static and finite datasets, relying on rigid supervision that struggles to support long-horizon generalization. This paradigm shifts the perspective by treating the environment not as a fixed data source, but as an optimizable and evolving entity. By adapting in tandem with the agent’s capabilities, the environment mitigates difficulty imbalance and alleviates data bottlenecks during training. Existing approaches explore Adaptive Curriculum Evolution by dynamically adjusting task difficulty
based on agent feedback, and Scalable Environment Evolution by constructing large-scale environments with automatically verifiable signals, thereby providing scalable and progressively challenging training grounds for continual improvement.
- Adaptive Curriculum Evolution: Traditional training on static datasets often suffers from difficulty mismatch, leading to instability or overfitting. Addressing this, this paradigm frames training as a co-evolving process where the environment continuously adjusts task difficulty based on real-time agent feedback, ensuring the curriculum adapts alongside the agent’s proficiency.
A representative line of work allows the environment to evolve with learning progress. GenEnv [190] employs the simulator as a dynamic curriculum generator to maintain optimal task difficulty for higher data efficiency. Similarly, Environment Tuning [191] shifts focus to tuning the environment itself, constructing structured curricula that convert sparse error signals into actionable feedback for complex tool-use tasks. In reasoning-intensive domains, RLVE [192] extends this to verifiable environments, addressing sparse rewards in reinforcement learning by dynamically matching task difficulty to model capability and utilizing programmatic verifiers for accurate real-time feedback.
- Scalable Environment Evolution: Beyond dynamically adjusting the difficulty of existing environment, this paradigm represents a fundamental pathway for overcoming training data bottlenecks. It focuses on constructing large-scale, diverse virtual environments through automatic generation of tasks and reward signals, enabling reliable verification during training. By simulating real-world interactions, this paradigm supports the co-evolution of agents and their environments. Dream-Gym [193] utilizes an inference-based world model to simulate dynamics and generate dense rewards for efficient synthetic RL. To enhance generalization, AutoEnv [194] automatically constructs diverse environments to enforce robust strategy learning, while Endless Terminals [195] applies this to system operations by generating and verifying large-scale terminal tasks via an automated pipeline.
To support stable training, research focuses on standardized and verifiable infrastructure. Reasoning Gym [196] provides an open-source library of cheat-resistant, programmatically verifiable tasks for logic and coding. For platform standardization, GEM [197] establishes an OpenAI Gym-style interface for agentic LLMs, introducing the ReBN algorithm for credit assignment. Similarly, AgentGym [198] offers a unified platform across diverse domains, enabling iterative self-evolution hrough systematic interaction and evaluation.
VI. APPLICATION
Beyond theoretical advancements, the paradigm of Self-Evolving Agents is rapidly entering practical domains, particularly under Model-Environment Co-Evolution, where agents actively reshape their surroundings to enable reciprocal adaptation. Starting from AlphaGo’s success in the closed and deterministic game of Go, the field has gradually moved toward more open-ended and complex settings. We are now witnessing a positive feedback loop in which agents continuously improve
their cognitive capabilities, such as tool usage [283], memory systems [147], and skill libraries [160] in real life application. This section synthesizes recent advances in Automated Scientific Discovery, Autonomous Software Engineering, and Open-World Simulation of self-evolving agents.
A. Automated Scientific Discovery
Scientific discovery is fundamentally a search for truth within an infinite hypothesis space. Static LLMs possess vast knowledge but cannot verify or iterate on unknown phenomena. Agentic science bridges this gap by establishing an iterative closed loop of hypothesis generation, experiment execution, and feedback-driven refinement, transforming AI from a passive assistant into an active explorer [284]–[286].
In scientific reasoning and research automation, The AI Scientist and AlphaProof [219] demonstrate co-evolution in knowledge-centric environments. The AI Scientist functions as an end-to-end AI researcher, using a Generation-Review loop with automated peer review to iteratively improve research quality [264]. Similarly, FARS [270] scales this end-to-end automation by employing a multi-agent system that comprises ideation, planning, experimentation, and writing agents to autonomously execute the complete research workflow from first principles, while systematically incorporating negative results and incremental findings into the iterative improvement process for sustained research refinement.
Across an increasing range of scientific domains, coevolution through interaction with external environments is becoming increasingly prominent, enabling agents to iteratively improve through real-world feedback from tools, simulations, and physical experiments. ChemCrow and Coscientist integrate LLM agents with laboratory tools, robotic platforms, and safety or multi-agent feedback loops, enabling autonomous experimentation and control of real-world instruments [265], [266]. GNoME applies an active learning cycle combining graph neural network predictions with DFT verification, leading to the discovery of 2.2 million new materials and substantially accelerating materials exploration [267]. A-Lab further demonstrates an autonomous robotic lab for inorganic synthesis, achieving a success rate over a 17-day continuous run [268]. Similarly, CRESt [269] proposes a multimodal closed-loop framework for electrocatalyst discovery and successfully identifies a highperformance 8-element catalyst from a large search space.
B. Autonomous Software Engineering
In the domain of software engineering, the environment consists of a complex ecosystem that includes massive codebases, terminal command lines, and Continuous Integration (CI) pipelines. Unlike static Q&A tasks, software agents must navigate a state space defined by millions of lines of code, where a single character error can trigger a cascade of environmental feedback, such as compiler failures or runtime exceptions. Consequently, co-evolution in this domain mainly depends on the agent’s ability to use software tools effectively and continuously track system states and updates in a strict development environment.
TABLE III: Taxonomy of Self-Evolving Agents Applications and Mechanisms.
| Application | Domain | Environment Definition | Evolution Mechanism | Core Technology | Breakthrough Results | Link |
| Automated Scientific Discovery | ||||||
| The AI Scientist [264] | Academic Research | Simulated review system | Gen-Review cycle | Auto peer-review | Paper auto-generation | |
| AlphaProof [219] | Logic & Math | Lean verifier | Search-Verify loop | Prover net | IMO 2024 silver-level | |
| ChemCrow [265] | Chemistry | Lab tools | Plan-Safety-Execute loop | Robotics control | Generalized lab automation | |
| Coscientist [266] | Automated Science | Lab env; hardware APIs | Hypothesis–Debate loop | multi-agent debate | Zero-shot hardware control | |
| GNoME [267] | Materials Sci | DFT simulation space | Active learning loop | GNN predictor | 2.2M stable crystals | |
| A-Lab [268] | Materials Sci | Robotic lab | Active-learning synthesis | ML-guided planning | 71% synthesis success | |
| CRESt [269] | Catalysis Discovery | Multimodal robotic lab | Multimodal BO loop | KABO; VLM | 9.3× cost-performance gain | |
| FARS [270] | Academic Research | Open research workspace | Hypothesis loop | Multi-agent automation | Auto paper generation | |
| Autonomous Software Engineering | ||||||
| SWE-agent [271] | Software Eng. | Terminal; codebase; CI | Error-feedback correction | ACI interface | High bug-fix success rate | |
| Claude Code [272] | Long-term Eng. | Project history | Skill accumulation | Skill memory | Senior-level coding | |
| Manus [273] | Software Eng. | Cloud VM sandbox | Plan-Act-Verify loop | CodeAct | Human-like env interaction | |
| OpenClaw [274] | Local Agents | Local FS | Community skill | Skill hub | Long-term local adaptation | |
| Devin [275] | Software Eng. | Browser; terminal; IDE | Web-based correction | Tool autonomy | Fully autonomous SWE | |
| Cursor [276] | Human-AI Coding | Repo index; shadow env | Human-AI co-evolution | Shadow workspace | Productivity co-adaptation | |
| Open-World Simulation | ||||||
| Voyager [160] | Gaming (Minecraft) | Minecraft open world | Explore-Code-Store | Auto curriculum | 15.3× faster progression | |
| GITM [277] | Gaming (Minecraft) | Minecraft open world | Decompose-Plan-Act | Text memory | +47.5% success (Diamond) | |
| Cradle [278] | General Computer Control | GUI interface | Observe-Plan-Act loop | MLLM; skill curation | API-free computer control | |
| Project Sid [279] | Digital Civ | Multi-agent society | Social norm co-evolution | PIANO | Emergent economy & laws | |
| Generative Agents [280] | Social Sim | Virtual town sandbox | Observe-Reflect-Plan | reflection | Emergent group activities | |
| SIMA [281] | Embodied AI | Generative 3D worlds | GenEnv feedback loop | World model | Embodied data reduction | |
| Genie 3 [282] | World Modeling | Text-to-3D worlds | Interactive world loop | Generative world model | Persistent 3D worlds | |
SWE-agent introduces the Agent-Computer Interface (ACI), showing that simplifying command interfaces and providing concise, structured feedback can significantly improve agent self-correction through environment design alone [271]. Building on this, systems such as Claude Code and the community-driven ACE emphasize long-term context and experience accumulation [272]. By extracting reusable patterns from execution traces and storing them as skills, these agents gradually adapt to specific codebases, enabling sustained performance gains over extended development tasks.
Beyond interface and memory, full-stack execution frameworks further extend agent autonomy. Manus and OpenClaw achieve tighter control over execution environments through programmatic interaction and structured control loops [273], while OpenClaw additionally supports community-driven skill sharing [274]. At the commercial frontier, agents such as Devin and Cursor integrate browsing, execution, and pre-simulation capabilities, enabling autonomous debugging and human–AI collaborative workflows that increasingly blur the boundary between assistance and independent development [275], [276].
C. Open-World Simulation
In gaming and virtual social simulations, co-evolution reaches its highest level of abstraction. Here, the environment is no longer a singular task or a static codebase, but an open, multi-agent social or physical world. Agents must not only adapt to the environment but also actively shape it through interactions, creating cultures, economies, and religions that reshape the environment’s dynamics.
Open-world environments drive the evolution of individual agent capabilities. Voyager [160] enables lifelong learning in Minecraft through intrinsic motivation and a reusable skill library, allowing agents to accumulate and compose executable skills over time. In the same domain, GITM [277] emphasizes
hierarchical planning with external knowledge, achieving full progression of the Minecraft technology tree. Beyond specific games, Cradle [278] extends to General Computer Control by interacting purely through screen observations and keyboard/mouse inputs, demonstrating cross-task generalization and long-horizon decision-making in complex real-world software and game environments.
VII. DISCUSSION, CHALLENGES, AND FUTURE FRONTIERS A. Discussion
- Offline Synthesis vs. Online Exploration: The evolution of model-centric paradigms signifies a transition from static knowledge distillation to dynamic capability exploration. Synthesis-Driven Offline Evolution serves as an efficient bootstrapper by consolidating internal priors; however, it remains fundamentally bounded by the base model’s initial capacity [66], risking model collapse [287] where internal feedback loops amplify hallucinations without introducing new information entropy. In contrast, Exploration-Driven Online Evolution transcends these data ceilings by transforming the LLM into an active agent that discovers novel strategies through iterative trialand-error. Crucially, this dynamic growth hinges on external environments like code executors, mathematical engines, or the open web to provide objective information. Through rigorous feedback signals, the environment acts as an essential pruning mechanism that prevents self-reinforcing biases and enables the model to optimize policies beyond its original training distribution [26], [38], [190].
- Static Knowledge vs. Dynamic Experience: Within environment-anchored interactions, feedback signals manifest as either static knowledge, in which the environment serves as a database to bridge informational gaps and supports learning what is, or dynamic experience, in which the environment functions as a gymnasium to refine reasoning strategies
through trajectory analysis and supports learning how to do. However, relying solely on one-way extraction from a fixed environment eventually bounds agent improvement by the environment’s inherent complexity. To transcend this ceiling, the agent–environment relationship must shift from passive extraction to reciprocal interaction, motivating the paradigm of Model–Environment Co-Evolution, where the environment actively adapts alongside the agent.
- Model-Centric Self-Evolving vs. Model Environment Co-Evolving: Model-Centric Self-Evolving focuses on optimizing internal policies within predefined and often specialized environments [38], [39]. While such systems can leverage adaptive curricula to provide tailored challenges, they remain fundamentally constrained by the limited scope and static rules of the simulator, which lack the generalizability and complexity required for real-world, open-ended tasks. Consequently, this approach inevitably encounters performance plateaus when faced with scenarios beyond its narrow predefined bounds [91], [94]. In contrast, Model-Environment Co-Evolving represents a mutually reinforcing paradigm where the environment itself is a dynamic, evolving system. Rather than acting as a restricted backdrop, the environment undergoes structural or complexitydriven transformations in tandem with the agent, providing an increasingly sophisticated landscape that drives continuous capability growth and supports true open-ended evolution [288].
Therefore, we posit that Model-Environment Co-Evolution represents the critical trajectory of self-evolution through its emphasis on the active reshaping of the training landscape. The future of Agentic AI depends on this symbiotic relationship where the continuous interplay between the model and the environment facilitates a self-sustaining cycle of intelligence. This paradigm shifts the focus from isolated optimization to an integrated developmental process, establishing the foundation for fully autonomous and open-ended evolutionary systems.
B. Challenges and Limitation
The current paradigm of Self-Evolving Agents faces several fundamental bottlenecks that prevent the realization of openended intelligence. These challenges stem from the discrepancy between constrained training setups and the boundless complexity of real-world application.
Static and Non-Adaptive Environments: Most selfevolution methods still operate in environments with fixed rules and fixed feedback signals [36], [67]. Since the environment does not change or introduce new challenges, agents tend to overfit to the existing task distribution. As training progresses, performance improvements gradually slow down and eventually saturate once the agent has fully exploited the limited complexity of the environment.
Over-Reliance on Easily Verifiable Tasks: Current selfevolution methods heavily depend on environments with clear automatic checkers, such as compilers, unit tests, or theorem provers [51], [212], [219]. Although these signals are reliable, they largely limit progress to deterministic domains. In more subjective tasks where correctness is unclear or cannot be computed directly [289], [290], agents struggle to obtain useful feedback and therefore cannot improve autonomously.
Limited Realism in Simulation Environments: Many existing frameworks rely on simplified simulators that do not capture the uncertainty and noise present in the physical world [291]–[293]. As a result, agents may perform well in controlled digital settings but fail to generalize to real-world scenarios that require robustness to randomness and complex causal interactions [294].
Continued Dependence on Human Initialization: Although these systems aim for full autonomy, their performance still strongly depends on the quality of human-provided instructions or preference data at the start [39], [91], [92]. If the initial supervision is limited or biased, the agent may reinforce its own mistakes over time, leading to error accumulation rather than genuine improvement.
Dimensionality of Generalization and Model Collapse: The recursive nature of training on self-generated data often leads to a narrowing of the model’s output distribution. This phenomenon, known as model collapse, results in the loss of long-tail information and a decrease in linguistic or strategic diversity, which severely undermines the agent’s ability to handle novel or out-of-distribution inputs.
C. Future Work
Overcoming the aforementioned limitations requires a paradigm shift toward Model-Environment Co-Evolution. This approach treats the agent and its operational context as a single integrated system where both components undergo simultaneous development.
Adaptive Environments that Grow with the Agent: Future research should explore mechanisms where the agent’s improvement naturally leads to more difficult environments. By maintaining a feedback loop in which the agent can reshape its training setting, the environment can continuously provide challenges that match the agent’s current level.
Building More Realistic and Open-Ended Environments: General-purpose environments should include richer physical dynamics and open-ended generation. They should be able to produce diverse tasks that better reflect the uncertainty and complexity of real-world interactions.
Connecting Multiple Simulators and Real-World Systems: Future frameworks should integrate different simulators, tools, and sensors into a unified training pipeline. This would allow agents to learn from multiple sources and improve through more diverse multimodal feedback.
Self-Evolution beyond Automatic Verification: A key direction is enabling self-evolution in tasks without clear ground truth. By developing stronger self-checking and crossvalidation mechanisms, agents may improve in areas such as creative writing, dialogue, and social reasoning where external evaluators are unavailable.
VIII. BENCHMARKS
To comprehensively evaluate the effectiveness of Self-Evolving Agents, we categorize existing benchmarks into two dimensions: Intrinsic Capability Benchmarks, which assess the base model’s reasoning and coding proficiency, and Agentic Reasoning Capabilities Benchmarks, which evaluate the agent’s
TABLE IV: Taxonomy of Evaluation Benchmarks for Self-Evolving Agents Systems.
| Name | Domain | Modality | Task Format | Key Features | Link |
| Intrinsic Capabilities | |||||
| MMLU-Pro [295] | General Knowledge | Text | MCQ (10-options) | Robust Reasoning, Low Noise | 8 |
| HotpotQA [296] | General Knowledge | Text | Extractive QA | Multi-hop, Supporting Facts | 8 |
| MMLU [297] | General Knowledge | Text | MCQ (4-options) | 57 Disciplines, Broad Coverage | 8 |
| MuSiQue [298] | General Knowledge | Text | Extractive QA | Connected Multi-hop, Harder | 8 |
| NQ [299] | General Knowledge | Text | Short/Long QA | Real User Queries, Open-Domain | 8 |
| TriviaQA [300] | General Knowledge | Text | QA | Reading Comp., Evidence Triples | 8 |
| PopQA [301] | General Knowledge | Text | Short QA | Long-Tail Entities, RAG Focus | 8 |
| 2WikiMultiHopQA [302] | General Knowledge | Text | QA | Structured Reasoning Paths | 8 |
| BBH [303] | General Knowledge | Text | Mixed (QA/MCQ) | Hard BIG-Bench Subset, CoT | 8 |
| AGIEval [304] | General Knowledge | Text | MCQ | Human-Centric Exams (SAT/LSAT) | 8 |
| ARC [305] | General Knowledge | Visual | Grid Generation | Abstraction, Few-Shot, Core Knowledge | 8 |
| ARC-AGI [306] | Abstract Reasoning | Visual | Grid Transformation | Fluid Intelligence, Human-easy AI-hard | 8 |
| NarrativeQA [307] | General Knowledge | Text | Generative QA | Very Long Context (Books/Scripts) | 8 |
| LongBench [308] | General Knowledge | Text | Mixed | Long Context, Multi-Task Eval | 8 |
| HLE [309] | General Knowledge | Multimodal | Mixed (MCQ/QA) | Frontier Knowledge, Un-googleable | 8 |
| GPQA [310] | Scientific Reasoning | Text | MCQ | PhD-Level, Google-Proof | 8 |
| SuperGPQA [311] | Scientific Reasoning | Text | MCQ | 285 Disciplines, Light Industry/Agri | 8 |
| SciBench [312] | Scientific Reasoning | Text | QA | College-Level, Step-by-Step Calc | 8 |
| ChemBench [313] | Scientific Reasoning | Text | Mixed | Chemistry, Autonomous Labs | 8 |
| SciQA [314] | Scientific Reasoning | Text | QA | Knowledge Graph, Scientific Data | 8 |
| AIME [315] | Mathematical Reasoning | Text | Numeric QA | Math Competition, Hard Difficulty | 8 |
| OlympiadBench [316] | Mathematical Reasoning | Multimodal | Mixed (QA/MCQ) | Visual Reasoning, Olympiad-Level | 8 |
| GSM8K [317] | Mathematical Reasoning | Text | Numeric QA | Grade School Math, CoT Focus | 8 |
| MATH [318] | Mathematical Reasoning | Text | Latex/QA | Challenging Math, Diverse Topics | 8 |
| AMC [319] | Mathematical Reasoning | Text | MCQ | Pre-Olympiad, Competition Math | 8 |
| LiveCodeBench [320] | Code Generation | Text | Function Gen | Contamination-Free, Live Data | 8 |
| BigCodeBench [321] | Code Generation | Text | Function/Full Gen | Complex Libraries, Instruction Following | 8 |
| HumanEval [322] | Code Generation | Text | Function Gen | Functional Correctness, Docstrings | 8 |
| MBPP [323] | Code Generation | Text | Function Gen | Basic Programming, Semantic | 8 |
| EvalPlus [324] | Code Generation | Text | Function Gen | Rigorous Eval, 80x Test Cases | 8 |
| MultiPL-E [325] | Code Generation | Text | Function Gen (Polyglot) | 18+ Languages, Parallel Corpus | 8 |
| CRUXEval [326] | Code Generation | Text | Input/Output Prediction | Execution Simulation, CoT Focus | 8 |
| Agentic Reasoning Capabilities | |||||
| WebArena [327] | Web Navigation | Text/HTML | Env. Interaction | Realistic Tasks, Long-Horizon | 8 |
| WebShop [328] | Web Navigation | Text | Env. Interaction | E-commerce, Decision Making | 8 |
| MT-Mind2Web [329] | Web Navigation | Text | Action Seq. | Multi-Turn, Generalization | 8 |
| Mind2Web [330] | Web Navigation | Text | Action Seq. | Generalist Agent, Real DOM | 8 |
| WebVoyager [104] | Web Navigation | Multimodal | Env. Interaction | End-to-End, Visual Navigation | 8 |
| VisualWebArena [331] | Web Navigation | Multimodal | Env. Interaction | Visual/HTML, Hybrid Interaction | 8 |
| ToolLLM [332] | Tool Usage | Text | API Calls | Large-Scale APIs, Instruction Tuning | 8 |
| AgentGym [198] | Unified Frameworks | Multimodal | Env. Interaction | Interactive Learning, Diversity | 8 |
| AgentBoard [333] | Unified Frameworks | Multimodal | Env. Interaction | Analytic Dashboard, Unified Eval | 8 |
| Reasoning Gym [196] | Unified Frameworks | Text | Interaction | Algorithmic, Dynamic Tasks | 8 |
| ALFWorld [334] | Unified Frameworks | Text | Text Interaction | Text-World, Household Tasks | 8 |
| AgentBench [335] | Unified Frameworks | Text | Mixed | Comprehensive, Multi-Environment | 8 |
| GAIA [336] | Unified Frameworks | Multimodal | QA w/ Tools | General Assistant, Hard Reasoning | 8 |
| DeepResearch Bench [108] | Unified Frameworks | Text | Web Search | Long-form Research, Citation Eval | 8 |
| SWE-bench [337] | Software Engineering | Text | Patch Gen | Real GitHub Issues, Repo-Level | 8 |
| Terminal-Bench [338] | OS Operations | Text | CLI Interaction | Linux Command Line, Security | 8 |
| OSWorld [339] | OS Operations | Multimodal | GUI/Desktop | Cross-App, Full OS Control | 8 |
ability to evolve through interaction with external worlds. Please refer to Table IV for a complete list of the benchmarks.
A. Intrinsic Capability Benchmarks
These benchmarks primarily evaluate the effectiveness of Model-Centric Self-Evolution methods. The focus is on static datasets requiring complex reasoning.
1) General Knowledge:
• MMLU-Pro [295] is an enhanced benchmark to address the limitations of MMLU, such as score saturation and the prevalence of trivial or noisy data. This professional version significantly increases difficulty by expanding
the dataset to over 12,000 questions across 14 diverse domains and increasing the answer choices per question from four to ten, thereby drastically reducing the probability of success through random guessing. MMLU-Pro eliminates simple rote-memorization tasks in favor of complex, reasoning-focused problems, making it a more discriminative tool for evaluating the upper limits of LLMs’ reasoning capabilities.
• HotpotQA [296] is a large-scale multi-hop question answering benchmark based on Wikipedia. The dataset is uniquely designed to challenge AI systems to reason across multiple documents to locate information and infer
the correct answer. HotpotQA requires models to provide the answer and identify the specific “supporting facts” used to reach that conclusion, thereby enforcing explainability.
• LongBench [308] is the first bilingual, multi-task benchmark to comprehensively evaluate the long-context understanding capabilities of LLMs. It comprises 21 datasets across six major task categories in both English and Chinese, with most tasks featuring an average length of 5,000 to 15,000 words/characters. To address more challenging real-world scenarios, the recently released LongBench v2 extends the context length up to 2 million words and focuses on assessing deep reasoning abilities within ultra-long contexts, establishing a more rigorous standard for next-generation long-context AI systems.
• AGIEval [304] is a human-centric benchmark framework designed to evaluate the general cognitive abilities of foundation models using questions derived from 20 high-standard official admission and qualification exams intended for human test-takers. By leveraging tasks that demand advanced reasoning and problem-solving skills rather than simple knowledge retrieval, AGIEval aims to rigorously quantify how close large language models are to achieving human-level intelligence.
• ARC [305] is a benchmark designed to evaluate progress toward AGI by measuring a system’s ability to rapidly adapt to novel tasks and perform abstract reasoning with minimal examples. Unlike traditional benchmarks that rely on massive datasets for pattern recognition, ARC consists of unique grid-based visual puzzles that require AI to deduce hidden logical rules using core cognitive priors and generalize them to unseen situations.
2) Scientific Reasoning:
• GPQA [310] is a highly challenging benchmark dataset consisting of 448 multiple-choice questions authored by domain experts in biology, physics, and chemistry. Its defining feature is being ”Google-proof”: even with unrestricted internet access, highly skilled non-expert validators achieve only accuracy, whereas domain experts score . The dataset is designed to evaluate the deep reasoning capabilities of AI models in scientific domains and to facilitate research on ”scalable oversight”, determining how humans can effectively supervise AI systems that surpass human expertise.
• SciBench [312] is a comprehensive benchmark designed to evaluate the college-level scientific problem-solving capabilities of LLMs. It comprises approximately 700 openended computational problems across 10 sub-disciplines in physics, chemistry, and mathematics, all sourced from authoritative textbooks and collegiate exams. The benchmark requires models to demonstrate not only profound domain knowledge but also intricate multi-step reasoning and precise numerical calculation skills, aiming to identify the systematic bottlenecks of current models in handling advanced scientific tasks through rigorous step-by-step evaluation.
3) Mathematical Reasoning:
• MATH [318] is a challenging benchmark for evaluating
LLMs on advanced mathematical reasoning. Covering topics such as algebra, geometry, number theory, and probability, it emphasizes multi-step problem solving with detailed solutions, testing models’ ability to perform longchain logical deduction beyond simple computation.
• AIME [315] stands as a premier standard for evaluating the advanced mathematical reasoning capabilities of LLMs. Spanning disciplines such as algebra, geometry, number theory, and combinatorics, the dataset presents problems that bridge the gap between high school competitions and national Olympiad-level challenges. Unlike tests that assess rote knowledge, AIME focuses on scrutinizing a model’s ability to execute long-chain logical deduction and multi-step strategic planning when engaging with novel, complex problems.
• OlympiadBench [316] is a challenging bilingual, multimodal scientific benchmark designed to evaluate the advanced reasoning capabilities of LLMs in mathematics and physics. The dataset comprises 8,476 rigorous problems sourced from Olympiad competitions and the Chinese College Entrance Examination, available in both Chinese and English, and integrates multiple modalities such as images and text. As one of the most difficult scientific evaluation sets, it focuses on complex tasks like open-ended questions and theorem proving.
4) Code Generation:
• LiveCodeBench [320] is a holistic and contamination-free benchmark designed to evaluate the coding capabilities of LLMs. It continuously collects the latest problems from competitive programming platforms such as Leet-Code, AtCoder, and CodeForces to test performance on unseen challenges. Beyond standard code generation, LiveCodeBench assesses a broader spectrum of skills including self-repair, code execution, and test output prediction, offering a more rigorous and comprehensive measure of a model’s programming proficiency.
• BigCodeBench [321] is a next-generation benchmark designed to evaluate the capability of LLMs in solving practical and challenging programming tasks. Big-CodeBench consists of 1,140 fine-grained tasks that require models to skillfully invoke functions from 139 libraries across 7 domains to execute complex instructions. With rigorous test case coverage and diverse function call scenarios, it provides a stricter and more realistic assessment of a model’s proficiency in real-world software development, marking a significant evolution in code generation benchmarking.
• HumanEval [322] is a benchmark for evaluating the functional correctness of code generated from docstrings. It contains 164 hand-written programming problems, each with a function signature, docstring, reference solution, and an average of 7.7 unit tests. Designed to avoid contamination from public code repositories, the tasks resemble entry-level programming interview questions and assess language understanding, algorithms, and basic mathematics. Performance is measured using the pass metric, where a solution is considered correct only if it
passes all unit tests in a sandboxed environment.
B. Agentic Reasoning Capabilities Benchmarks
These benchmarks provide dynamic environments for Environment-Centric and Co-Evolution paradigms. Unlike static datasets, they function as gyms providing observations, actions, and feedback signals essential for reinforcement learning and lifelong evolution.
C. Web Navigation & Tool Usage
• WebArena [327] is a realistic, standalone web environment and benchmark suite designed to evaluate the capabilities of autonomous AI agents in performing end-toend web tasks. Unlike simplified simulation environments, WebArena creates a fully functional, self-hostable ecosystem comprising four key domains—e-commerce, social forums, collaborative software development, and content management systems—augmented with utility tools like maps and knowledge bases. Featuring over 800 curated long-horizon tasks, the benchmark challenges LLM agents to interpret natural language instructions and execute complex planning, reasoning, and cross-site interactions to achieve specific goals, thereby providing a rigorous standard for measuring agent functional correctness and robustness in real-world scenarios.
• WebVoyager [104] is an innovative Large Multimodal Model (LMM) powered web agent capable of completing user instructions end-to-end by interacting directly with real-world websites. Unlike existing web agents that typically handle only a single input modality and are evaluated in simplified web simulators or static web snapshots , WebVoyager mimics human web browsing behavior by making observations from screenshots and textual content in interactive web elements to formulate thoughts and execute actions such as clicking, typing, or scrolling. Furthermore, the framework introduces a new benchmark compiling real-world tasks from 15 popular websites and establishes an automatic evaluation protocol leveraging the multimodal understanding abilities of GPT-4V to reliably evaluate open-ended web agents.
• ToolLLM [332] is a general-purpose framework and benchmark designed to facilitate open-source LLMs in mastering real-world APIs. It introduces ToolBench, an instruction-tuning dataset constructed from over 16,000 real-world RESTful APIs collected from RapidAPI, covering highly diverse scenarios. Beyond the dataset, the framework includes ToolEval, an automatic evaluation toolkit that measures a model’s ability to execute complex, multi-step reasoning tasks. By assessing the pass rate of tool-use chains, ToolLLM aims to enable open-source models to match or exceed the tool-use capabilities of state-of-the-art closed-source models like ChatGPT.
D. Unified Frameworks
• ALFWorld [334] is an interactive framework that aligns text-based environments with embodied robotic simulations. Extending TextWorld and ALFRED, it allows agents
to first learn high-level policies in a text-only setting and then transfer them to complex visual environments. This dual-environment design encourages interaction and semantic grounding through language, enabling modular agents that connect abstract reasoning with embodied execution. Empirical results show that text-based pretraining improves zero-shot generalization compared to training directly in physical environments.
• GAIA [336] is a benchmark designed to evaluate General AI Assistants on real-world questions that require diverse capabilities such as reasoning, multi-modality understanding, web browsing, and tool use. Unlike specialized examstyle benchmarks, GAIA focuses on conceptually simple tasks that humans can solve reliably but remain difficult for advanced AIs. It contains 466 questions organized into three difficulty levels, determined by the number of reasoning steps and tools required. To ensure reliable evaluation and reduce issues like data contamination, each question is crafted to yield a single concise and unambiguous factoid answer, enabling fast and automated scoring.
• AgentGym [198] is a comprehensive framework designed to evaluate and evolve LLM-based agents across diverse environments. The benchmark features 14 distinct interactive environments with 89 specific tasks, spanning categories such as web navigation, text games, embodied control, programming, and tool usage. Unlike traditional benchmarks that focus solely on static evaluation, Agent-Gym provides a unified interactive interface and a highquality trajectory dataset. It enables agents to achieve ”selfevolution” through multi-turn interaction, exploration, and reinforcement learning, offering a standardized platform for developing continuously learning agents.
• AgentBoard [333] is a comprehensive analytical evaluation benchmark designed for multi-turn LLM agents. Aiming to go beyond simple success rate metrics, it provides an in-depth analysis of agent performance in partially observable environments through fine-grained progress feedback and interaction visualization. AgentBoard encompasses diverse environments and tasks, focusing on evaluating core capabilities such as memory, planning, and world modeling, thereby enabling researchers to gain a more intuitive understanding of the model’s decision-making processes and limitations.
• Reasoning Gym (RG) [196] is a comprehensive library of reasoning environments designed for RLVR, offering over 100 procedural data generators spanning domains such as algebra, logic, geometry, algorithms, and games. Unlike traditional benchmarks comprised of fixed questionanswer pairs, RG’s key innovation lies in its ability to algorithmically generate virtually infinite training data with adjustable complexity and structural variation, effectively addressing the scalability bottleneck of data scarcity while eliminating memorization concerns. Serving as an open-ended playground that supports dynamic curriculum learning, RG allows for task difficulty to be adjusted based on performance, and experiments demonstrate that RLVR training on RG tasks significantly enhances reasoning
TABLE V: Taxonomy of Open Source Libraries for Self-Evolving Agents Systems.
| Category | Library | Key Features | Link |
| Foundational Agent Orchestration | LangGraph [340] | Enables multi-actor applications with cyclic graphs for complex looping logic. | ♀ |
| LlamaIndex [341] | Integrates private data with LLMs via robust connectors and query engines. | ♀ | |
| AutoGen [342] | Automates tasks via customizable agents using conversation and tool integration. | ♀ | |
| MetaGPT [343] | Encodes SOPs into LLMs for role-based software development. | ♀ | |
| Distributed Training | Megatron-LM [344] | Facilitates high-performance training utilizing multi-dimensional parallelism. | ♀ |
| DeepSpeed [345] | Optimizes memory efficiency featuring ZeRO technology. | ♀ | |
| Post-training & Alignment | slime [346] | High-Performance Training and Flexible Data Generation. | ♀ |
| VeRL [347] | Provides a HybridFlow-based RL library with 3D-HybridEngine. | ♀ | |
| OpenRLHF [348] | Supports distributed RLHF based on Ray and vLLM frameworks. | ♀ | |
| TRL [349] | Offers a full-stack library for SFT, Reward Modeling, and RL alignment. | ♀ | |
| Efficient Fine-tuning | LLaMA Factory [350] | Provides a unified “code-free” WebUI supporting 100+ models. | ♀ |
| Unsloth [351] | Accelerates training via manually derived backpropagation and Triton kernels. | ♀ | |
| Inference & Serving | vLLM [352] | Serves models with high throughput utilizing PagedAttention. | ♀ |
| SGLang [353] | Manages structured generation using RadixAttention for aggressive cache reuse. | ♀ |
capabilities and generalization both within domains and across established external benchmarks.
E. Software Engineering & OS Operations
• SWE-bench [337] is a rigorous evaluation framework designed to assess the ability of LLMs to resolve realworld software engineering issues. It comprises 2,294 task instances drawn from actual GitHub issues and pull requests across 12 popular open-source Python repositories, such as Django and scikit-learn. To ensure reliability, OpenAI and the original authors released SWEbench Verified, a human-validated subset of 500 highquality samples that filters out ambiguous or unsolvable tasks. Unlike simple code generation benchmarks, SWEbench requires models to navigate an entire codebase, troubleshoot bugs, and generate functional patches that pass new unit tests, serving as the gold standard for measuring an AI’s capability to act as an autonomous software engineer.
• Terminal-Bench [338] is a rigorous benchmark designed to evaluate the comprehensive capabilities of LLM agents within realistic Linux terminal environments. Unlike traditional QA benchmarks, this setting requires AI agents to solve complex, long-horizon tasks such as server configuration, kernel compilation, and software development. To accomplish these tasks, agents execute shell commands within sandboxed Docker containers. Terminal-Bench 2.0 introduces the Harbor framework to support concurrent cloud-based evaluation and employs a strict ”all-or-nothing” scoring mechanism, ensuring that the results accurately reflect an agent’s robustness, reasoning, and planning skills in solving real-world engineering problems.
IX. OPEN SOURCE LIBRARIES
To facilitate future research and deployment of Self-Evolving Agents systems, we summarize key open-source libraries
categorized by our taxonomy. Please refer to Table V for a complete list of the open source libraries.
A. Foundational Agent Orchestration
• LangGraph [340] is a library extension of LangChain designed for building stateful, multi-actor applications with LLMs; by modeling workflows as cyclic graphs, it overcomes the limitations of traditional Directed Acyclic Graphs (DAGs), enabling the creation of complex looping logic, persistent memory, and controllable ”human-in-theloop” agentic systems.
• LlamaIndex [341] is a data framework for building context-augmented LLM applications, available in Python and TypeScript, which seamlessly integrates private data with LLMs by providing robust data connectors, index structures, and query engines, thereby empowering developers to easily build efficient RAG systems and AI agents.
• AutoGen [342] is an open-source framework developed by Microsoft for building next-generation LLM applications through multi-agent collaboration, enabling customizable agents to seamlessly integrate LLMs, tools, and human inputs to efficiently automate complex tasks ranging from coding to decision-making workflows via automated conversation.
• MetaGPT [343] is an open-source multi-agent collaborative framework, recognized as the ”First AI Software Company” and a foundational infrastructure for ”Natural Language Programming”; built upon the core philosophy of , it encodes Standard Operating Procedures (SOPs) into LLMs to assign specific roles—such as Product Managers, Architects, and Engineers—enabling the autonomous execution of the entire software development lifecycle, from competitive analysis and system design to coding and documentation, starting with just a single-line requirement.
B. Distributed Training
• Megatron-LM [344] is a high-performance, PyTorchbased distributed training framework, engineered specifically for building and training ultra-large-scale language models such as GPT, BERT, and MoE architectures; its modular Megatron Core library integrates multi-dimensional parallelism strategies—including tensor, pipeline, sequence, and expert parallelism—combined with FP8 mixed-precision training via the Transformer Engine to achieve efficient linear scalability across thousands of GPUs.
• DeepSpeed [345] is an open-source deep learning optimization library developed, engineered to accelerate the distributed training and inference of massive-scale models. Powered by innovations like the Zero Redundancy Optimizer (ZeRO), it partitions model states to overcome memory constraints while optimizing communication and computation, enabling developers to efficiently train and deploy models with hundreds of billions or even trillions of parameters on limited hardware resources.
C. Post-training & Alignment
• slime [346] is an asynchronous reinforcement learning training framework designed to integrate diverse agent systems with minimal engineering effort. It adopts a serverbased rollout architecture compatible with OpenAI-style APIs and uses a modular rollout buffer to ingest and filter heterogeneous trajectories. By decoupling rollout and training engines across GPU resources, slime enables continuous data generation and model updates, improving hardware utilization and training throughput on complex agentic tasks.
• VeRL [347] is an open-source reinforcement learning library for LLMs initiated by the ByteDance Seed team; built on the HybridFlow framework, it is designed to offer a flexible and efficient post-training solution supporting mainstream algorithms like PPO and GRPO. By seamlessly integrating with infrastructures such as FSDP, Megatron-LM, and vLLM via modular APIs, and utilizing the 3D-HybridEngine to eliminate memory redundancy between training and generation phases, verl achieves state-of-the-art throughput on multi-GPU clusters, serving tasks ranging from DeepSeek R1 reproduction to complex agent training.
• OpenRLHF [348] is a high-performance, open-source RLHF framework built on Ray and vLLM, designed to significantly enhance the efficiency and stability of alignment training for LLMs. By leveraging Ray for distributed orchestration to flexibly allocate Actor, Reward, Reference, and Critic models across GPUs, and integrating vLLM for inference acceleration, it supports full finetuning of models with parameters with vastly improved training throughput compared to traditional frameworks. Seamlessly compatible with Hugging Face Transformers, the library supports mainstream algorithms such as PPO, DPO, KTO, and Rejection Sampling, making
it a premier choice for high-performance LLM alignment with its user-friendly interface.
• TRL [349] is a full-stack library maintained by Hugging Face, specifically designed for the post-training phase of LLMs. Built upon the transformers ecosystem with seamless integration of accelerate and peft, it provides a comprehensive toolkit for model alignment and optimization, supporting methods ranging from SFT and Reward Modeling to advanced algorithms like PPO, DPO, KTO, and the recent GRPO, enabling developers to efficiently align models with human preferences via reinforcement learning.
D. Efficient Fine-tuning
• LLaMA Factory [350] is an open-source, unified, and efficient fine-tuning framework for LLMs, designed to enable ”code-free” model customization through its visual WebUI (LlamaBoard). It supports over 100 mainstream models—including LLaMA, Qwen, Mistral, and DeepSeek—and integrates various advanced fine-tuning and alignment algorithms such as LoRA, QLoRA, PPO, and DPO. By leveraging acceleration technologies like FlashAttention-2 and Unsloth, it significantly boosts training efficiency, allowing users to easily manage the entire workflow from pre-training and supervised finetuning to reinforcement learning.
• Unsloth [351] is an open-source library for fine-tuning LLMs that significantly accelerates training speeds (up to 2x faster) and reduces memory usage by up to for models like Llama 3, Mistral, Phi-3, and Gemma by manually deriving backpropagation steps and rewriting PyTorch modules using OpenAI’s Triton language, all while maintaining loss in accuracy and enabling efficient full fine-tuning or LoRA/QLoRA training on a single GPU.
E. Inference & Serving
• vLLM [352] is a high-performance, open-source library for LLM inference and serving, distinguished by its innovative PagedAttention algorithm which manages attention key-value memory like virtual memory pages to minimize fragmentation and maximize throughput. vLLM supports advanced features such as continuous batching, speculative decoding, prefix caching, various quantization methods, and distributed inference, all while offering a seamless OpenAI-compatible API server for the easy deployment of popular open-source models like Llama, Qwen, and DeepSeek.
• SGLang [353] is a high-performance framework and structured generation language developed by LMSYS Org designed for efficient LLM inference; it utilizes a novel RadixAttention technique for aggressive KV cache reuse and co-designs the frontend language with the backend runtime, significantly accelerating complex reasoning tasks and multi-turn interactions while ensuring controllable, structured outputs.
X. CONCLUSION
Autonomous agents have shown strong promise in solving open-ended real-world problems, yet their progress is increasingly constrained by a supervision bottleneck. Posttraining still relies heavily on human oversight: SFT limits models to imitation, while RL depends on sparse and humandefined rewards that can be fragile or exploitable. To move beyond these constraints, Self-Evolving Agents offers a shift toward autonomous continual improvement. By leveraging active agency, agents can learn through ongoing interaction, generating intrinsic training signals and refining their capabilities over time. Such evolution may occur within the model itself, through interaction with the environment, or via their joint co-evolution. This survey systematically reviews existing approaches, highlights key challenges, and outlines future directions, providing resources to support research toward more autonomous evolutionary agents.
REFERENCES
[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar ` et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[3] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv e-prints, pp. arXiv–2407, 2024.
[4] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[5] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025.
[6] G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv preprint arXiv:2507.06261, 2025.
[7] K. Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen et al., “Kimi k2: Open agentic intelligence,” arXiv preprint arXiv:2507.20534, 2025.
[8] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024.
[9] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[10] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023.
[11] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International conference on machine learning. PMLR, 2022, pp. 9118–9147.
[12] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,” Advances in Neural Information Processing Systems, vol. 36, pp. 38 154–38 180, 2023.
[13] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in Proceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22.
[14] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[15] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
[16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
[17] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[18] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in The eleventh international conference on learning representations, 2022.
[19] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for alignment,” Advances in Neural Information Processing Systems, vol. 36, pp. 55 006–55 021, 2023.
[20] D. Guo, D. Yang, H. Zhang, and J. e. Song, “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,” Nature, vol. 645, no. 8081, p. 633–638, Sep. 2025. [Online]. Available: http://dx.doi.org/10.1038/s41586-025-09422-z
[21] “A primer on llm post-training.” [Online]. Available: https://pytorch. org/blog/a-primer-on-llm-post-training/
[22] “Claude 3.5 sonnet.” [Online]. Available: https://www.anthropic.com/ news/claude-3-5-sonnet
[23] K. Kumar, T. Ashraf, O. Thawakar, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, P. H. Torr, F. S. Khan, and S. Khan, “Llm post-training: A deep dive into reasoning large language models,” arXiv preprint arXiv:2502.21321, 2025.
[24] T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V. Le, S. Levine, and Y. Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,” arXiv preprint arXiv:2501.17161, 2025.
[25] D. Silver and R. S. Sutton, “Welcome to the era of experience,” Google AI, vol. 1, 2025.
[26] Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua et al., “Ttrl: Test-time reinforcement learning,” arXiv preprint arXiv:2504.16084, 2025.
[27] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022.
[28] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with selfgenerated instructions,” in Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 13 484–13 508.
[29] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
[30] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, G. Wang et al., “Instruction tuning for large language models: A survey,” ACM Computing Surveys, vol. 58, no. 7, pp. 1–36, 2026.
[31] Y. Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,” Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024.
[32] J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan, A. Chen, K. Cho, and E. Perez, “Training language models with language feedback at scale,” arXiv preprint arXiv:2303.16755, 2023.
[33] W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He, “Simplerlzoo: Investigating and taming zero reinforcement learning for open base models in the wild,” arXiv preprint arXiv:2503.18892, 2025.
[34] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024.
[35] C. Jin, J. Xu, B. Liu, L. Tao, O. Golovneva, T. Shu, W. Zhao, X. Li, and J. Weston, “The era of real-world human interaction: Rl from user conversations,” arXiv preprint arXiv:2509.25137, 2025.
[36] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” in The Twelfth International Conference on Learning Representations, 2023.
[37] M. MacDiarmid, B. Wright, J. Uesato, J. Benton, J. Kutasov, S. Price, N. Bouscal, S. Bowman, T. Bricken, A. Cloud et al., “Natural emergent
misalignment from reward hacking in production rl,” arXiv preprint arXiv:2511.18397, 2025.
[38] C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu, “R-zero: Self-evolving reasoning llm from zero data,” arXiv preprint arXiv:2508.05004, 2025.
[39] C. Yang, Z. Xiang, Y. Tang, Z. Teng, C. Huang, F. Long, Y. Liu, and J. Su, “Ttcs: Test-time curriculum synthesis for self-evolving,” arXiv preprint arXiv:2601.22628, 2026.
[40] Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S. Yao et al., “Webrl: Training llm web agents via selfevolving online curriculum reinforcement learning,” arXiv preprint arXiv:2411.02337, 2024.
[41] Z. Wei, W. Yao, Y. Liu, W. Zhang, Q. Lu, L. Qiu, C. Yu, P. Xu, C. Zhang, B. Yin et al., “Webagent-r1: Training web agents via end-to-end multiturn reinforcement learning,” arXiv preprint arXiv:2505.16421, 2025.
[42] J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma, “Lifelong learning of large language model based agents: A roadmap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026.
[43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
[44] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” arXiv preprint arXiv:1712.01815, 2017.
[45] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
[46] P. Manakul, A. Liusie, and M. Gales, “Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models,” in Proceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 9004–9017.
[47] D. Jiang, X. Ren, and B. Y. Lin, “Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,” arXiv preprint arXiv:2306.02561, 2023.
[48] B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Re, and ´ A. Mirhoseini, “Large language monkeys: Scaling inference compute with repeated sampling,” arXiv preprint arXiv:2407.21787, 2024.
[49] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023.
[50] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang et al., “Self-refine: Iterative refinement with self-feedback,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 534–46 594, 2023.
[51] X. Chen, M. Lin, N. Scharli, and D. Zhou, “Teaching large language ¨ models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.
[52] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, “Critic: Large language models can self-correct with tool-interactive critiquing,” arXiv preprint arXiv:2305.11738, 2023.
[53] E. Wang, F. Cassano, C. Wu, Y. Bai, W. Song, V. Nath, Z. Han, S. Hendryx, S. Yue, and H. Zhang, “Planning in natural language improves llm search for code generation,” arXiv preprint arXiv:2409.03733, 2024.
[54] Y. Zhang, M. Khalifa, L. Logeswaran, J. Kim, M. Lee, H. Lee, and L. Wang, “Small language models need strong verifiers to self-correct reasoning,” arXiv preprint arXiv:2404.17140, 2024.
[55] V. Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden, D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought,” arXiv preprint arXiv:2501.04682, 2025.
[56] K.-H. Lee, I. Fischer, Y.-H. Wu, D. Marwood, S. Baluja, D. Schuurmans, and X. Chen, “Evolving deeper llm thinking,” arXiv preprint arXiv:2501.09891, 2025.
[57] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models, 2023,” URL https://arxiv. org/abs/2305.10601, vol. 3, p. 1, 2023.
[58] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690.
[59] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, L. M. Ni, H.-Y. Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large
language model on knowledge graph,” arXiv preprint arXiv:2307.07697, 2023.
[60] L. Luo, Y.-F. Li, G. Haffari, and S. Pan, “Reasoning on graphs: Faithful and interpretable large language model reasoning,” arXiv preprint arXiv:2310.01061, 2023.
[61] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang, “Language agent tree search unifies reasoning acting and planning in language models,” arXiv preprint arXiv:2310.04406, 2023.
[62] X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179, 2023.
[63] S. Ma, C. Xu, X. Jiang, M. Li, H. Qu, C. Yang, J. Mao, and J. Guo, “Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation,” arXiv preprint arXiv:2407.10805, 2024.
[64] B. Jin, C. Xie, J. Zhang, K. K. Roy, Y. Zhang, Z. Li, R. Li, X. Tang, S. Wang, Y. Meng et al., “Graph chain-of-thought: Augmenting large language models by reasoning on graphs,” arXiv preprint arXiv:2404.07103, 2024.
[65] J. Lu, W. Zhong, W. Huang, Y. Wang, Q. Zhu, F. Mi, B. Wang, W. Wang, X. Zeng, L. Shang et al., “Self: Self-evolution with language feedback,” arXiv preprint arXiv:2310.00533, 2023.
[66] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with selfgenerated instructions,” in Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 13 484–13 508.
[67] E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022.
[68] J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, “Large language models can self-improve,” in Proceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 1051–1068.
[69] Z. Chen, Y. Deng, H. Yuan, K. Ji, and Q. Gu, “Self-play fine-tuning converts weak language models to strong language models,” arXiv preprint arXiv:2401.01335, 2024.
[70] D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang, “Rest-mcts*: Llm self-training via process reward guided tree search,” Advances in Neural Information Processing Systems, vol. 37, pp. 64 735– 64 772, 2024.
[71] A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu et al., “Beyond human data: Scaling self-training for problem-solving with language models,” arXiv preprint arXiv:2312.06585, 2023.
[72] Y. Wu, Z. Sun, H. Yuan, K. Ji, Y. Yang, and Q. Gu, “Self-play preference optimization for language model alignment,” arXiv preprint arXiv:2405.00675, 2024.
[73] Z. Wang, K. Wang, Q. Wang, P. Zhang, L. Li, Z. Yang, X. Jin, K. Yu, M. N. Nguyen, L. Liu et al., “Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning,” arXiv preprint arXiv:2504.20073, 2025.
[74] C. Zhao, X. Jia, V. Viswanathan, T. Wu, and G. Neubig, “Self-guide: Better task-specific instruction following via self-synthetic finetuning,” arXiv preprint arXiv:2407.12874, 2024.
[75] W. Zhao, M. Yuksekgonul, S. Wu, and J. Zou, “Sirius: Self-improving multi-agent systems via bootstrapped reasoning,” arXiv preprint arXiv:2502.04780, 2025.
[76] A. Zweiger, J. Pari, H. Guo, E. Akyurek, Y. Kim, and P. Agrawal, “Self-¨ adapting language models,” arXiv preprint arXiv:2506.10943, 2025.
[77] A. Zhao, Y. Wu, Y. Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang, “Absolute zero: Reinforced self-play reasoning with zero data,” arXiv preprint arXiv:2505.03335, 2025.
[78] H. Ma, T. Hu, Z. Pu, L. Boyin, X. Ai, Y. Liang, and M. Chen, “Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 37, pp. 15 497–15 525, 2024.
[79] Z. Li, W. Yu, C. Huang, R. Liu, Z. Liang, F. Liu, J. Che, D. Yu, J. Boyd-Graber, H. Mi et al., “Self-rewarding vision-language model via reasoning decomposition,” arXiv preprint arXiv:2508.19652, 2025.
[80] B. Liu, L. Guertler, S. Yu, Z. Liu, P. Qi, D. Balcells, M. Liu, C. Tan, W. Shi, M. Lin et al., “Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,” arXiv preprint arXiv:2506.24119, 2025.
[81] Y. Zhou, S. Levine, J. Weston, X. Li, and S. Sukhbaatar, “Selfchallenging language model agents,” arXiv preprint arXiv:2506.01716, 2025.
[82] L. Chen, M. Prabhudesai, K. Fragkiadaki, H. Liu, and D. Pathak, “Selfquestioning language models,” arXiv preprint arXiv:2508.03682, 2025.
[83] Y. Wang, L. Yang, Y. Tian, K. Shen, and M. Wang, “Co-evolving llm coder and unit tester via reinforcement learning,” arXiv preprint arXiv:2506.03136, 2025.
[84] J. Chen, B. Zhang, R. Ma, P. Wang, X. Liang, Z. Tu, X. Li, and K.-Y. K. Wong, “Spc: Evolving self-play critic via adversarial games for llm reasoning,” arXiv preprint arXiv:2504.19162, 2025.
[85] Y. Zhou, Z. Liang, H. Liu, W. Yu, K. Panaganti, L. Song, D. Yu, X. Zhang, H. Mi, and D. Yu, “Evolving language models without labels: Majority drives selection, novelty promotes variation,” arXiv preprint arXiv:2509.15194, 2025.
[86] W. Fang, S. Liu, Y. Zhou, K. Zhang, T. Zheng, K. Chen, M. Song, and D. Tao, “Serl: Self-play reinforcement learning for large language models with limited data,” arXiv preprint arXiv:2505.20347, 2025.
[87] S. Huang, W. Zhong, D. Cai, F. Wan, C. Wang, M. Wang, M. Qiao, and R. Xu, “Empowering self-learning of llms: Inner knowledge explicitation as a catalyst,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 150–24 158.
[88] T. Simonds and A. Yoshiyama, “Ladder: Self-improving llms through recursive problem decomposition,” arXiv preprint arXiv:2503.00735, 2025.
[89] Z. Lin, S. Shen, J. Shang, J. Weston, and Y. Nie, “Learning to solve and verify: A self-play framework for code and test generation,” arXiv preprint arXiv:2502.14948, 2025.
[90] J. G. Kuba, M. Gu, Q. Ma, Y. Tian, V. Mohan, and J. Chen, “Language self-play for data-free training,” arXiv preprint arXiv:2509.07414, 2025.
[91] B. Liu, C. Jin, S. Kim, W. Yuan, W. Zhao, I. Kulikov, X. Li, S. Sukhbaatar, J. Lanchantin, and J. Weston, “Spice: Self-play in corpus environments improves reasoning,” arXiv preprint arXiv:2510.24684, 2025.
[92] W. Yu, Z. Liang, C. Huang, K. Panaganti, T. Fang, H. Mi, and D. Yu, “Guided self-evolving llms with minimal human supervision,” arXiv preprint arXiv:2512.02472, 2025.
[93] Q. Wang, B. Liu, T. Zhou, J. Shi, Y. Lin, Y. Chen, H. H. Li, K. Wan, and W. Zhao, “Vision-zero: Scalable vlm self-improvement via strategic gamified self-play,” arXiv preprint arXiv:2509.25541, 2025.
[94] Y. He, C. Huang, Z. Li, J. Huang, and Y. Yang, “Visplay: Self-evolving vision-language models from images,” arXiv preprint arXiv:2511.15661, 2025.
[95] P. Xia, K. Zeng, J. Liu, C. Qin, F. Wu, Y. Zhou, C. Xiong, and H. Yao, “Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning,” arXiv preprint arXiv:2511.16043, 2025.
[96] S. Wang, Z. Jiao, Z. Zhang, Y. Peng, X. Ze, B. Yang, W. Wang, H. Wei, and L. Zhang, “Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution,” arXiv preprint arXiv:2509.24726, 2025.
[97] Z. Yang, W. Shen, C. Li, R. Chen, F. Wan, M. Yan, X. Quan, and F. Huang, “Spell: Self-play reinforcement learning for evolving longcontext language models,” arXiv preprint arXiv:2509.23863, 2025.
[98] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” 2024.
[99] B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Training llms to reason and leverage search engines with reinforcement learning,” arXiv preprint arXiv:2503.09516, 2025.
[100] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam et al., “Dspy: Compiling declarative language model calls into self-improving pipelines,” arXiv preprint arXiv:2310.03714, 2023.
[101] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang, and Z. Dou, “Search-o1: Agentic search-enhanced large reasoning models,” arXiv preprint arXiv:2501.05366, 2025.
[102] M. Chen, L. Sun, T. Li, H. Sun, Y. Zhou, C. Zhu, H. Wang, J. Z. Pan, W. Zhang, H. Chen et al., “Learning to reason with search for llms via reinforcement learning,” arXiv preprint arXiv:2503.19470, 2025.
[103] Y. Qin, Z. Cai, D. Jin, L. Yan, S. Liang, K. Zhu, Y. Lin, X. Han, N. Ding, H. Wang et al., “Webcpm: Interactive web search for chinese long-form question answering,” arXiv preprint arXiv:2305.06849, 2023.
[104] H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu, “Webvoyager: Building an end-to-end web agent with large multimodal models,” arXiv preprint arXiv:2401.13919, 2024.
[105] Y. Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu, “Deepresearcher: Scaling deep research via reinforcement learning in real-world environments,” arXiv preprint arXiv:2504.03160, 2025.
[106] X. Li, J. Jin, G. Dong, H. Qian, Y. Wu, J.-R. Wen, Y. Zhu, and Z. Dou, “Webthinker: Empowering large reasoning models with deep research capability,” arXiv preprint arXiv:2504.21776, 2025.
[107] Z. Chen, K. Liu, Q. Wang, J. Liu, W. Zhang, K. Chen, and F. Zhao, “Mindsearch: Mimicking human minds elicits deep ai searcher,” arXiv preprint arXiv:2407.20183, 2024.
[108] M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “Deepresearch bench: A comprehensive benchmark for deep research agents,” arXiv preprint arXiv:2506.11763, 2025.
[109] G. Chen, Z. Qiao, W. Wang, D. Yu, X. Chen, H. Sun, M. Liao, K. Fan, Y. Jiang, P. Xie, W. X. Zhao, R. Song, and F. Huang, “Mars: Co-evolving dual-system deep research via multi-agent reinforcement learning,” 2025.
[110] X. Liang, J. Yang, Y. Wang, C. Tang, Z. Zheng, S. Song, Z. Lin, Y. Yang, S. Niu, H. Wang et al., “Surveyx: Academic survey automation via large language models,” arXiv preprint arXiv:2502.14776, 2025.
[111] T. D. Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou et al., “Tongyi deepresearch technical report,” arXiv preprint arXiv:2510.24701, 2025.
[112] X.-P. Nguyen, S. Pandit, R. G. Reddy, A. Xu, S. Savarese, C. Xiong, and S. Joty, “Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents,” arXiv preprint arXiv:2509.06283, 2025.
[113] R. Lu, Z. Hou, Z. Wang, H. Zhang, X. Liu, Y. Li, S. Feng, J. Tang, and Y. Dong, “Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl,” arXiv preprint arXiv:2509.10446, 2025.
[114] F. Wu, W. Xuan, H. Qi, X. Lu, A. Tu, L. E. Li, and Y. Choi, “Deepsearch: Overcome the bottleneck of reinforcement learning with verifiable rewards via monte carlo tree search,” arXiv preprint arXiv:2509.25454, 2025.
[115] Z. Li, X. Guan, B. Zhang, S. Huang, H. Zhou, S. Lai, M. Yan, Y. Jiang, P. Xie, F. Huang et al., “Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research,” arXiv preprint arXiv:2509.13312, 2025.
[116] Z. Z. Wang, J. Mao, D. Fried, and G. Neubig, “Agent workflow memory,” arXiv preprint arXiv:2409.07429, 2024.
[117] X. Tang, T. Qin, T. Peng, Z. Zhou, D. Shao, T. Du, X. Wei, P. Xia, F. Wu, H. Zhu et al., “Agent kb: Leveraging cross-domain experience for agentic problem solving,” arXiv preprint arXiv:2507.06229, 2025.
[118] B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig et al., “Skillweaver: Web agents can self-improve by discovering and honing skills,” arXiv preprint arXiv:2504.07079, 2025.
[119] C. Yang, C. Zhao, Q. Gu, and D. Zhou, “Cops: Empowering llm agents with provable cross-task experience sharing,” arXiv preprint arXiv:2410.16670, 2024.
[120] L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang et al., “Gepa: Reflective prompt evolution can outperform reinforcement learning,” arXiv preprint arXiv:2507.19457, 2025.
[121] Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V. Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li et al., “Agentic context engineering: Evolving contexts for self-improving language models,” arXiv preprint arXiv:2510.04618, 2025.
[122] M. Suzgun, M. Yuksekgonul, F. Bianchi, D. Jurafsky, and J. Zou, “Dynamic cheatsheet: Test-time learning with adaptive memory, 2025,” URL https://arxiv. org/abs/2504.07952.
[123] H. Zhou, Y. Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y. Lee, G. Zhang, K. Shao, L. Yang et al., “Memento: Fine-tuning llm agents without fine-tuning llms,” arXiv preprint arXiv:2508.16153, 2025.
[124] S. Ouyang, J. Yan, I. Hsu, Y. Chen, K. Jiang, Z. Wang, R. Han, L. T. Le, S. Daruki, X. Tang et al., “Reasoningbank: Scaling agent self-evolving with reasoning memory,” arXiv preprint arXiv:2509.25140, 2025.
[125] S. Chen, S. Lin, X. Gu, Y. Shi, H. Lian, L. Yun, D. Chen, W. Sun, L. Cao, and Q. Wang, “Swe-exp: Experience-driven software issue resolution,” arXiv preprint arXiv:2507.23361, 2025.
[126] Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried, “Inducing programmatic skills for agentic tasks,” arXiv preprint arXiv:2504.06821, 2025.
[127] K. Zhang, X. Chen, B. Liu, T. Xue, Z. Liao, Z. Liu, X. Wang, Y. Ning, Z. Chen, X. Fu et al., “Agent learning via early experience,” arXiv preprint arXiv:2510.08558, 2025.
[128] Z. Cao, J. Deng, L. Yu, W. Zhou, Z. Liu, B. Ding, and H. Zhao, “Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution,” arXiv preprint arXiv:2512.10696, 2025.
[129] G. Zhang, M. Fu, and S. Yan, “Memgen: Weaving generative latent memory for self-evolving agents,” arXiv preprint arXiv:2509.24704, 2025.
[130] Z. Cai, X. Guo, Y. Pei, J. Feng, J. Su, J. Chen, Y.-Q. Zhang, W.-Y. Ma, M. Wang, and H. Zhou, “Flex: Continuous agent evolution via forward learning from experience,” arXiv preprint arXiv:2511.06449, 2025.
[131] M. Ho, C. Si, Z. Feng, F. Yu, Y. Yang, Z. Liu, Z. Hu, and L. Qin, “Arcmemo: Abstract reasoning composition with lifelong llm memory,” arXiv preprint arXiv:2509.04439, 2025.
[132] Y. Cai, S. Cai, Y. Shi, Z. Xu, L. Chen, Y. Qin, X. Tan, G. Li, Z. Li, H. Lin et al., “Training-free group relative policy optimization,” arXiv preprint arXiv:2510.08191, 2025.
[133] R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y. Shen, Y. Wang et al., “Evolver: Self-evolving llm agents through an experience-driven lifecycle,” arXiv preprint arXiv:2510.16079, 2025.
[134] G. Zhang, F. Meng, G. Wan, Z. Li, K. Wang, Z. Yin, L. Bai, and S. Yan, “Latentevolve: Self-evolving test-time scaling in latent space,” arXiv preprint arXiv:2509.24771, 2025.
[135] J. Wang, Q. Yan, Y. Wang, Y. Tian, S. S. Mishra, Z. Xu, M. Gandhi, P. Xu, and L. L. Cheong, “Reinforcement learning for self-improving agent with skill library,” arXiv preprint arXiv:2512.17102, 2025.
[136] Y. Zhai, S. Tao, C. Chen, A. Zou, Z. Chen, Q. Fu, S. Mai, L. Yu, J. Deng, Z. Cao et al., “Agentevolver: Towards efficient self-evolving agent system,” arXiv preprint arXiv:2511.10395, 2025.
[137] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “Memorybank: Enhancing large language models with long-term memory,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 724–19 731.
[138] C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez, “Memgpt: Towards llms as operating systems.” 2023.
[139] L. Liu, X. Yang, Y. Shen, B. Hu, Z. Zhang, J. Gu, and G. Zhang, “Thinkin-memory: Recalling and post-thinking enable llms with long-term memory,” arXiv preprint arXiv:2311.08719, 2023.
[140] K.-H. Lee, X. Chen, H. Furuta, J. Canny, and I. Fischer, “A humaninspired reading agent with gist memory of very long contexts,” arXiv preprint arXiv:2402.09727, 2024.
[141] X. Li and X. Qiu, “Mot: Memory-of-thought enables chatgpt to selfimprove,” arXiv preprint arXiv:2305.05181, 2023.
[142] R. Fang, Y. Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang, “Memp: Exploring agent procedural memory,” arXiv preprint arXiv:2508.06433, 2025.
[143] Z. Tan, J. Yan, I.-H. Hsu, R. Han, Z. Wang, L. Le, Y. Song, Y. Chen, H. Palangi, G. Lee et al., “In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 8416–8439.
[144] J. Fang, X. Deng, H. Xu, Z. Jiang, Y. Tang, Z. Xu, S. Deng, Y. Yao, M. Wang, S. Qiao et al., “Lightmem: Lightweight and efficient memoryaugmented generation,” arXiv preprint arXiv:2510.18866, 2025.
[145] B. Yan, C. Li, H. Qian, S. Lu, and Z. Liu, “General agentic memory via deep research,” arXiv preprint arXiv:2511.18423, 2025.
[146] R. Ye, Z. Zhang, K. Li, H. Yin, Z. Tao, Y. Zhao, L. Su, L. Zhang, Z. Qiao, X. Wang et al., “Agentfold: Long-horizon web agents with proactive context management,” arXiv preprint arXiv:2510.24699, 2025.
[147] W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, “A-mem: Agentic memory for llm agents,” arXiv preprint arXiv:2502.12110, 2025.
[148] P. Rasmussen, P. Paliychuk, T. Beauvais, J. Ryan, and D. Chalef, “Zep: a temporal knowledge graph architecture for agent memory,” arXiv preprint arXiv:2501.13956, 2025.
[149] Z. Wang, H. Xu, J. Wang, X. Zhang, M. Yan, J. Zhang, F. Huang, and H. Ji, “Mobile-agent-e: Self-evolving mobile assistant for complex tasks,” arXiv preprint arXiv:2501.11733, 2025.
[150] P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, “Mem0: Building production-ready ai agents with scalable long-term memory,” arXiv preprint arXiv:2504.19413, 2025.
[151] Y. Wang, R. Takanobu, Z. Liang, Y. Mao, Y. Hu, J. McAuley, and X. Wu, “Mem-{\alpha}: Learning memory construction via reinforcement learning,” arXiv preprint arXiv:2509.25911, 2025.
[152] Y. Wang, D. Krotov, Y. Hu, Y. Gao, W. Zhou, J. McAuley, D. Gutfreund, R. Feris, and Z. He, “M+: Extending memoryllm with scalable longterm memory,” in Forty-second International Conference on Machine Learning, 2025.
[153] R. Li, Z. Zhang, X. Bo, Z. Tian, X. Chen, Q. Dai, Z. Dong, and R. Tang, “Cam: A constructivist view of agentic memory for llm-based reading comprehension,” arXiv preprint arXiv:2510.05520, 2025.
[154] Y. Zhang, J. Shu, Y. Ma, X. Lin, S. Wu, and J. Sang, “Memory as action: Autonomous context curation for long-horizon agentic tasks,” arXiv preprint arXiv:2510.12635, 2025.
[155] G. Zhang, H. Yu, K. Yang, B. Wu, F. Huang, Y. Li, and S. Yan, “Evoroute: Experience-driven self-routing llm agent systems,” arXiv preprint arXiv:2601.02695, 2026.
[156] G. Zhang, H. Ren, C. Zhan, Z. Zhou, J. Wang, H. Zhu, W. Zhou, and S. Yan, “Memevolve: Meta-evolution of agent memory systems,” arXiv preprint arXiv:2512.18746, 2025.
[157] H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “Memskill: Learning and evolving memory skills for self-evolving agents,” arXiv preprint arXiv:2602.02474, 2026.
[158] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in The eleventh international conference on learning representations, 2022.
[159] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browserassisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
[160] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
[161] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799.
[162] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large language models as tool makers,” arXiv preprint arXiv:2305.17126, 2023.
[163] C. Qian, C. Han, Y. Fung, Y. Qin, Z. Liu, and H. Ji, “Creator: Tool creation for disentangling abstract and concrete reasoning of large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 6922–6939.
[164] L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji, “Craft: Customizing llms by creating and retrieving from specialized toolsets,” arXiv preprint arXiv:2309.17428, 2023.
[165] G. Wolflein, D. Ferber, D. Truhn, O. Arandjelovic, and J. N. Kather, ¨ “Llm agents making agent tools,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 092–26 130.
[166] J. Qiu, X. Qi, T. Zhang, X. Juan, J. Guo, Y. Lu, Y. Wang, Z. Yao, Q. Ren, X. Jiang et al., “Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution,” arXiv preprint arXiv:2505.20286, 2025.
[167] J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang et al., “Aflow: Automating agentic workflow generation,” arXiv preprint arXiv:2410.10762, 2024.
[168] C. Qian, Z. Xie, Y. Wang, W. Liu, K. Zhu, H. Xia, Y. Dang, Z. Du, W. Chen, C. Yang et al., “Scaling large language model-based multiagent collaboration,” arXiv preprint arXiv:2406.07155, 2024.
[169] S. Hu, C. Lu, and J. Clune, “Automated design of agentic systems,” arXiv preprint arXiv:2408.08435, 2024.
[170] Z. Li, S. Xu, K. Mei, W. Hua, B. Rama, O. Raheja, H. Wang, H. Zhu, and Y. Zhang, “Autoflow: Automated workflow generation for large language model agents,” arXiv preprint arXiv:2407.12821, 2024.
[171] M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber, “Language agents as optimizable graphs,” arXiv preprint arXiv:2402.16823, 2024.
[172] R. Ye, S. Tang, R. Ge, Y. Du, Z. Yin, S. Chen, and J. Shao, “Mas-gpt: Training llms to build llm-based multi-agent systems,” arXiv preprint arXiv:2503.03686, 2025.
[173] G. Chen, S. Dong, Y. Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y. Shi, “Autoagents: A framework for automatic agent generation,” arXiv preprint arXiv:2309.17288, 2023.
[174] G. Zhang, Y. Yue, X. Sun, G. Wan, M. Yu, J. Fang, K. Wang, T. Chen, and D. Cheng, “G-designer: Architecting multi-agent communication topologies via graph neural networks,” arXiv preprint arXiv:2410.11782, 2024.
[175] G. Zhang, L. Niu, J. Fang, K. Wang, L. Bai, and X. Wang, “Multi-agent architecture search via agentic supernet,” arXiv preprint arXiv:2502.04180, 2025.
[176] S. Yuan, K. Song, J. Chen, X. Tan, D. Li, and D. Yang, “Evoagent: Towards automatic multi-agent generation via evolutionary algorithms,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 6192–6217.
[177] H. Zhou, X. Wan, R. Sun, H. Palangi, S. Iqbal, I. Vulic, A. Korhonen, ´ and S. O. Arık, “Multi-agent design: Optimizing agents with better ¨ prompts and topologies,” arXiv preprint arXiv:2502.02533, 2025.
[178] S. R. Motwani, C. Smith, R. J. Das, R. Rafailov, I. Laptev, P. H. Torr, F. Pizzati, R. Clark, and C. S. de Witt, “Malt: Improving reasoning with multi-agent llm training,” arXiv preprint arXiv:2412.01928, 2024.
[179] Z. Wan, Y. Li, X. Wen, Y. Song, H. Wang, L. Yang, M. Schmidt, J. Wang, W. Zhang, S. Hu et al., “Rema: Learning to meta-think for llms with multi-agent reinforcement learning,” arXiv preprint arXiv:2503.09501, 2025.
[180] B. Li, Z. Zhao, D.-H. Lee, and G. Wang, “Adaptive graph pruning for multi-agent communication,” arXiv preprint arXiv:2506.02951, 2025.
[181] G. Zhang, M. Fu, G. Wan, M. Yu, K. Wang, and S. Yan, “G-memory: Tracing hierarchical memory for multi-agent systems,” arXiv preprint arXiv:2506.07398, 2025.
[182] A. Rezazadeh, Z. Li, A. Lou, Y. Zhao, W. Wei, and Y. Bao, “Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control,” arXiv preprint arXiv:2505.18279, 2025.
[183] J. Zou, X. Yang, R. Qiu, G. Li, K. Tieu, P. Lu, K. Shen, H. Tong, Y. Choi, J. He et al., “Latent collaboration in multi-agent systems,” arXiv preprint arXiv:2511.20639, 2025.
[184] H. Xu, J. Hu, K. Zhang, L. Yu, Y. Tang, X. Song, Y. Duan, L. Ai, and B. Shi, “Sedm: Scalable self-evolving distributed memory for agents,” arXiv preprint arXiv:2509.09498, 2025.
[185] W. Chen, J. Yuan, C. Qian, C. Yang, Z. Liu, and M. Sun, “Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 11 534–11 557.
[186] C. Park, S. Han, X. Guo, A. E. Ozdaglar, K. Zhang, and J.-K. Kim, “Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 215–30 248.
[187] J. Liao, M. Wen, J. Wang, and W. Zhang, “Marft: Multi-agent reinforcement fine-tuning,” arXiv preprint arXiv:2504.16129, 2025.
[188] S. Liu, T. Chen, Z. Liang, X. Lyu, and C. Amato, “Llm collaboration with multi-agent reinforcement learning,” arXiv preprint arXiv:2508.04652, 2025.
[189] X. Xue, Y. Zhou, G. Zhang, Z. Zhang, Y. Li, C. Zhang, Z. Yin, P. Torr, W. Ouyang, and L. Bai, “Comas: Co-evolving multi-agent systems via interaction rewards,” arXiv preprint arXiv:2510.08529, 2025.
[190] J. Guo, L. Yang, P. Chen, Q. Xiao, Y. Wang, X. Juan, J. Qiu, K. Shen, and M. Wang, “Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators,” arXiv preprint arXiv:2512.19682, 2025.
[191] S. Lu, Z. Wang, H. Zhang, Q. Wu, L. Gan, C. Zhuang, J. Gu, and T. Lin, “Don’t just fine-tune the agent, tune the environment,” arXiv preprint arXiv:2510.10197, 2025.
[192] Z. Zeng, H. Ivison, Y. Wang, L. Yuan, S. S. Li, Z. Ye, S. Li, J. He, R. Zhou, T. Chen et al., “Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments,” arXiv preprint arXiv:2511.07317, 2025.
[193] Z. Chen, Z. Zhao, K. Zhang, B. Liu, Q. Qi, Y. Wu, T. Kalluri, S. Cao, Y. Xiong, H. Tong et al., “Scaling agent learning via experience synthesis,” arXiv preprint arXiv:2511.03773, 2025.
[194] J. Zhang, Y. Peng, F. Kong, C. Yang, Y. Wu, Z. Yu, J. Xiang, J. Ruan, J. Wang, M. Song et al., “Autoenv: Automated environments for measuring cross-environment agent learning,” arXiv preprint arXiv:2511.19304, 2025.
[195] K. Gandhi, S. Garg, N. D. Goodman, and D. Papailiopoulos, “Endless terminals: Scaling rl environments for terminal agents,” arXiv preprint arXiv:2601.16443, 2026.
[196] Z. Stojanovski, O. Stanley, J. Sharratt, R. Jones, A. Adefioye, J. Kaddour, and A. Kopf, “Reasoning gym: Reasoning environments for reinforce- ¨ ment learning with verifiable rewards,” arXiv preprint arXiv:2505.24760, 2025.
[197] Z. Liu, A. Sims, K. Duan, C. Chen, S. Yu, X. Zhou, H. Xu, S. Xiong, B. Liu, C. Tan et al., “Gem: A gym for agentic llms,” arXiv preprint arXiv:2510.01051, 2025.
[198] Z. Xi, Y. Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, C. Liao, X. Guo, W. He, S. Gao, L. Chen, R. Zheng, Y. Zou, T. Gui, Q. Zhang, X. Qiu, X. Huang, Z. Wu, and Y.-G. Jiang, “Agentgym: Evolving large language model-based agents across diverse environments,” 2024. [Online]. Available: https://arxiv.org/abs/2406.04151
[199] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv:2408.03314, 2024.
[200] L. Chen, J. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou, “Are more llm calls all you need? towards the scaling properties of compound ai systems,” Advances in Neural Information Processing Systems, vol. 37, pp. 45 767–45 790, 2024.
[201] N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candes, and T. B. Hashimoto, “s1: Simple `
test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20 286–20 332.
[202] T. Han, Z. Wang, C. Fang, S. Zhao, S. Ma, and Z. Chen, “Token-budgetaware llm reasoning,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 24 842–24 855.
[203] Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang, “Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models,” arXiv preprint arXiv:2408.00724, 2024.
[204] J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein, “Scaling up test-time compute with latent reasoning: A recurrent depth approach,” arXiv preprint arXiv:2502.05171, 2025.
[205] Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou, “Scaling relationship on learning mathematical reasoning with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01825
[206] L. Chen, J. Q. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou, “Are more llm calls all you need? towards scaling laws of compound inference systems,” arXiv preprint arXiv:2403.02419, 2024.
[207] Y. Song, G. Wang, S. Li, and B. Y. Lin, “The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4195–4206.
[208] Q. Zhang, Y. Wang, Y. Jiang, L. Li, C. Wu, Y. Wang, X. Jiang, L. Shang, R. Tang, F. Lyu et al., “Crowd comparative reasoning: Unlocking comprehensive evaluations for llm-as-a-judge,” arXiv preprint arXiv:2502.12501, 2025.
[209] C. Li, T. Xu, and Y. Guo, “Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment,” arXiv preprint arXiv:2502.07803, 2025.
[210] K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. D. Goodman, “Stream of search (sos): Learning to search in language,” arXiv preprint arXiv:2404.03683, 2024.
[211] Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, L. Han, H. Mi, and D. Yu, “Toward self-improvement of llms via imagination, searching, and criticizing,” Advances in Neural Information Processing Systems, vol. 37, pp. 52 723–52 748, 2024.
[212] X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179, 2023.
[213] A. Wang, L. Song, Y. Tian, B. Peng, D. Yu, H. Mi, J. Su, and D. Yu, “Litesearch: Efficacious tree search for llm,” arXiv preprint arXiv:2407.00320, 2024.
[214] X. Wei, Y. Dong, X. Wang, X. Zhang, Z. Zhao, D. Shen, L. Xia, and D. Yin, “Beyond react: A planner-centric framework for complex tool-augmented llm reasoning,” arXiv preprint arXiv:2511.10037, 2025.
[215] J. Pourcel, C. Colas, and P.-Y. Oudeyer, “Self-improving language models for evolutionary program synthesis: A case study on arc-agi,” arXiv preprint arXiv:2507.14172, 2025.
[216] Y. Ge, S. Romeo, J. Cai, M. Sunkara, and Y. Zhang, “Samule: Selflearning agents enhanced by multi-level reflection,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 16 602–16 621.
[217] Y. Wang, L. Yang, Y. Tian, K. Shen, and M. Wang, “Co-evolving llm coder and unit tester via reinforcement learning,” arXiv preprint arXiv:2506.03136, 2025.
[218] S. Sundaram, J. Quan, A. Kwiatkowski, K. Ahuja, Y. Ollivier, and J. Kempe, “Teaching models to teach themselves: Reasoning at the edge of learnability,” arXiv preprint arXiv:2601.18778, 2026.
[219] T. Hubert, R. Mehta, L. Sartran, M. Z. Horvath, G. ´ Zuˇ ziˇ c, E. Wieser, ´ A. Huang, J. Schrittwieser, Y. Schroecker, H. Masoom et al., “Olympiadlevel formal mathematical reasoning with reinforcement learning,” Nature, pp. 1–3, 2025.
[220] S. Fan, X. Ye, and Y. Lin, “Darc: Decoupled asymmetric reasoning curriculum for llm evolution,” arXiv preprint arXiv:2601.13761, 2026.
[221] M. Bagatella, M. Albaba, J. Hubotter, G. Martius, and A. Krause, “Test- ¨ time offline reinforcement learning on goal-related experience,” arXiv preprint arXiv:2507.18809, 2025.
[222] S. Yuan, Z. Chen, Z. Xi, J. Ye, Z. Du, and J. Chen, “Agent-r: Training language model agents to reflect via iterative self-training,” arXiv preprint arXiv:2501.11425, 2025.
[223] W. Sun, X. Cheng, J. Fan, Y. Xu, X. Yu, S. He, J. Zhao, and K. Liu, “Towards agentic self-learning llms in search environment,” arXiv preprint arXiv:2510.14253, 2025.
[224] Y. Chen, Y. Wang, S. Zhu, H. Yu, T. Feng, M. Zhang, M. Patwary, and J. You, “Multi-agent evolve: Llm self-improve through co-evolution,” arXiv preprint arXiv:2510.23595, 2025.
[225] C. Zhou, T. Xu, J. Lin, and D. Ge, “Steporlm: A self-evolving framework with generative process supervision for operations research language models,” arXiv preprint arXiv:2509.22558, 2025.
[226] Z. Yue, K. Upasani, X. Yang, S. Ge, S. Nie, Y. Mao, Z. Liu, and D. Wang, “Dr. zero: Self-evolving search agents without training data,” arXiv preprint arXiv:2601.07055, 2026.
[227] Y. Sun, Y. Liang, Z. Zhang, and J. Teng, “Theoretical modeling of llm self-improvement training dynamics through solver-verifier gap,” arXiv preprint arXiv:2507.00075, 2025.
[228] H. Lu, Y. Wen, P. Cheng, R. Ding, J. Guo, H. Xu, C. Wang, H. Chen, X. Jiang, and G. Jiang, “Search self-play: Pushing the frontier of agent capability without supervision,” arXiv preprint arXiv:2510.18821, 2025.
[229] Y. Jin, K. Xu, H. Li, X. Han, Y. Zhou, C. Li, and J. Bai, “Reveal: Self-evolving code agents via iterative generation-verification,” arXiv preprint arXiv:2506.11442, 2025.
[230] Q. Zhang, Z. Xiang, Y. Xiao, L. Wang, J. Li, X. Wang, and J. Su, “Faithfulrag: Fact-level conflict modeling for context-faithful retrievalaugmented generation,” arXiv preprint arXiv:2506.08938, 2025.
[231] L. Zhuang, S. Chen, Y. Xiao, H. Zhou, Y. Zhang, H. Chen, Q. Zhang, and X. Huang, “Linearrag: Linear graph retrieval augmented generation on large-scale corpora,” arXiv preprint arXiv:2510.10114, 2025.
[232] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 7969–7992.
[233] Q. Zhang, S. Chen, Y. Bei, Z. Yuan, H. Zhou, Z. Hong, H. Chen, Y. Xiao, C. Zhou, J. Dong et al., “A survey of graph retrieval-augmented generation for customized large language models,” arXiv preprint arXiv:2501.13958, 2025.
[234] G. Dong, J. Jin, X. Li, Y. Zhu, Z. Dou, and J.-R. Wen, “Rag-critic: Leveraging automated critic-guided agentic workflow for retrieval augmented generation,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 3551–3578.
[235] H. Song, J. Jiang, Y. Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen, “R1-searcher: Incentivizing the search capability in llms via reinforcement learning,” arXiv preprint arXiv:2503.05592, 2025.
[236] S. Chen, C. Zhou, Z. Yuan, Q. Zhang, Z. Cui, H. Chen, Y. Xiao, J. Cao, and X. Huang, “You don’t need pre-built graphs for rag: Retrieval augmented generation with adaptive reasoning structures,” arXiv preprint arXiv:2508.06105, 2025.
[237] S. Alzubi, C. Brooks, P. Chiniya, E. Contente, C. von Gerlach, L. Irwin, Y. Jiang, A. Kaz, W. Nguyen, S. Oh et al., “Open deep search: Democratizing search with open-source reasoning agents,” arXiv preprint arXiv:2503.20201, 2025.
[238] H. Wan, C. Yang, J. Yu, M. Tu, J. Lu, D. Yu, J. Cao, B. Gao, J. Xie, A. Wang et al., “Deepresearch arena: The first exam of llms’ research abilities via seminar-grounded tasks,” arXiv preprint arXiv:2509.01396, 2025.
[239] L. Patel, N. Arabzadeh, H. Gupta, A. Sundar, I. Stoica, M. Zaharia, and C. Guestrin, “Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis,” arXiv preprint arXiv:2508.20033, 2025.
[240] Y. Huang, Y. Chen, H. Zhang, K. Li, H. Zhou, M. Fang, L. Yang, X. Li, L. Shang, S. Xu et al., “Deep research agents: A systematic examination and roadmap,” arXiv preprint arXiv:2506.18096, 2025.
[241] B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi et al., “Mathematical discoveries from program search with large language models,” Nature, vol. 625, no. 7995, pp. 468–475, 2024.
[242] Z. Xiang, C. Wu, Q. Zhang, S. Chen, Z. Hong, X. Huang, and J. Su, “When to use graphs in rag: A comprehensive analysis for graph retrievalaugmented generation,” arXiv preprint arXiv:2506.05690, 2025.
[243] J. Jin, Y. Zhang, Y. Xu, H. Qian, Y. Zhu, and Z. Dou, “Finsight: Towards real-world financial deep research,” arXiv preprint arXiv:2510.16844, 2025.
[244] T. Wei, N. Sachdeva, B. Coleman, Z. He, Y. Bei, X. Ning, M. Ai, Y. Li, J. He, E. H. Chi et al., “Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory,” arXiv preprint arXiv:2511.20857, 2025.
[245] J. Wang, Z. Guo, W. Ma, and M. Zhang, “How far can llms improve from experience? measuring test-time learning ability in llms with human comparison,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 25 688–25 702.
[246] E. Feng, W. Zhou, Z. Liu, L. Chen, Y. Dong, C. Zhang, Y. Zhao, D. Du, Z. Hua, Y. Xia et al., “Get experience from practice: Llm agents with record & replay,” arXiv preprint arXiv:2505.17716, 2025.
[247] S. Xia, Z. Xu, J. Chai, W. Fan, Y. Song, X. Wang, G. Yin, W. Lin, H. Zhang, and J. Wang, “From experience to strategy: Empowering llm agents with trainable graph memory,” arXiv preprint arXiv:2511.07800, 2025.
[248] G. Zhang, S. Zhu, A. Wei, Z. Song, A. Nie, Z. Jia, N. Vijaykumar, Y. Wang, and K. Olukotun, “Accelopt: A self-improving llm agentic system for ai accelerator kernel optimization,” arXiv preprint arXiv:2511.15915, 2025.
[249] W. Zhang, X. Zhang, H. Yu, S. Nie, B. Wu, J. Yue, T. Liu, and Y. Li, “Expseek: Self-triggered experience seeking for web agents,” arXiv preprint arXiv:2601.08605, 2026.
[250] X. Huang, J. Chen, Y. Fei, Z. Li, P. Schwaller, and G. Ceder, “Cascade: Cumulative agentic skill creation through autonomous development and evolution,” arXiv preprint arXiv:2512.23880, 2025.
[251] M. T. Hosain, S. Rahman, M. K. Morol, and M. R. Parvez, “Xolver: Multi-agent reasoning with holistic experience learning just like an olympiad team,” arXiv preprint arXiv:2506.14234, 2025.
[252] H. Shi, X. Yuan, and B. Liu, “Evolving programmatic skill networks,” arXiv preprint arXiv:2601.03509, 2026.
[253] H. Ye, X. He, V. Arak, H. Dong, and G. Song, “Meta context engineering via agentic skill evolution,” arXiv preprint arXiv:2601.21557, 2026.
[254] H. Yu, F. Zhu, G.-S. Xie, and L. Shao, “Self-consolidation for selfevolving agents,” arXiv preprint arXiv:2602.01966, 2026.
[255] J. Qiu, X. Qi, H. Wang, X. Juan, Y. Wang, Z. Zhao, J. Geng, J. Guo, P. Li, J. Shi et al., “Alita-g: Self-evolving generative agent for agent generation,” arXiv preprint arXiv:2510.23601, 2025.
[256] S. Zhang, C. Yuan, R. Guo, X. Yu, R. Xu, Z. Chen, Z. Li, Z. Yang, S. Guan, Z. Tang et al., “Evofsm: Controllable self-evolution for deep research with finite state machines,” arXiv preprint arXiv:2601.09465, 2026.
[257] J. Tan, Z. Dou, Y. Yu, J. Cheng, Q. Ju, J. Xie, and J.-R. Wen, “Hiersearch: A hierarchical enterprise deep search framework integrating local and web searches,” arXiv preprint arXiv:2508.08088, 2025.
[258] S. Zhang, J. Wang, R. Zhou, J. Liao, Y. Feng, W. Zhang, Y. Wen, Z. Li, F. Xiong, Y. Qi et al., “Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory,” arXiv preprint arXiv:2601.03192, 2026.
[259] Y. Chen, G. Dong, and Z. Dou, “Toward effective tool-integrated reasoning via self-evolved preference learning,” arXiv preprint arXiv:2509.23285, 2025.
[260] Y. Du, B. Liu, V. Moens, Z. Liu, Z. Ren, J. Wang, X. Chen, and H. Zhang, “Learning correlated communication topology in multiagent reinforcement learning,” in Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 2021, pp. 456–464.
[261] X. Guo, J. Kuang, L. Pan, Y. Li, Y. Li, H.-T. Zheng, Y. Shen, D. Yin, and X. Sun, “Evoconfig: Self-evolving multi-agent systems for efficient autonomous environment configuration,” arXiv preprint arXiv:2601.16489, 2026.
[262] Y. Zhao, L. Hu, Y. Wang, M. Hou, H. Zhang, K. Ding, and J. Zhao, “Stronger-mas: Multi-agent reinforcement learning for collaborative llms,” arXiv preprint arXiv:2510.11062, 2025.
[263] Z. Weng, A. Antoniades, D. Nathani, Z. Zhang, X. Pu, and X. E. Wang, “Group-evolving agents: Open-ended self-improvement via experience sharing,” arXiv preprint arXiv:2602.04837, 2026.
[264] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The ai scientist: Towards fully automated open-ended scientific discovery,” arXiv preprint arXiv:2408.06292, 2024.
[265] A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, “Chemcrow: Augmenting large-language models with chemistry tools,” arXiv preprint arXiv:2304.05376, 2023.
[266] J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno et al., “Towards an ai co-scientist,” arXiv preprint arXiv:2502.18864, 2025.
[267] A. Merchant, S. Batzner, S. S. Schoenholz, M. Aykol, G. Cheon, and E. D. Cubuk, “Scaling deep learning for materials discovery,” Nature, 2023.
[268] N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant et al., “An autonomous laboratory for the accelerated synthesis of novel materials,” Nature, vol. 624, no. 7990, pp. 86–91, 2023.
[269] Z. Zhang, Z. Ren, C.-W. Hsu, W. Chen, Z.-W. Hong, C.-F. Lee, A. Penn, H. Xu, D. J. Zheng, S. Miao et al., “A multimodal robotic platform
for multi-element electrocatalyst discovery,” Nature, vol. 647, no. 8089, pp. 390–396, 2025.
[270] AnalemmaAI. Fars. [Online]. Available: https://analemma.ai/blog/ introducing-fars/
[271] J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 528–50 652, 2024.
[272] Anthropic. Claude code. [Online]. Available: https://claude.com/product/ claude-code
[273] Manus. Manus: Hands on ai. [Online]. Available: https://manus.im/
[274] openclaw. openclaw. [Online]. Available: https://openclaw.ai/
[275] C. Labs. Devin ai: The ai software engineer. [Online]. Available: https://devin.ai/
[276] Cursor. Cursor. [Online]. Available: https://cursor.com/cn/agents
[277] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang, G. Huang, B. Li, L. Lu, X. Wang et al., “Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory,” arXiv preprint arXiv:2305.17144, 2023.
[278] W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhou, J. Yue, J. Jiang, Y. Li et al., “Cradle: Empowering foundation agents towards general computer control,” arXiv preprint arXiv:2403.03186, 2024.
[279] A. AL, A. Ahn, N. Becker, S. Carroll, N. Christie, M. Cortes, A. Demirci, M. Du, F. Li, S. Luo et al., “Project sid: Many-agent simulations toward ai civilization,” arXiv preprint arXiv:2411.00114, 2024.
[280] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in Proceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22.
[281] A. Bolton, A. Lerchner, A. Cordell, A. Moufarek, A. Bolt, A. Lampinen, A. Mitenkova, A. O. Hallingstad, B. Vujatovic, B. Li et al., “Sima 2: A generalist embodied agent for virtual worlds,” arXiv preprint arXiv:2512.04797, 2025.
[282] J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps et al., “Genie: Generative interactive environments,” in Forty-first International Conference on Machine Learning, 2024.
[283] T. Schick, J. Dwivedi-Yu, R. Dess`ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, pp. 68 539–68 551, 2023.
[284] H. Wang, T. Fu, Y. Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac et al., “Scientific discovery in the age of artificial intelligence,” Nature, vol. 620, no. 7972, pp. 47–60, 2023.
[285] Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du, “Llm4sr: A survey on large language models for scientific research,” arXiv preprint arXiv:2501.04306, 2025.
[286] Y. Zhang, X. Chen, B. Jin, S. Wang, S. Ji, W. Wang, and J. Han, “A comprehensive survey of scientific large language models and their applications in scientific discovery,” arXiv preprint arXiv:2406.10833, 2024.
[287] I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, “Ai models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024.
[288] Z. Wang, C. Xu, B. Liu, Y. Wang, S. Han, Z. Yao, H. Yao, and Y. He, “Agent world model: Infinity synthetic environments for agentic reinforcement learning,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10090
[289] Z. Li, Y. Chang, Y. Zhou, X. Wu, Z. Liang, Y. Y. Sung, and J. L. Boyd-Graber, “Semantically-aware rewards for open-ended r1 training in free-form generation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.15068
[290] Z. Li, I. Mondal, H. Nghiem, Y. Liang, and J. L. Boyd-Graber, “PEDANTS: Cheap but effective and interpretable answer equivalence,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 9373–9398. [Online]. Available: https: //aclanthology.org/2024.findings-emnlp.548/
[291] E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y. Narang, D. Fox, D. Scaramuzza, and F. Ramos, “The reality gap in robotics: Challenges, solutions, and best practices,” 2025. [Online]. Available: https://arxiv.org/abs/2510.20808
[292] Z. Li, X. Wu, G. Shi, Y. Qin, H. Du, F. Liu, T. Zhou, D. Manocha, and J. L. Boyd-Graber, “Videohallu: Evaluating and mitigating multi-modal
hallucinations on synthetic video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2505.01481
[293] E. Salvato, G. Fenu, E. Medvet, and F. A. Pellegrino, “Crossing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning,” IEEE Access, vol. 9, pp. 153 171–153 187, 2021.
[294] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba, “Learning dexterous in-hand manipulation,” 2019. [Online]. Available: https://arxiv.org/abs/1808.00177
[295] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang et al., “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” Advances in Neural Information Processing Systems, vol. 37, pp. 95 266–95 290, 2024.
[296] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” in Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369–2380.
[297] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
[298] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Musique: Multihop questions via single-hop question composition,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 539–554, 2022.
[299] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
[300] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551, 2017.
[301] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 9802–9822.
[302] X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa, “Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps,” arXiv preprint arXiv:2011.01060, 2020.
[303] M. Suzgun, N. Scales, N. Scharli, S. Gehrmann, Y. Tay, H. W. Chung, ¨ A. Chowdhery, Q. Le, E. Chi, D. Zhou et al., “Challenging big-bench tasks and whether chain-of-thought can solve them,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 13 003–13 051.
[304] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” in Findings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 2299–2314.
[305] F. Chollet, “On the measure of intelligence,” arXiv preprint arXiv:1911.01547, 2019.
[306] A. Prize. Arc-agi: The general intelligence benchmark. [Online]. Available: https://arcprize.org/arc-agi
[307] T. Kocisk ˇ y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, ` and E. Grefenstette, “The narrativeqa reading comprehension challenge,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018.
[308] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou et al., “Longbench: A bilingual, multitask benchmark for long context understanding,” in Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), 2024, pp. 3119–3137.
[309] L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi et al., “Humanity’s last exam,” arXiv preprint arXiv:2501.14249, 2025.
[310] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof q&a benchmark,” in First Conference on Language Modeling, 2024.
[311] X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei et al., “Supergpqa: Scaling llm evaluation across 285 graduate disciplines,” arXiv preprint arXiv:2502.14739, 2025.
[312] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y. Sun, and W. Wang, “Scibench: Evaluating college-
level scientific problem-solving abilities of large language models,” arXiv preprint arXiv:2307.10635, 2023.
[313] A. Mirza, N. Alampara, S. Kunchapu, M. R´ıos-Garc´ıa, B. Emoekabu, A. Krishnan, T. Gupta, M. Schilling-Wilhelmi, M. Okereke, A. Aneesh et al., “Are large language models superhuman chemists?” arXiv preprint arXiv:2404.01475, 2024.
[314] S. Auer, D. A. Barone, C. Bartz, E. G. Cortes, M. Y. Jaradeh, O. Karras, M. Koubarakis, D. Mouromtsev, D. Pliukhin, D. Radyush et al., “The sciqa scientific question answering benchmark for scholarly knowledge,” Scientific Reports, vol. 13, no. 1, p. 7240, 2023.
[315] Y. Zhang and T. Math-AI, “American invitational mathematics examination (aime) 2025,” 2025.
[316] C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang et al., “Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 3828– 3850.
[317] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
[318] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
[319] Z. He, “amc23,” https://huggingface.co/datasets/zwhe99/amc23, 2024.
[320] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” arXiv preprint arXiv:2403.07974, 2024.
[321] T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul et al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024.
[322] M. Chen, “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[323] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
[324] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 21 558–21 572, 2023.
[325] F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman et al., “Multipl-e: A scalable and extensible approach to benchmarking neural code generation,” arXiv preprint arXiv:2208.08227, 2022.
[326] A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. ` Wang, “Cruxeval: A benchmark for code reasoning, understanding and execution,” arXiv preprint arXiv:2401.03065, 2024.
[327] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried et al., “Webarena: A realistic web environment for building autonomous agents,” arXiv preprint arXiv:2307.13854, 2023.
[328] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,” Advances in Neural Information Processing Systems, vol. 35, pp. 20 744– 20 757, 2022.
[329] Y. Deng, X. Zhang, W. Zhang, Y. Yuan, S. K. Ng, and T.-S. Chua, “On the multi-turn instruction following for conversational web agents,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8795– 8812.
[330] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su, “Mind2web: Towards a generalist agent for the web,” Advances in Neural Information Processing Systems, vol. 36, pp. 28 091–28 114, 2023.
[331] J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905.
[332] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian et al., “Toolllm: Facilitating large language models to master real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
[333] M. Chang, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan, L. Kong, and J. He, “Agentboard: An analytical evaluation board of multi-turn
llm agents,” Advances in neural information processing systems, vol. 37, pp. 74 325–74 362, 2024.
[334] M. Shridhar, X. Yuan, M.-A. Cotˆ e, Y. Bisk, A. Trischler, and ´ M. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” arXiv preprint arXiv:2010.03768, 2020.
[335] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang et al., “Agentbench: Evaluating llms as agents,” arXiv preprint arXiv:2308.03688, 2023.
[336] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “Gaia: a benchmark for general ai assistants,” in The Twelfth International Conference on Learning Representations, 2023.
[337] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” arXiv preprint arXiv:2310.06770, 2023.
[338] T. T.-B. Team, “Terminal-bench: A benchmark for ai agents in terminal environments,” Apr 2025. [Online]. Available: https: //github.com/laude-institute/terminal-bench
[339] T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” Advances in Neural Information Processing Systems, vol. 37, pp. 52 040–52 094, 2024.
[340] H. Chase, “LangChain,” Oct. 2022. [Online]. Available: https: //github.com/langchain-ai/langchain
[341] J. Liu, “LlamaIndex,” 11 2022. [Online]. Available: https://github.com/ jerryjliu/llama index
[342] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu et al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” in First Conference on Language Modeling, 2024.
[343] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin et al., “Metagpt: Meta programming for a multi-agent collaborative framework,” in The twelfth international conference on learning representations, 2023.
[344] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
[345] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3505– 3506.
[346] Z. Zhu, C. Xie, X. Lv, and slime Contributors, “slime: An llm posttraining framework for rl scaling,” https://github.com/THUDM/slime, 2025, gitHub repository. Corresponding author: Xin Lv.
[347] G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,” arXiv preprint arXiv: 2409.19256, 2024.
[348] J. Hu, X. Wu, Z. Zhu, Xianyu, W. Wang, D. Zhang, and Y. Cao, “Openrlhf: An easy-to-use, scalable and high-performance rlhf framework,” arXiv preprint arXiv:2405.11143, 2024.
[349] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouedec, “TRL: ´ Transformers Reinforcement Learning,” 2020. [Online]. Available: https://github.com/huggingface/trl
[350] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, Z. Feng, and Y. Ma, “Llamafactory: Unified efficient fine-tuning of language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: http://arxiv.org/abs/2403.13372
[351] M. H. Daniel Han and U. team, “Unsloth,” 2023. [Online]. Available: http://github.com/unslothai/unsloth
[352] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[353] L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., “Sglang: Efficient execution of structured language model programs,” Advances in neural information processing systems, vol. 37, pp. 62 557–62 583, 2024.