Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Agentic AI
Published: arXiv: 2509.22601v1
Authors

Yulei Qin Xiaoyu Tan Zhengbao He Gang Li Haojia Lin Zongyi Li Zihan Xu Yuchen Shi Siqi Cai Renting Rui Shaofei Cai Yuzheng Cai Xuan Zhang Sheng Ye Ke Li Xing Sun

Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

Paper Summary

Problem
The main problem addressed in this research paper is the challenge of balancing exploration and exploitation in Reinforcement Learning (RL) training of Large Language Model (LLM) agents, particularly for long-horizon tasks with sparse rewards. This balance is crucial for LLM agents to exploit their pretrained knowledge and feedback from past interactions to identify and refine strategies that maximize ultimate reward, while also exploring novel behaviors and discovering more effective solutions.
Key Innovation
The proposed solution, SPEAR (Self-imitation with Progressive Exploration for Agentic Reinforcement Learning), is a curriculum-based self-imitation learning recipe that extends the vanilla self-imitation learning framework. SPEAR incorporates a replay buffer that stores self-generated promising trajectories for off-policy update, and uses intrinsic rewards to foster skill-level exploration and facilitate action-level exploration. The approach also includes a curriculum to manage the exploration process, steering the policy evolution within a well-balanced range of entropy across stages.
Practical Impact
SPEAR has the potential to improve the performance of LLM agents in various agentic applications, such as simulated robot navigation, mobile assistants, web navigation, and GUI masters. By effectively balancing exploration and exploitation, SPEAR can help LLM agents learn from past experiences, manage policy entropy, and develop strong reasoning and tool integration skills. This can lead to more efficient and effective decision-making in complex, real-world scenarios.
Analogy / Intuitive Explanation
Imagine a child learning to ride a bike. At first, they need to explore different ways of balancing and steering, which requires a lot of trial and error. As they gain more experience and confidence, they can start to exploit their existing skills and knowledge to ride more efficiently and safely. SPEAR is like a virtual coach that helps the child (or the LLM agent) learn from their experiences, balance exploration and exploitation, and develop strong skills and strategies for success.
Paper Information
Categories:
cs.LG cs.AI cs.CL cs.CV cs.MA
Published Date:

arXiv ID:

2509.22601v1

Quick Actions