Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong
<aside> đź’ˇ
Github: https://github.com/JonP07/HiPER-agent Paper: https://arxiv.org/abs/2602.16165
Training LLM agents with reinforcement learning is particularly difficult in long-horizon, multi-turn environments, where the agent may need to take many actions before receiving meaningful feedback. Most existing RL approaches treat the agent as a flat policy that chooses one action at a time, all at a single time scale. This makes long-range credit assignment hard: when a task succeeds or fails, it is often unclear which earlier decisions mattered most. As a result, flat RL can become unstable and inefficient, especially on tasks that require several dependent subtasks to be completed in sequence.
We propose HiPER, a novel Hierarchical Plan–Execute RL framework that explicitly separates high-level planning from low-level execution. The agent first proposes a subgoal, then carries out actions under that subgoal, and can later decide whether to keep the current subgoal or switch to a new one. To train agents with this structure, HiPER introduces Hierarchical Advantage Estimation (HAE), which sends learning signals at two levels: step-by-step within each subgoal segment, and across segments when the agent switches subgoals. This matches the credit assignment mechanism to the hierarchical structure of the policy, leading to a principled and more effective way to jointly optimize planning and execution.
Empirically, HiPER performs strongly on challenging interactive agent benchmarks, including ALFWorld and WebShop. With Qwen2.5-7B-Instruct, it reaches 97.4% success on ALFWorld and 83.3% on WebShop, outperforming prior RL baselines. The gains are especially large on harder tasks that require multiple sequential subtasks, which is exactly where flat RL methods tend to struggle most.
</aside>
Large language models are increasingly used as interactive agents that must act over many turns, but reinforcement learning for these agents is still difficult in long-horizon settings with sparse and delayed rewards. Most existing methods treat the agent as a flat policy that picks one action at a time, which makes it hard to assign credit across long trajectories. Yet successful behavior in these tasks often has an implicit hierarchical structure: agents typically move through a sequence of intermediate subgoals that persist for several steps. Because flat RL does not explicitly represent or optimize this structure, it often leads to brittle behavior, such as drifting off task, switching direction too early, or repeating ineffective actions.

To address this, HiPER introduces a simple but powerful idea: make the underlying hierarchy explicit, and train the agent in a way that respects it:

HiPER turns the hierarchical structure into a concrete agent interface. Instead of asking the agent to produce only the next action, it asks the model to make three linked decisions at every environment step: whether to keep the current subgoal or switch to a new one, what the current subgoal should be, and what primitive action to take next. This is implemented as a structured prompt template that extends ReAct, so the hierarchy is not added as an external controller, but directly realized inside a single auto-regressive LLM policy.
Concretely, at each environment step $t$, the agent receives the current observation together with the previous subgoal, and then outputs a structured tuple $\langle q_t, o_t, a_t \rangle$, where $q_t$ is a binary switching decision ( $\texttt{KEEP}$ or $\texttt{SWITCH}$ ), $o_t$ is the current subgoal text, and $a_t$ is the primitive environment action. For $q_t$, if the model outputs $\texttt{KEEP}$ , the previous subgoal is simply carried forward; if it outputs $\texttt{SWITCH}$ , the model proposes a new subgoal before generating the next action. This creates a natural two-level decomposition: the high level decides when to revise the short-term subgoal, while the low level chooses the concrete action conditioned on that subgoal.
A key design choice is that this hierarchy is dynamic rather than fixed upfront. HiPER does not ask the agent to generate a complete multi-step plan at the beginning and then rigidly execute it. Instead, subgoal generation and subgoal switching are decided online as the state evolves, and low-level actions are also selected step by step under the current subgoal. This makes the framework much better suited to interactive environments, where the correct next move often depends on feedback gathered during execution. In other words, HiPER separates planning from acting, but it does not freeze either of them.
Once the agent is structured around persistent subgoals, the learning signal also needs to respect that structure. A flat advantage estimator treats every step in the trajectory in the same way, but under Plan-Execute, different decisions live at different time scales: low-level actions matter within a subgoal segment, while high-level planning decisions matter across segments. HiPER addresses this mismatch with Hierarchical Advantage Estimation (HAE), a credit assignment scheme that propagates learning signals both within subgoal segments and across their boundaries.
The starting point is that the agent’s $\texttt{SWITCH}$ decisions partition a trajectory into segments. Let the switching boundaries be $0 = b_0 < b_1 < \cdots < b_K = T$, where the $k$-th segment covers time steps $t \in [b_k, b_{k+1}-1]$ and keeps the same subgoal active throughout. HAE then introduces two value baselines: a low-level value $V^{\mathrm{low}}(s_t, o_t)$, which estimates the return when continuing to act under the current subgoal, and a high-level value $V^{\mathrm{high}}(s_t)$, which estimates the return at states where the agent is choosing a new subgoal. Intuitively, the low-level value is used to judge action execution inside a segment, while the high-level value is used to judge subgoal choices at segment boundaries.
For action execution, HAE applies a GAE-style estimator within each segment. At step $t$ inside segment $k$, the low-level temporal-difference residual is $\delta_t^{\mathrm{low}} = r_t + \gamma V_t^{\mathrm{next}} - V^{\mathrm{low}}(s_t, o_k)$, where $V_t^{\mathrm{next}}$ is usually the next-step low-level value, except at the final step of the segment, where it switches to the boundary high-level value $V^{\mathrm{high}}(s_{b_{k+1}})$. The resulting execution advantage is $\hat A_t^{\mathrm{low}} = \sum_{\ell=t}^{b_{k+1}-1} (\gamma \lambda_{\mathrm{low}})^{\ell-t} \delta_\ell^{\mathrm{low}}$. This design gives fine-grained credit to primitive actions while still connecting the last action in a segment to the long-term progress made at the next boundary.
For subgoal generation, HAE compresses each segment into a macro-step. The segment-level discounted reward is $\tilde r_k = \sum_{t=b_k}^{b_{k+1}-1} \gamma^{t-b_k} r_t$, and the corresponding duration discount is $\tilde \gamma_k = \gamma^{\,b_{k+1}-b_k}$. Using these, HiPER defines a segment-level residual $\delta_k^{\mathrm{high}} = \tilde r_k + \tilde \gamma_k V^{\mathrm{high}}(s_{b_{k+1}}) - V^{\mathrm{high}}(s_{b_k})$ and a GAE-style planning advantage $\hat A_{b_k}^{\mathrm{high}}$ over segment boundaries. This means a subgoal is not judged by a single immediate next-step reward, but by the aggregated outcome of the whole segment it governs.