Long-horizon manipulation under sparse rewards remains challenging for reinforcement learning due to delayed feedback and inefficient exploration. Existing skill-based approaches often assume a fixed parametric prior (e.g., a single Gaussian), limiting their ability to capture diverse and multi-modal skill structures required for complex tasks. We propose a Bayesian non-parametric skill prior that models temporally extended skills in a structured latent space using a Dirichlet Process Mixture, enabling adaptive skill discovery without predefining the number of components. Integrated into a hierarchical RL framework, the learned prior guides high-level skill selection while a pretrained decoder generates temporally abstracted actions, improving exploration efficiency in sparse-reward settings. Experiments on Franka Kitchen, LIBERO-Long, Meta-World, and a real robot demonstrate consistent gains in long-horizon manipulation, achieving over 0.8 success rate within 1.5M steps, whereas SAC fails to converge even after 5M steps ($<$0.1). Compared to a single-Gaussian prior baseline, our model yields an average improvement of 21.8\%.
The training process is divided into two phases. In Phase I, a VAE with GRU modules is used to pre-train a skill representation model from a dataset of action trajectories. The model leverages a DPM to capture the non-parametric nature of skill priors, aiding in learning precise action patterns and subsequent effective task representations. In Phase II, this pre-trained skill decoder and prior are deployed within a RL framework to address long-horizon manipulation tasks. Here, the upstream inference model uses soft actor-critic structure to learn specific task reasoning, ensuring the successful execution of complex, extended long-horizon tasks.
The evaluation of Bayesian non-parametric skill prior.
a, The total training loss \(\mathcal{L}_{total}\) observed during the skill prior pretraining phase.
b, The evolution of the number of generated clusters in the Bayesian non-parametric skill prior space throughout training.
We conduct at least five trials and report the mean and standard deviation (\(\mu\pm\sigma\)).
c, t-SNE projection of the DPM-based skill prior in the final epoch.
We train the proposed Bayesian non-parametric skill prior model as presented in Sec. IV-B.
To evaluate the training performance, we first examine the total regression loss during training, as depicted in Fig. 2a.
The training loss converges rapidly, stabilizing around a value of 1.0.
In Fig. 2b, we illustrate the evolution of the number of components in the Bayesian non-parametric skill prior.
Initially, the DPM starts with a single Gaussian component.
After each training epoch, the DPM uses buffered data collected during the epoch to adjust the number, shape, and density of components, adapting to the data representation.
In the early stages, when both the network parameters and the DPM still do not converge, the DPM may generate extra components, in some cases reaching up to 12.
As training progresses and data noise decreases, the DPM merges redundant components to optimize its objective, ultimately stabilizing around 6 - 8 components in the skill prior space.
Fig. 2c displays the t-SNE projection of the Bayesian non-parametric skill prior space after training.
The results indicate that our DPM-based prior model effectively identifies and clusters the underlying features in the data,
providing a well-structured skill prior space that can guide the decoder's action pattern learning and enhance subsequent downstream RL training for long-horizon manipulation.
Average reward of total long-horizon manipulation task.
In the Franka-Kitchen Benchmark, we use sparse rewards to train the agent, awarding a score of 1 only for successfully completed subtasks and 0 otherwise.
For each model, we run at least five trials, reporting the average reward and standard deviation (\(\mu\pm\sigma\)) for comparison.
In the subsequent experiments, we compare the downstream long-horizon task performance of our framework, HELIOS, to recent SOTA baseline models in our proposed RL manipulation tasks, as outlined below:
Long-horizon manipulation task performance.
a, Snapshots of each sub-task, demonstrating that the agent efficiently completes all assigned subtasks.
b, The average skill ratios applied across the entire task, based on the results from at least five trials.
c, Snapshots of each primitive skill motion. Our Bayesian non-parametric prior captures seven base skills: "pick", "place", "pull", "rotate", "toggle", and "explore" movements (both left and right directions).
d, t-SNE projections of each primitive skill motion, showing the re-encoded skills through the pre-trained skill encoder \(q(z|\mathbf{a}_i)\) and their corresponding assignments in the Bayesian non-parametric knowledge space.
Figure 4a demonstrates the rendering snapshots of our proposed agent performing a sequence of subtasks in the Franka Kitchen environment, including manipulating the microwave, kettle, burner, and light switch.
The images demonstrate that the agent completes all assigned subtasks in the scene, highlighting its ability to handle complex, multi-step, long-horizon tasks.
The renderings of these subtasks reflect the agent's capability to utilize the pre-trained Bayesian non-parametric skill prior for efficient skill selection and execution. This showcases HELIOS's strength in combining learned skills to solve sequential subtasks in the long-horizon manipulation scene.
The pie chart in Figure 4b illustrates the distribution of skill ratios utilized throughout the long-horizon task sequence, averaged over at least five trials and color-coded by respective skills. In our task dataset, the Bayesian non-parametric skill prior captures an average of seven primitive skill motions, with snapshots of each skill depicted in Figure 4c.
These skills include "pick" (14.8%), "place" (4.8%), "pull" (19.7%), "rotate" (9.9%), "toggle" (4.9%), and "explore" in both left (26.1%) and right (19.7%) directions.
Together, these skills form the fundamental building blocks of the agent's actions, enabling efficient recombination to perform long-horizon tasks.
By leveraging the pre-trained Bayesian non-parametric skill prior, the agent effectively utilizes these skills to manage complex object manipulations, showcasing the prior's capability to encode and generalize key motion behaviors.
Additionally, Figure 4d presents the t-SNE projections of each primitive skill motion in the latent space, revealing well-defined clusters of re-encoded skills generated by the skill encoder \(q(z|\mathbf{a}_i)\) from phase I. Each cluster corresponds to a distinct skill, highlighting the ability of the Bayesian non-parametric prior to group similar actions while maintaining clear boundaries between different skill types.
This structured latent space demonstrates the prior's strength in capturing complex, multi-modal skill distributions.
Such precise clustering is essential for guiding the agent in downstream RL tasks, enabling efficient exploration and robust action pattern inference.