Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback

1School of Computation, Information and Technology, Technical University of Munich, Germany
2Center for Information and Language Processing, Ludwig Maximilian University of Munich, Germany
3State Key Laboratory for Novel Software Technology, Nanjing University, China
Corresponding author: zhenshan.bing@tum.de

Simulation and real-world demonstrations of DAHLIA framework in solving long-horizon tasks. This study introduces DAHLIA, an efficient framework for embodied long-horizon manipulation, leveraging LLMs to directly interpret high-level language instructions and dynamically generate actionable motion primitives. The framework employs a planner-reporter dual-tunnel architecture, forming a closed-loop feedback system that enables self-checking, significantly improving robustness and success rates. By integrating temporal abstraction in action execution and chain-of-thought reasoning in task planning, DAHLIA demonstrates strong generalization across diverse long-horizon manipulation tasks, achieving high success rates in both seen and unseen scenarios.

Video demonstrations of DAHLIA in solving real-world long-horizon tasks. Here we prepared three tasks for the robot agent to complete in the real-world environment. The tasks are as follows: (1) Pick all fruits and place them on the plate. (2) Stack blocks in a pyramid shape. (3) Make the coffee from the coffee machine.

Simulation demonstrations of various long-horizon tasks from both original LoHoRavens benchmark and generated pool. We test our method in the original LoHoRavens benchmark and the generated pool, where the agent needs to perform up to 20 long-horizon tasks

Abstract

Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these approaches often suffer from limited generalization, adaptability, and the scarcity of large-scale specialized datasets—unlike data-rich fields such as computer vision—leading to challenges in handling complex long-horizon tasks. In this work, We introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation that leverages large language models (LLMs) for real-time task planning and execution. Our framework features a dual-tunnel architecture, where a planner LLM decomposes tasks and generates executable plans, while a reporter LLM provides closed-loop feedback, ensuring adaptive re-planning and robust execution. Additionally, we incorporate chain-of-thought (CoT) in model reasoning and temporal abstraction in action execution to enhance efficiency and traceability. DAHLIA achieves robust generalization across diverse long-horizon tasks, demonstrating state-of-the-art performance in both simulated and real-world scenarios.

DAHLIA

Framework Overview

We present a planner-reporter architecture for language-based task planning with vision-based task evaluations to build a closed-loop feedback system. The LLM planner takes in as input a natural language task goal embedded in a structured prompt, which instructs the LLM how to write task execution codes properly. Since the LLM is pre-trained with large amounts of text data from the internet, it acts as a powerful knowledge prior for grounding abstract concepts on colors, shapes, positions, arithmetics, etc., and imitates the prompt to output a task code. The reporter is also a multi-model LLM being able to take in images and text queries from the planner, and to return feedback including the scene discription and evaluation to the planner.

Interpolate start reference image.

Fig. 1: Framework of DAHLIA. Our framework utilizes a dual-tunnel structure to implement a planner-reporter closed-loop feedback mechanism for various long-horizon manipulation tasks. The planner, powered by LLMs, converts high-level language instructions and initial state into task plans, leveraging available LMPs/APIs to generate multi-step motion primitives with the idea of temporal abstraction. The reporter, powered by a VLM, evaluates task outcomes using visual observations and provides feedback to the planner, enabling dynamic re-planning and robust performance in unstructured environments.


Prompts

We use prompt consisting of several functional parts to instruct the LLM to output reasonable content. These parts are imports, method explanations, orientations, general requirements, task examples and question. Each part is written in Python style and should guide the LLM to get familiar with the environment where the robot performs task completion, including the way to get information from the environment, as well as the way to control the objects in the environment.

Scalability

To assess the scalability and generalization of our framework across different embodiments and task setups, we conduct additional experiments using the CALVIN and Franka Kitchen benchmarks. The average success rates for individual subtasks are presented in Table \ref{tab:scalability}. The results show that our default dual-tunnel DAHLIA setup consistently outperforms its planner-only variation (DAHLIA GPT-4o-miniP), demonstrating the effectiveness of its closed-loop feedback mechanism in improving execution reliability. Our framework achieves 100% success in most tasks, including lift block, open drawer, open slide door, kettle, and slider cabinet, while also surpassing the planner-only model in complex tasks requiring precise coordination, such as rotate block (+4%) and turn on switch (+4%) in CALVIN, as well as top burner (+6%) and bottom burner (+8%) in Franka Kitchen. Notably, the planner-only model may struggle in tasks requiring highly accurate task planning and fine-grained manipulation motions. For example, in the "top burner" and "bottom burner" tasks in Franka Kitchen, the robot must accurately reach the correct rotary knob and rotate it counterclockwise by 45 degrees to activate the corresponding burner. Errors in task planning or dexterous execution—such as selecting the wrong switch or rotating in the wrong direction—can lead to complete task failure. In such cases, closed-loop feedback in the dual-tunnel plays a critical role by detecting and correcting execution errors, enabling the planner to refine its actions and complete the global task plan. The performance gap between the two DAHLIA variations is relatively small, as both employ temporal abstraction and CoT-based task planning, ensuring efficient long-horizon primitive execution. However, the closed-loop feedback mechanism can enhance robustness, allowing for error correction and ensuring greater stability in complex environments. These results further validate DAHLIA's robust execution and adaptive reasoning capabilities, demonstrating its ability to generalize across diverse tasks and embodiments.

Average success rates (%) for individual subtasks in CALVIN and Franka Kitchen
CALVIN Tasks †DAHLIA (GPT-4o-miniP) DAHLIA (Ours)
lift block100100
rotate block9498
turn on switch96100
open slide door100100
open drawer100100
overall98.0099.60
Franka Kitchen Tasks †DAHLIA (GPT-4o-miniP) DAHLIA (Ours)
microwave98100
kettle100100
light98100
top burner9096
bottom burner9098
slider cabinet100100
hinge cabinet9698
overall96.0098.86

P indicates the Planner model. † indicates the framework barely uses a Planner-only tunnel for open-loop control without the Reporter.

Simulation demonstrations in CALVIN and Franka Kitchen Benchmark. We also test the method in other benchmarks like CALVIN and Franka Kitchen with different robot setup, where the agent needs to perform various tasks in a table-top environment or a kitchen environment. For CALVIN, the agent needs to perform up to five tasks in a row according to language instructions. For Franka Kitchen, the agent needs to perform up to 4 sub-tasks in a row according to language instructions.