Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these approaches often suffer from limited generalization, adaptability, and the scarcity of large-scale specialized datasets—unlike data-rich fields such as computer vision—leading to challenges in handling complex long-horizon tasks. In this work, We introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation that leverages large language models (LLMs) for real-time task planning and execution. Our framework features a dual-tunnel architecture, where a planner LLM decomposes tasks and generates executable plans, while a reporter LLM provides closed-loop feedback, ensuring adaptive re-planning and robust execution. Additionally, we incorporate chain-of-thought (CoT) in model reasoning and temporal abstraction in action execution to enhance efficiency and traceability. DAHLIA achieves robust generalization across diverse long-horizon tasks, demonstrating state-of-the-art performance in both simulated and real-world scenarios.
We present a planner-reporter architecture for language-based task planning with vision-based task evaluations to build a closed-loop feedback system. The LLM planner takes in as input a natural language task goal embedded in a structured prompt, which instructs the LLM how to write task execution codes properly. Since the LLM is pre-trained with large amounts of text data from the internet, it acts as a powerful knowledge prior for grounding abstract concepts on colors, shapes, positions, arithmetics, etc., and imitates the prompt to output a task code. The reporter is also a multi-model LLM being able to take in images and text queries from the planner, and to return feedback including the scene discription and evaluation to the planner.
We use prompt consisting of several functional parts to instruct the LLM to output reasonable content. These parts are imports, method explanations, orientations, general requirements, task examples and question. Each part is written in Python style and should guide the LLM to get familiar with the environment where the robot performs task completion, including the way to get information from the environment, as well as the way to control the objects in the environment.
To assess the scalability and generalization of our framework across different embodiments and task setups, we conduct additional experiments using the CALVIN and Franka Kitchen benchmarks. The average success rates for individual subtasks are presented in Table \ref{tab:scalability}. The results show that our default dual-tunnel DAHLIA setup consistently outperforms its planner-only variation (DAHLIA GPT-4o-miniP), demonstrating the effectiveness of its closed-loop feedback mechanism in improving execution reliability. Our framework achieves 100% success in most tasks, including lift block, open drawer, open slide door, kettle, and slider cabinet, while also surpassing the planner-only model in complex tasks requiring precise coordination, such as rotate block (+4%) and turn on switch (+4%) in CALVIN, as well as top burner (+6%) and bottom burner (+8%) in Franka Kitchen. Notably, the planner-only model may struggle in tasks requiring highly accurate task planning and fine-grained manipulation motions. For example, in the "top burner" and "bottom burner" tasks in Franka Kitchen, the robot must accurately reach the correct rotary knob and rotate it counterclockwise by 45 degrees to activate the corresponding burner. Errors in task planning or dexterous execution—such as selecting the wrong switch or rotating in the wrong direction—can lead to complete task failure. In such cases, closed-loop feedback in the dual-tunnel plays a critical role by detecting and correcting execution errors, enabling the planner to refine its actions and complete the global task plan. The performance gap between the two DAHLIA variations is relatively small, as both employ temporal abstraction and CoT-based task planning, ensuring efficient long-horizon primitive execution. However, the closed-loop feedback mechanism can enhance robustness, allowing for error correction and ensuring greater stability in complex environments. These results further validate DAHLIA's robust execution and adaptive reasoning capabilities, demonstrating its ability to generalize across diverse tasks and embodiments.
CALVIN Tasks | †DAHLIA (GPT-4o-miniP) | DAHLIA (Ours) |
---|---|---|
lift block | 100 | 100 |
rotate block | 94 | 98 |
turn on switch | 96 | 100 |
open slide door | 100 | 100 |
open drawer | 100 | 100 |
overall | 98.00 | 99.60 |
Franka Kitchen Tasks | †DAHLIA (GPT-4o-miniP) | DAHLIA (Ours) |
microwave | 98 | 100 |
kettle | 100 | 100 |
light | 98 | 100 |
top burner | 90 | 96 |
bottom burner | 90 | 98 |
slider cabinet | 100 | 100 |
hinge cabinet | 96 | 98 |
overall | 96.00 | 98.86 |
P indicates the Planner model. † indicates the framework barely uses a Planner-only tunnel for open-loop control without the Reporter.