Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedbackn

Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback

¹School of Computation, Information and Technology, Technical University of Munich, Germany
²Center for Information and Language Processing, Ludwig Maximilian University of Munich, Germany
³State Key Laboratory for Novel Software Technology, Nanjing University, China
^†Corresponding author: zhenshan.bing@tum.de

Abstract

Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these approaches often suffer from limited generalization, adaptability, and the scarcity of large-scale specialized datasets—unlike data-rich fields such as computer vision—leading to challenges in handling complex long-horizon tasks. In this work, We introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation that leverages large language models (LLMs) for real-time task planning and execution. Our framework features a dual-tunnel architecture, where a planner LLM decomposes tasks and generates executable plans, while a reporter LLM provides closed-loop feedback, ensuring adaptive re-planning and robust execution. Additionally, we incorporate chain-of-thought (CoT) in model reasoning and temporal abstraction in action execution to enhance efficiency and traceability. DAHLIA achieves robust generalization across diverse long-horizon tasks, demonstrating state-of-the-art performance in both simulated and real-world scenarios.

DAHLIA

Framework Overview

We present a planner-reporter architecture for language-based task planning with vision-based task evaluations to build a closed-loop feedback system. The LLM planner takes in as input a natural language task goal embedded in a structured prompt, which instructs the LLM how to write task execution codes properly. Since the LLM is pre-trained with large amounts of text data from the internet, it acts as a powerful knowledge prior for grounding abstract concepts on colors, shapes, positions, arithmetics, etc., and imitates the prompt to output a task code. The reporter is also a multi-model LLM being able to take in images and text queries from the planner, and to return feedback including the scene discription and evaluation to the planner.

Fig. 1:

Prompts

We use prompt consisting of several functional parts to instruct the LLM to output reasonable content. These parts are imports, method explanations, orientations, general requirements, task examples and question. Each part is written in Python style and should guide the LLM to get familiar with the environment where the robot performs task completion, including the way to get information from the environment, as well as the way to control the objects in the environment.

Imports:. Example of importation that tells the LLM the methods it can directly use, which can also include language model programs (LMPs).

Method explanations: Example of API introduction, which emphasizes the data format of arguments and return values.

Orientations: Example of orientation explanation, which describes the relationship between common orientation expressions and coordinate axes.

General requirements: Example of overall requirements.

Task examples: A small set of task examples of increasing difficulty for LMP parse_obj_name(), which should return objects that match the language discription.

Question: Example of the current task for LMP parse_obj_name().

To assess the scalability and generalization of our framework across different embodiments and task setups, we conduct additional experiments using the CALVIN and Franka Kitchen benchmarks. The average success rates for individual subtasks are presented in Table \ref{tab:scalability}. The results show that our default dual-tunnel DAHLIA setup consistently outperforms its planner-only variation (DAHLIA GPT-4o-mini^P), demonstrating the effectiveness of its closed-loop feedback mechanism in improving execution reliability. Our framework achieves 100% success in most tasks, including lift block, open drawer, open slide door, kettle, and slider cabinet, while also surpassing the planner-only model in complex tasks requiring precise coordination, such as rotate block (+4%) and turn on switch (+4%) in CALVIN, as well as top burner (+6%) and bottom burner (+8%) in Franka Kitchen. Notably, the planner-only model may struggle in tasks requiring highly accurate task planning and fine-grained manipulation motions. For example, in the "top burner" and "bottom burner" tasks in Franka Kitchen, the robot must accurately reach the correct rotary knob and rotate it counterclockwise by 45 degrees to activate the corresponding burner. Errors in task planning or dexterous execution—such as selecting the wrong switch or rotating in the wrong direction—can lead to complete task failure. In such cases, closed-loop feedback in the dual-tunnel plays a critical role by detecting and correcting execution errors, enabling the planner to refine its actions and complete the global task plan. The performance gap between the two DAHLIA variations is relatively small, as both employ temporal abstraction and CoT-based task planning, ensuring efficient long-horizon primitive execution. However, the closed-loop feedback mechanism can enhance robustness, allowing for error correction and ensuring greater stability in complex environments. These results further validate DAHLIA's robust execution and adaptive reasoning capabilities, demonstrating its ability to generalize across diverse tasks and embodiments.

Average success rates (%) for individual subtasks in CALVIN and Franka Kitchen
CALVIN Tasks	†DAHLIA (GPT-4o-mini^P)	DAHLIA (Ours)
lift block	100	100
rotate block	94	98
turn on switch	96	100
open slide door	100	100
open drawer	100	100
overall	98.00	99.60
Franka Kitchen Tasks	†DAHLIA (GPT-4o-mini^P)	DAHLIA (Ours)
microwave	98	100
kettle	100	100
light	98	100
top burner	90	96
bottom burner	90	98
slider cabinet	100	100
hinge cabinet	96	98
overall	96.00	98.86

^P indicates the Planner model. † indicates the framework barely uses a Planner-only tunnel for open-loop control without the Reporter.