AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

Zhaxizhuoma Zhaxizhuoma; Pengan Chen; Ziniu Wu; Jiawei Sun; Dong Wang; Peng Zhou; Nieqing Cao; Yan Ding; Bin Zhao; Xuelong Li

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

Zhaxizhuoma Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li

TL;DR

AlignBot tackles the problem of aligning household robot task planning with user reminders, which are sparse, diverse, and multimodal. It combines a fine-tuned LLaVA-7B model as an adapter for GPT-4o to convert reminders into structured cues, and a dynamic retrieval mechanism to inject past successful cases into prompts. The approach shows large performance gains over strong baselines, achieving about 86.85% task-success compared with 21.67%–40.19% for baselines, and higher cue quality. This work demonstrates a practical route to robust, personalized long-horizon planning for household robots using multimodal reminders.

Abstract

This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

TL;DR

Abstract

Paper Structure (9 sections, 4 figures, 2 tables)

This paper contains 9 sections, 4 figures, 2 tables.

Introduction
Related Work
Problem Formulation
The AlignBot Approach
Fine-Tuning LLaVA with User Reminders
Case-Based Learning for Enhanced GPT Prompting
Action Policy for Task Execution
Experiments
Conclusion

Figures (4)

Figure 1: The robot needs to align its task planning with customized user reminders-categorized into three types: personalized preferences, corrective guidance, and contextual assistance-with each illustrated through examples in distinct colors.
Figure 2: The fine-tuned LLaVA model serves as an adapter for GPT-4o during inference, processing user id, task descriptions, and observations to produce cues that guide GPT-4o's task planning. These cues, combined with a dynamically retrieved task-relevant cases of past successes, are incorporated into the prompt, optimizing GPT-4o's generation of action plans. If the initial output does not meet user expectations, the system enables iterative dialogue, supporting multiple rounds of feedback and refinement until a satisfactory result is achieved.
Figure 3: Illustration of fine-tuning LLaVA. LLaVA is fine-tuned to optimize both semantic grounding and cue generation. The dataset is organized in a Question-Answer (Q&A) format. During training, LLaVA receives an image, a question, and the corresponding correct answer as input, and the output is the generated response.
Figure 4: Real Robot Demonstration. We implement AlignBot on a real robotic system, comprising an AgileX-based mobile platform and a UFactory XArm robotic arm. The system employs the ACT algorithm alongside the AnyGrasp method for manipulation. In this setup, the robot is tasked with placing items from a countertop into a drawer, ensuring that the task plan generated is consistently aligned with user reminders.

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

TL;DR

Abstract

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (4)