Visual Grounding for Object-Level Generalization in Reinforcement Learning

Haobin Jiang; Zongqing Lu

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Haobin Jiang, Zongqing Lu

TL;DR

This paper tackles zero-shot object-level generalization in reinforcement learning by grounding language instructions in a visual representation. It introduces COPL, which uses a modified MineCLIP to generate an object-specific confidence map and transfers knowledge to RL via a focal intrinsic reward and via representation as a visual task input. The focal reward, defined as $r^{f}_t = \operatorname{mean}(m^c_t \circ m^k)$ with a centered Gaussian kernel $m^k$, addresses distance-to-target and centering, while the confidence map as input provides a straightforward, open-vocabulary representation for unseen targets. Across single-task and multi-task Minecraft experiments, COPL outperforms language-conditioned baselines and demonstrates strong zero-shot generalization to novel objects, highlighting the practical potential of embedding vision-language grounding into RL for open-ended environments.

Abstract

Generalization is a pivotal challenge for agents following natural language instructions. To approach this goal, we leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning (RL) for object-centric tasks, which makes the agent capable of zero-shot generalization to unseen objects and instructions. By visual grounding, we obtain an object-grounded confidence map for the target object indicated in the instruction. Based on this map, we introduce two routes to transfer VLM knowledge into RL. Firstly, we propose an object-grounded intrinsic reward function derived from the confidence map to more effectively guide the agent towards the target object. Secondly, the confidence map offers a more unified, accessible task representation for the agent's policy, compared to language embeddings. This enables the agent to process unseen objects and instructions through comprehensible visual confidence maps, facilitating zero-shot object-level generalization. Single-task experiments prove that our intrinsic reward significantly improves performance on challenging skill learning. In multi-task experiments, through testing on tasks beyond the training set, we show that the agent, when provided with the confidence map as the task representation, possesses better generalization capabilities than language-based conditioning. The code is available at https://github.com/PKU-RL/COPL.

Visual Grounding for Object-Level Generalization in Reinforcement Learning

TL;DR

with a centered Gaussian kernel

, addresses distance-to-target and centering, while the confidence map as input provides a straightforward, open-vocabulary representation for unseen targets. Across single-task and multi-task Minecraft experiments, COPL outperforms language-conditioned baselines and demonstrates strong zero-shot generalization to novel objects, highlighting the practical potential of embedding vision-language grounding into RL for open-ended environments.

Abstract

Paper Structure (34 sections, 2 equations, 14 figures, 17 tables)

This paper contains 34 sections, 2 equations, 14 figures, 17 tables.

Introduction
Preliminary
Related Work
Method
Visual Grounding
Transfer via Reward
Transfer via Representation
Experiments
Single-Task Experiments
Multi-Task and Generalization Experiments
Discussion
Conclusion
Segmentation Details
Extracting Targets via LLM
Negative Words
...and 19 more sections

Figures (14)

Figure 1: Overview of CLIP-guided Object-grounded Policy Learning (COPL). (left) Visual grounding: The instruction is converted into a unified 2D confidence map via our modified MineCLIP. (right) Transfer VLM knowledge into RL: The agent takes the confidence map as the task representation and is trained with our proposed focal reward derived from the confidence map to guide the agent toward the target.
Figure 2: Process of segmentation via MineCLIP. The modified MineCLIP image encoder takes as input the image and outputs patch embeddings, which are subsequently processed by the temporal transformer to guarantee embedding alignment. The MineCLIP text encoder encodes the target name along with a list of negative words. The probability of the target's presence on each patch is calculated based on the similarities between patch embeddings and text embeddings.
Figure 3: Segmentation instances for targets: (a) cow, (b) pig, (c) sword, (d) sheep, (e) flower, and (f) tree. The darker blue the patch, the higher the probability of the target's presence on it.
Figure 4: Comparison between MineCLIP reward $r^{mc}$ and focal reward $r^{f}$ at Frame 25, 35, and 45, in one episode of the task milk a cow. From (a) to (c), our focal reward consistently increases as the agent approaches the target cow, while the MineCLIP reward varies in an uncorrelated way.
Figure 5: (a) Digging depth and (b) number of laid carpets in one episode. The increase in the y-axis metric with steps indicates that our focal reward successfully guides the agent to conduct the corresponding tasks.
...and 9 more figures

Visual Grounding for Object-Level Generalization in Reinforcement Learning

TL;DR

Abstract

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)