Table of Contents
Fetching ...

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Aniket Didolkar, Andrii Zadaianchuk, Rabiul Awal, Maximilian Seitzer, Efstratios Gavves, Aishwarya Agrawal

TL;DR

CTRL-O tackles the lack of controllability in object-centric representation learning by introducing language-conditioned slots that can be bound to user-described objects in complex scenes. It combines a frozen DINOv2 backbone, a transformer-based alignment module, and a control contrastive loss to ground slots to language queries without mask supervision. The approach enables instance-controllable image generation and strengthens visual question answering by injecting language-guided, object-centric representations into downstream tasks. Results on COCO and Visual Genome demonstrate improved grounding, with concrete gains in ARI, mBO, and reliable instance-specific generation, signaling a practical path toward language-guided object-centric models in real-world settings.

Abstract

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

TL;DR

CTRL-O tackles the lack of controllability in object-centric representation learning by introducing language-conditioned slots that can be bound to user-described objects in complex scenes. It combines a frozen DINOv2 backbone, a transformer-based alignment module, and a control contrastive loss to ground slots to language queries without mask supervision. The approach enables instance-controllable image generation and strengthens visual question answering by injecting language-guided, object-centric representations into downstream tasks. Results on COCO and Visual Genome demonstrate improved grounding, with concrete gains in ARI, mBO, and reliable instance-specific generation, signaling a practical path toward language-guided object-centric models in real-world settings.

Abstract

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

Paper Structure

This paper contains 41 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: Standard object-centric learning (OCL) assigns arbitrary slots with no control. Right: CTRL-O introduces language-based control, enabling specific object targeting and multimodal applications.
  • Figure 2: (a) Overview of CTRL-O architecture. An input image is processed by a frozen DINOv2 ViT model $f$, yielding patch features $H$. These features are then transformed into $H'$ by a learnable transformer encoder $g$ to align the feature space with the control queries. The control queries are introduced in the Slot Attention (SA) module, which guides the grouping of the encoded features into slots $S$. The initial slots in the SA module are conditioned with the control queries. Finally, an MLP decoder $d$, conditioned on control queries, reconstructs the DINOv2 features. (b) To ensure that slots utilize query information to represent specific objects, we apply a contrastive loss between control queries and the Slot Attention-modulated weighted DINO features $A_{\text{slot}}$ (referred to as weighted DINO slots).
  • Figure 3: Referring Expression Controllability on Visual Genome. Visualization CTRL-O with free-form queries. The original image (left) and predicted segmentation masks are shown, with conditioning phrases presented above the corresponding segmented image; unconditioned slots have no phrase.
  • Figure 4: (a) Instance Specific Image Generation: We query an image to extract instance slot representations, which are then input into a Stable Diffusion model with the caption to generate the image. (b) Visual Question Answering: Slots are extracted from noun chunks or referring expressions in the question, then embedded into the text and input into the language model. = frozen; = trainable.
  • Figure 5: a. Instance Controllable Image Generation. Comparison between CTRL-O-SD and the baseline Stable Diffusion (SD). For a given query image (marked query ), we extract a slot representation of a specific instance $I_q$ (e.g., laptop, bus, banana). In CTRL-O-SD, the input is "A photo of $I_q$. $S_{I_q}$" to guide instance generation, while for SD, only "A photo of $I_q$" is used. Our approach produces images that more closely match the visual identity of the conditioned instance. b. Multi-Instance Composition. We extract instances from multiple images (e.g., "bench" and "pizza") and compose them into a single image, as seen with "the pizza on the bench".
  • ...and 5 more figures