Table of Contents
Fetching ...

What does CLIP know about peeling a banana?

Claudia Cuttano, Gabriele Rosi, Gabriele Trivigno, Giuseppe Averta

TL;DR

The paper tackles affordance grounding by addressing the limitations of closed-action supervision through open-vocabulary reasoning with Vision-Language Models. It introduces AffordanceCLIP, which freezes CLIP and adds a lightweight Feature Pyramid Network to recover spatial details, trained with a pixel-text contrastive objective on referring segmentation data. The approach achieves competitive zero-shot performance on AGD20K and surpasses several weakly supervised baselines while using only a small number of learnable parameters, demonstrating open-world capability for action-object reasoning. This work highlights the potential of leveraging large multimodal models for functionality-based perception and lays the groundwork for future integration with even larger language-vision systems.

Abstract

Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.

What does CLIP know about peeling a banana?

TL;DR

The paper tackles affordance grounding by addressing the limitations of closed-action supervision through open-vocabulary reasoning with Vision-Language Models. It introduces AffordanceCLIP, which freezes CLIP and adds a lightweight Feature Pyramid Network to recover spatial details, trained with a pixel-text contrastive objective on referring segmentation data. The approach achieves competitive zero-shot performance on AGD20K and surpasses several weakly supervised baselines while using only a small number of learnable parameters, demonstrating open-world capability for action-object reasoning. This work highlights the potential of leveraging large multimodal models for functionality-based perception and lays the groundwork for future integration with even larger language-vision systems.

Abstract

Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
Paper Structure (25 sections, 9 equations, 5 figures, 3 tables)

This paper contains 25 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of AffordanceCLIP. Our AffordanceCLIP unlocks the hidden affordance understanding capabilities within CLIP. Traditional techniques rely on task-specific supervised training, limiting them to a closed set of actions. Our key insight is that CLIP, instead, already embeds knowledge on how humans interact with objects, without the need for explicit finetuning. This enables open-vocabulary reasoning about a vast range of potential actions. Our open-vocabulary approach demonstrates promising performance in zero-shot, paving the way for broader and more flexible affordance understanding.
  • Figure 2: Overview of the proposed AffordanceCLIP. Left: We train a lightweight FPN to obtain dense feature maps from CLIP. Given an image, and a textual query referring an object, a frozen CLIP model extracts visual and linguistic features. Then, our FPN gradually refines the output visual vector with fine-grained spatial details, in order to retain both spatial information and local image semantics. Finally, a contrastive loss encourages pixel-level embeddings within the GT mask of the object to align with the corresponding linguistic features. Right: At inference, AffordanceCLIP can be directly queried with any textual prompt to obtain zero-shot affordance predictions.
  • Figure 3: Qualitative results. Given an image and action, we show our model's prediction and the corresponding Ground Truth.
  • Figure 4: Open Vocabulary capabilities.Top: AffordanceCLIP is queried with actions outside the 36 of AGD20K dataset. Bottom: AffordanceCLIP is tested in the wild, on a challenging image from everyday settings.
  • Figure 5: Examples of failure cases.