Table of Contents
Fetching ...

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Zeyu Zhang, Yue Huang, Kun Zhang

TL;DR

The paper tackles the fragility of foundation models under open-world distribution shifts, weak supervision, and adversarial attacks. It introduces OCRT, a three-part framework that binds object-centric scene representations to a sparse semantic concept space and constructs a flexible concept-based graph to perform high-order relational reasoning, without altering FM architectures. The approach yields a plug-and-play enhancement that improves generalization and robustness of SAM and CLIP across segmentation, zero-shot classification, captioning, and VQA, including under adversarial conditions. The work demonstrates that focusing on high-order relations among informative concepts offers a scalable path to universal generalization in practical deployments.

Abstract

Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

TL;DR

The paper tackles the fragility of foundation models under open-world distribution shifts, weak supervision, and adversarial attacks. It introduces OCRT, a three-part framework that binds object-centric scene representations to a sparse semantic concept space and constructs a flexible concept-based graph to perform high-order relational reasoning, without altering FM architectures. The approach yields a plug-and-play enhancement that improves generalization and robustness of SAM and CLIP across segmentation, zero-shot classification, captioning, and VQA, including under adversarial conditions. The work demonstrates that focusing on high-order relations among informative concepts offers a scalable path to universal generalization in practical deployments.

Abstract

Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.

Paper Structure

This paper contains 16 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Performance between SoTAs and OCRT across downstream tasks in the open world. (a) OCRT substantially resists performance damage from weak supervision. (b) OCRT mitigates the hallucination and comes closest to the original CLIP.
  • Figure 2: Overview of the proposed OCRT, which consists of three novel components: (1) Low-level visual scene decomposition with unstructured object-centric representations; (2) High-level informative concepts extraction via irrelevant concepts suppression; (3) Concept-based graph with flexible degree performs high-order relational reasoning for generalized factors.
  • Figure 3: Object-Concept-Relation Triad Decoder of SAM.
  • Figure 4: Concept-based graph with flexible degree.
  • Figure 5: Visual comparison of segmentation quality.
  • ...and 3 more figures