Table of Contents
Fetching ...

I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames

Katharina Prasse, Isaac Bravo, Stefanie Walter, Margret Keuper

TL;DR

The paper introduces a probabilistic clustering approach for visual frame detection by formulating image clustering as a Minimum Cost Multicut Problem (MP) and grounding edge costs in pairwise embedding-derived similarities. It systematically compares multiple vision foundation models (e.g., DINOv2, ConvNeXt V2, CLIP) to assess how embedding space properties influence frame detection, using calibration to tune edge weights and Variational Information as a clustering metric. Key findings show that embedding choice strongly shapes cluster quality, with DINOv2 excelling at broad frames and ConvNeXt V2 capturing fine-grained distinctions; a multimodal extension with text can further enrich clustering at higher computational cost. The work provides practical guidelines for selecting embedding spaces and demonstrates MP's advantage over conventional clustering methods for detecting evolving or abstract frames in climate-change imagery, offering code for reproducibility and future expansion.

Abstract

Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at https://github.com/KathPra/MP4VisualFrameDetection.

I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames

TL;DR

The paper introduces a probabilistic clustering approach for visual frame detection by formulating image clustering as a Minimum Cost Multicut Problem (MP) and grounding edge costs in pairwise embedding-derived similarities. It systematically compares multiple vision foundation models (e.g., DINOv2, ConvNeXt V2, CLIP) to assess how embedding space properties influence frame detection, using calibration to tune edge weights and Variational Information as a clustering metric. Key findings show that embedding choice strongly shapes cluster quality, with DINOv2 excelling at broad frames and ConvNeXt V2 capturing fine-grained distinctions; a multimodal extension with text can further enrich clustering at higher computational cost. The work provides practical guidelines for selecting embedding spaces and demonstrates MP's advantage over conventional clustering methods for detecting evolving or abstract frames in climate-change imagery, offering code for reproducibility and future expansion.

Abstract

Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at https://github.com/KathPra/MP4VisualFrameDetection.

Paper Structure

This paper contains 22 sections, 8 equations, 17 figures, 6 tables, 2 algorithms.

Figures (17)

  • Figure 1: The Multicut Problem can be understood as a Bayesian Network which aims to predict the optimal partitioning $\mathscr{Y}$.
  • Figure 2: Calibration term [c] ablation across embedding spaces on ImageNette's train set shows that embedding spaces where the distance between different data points is increased during training require a smaller $cal$ compared to traditionally trained embedding spaces.
  • Figure 3: UMAP visualization for image embeddings on ClimateTV reveals the differences in embedding space occupancy between different embedding models. DINOv2 and CLIP ViT-B/32 have a similar $s_c$ distribution but only a small overlap. The embedding space comparison of CNNs pre-trained on ImageNet1k shows almost no overlap.
  • Figure 4: The ImageNette church class is grouped into two single image clusters (r1), one fn as parachute and one fp parachute (r2).
  • Figure 5: DINOv2 features appear to encode dog position and background, as r1 and r2 each show one cluster for ImageWoof.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 1