I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames
Katharina Prasse, Isaac Bravo, Stefanie Walter, Margret Keuper
TL;DR
The paper introduces a probabilistic clustering approach for visual frame detection by formulating image clustering as a Minimum Cost Multicut Problem (MP) and grounding edge costs in pairwise embedding-derived similarities. It systematically compares multiple vision foundation models (e.g., DINOv2, ConvNeXt V2, CLIP) to assess how embedding space properties influence frame detection, using calibration to tune edge weights and Variational Information as a clustering metric. Key findings show that embedding choice strongly shapes cluster quality, with DINOv2 excelling at broad frames and ConvNeXt V2 capturing fine-grained distinctions; a multimodal extension with text can further enrich clustering at higher computational cost. The work provides practical guidelines for selecting embedding spaces and demonstrates MP's advantage over conventional clustering methods for detecting evolving or abstract frames in climate-change imagery, offering code for reproducibility and future expansion.
Abstract
Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at https://github.com/KathPra/MP4VisualFrameDetection.
