Table of Contents
Fetching ...

Improving Computer Vision Interpretability: Transparent Two-level Classification for Complex Scenes

Stefan Scholz, Nils B. Weidmann, Zachary C. Steinert-Threlkeld, Eda Keremoğlu, Bastian Goldlücke

TL;DR

The paper tackles the opacity of deep vision models in social science image analysis by introducing a two-level approach that first represents images through detected objects and then applies non-visual classifiers to predict outcomes. It demonstrates the method on a large protest image dataset, enabling interpretable insights such as which objects drive predictions and how object importance varies across countries. While end-to-end models like Vision Transformers achieve higher raw accuracy, the proposed framework provides transparent, inspectable content and replication-friendly resources. This approach broadens the practical impact of image analysis in political science by enabling cross-context comparisons and targeted extensions using open tools and data.

Abstract

Treating images as data has become increasingly popular in political science. While existing classifiers for images reach high levels of accuracy, it is difficult to systematically assess the visual features on which they base their classification. This paper presents a two-level classification method that addresses this transparency problem. At the first stage, an image segmenter detects the objects present in the image and a feature vector is created from those objects. In the second stage, this feature vector is used as input for standard machine learning classifiers to discriminate between images. We apply this method to a new dataset of more than 140,000 images to detect which ones display political protest. This analysis demonstrates three advantages to this paper's approach. First, identifying objects in images improves transparency by providing human-understandable labels for the objects shown on an image. Second, knowing these objects enables analysis of which distinguish protest images from non-protest ones. Third, comparing the importance of objects across countries reveals how protest behavior varies. These insights are not available using conventional computer vision classifiers and provide new opportunities for comparative research.

Improving Computer Vision Interpretability: Transparent Two-level Classification for Complex Scenes

TL;DR

The paper tackles the opacity of deep vision models in social science image analysis by introducing a two-level approach that first represents images through detected objects and then applies non-visual classifiers to predict outcomes. It demonstrates the method on a large protest image dataset, enabling interpretable insights such as which objects drive predictions and how object importance varies across countries. While end-to-end models like Vision Transformers achieve higher raw accuracy, the proposed framework provides transparent, inspectable content and replication-friendly resources. This approach broadens the practical impact of image analysis in political science by enabling cross-context comparisons and targeted extensions using open tools and data.

Abstract

Treating images as data has become increasingly popular in political science. While existing classifiers for images reach high levels of accuracy, it is difficult to systematically assess the visual features on which they base their classification. This paper presents a two-level classification method that addresses this transparency problem. At the first stage, an image segmenter detects the objects present in the image and a feature vector is created from those objects. In the second stage, this feature vector is used as input for standard machine learning classifiers to discriminate between images. We apply this method to a new dataset of more than 140,000 images to detect which ones display political protest. This analysis demonstrates three advantages to this paper's approach. First, identifying objects in images improves transparency by providing human-understandable labels for the objects shown on an image. Second, knowing these objects enables analysis of which distinguish protest images from non-protest ones. Third, comparing the importance of objects across countries reveals how protest behavior varies. These insights are not available using conventional computer vision classifiers and provide new opportunities for comparative research.
Paper Structure (23 sections, 6 figures)

This paper contains 23 sections, 6 figures.

Figures (6)

  • Figure 1: Comparison of visual information extracted from protest image with Deconvolution, Grad-CAM, Integrated Gradients and Attention Rollout.
  • Figure 2: Instance segmentation applied to an image of a candlelight vigil (left) using COCO vocabulary (center) and LVIS vocabulary (right).
  • Figure 3: Feature generation from a segmented image (left), with different feature vectors generated from this image (right): binary vector ($v_a$), count-based vector ($v_b$), area-max vector ($v_c$), and area-sum vector ($v_d$).
  • Figure 4: Out-of-sample evaluation of different methods. The figure displays the F1 score achieved on the test set; LVIS area sum with gradient-boosted trees achieves the best F1 score of 0.7203. The logistic regression obtains low F1 scores with the area-based features, making certain bars invisible. Visualization based on Table A3 in the Appendix.
  • Figure 5: Proportion of segments in protest and non-protest images (left) and importance of area-sum aggregated segments (right).
  • ...and 1 more figures