The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Stephen Casper; Jieun Yun; Joonhyuk Baek; Yeseong Jung; Minhwan Kim; Kiwan Kwon; Saerom Park; Hayden Moore; David Shriver; Marissa Connor; Keltin Grimes; Angus Nicolson; Arush Tagade; Jessica Rumbelow; Hieu Minh Nguyen; Dylan Hadfield-Menell

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

TL;DR

The paper evaluates interpretability tools for discovering CNN trojans at ImageNet scale through a public competition, extending the casper2023red benchmark. It showcases four featured entries—Prototype Generation (PG), TextCAVs, FEUD, and RFLA-Gen2—each delivering distinct visualization and captioning approaches to identify trojan triggers. Yun et al. set a new record on the benchmark and, along with Nicolson, successfully identified all four secret trojans, while style trojans remained challenging. The work demonstrates that benchmarking interpretability tools can aid red-teaming and diagnostics for large-scale vision models and motivates applying similar methods to other AI domains, including language models.

Abstract

Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023.

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

TL;DR

Abstract

Paper Structure (9 sections, 6 figures, 2 tables)

This paper contains 9 sections, 6 figures, 2 tables.

Background
Competition Details and Results
Methods used by Featured Entries
Tagade and Rumbelow - https://github.com/leap-laboratories/trojan-detection-submission
Nicolson - https://github.com/AngusNicolson/satml-trojans-textcavs
Moore et al. - https://github.com/cmu-sei/feud
Yun et al. - Finetuned Robust Feature Level Adversary Generator (RFLA-Gen2)
Discussion
Visualizations from Competition Entries

Figures (6)

Figure 1: From casper2023red: Examples of trojaned images from each of the three types. Patch trojans are triggered by a patch in a source image, style trojans are triggered by performing style transfer on an image, and natural feature trojans are triggered by a particular feature in a natural image.
Figure 2: Results from human evaluators showing the proportion out of 100 subjects who identified the correct trigger from an 8-option multiple choice question. A random-guess baseline achieves 0.125. (a) Result from the methods tested in casper2023red. "All" refers to using all visualizations from all 9 tools at once. (b) Results from 4 competition entries featured here.
Figure 3: All visualizations from Tagade and Rumbelow produced using Prototype Generation (PG).
Figure 4: All captions from Nicolson produced using TextCAVs.
Figure 5: All visualizations and captions from Moore et al. produced using Feature Embeddings using Diffusion (FEUD).
...and 1 more figures

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

TL;DR

Abstract

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (6)