Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Chiao-An Yang; Ziwei Liu; Raymond A. Yeh

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Chiao-An Yang, Ziwei Liu, Raymond A. Yeh

TL;DR

This work hypothesizes that the discarded activations in subsampling layers are useful and can be incorporated on the fly to improve models' prediction, and proposes a search and aggregate method to find useful activation maps to be used at test time.

Abstract

Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models' prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test-time performance, complementing existing test-time augmentation techniques. Our code is available at https://github.com/ca-joe-yang/discard-in-subsampling.

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

TL;DR

Abstract

Paper Structure (29 sections, 16 equations, 13 figures, 17 tables, 1 algorithm)

This paper contains 29 sections, 16 equations, 13 figures, 17 tables, 1 algorithm.

Introduction
Related work
Preliminaries
Our Approach
Aggregating selected activations for prediction
Searching for useful activations
Experiments
Image classification
Semantic segmentation
Discussion & limitation
Conclusion
Additional details and comparison to TTA
Details for integrating Ours with existing TTA methods.
Learned TTA methods
Non-learned TTA methods
...and 14 more sections

Figures (13)

Figure 1: Comparisons on test-time procedures.(a) In classic test-time augmentation, the output $\hat{{\bm{y}}}$ is aggregated from different augmented images ${\bm{I}}_{\texttt{aug}}$ feeding into the same model ${\bm{F}}_\theta$ with default selection indices ${\bm{s}} = (0,0,0)$. (b) In our procedure, $\hat{{\bm{y}}}$ is aggregated over one single image ${\bm{I}}$ feeding into ${\bm{F}}_\theta$ but activations are extracted over a set of selection indices ${\bm{s}}$. We apply a searching algorithm to search for the top-$B_{\texttt{ours}}$ selection indices ${\bm{s}}$ based on a scoring function (Sec \ref{['sec:search']}). We then aggregate (Sec \ref{['sec:aggregate']}) the resulting feature set ${\mathcal{F}} = \{{\bm{f}}_{\bm{s}}\}$ by first aligning each feature according to ${\bm{s}}$ and then merging them using an attention aggregation module.
Figure 2: Subsampling by two.
Figure 3: Illustration of $\texttt{Align}_{\bm{s}}$ with a single subsampling layer.
Figure 4: Search for activations. From the initial state $(0, 0,0)$, we add its 3 neighbors ($l=1$) in a top-down fashion. Next, state $(2, 0,0)$ has the lowest criterion, hence we further add its 3 neighbors ($l=2$). Finally, the lowest-$B_{\texttt{ours}}$ states are returned in $\color{OliveGreen}\hat{{\mathcal{S}}}$. We keep track of the expanded $\color{Mahogany}l$ in a dictionary $E$.
Figure 5: Acc. vs. budget. We observe an initial gain in Acc. when increasing the budget $B_\texttt{ours}$. The improvement plateaus when $B_\texttt{ours}$ reaches about 15.
...and 8 more figures

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

TL;DR

Abstract

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Authors

TL;DR

Abstract

Table of Contents

Figures (13)