Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

Sofia Casarin; Cynthia I. Ugwu; Sergio Escalera; Oswald Lanz

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

Sofia Casarin, Cynthia I. Ugwu, Sergio Escalera, Oswald Lanz

TL;DR

This work introduces the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos and leverages DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations.

Abstract

The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. amount and variability) is still a major factor that affects model generalization. In this work, we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches, DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically, we leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations. As a result, compared to standard augmentation alternatives, we improve in terms of accuracy on ImageNet, Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones.

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

TL;DR

Abstract

Paper Structure (26 sections, 16 equations, 12 figures, 11 tables)

This paper contains 26 sections, 16 equations, 12 figures, 11 tables.

Introduction
Related Work
Automatic Data Augmentation
Enhancing the Receptive Field
Spatial Domain
Temporal Domain
Methods
Differentiable Augmentation Search
Temporal Data Augmentation
Experiments
Comparison with SOTAs
Image Classification
Image Semantic Segmentation
Ablation
Conclusions
...and 11 more sections

Figures (12)

Figure 1: \ref{['fig:abs_fig']} overviews our approach and \ref{['subfig:RF_shapes']} shows a real example of obtained receptive fields. The employed transformations are fundamental to shape the receptive field, as shown in \ref{['subfig:RF_shapes']}. The augmented images with DAS (Sec. 3.1) are concatenated in time, and processed through a video network that partially shifts and fuses the features (Sec. 3.2).
Figure 2: Our method takes an input image and processes it through a DAS cell. The cell, as shown more in detail in Fig. \ref{['fig:das']}, applies all possible transformations defined in the search space and generates an input video. The video is processed through a video network integrated by a temporal shift mechanism, with the goal of shifting the features of adjacent frames. As it can be observed in the pink box, the features shifted and combined as if a kernel $3\times3\times3$ was applied. As the content derives from transformations of the same image, the result over the original 2D image is a reshaping of the RF. Finally, the predictions for the video input are combined so that the performance for the original 2D task are given back as feedback to the DAS cell.
Figure 3: Cell structure in DAS. Multiple operations are defined on each edge, collectively applied to the image, and optimized during training: as the gradients are updated through multiple steps, the $\tau$ values associated with each operation change. Fig. 3b depicts one step of such process through thicker edges. In the end, the cell is discretized through a perturbation-based approach, the final operations are chosen and composed (black edges).
Figure 4: RF reshape for translation (top-row) and rotation (bottom-row). In our representation we are assuming that GSF was inserted once, therefore the temporal RF is expanded by 2 and three adjacent frames are considered. The first column shows the features of original image ($f_0$), while second and third columns show the features of frames obtained applying the transformation. The last column shows the effect of processing the input with a temporal shift mechanism.
Figure 5: ImageNet results for different Resnet depths.
...and 7 more figures

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

TL;DR

Abstract

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (12)