Table of Contents
Fetching ...

A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

Francesco Barbato, Umberto Michieli, Mehmet Kerim Yucel, Pietro Zanuttigh, Mete Ozay

TL;DR

SyMPIE introduces a lightweight modular system that enhances input content via two trainable modules, without requiring paired clean-corrupted data, enabling cross-task robustness for multimedia understanding. By predicting parametric image operators through a Noise Estimation Module and applying them with a Differentiable Warping Module, the approach preserves a frozen upstream model during training and inference, achieving $≈$2 GFLOPs overhead and real-time throughput. Across ImageNetC, ImageNetC-mixed, VizWiz, Cityscapes derivatives, and adverse-weather segmentation benchmarks, SyMPIE delivers consistent accuracy gains (roughly $2.0$–$2.2$ pp absolute on classification, $4.1$–$4.0$% on segmentation) with modest latency, and remains stable under iterative use and certain non-modeled corruptions. The method’s practical impact lies in its cross-task applicability, compatibility with modern architectures, and potential for deployment in real-time systems facing varied corruptions.

Abstract

In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.

A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

TL;DR

SyMPIE introduces a lightweight modular system that enhances input content via two trainable modules, without requiring paired clean-corrupted data, enabling cross-task robustness for multimedia understanding. By predicting parametric image operators through a Noise Estimation Module and applying them with a Differentiable Warping Module, the approach preserves a frozen upstream model during training and inference, achieving 2 GFLOPs overhead and real-time throughput. Across ImageNetC, ImageNetC-mixed, VizWiz, Cityscapes derivatives, and adverse-weather segmentation benchmarks, SyMPIE delivers consistent accuracy gains (roughly pp absolute on classification, % on segmentation) with modest latency, and remains stable under iterative use and certain non-modeled corruptions. The method’s practical impact lies in its cross-task applicability, compatibility with modern architectures, and potential for deployment in real-time systems facing varied corruptions.

Abstract

In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.
Paper Structure (22 sections, 4 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 2: A detailed scheme of our modules working together to enhance the content of an image. The Noise Estimation Module (NEM) receives a corrupted input and predicts a triple of parameters $(\mathbf{C}_S, \mathbf{C}_M, \mathbf{K})$. These parameters are used by the Differentiable Warping Module (DWM) to enhance the image using parametric operators.
  • Figure 3: An overview of the training procedure of our modular system.
  • Figure 6: Qualitative results on the ACDC semantic segmentation benchmark.
  • Figure 7: Qualitative results on iterated application of our method on an input image from the ImageNetC validation set. The first row represents the input, while the others correspond to four iterative applications of our modules.