Table of Contents
Fetching ...

Rate-Distortion Theory in Coding for Machines and its Application

Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V. Srinivasa Somayazulu, Omesh Tickoo, Anderson de Andrade, Ivan V. Bajic

TL;DR

This work extends rate-distortion theory to coding for machines (CfM), introducing task-based distortions and three encoding paradigms: full-input, model-splitting, and direct coding. It proves that, under optimal conditions, these approaches achieve the same RD performance, while supervised optimization yields superior RD compared to unsupervised proxies. The authors then apply the theory to image coding for machines, designing both model-splitting and direct-coding pipelines and achieving state-of-the-art RD on tasks such as classification, object detection, and instance segmentation. They further provide design guidelines and empirical evidence showing that deeper distillation points improve RD in unsupervised settings and deliver substantial practical gains across a range of CV models, including SWIN transformers, while remaining agnostic to the input modality and task model.

Abstract

Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of images and video. As a result, a growing need for efficient compression methods optimized for machine vision, rather than human vision, has emerged. To meet this growing demand, several methods have been developed for image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilize this newfound understanding to improve several methods for learnable image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks such as classification, instance segmentation, and object detection.

Rate-Distortion Theory in Coding for Machines and its Application

TL;DR

This work extends rate-distortion theory to coding for machines (CfM), introducing task-based distortions and three encoding paradigms: full-input, model-splitting, and direct coding. It proves that, under optimal conditions, these approaches achieve the same RD performance, while supervised optimization yields superior RD compared to unsupervised proxies. The authors then apply the theory to image coding for machines, designing both model-splitting and direct-coding pipelines and achieving state-of-the-art RD on tasks such as classification, object detection, and instance segmentation. They further provide design guidelines and empirical evidence showing that deeper distillation points improve RD in unsupervised settings and deliver substantial practical gains across a range of CV models, including SWIN transformers, while remaining agnostic to the input modality and task model.

Abstract

Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of images and video. As a result, a growing need for efficient compression methods optimized for machine vision, rather than human vision, has emerged. To meet this growing demand, several methods have been developed for image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilize this newfound understanding to improve several methods for learnable image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks such as classification, instance segmentation, and object detection.
Paper Structure (36 sections, 13 theorems, 48 equations, 30 figures, 5 tables)

This paper contains 36 sections, 13 theorems, 48 equations, 30 figures, 5 tables.

Key Result

Theorem 1

The minimal achievable rates for direct coding for machines and model splitting are identical, that is, $R_{XY}(D;T) = R_Y(D;T)$

Figures (30)

  • Figure 1: Three common approaches for coding for machines.
  • Figure 2: Distribution and quantisation of "circles" and "squares" at both the input and the intermediate layer. The marker symbol corresponds to the class of each point, whereas the color corresponds to the 1-bit bin to which each point is quantised to minimise MSE.
  • Figure 3: Task appropriateness and t-SNE visualisation for various layers in VGG16, using the CIFAR-10 dataset and MSE distortion. The improvement in $\rho$ values suggests that using deeper layers, such as 'Features.32' will lead to better rate-distortion performance for this task.
  • Figure 4: Model distillation approach - Note that our latent decoder only recreates the cut-point $Y_1$. To obtain the distillation loss at the point $Y_2$, we make use of pre-trained layers of the original task-model, which we denote the task mid-model.
  • Figure 5: Benchmark comparison for image classification in a model-splitting setup using ResNet50, on the Imagenet validation set.
  • ...and 25 more figures

Theorems & Definitions (18)

  • Theorem 1
  • Theorem 2
  • Theorem 2.A
  • Corollary 2.1
  • Corollary 2.2
  • Corollary 2.3
  • Corollary 2.4
  • Theorem 3
  • Corollary 3.1
  • Corollary 3.2
  • ...and 8 more