Table of Contents
Fetching ...

Towards Task-Compatible Compressible Representations

Anderson de Andrade, Ivan Bajić

TL;DR

This work addresses the problem that a representation learned for one task in multi-task compressible representations may not be optimally informative for a second task. It frames the issue through predictive $\mathcal{V}$-information and proposes adding a small reconstruction reward to the rate-distortion objective to promote task compatibility. Implemented within a scalable coding framework, the method uses base and enhancement representations encoded by CNN-based analysis/synthesis transforms and an autoregressive entropy model with a hyper-prior, with $\hat{X}$ recovered via a reconstruction channel to encourage generality. Experiments on COCO 2017 object detection and Cityscapes depth estimation (and semantic segmentation) demonstrate substantial rate-distortion gains for secondary tasks, and in many cases improved base-task performance, suggesting that the learned representations become simpler and more transferable while remaining effective for their primary objective.

Abstract

We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive $\mathcal{V}$-information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.

Towards Task-Compatible Compressible Representations

TL;DR

This work addresses the problem that a representation learned for one task in multi-task compressible representations may not be optimally informative for a second task. It frames the issue through predictive -information and proposes adding a small reconstruction reward to the rate-distortion objective to promote task compatibility. Implemented within a scalable coding framework, the method uses base and enhancement representations encoded by CNN-based analysis/synthesis transforms and an autoregressive entropy model with a hyper-prior, with recovered via a reconstruction channel to encourage generality. Experiments on COCO 2017 object detection and Cityscapes depth estimation (and semantic segmentation) demonstrate substantial rate-distortion gains for secondary tasks, and in many cases improved base-task performance, suggesting that the learned representations become simpler and more transferable while remaining effective for their primary objective.

Abstract

We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive -information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.
Paper Structure (13 sections, 8 equations, 3 figures)

This paper contains 13 sections, 8 equations, 3 figures.

Figures (3)

  • Figure 1: Architecture overview. The green lines denote the scalable approach, and the red line denotes a direct approach where the secondary task only has access to the base representation. The dotted lines mean that gradients do not flow past that edge.
  • Figure 2: Rate-distortion performance of base representations on image reconstruction (a)-(d) and base tasks (e)-(f). Left: object detection on COCO 2017; Right: depth estimation on Cityscapes. "Direct" means reconstructing the input from the base representation directly, without the dedicated channel (see Fig. \ref{['figure:diagram']}). "Scalable" means coding reconstruction representation conditional on the base representation; in this case, the BPP reported is the sum of both representations. The "Standalone" method does not use side-information. The lines marked "Uncompressed" correspond to the best task performance obtained with a very large $\lambda_b = 1,000$, amongst the baseline and proposed methods. PSNR units are in decibels (dB).
  • Figure 3: Rate-distortion performance of scalable conditional coding on semantic segmentation on Cityscapes. The BPP measurements corresponds to the sum of the base and secondary rates. Although the distortion function $d_e(\cdot, \cdot)$ used in training is the per-pixel multi-class cross-entropy, we report the more conventional mean Intersection-over-Union (mIoU) test metric.