Arbitrary Ratio Feature Compression via Next Token Prediction

Yufan Liu; Daoyuan Ren; Zhipeng Zhang; Wenyang Luo; Bing Li; Weiming Hu; Stephen Maybank

Arbitrary Ratio Feature Compression via Next Token Prediction

Yufan Liu, Daoyuan Ren, Zhipeng Zhang, Wenyang Luo, Bing Li, Weiming Hu, Stephen Maybank

TL;DR

The paper tackles the need for flexible feature compression across arbitrary ratios without retraining. It introduces ARFC, composed of ARC (next-token prediction), MoS (cross-solution attention), and ERGC (graph-based relational regularization), enabling a single model to support any compression ratio at inference. Empirical results across cross-modal retrieval, image classification, and image retrieval show ARFC consistently outperforms baselines and, in some cases, surpasses uncompressed features, validating its robustness and practicality for resource-constrained settings. The approach reduces training overhead while offering dynamic control of compression level, with strong implications for scalable, multimodal systems.

Abstract

Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.

Arbitrary Ratio Feature Compression via Next Token Prediction

TL;DR

Abstract

Paper Structure (16 sections, 15 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 15 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Classical Dimension Reduction
Learned Feature Compression via Neural Encoders
Quantization-based Compression
Approach
Arbitrary Ratio Compressor via NTP
Mixture of solutions module
Entity relation graph constraint
Optimization
Training procedure
Experiments
Settings
Evaluation Results
Ablation Study
...and 1 more sections

Figures (8)

Figure 1: Comparison between conventional compressors and our proposed compressor. The conventional feature compression methods require training multiple compressors for various compression ratios separately, while our method only needs to train ONE compressor to support arbitrary compression ratios.
Figure 2: The overall framework of the proposed method. It consists of three key components, including the Arbitrary Ratio Compressor (ARC) via Next Token Prediction (NTP), the Mixture of Solution (MoS) blocks, and the Entity Relation Graph Constraint (ERGC).
Figure 3: An example of an atom compressor with two auxiliary decoders. It consists of an encoder, a decoder, and several auxiliary decoders. The auxiliary decoders receive multi-view compressed features with different dropout rates and reconstruct the original feature.
Figure 4: An illustration of the Entity Relation Graph Constraint (ERGC). It constructs two Entity-Relation Graphs (ERG) for the original feature space and the compressed feature space, respectively. Then, it forces the latter to be as close as the former. Note that "$E_{12}$" represents the relationship between the first entity and the second entity.
Figure 5: Feature visualization via t-SNE. Each color represents a class sampled on ImageNet. The "Baseline" represents the original feature whose dimension is 1024. The feature compression ratio of the compared methods is 75%. The backbone model is CN-CLIP with ViT-H/14 architecture.
...and 3 more figures

Arbitrary Ratio Feature Compression via Next Token Prediction

TL;DR

Abstract

Arbitrary Ratio Feature Compression via Next Token Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)