Table of Contents
Fetching ...

Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Jiajun Guo, Xin Luo, Jie Liu

TL;DR

This work tackles the privacy-driven data-sharing challenges of training large foundation models by integrating split learning with a learning-based, quantized multimodal framework. It introduces a FSQ-inspired discrete representation method with linear scaling and distortion regularization, paired with entropy-based bit-width selection to minimize transmission costs. Built on TinyLLaVA, the architecture includes per-modality quantizers, a vision tower, a connector, and a language model, and employs a two-stage training regime. Empirical results across perception and cognition benchmarks show near-original performance at low-bit transmission, demonstrating substantial efficiency gains for privacy-preserving, distributed multimodal learning.

Abstract

Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that raises privacy issues. However, high network communication costs are always an impediment to split learning, especially for large foundation models that require transmitting large amounts of high-dimensional data. To resolve this issue, we present a new multimodal model structure that incorporates a learning-based data compression method, which compresses model embeddings into low-bit integers while preserving the model's performance, greatly reducing the transmission costs between partitions. We then determine the optimal number of discrete representation levels based on a solid theoretical foundation from entropy coding.

Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

TL;DR

This work tackles the privacy-driven data-sharing challenges of training large foundation models by integrating split learning with a learning-based, quantized multimodal framework. It introduces a FSQ-inspired discrete representation method with linear scaling and distortion regularization, paired with entropy-based bit-width selection to minimize transmission costs. Built on TinyLLaVA, the architecture includes per-modality quantizers, a vision tower, a connector, and a language model, and employs a two-stage training regime. Empirical results across perception and cognition benchmarks show near-original performance at low-bit transmission, demonstrating substantial efficiency gains for privacy-preserving, distributed multimodal learning.

Abstract

Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that raises privacy issues. However, high network communication costs are always an impediment to split learning, especially for large foundation models that require transmitting large amounts of high-dimensional data. To resolve this issue, we present a new multimodal model structure that incorporates a learning-based data compression method, which compresses model embeddings into low-bit integers while preserving the model's performance, greatly reducing the transmission costs between partitions. We then determine the optimal number of discrete representation levels based on a solid theoretical foundation from entropy coding.

Paper Structure

This paper contains 27 sections, 4 theorems, 30 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\Sigma_1, \Sigma_2$ denote two finite alphabets, and let $\Sigma_1^{*}$ and $\Sigma_2^{*}$ denote the sets of all finite words from those alphabets, respectively. Suppose that $X$ is a random variable taking values in $\Sigma_1$, and let $f$ be a uniquely decodable code from $\Sigma_1^{*}$ to $

Figures (5)

  • Figure 1: Overview of model architecture: (a). Vision tower for image encoding. (b). Connector for modality alignment. (c). Vision quantizer and language quantizer for feature compression and reconstruction. (d). Large language model for downstream tasks.
  • Figure 2: The quantizer $Q_{\varphi}$ consists of a client part and a server part. The client maps intermediate features into low-bit integer indices, while the server reconstruct these indices back into features for the subsequent model.
  • Figure 3: Model's performance in 5 benchmarks across 1-5 bits. Our method achieves optimal performance at 2 or 3 bits.
  • Figure 4: Model Inference examples
  • Figure : Figure A1: Estimation of Distribution and Entropy across 8 batches

Theorems & Definitions (5)

  • Theorem
  • Theorem B1: Hjort and Jones, 1996 10.1214/aos/1032298288
  • Theorem B2: Lebesgue Dominated Convergence Theorem
  • Theorem B3: Asymptotic Unbiasedness of Entropy Estimator
  • proof