SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

Hongjun An; Yifan Chen; Zhe Sun; Xuelong Li

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

Hongjun An, Yifan Chen, Zhe Sun, Xuelong Li

TL;DR

This work tackles the bottleneck of slow inference in token-by-token large language models by introducing SentenceVAE, a module that compresses each sentence into a single token and reconstructs it via a Sentence Decoder. When grafted into LLMs to form Sentence-level LLMs (SLLMs), the approach enables next-sentence prediction and reduces the number of tokens processed, improving speed and memory efficiency while preserving or enhancing accuracy. Empirical results on the Wanjuan dataset show substantial speedups (204–365%), better perplexity (to 46–75% of baseline), and large memory savings (86–91%) for equivalent context length, with scaling laws extending to larger models. The framework also anticipates further gains through architectural enhancements, edge-cloud deployment, embodied intelligence, and multimodal extensions, enabling longer contexts and more responsive AI systems.

Abstract

Current large language models (LLMs) primarily utilize next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aiming at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), which includes a Sentence Encoder to compress multiple tokens in a sentence into a single token, and a Sentence Decoder to reconstruct it. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference method. In addition, the SentenceVAE module of SLLMs can maintain the integrity of the original semantic content by segmenting the context into sentences, thereby improving accuracy while boosting inference speed. Moreover, compared to previous LLMs, SLLMs process fewer tokens over equivalent context length, significantly reducing memory demands for self-attention computation and facilitating the handling of longer context. Extensive experiments on Wanjuan dataset have revealed that the proposed method can accelerate inference speed by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the equivalent context length, compared to previous token-by-token methods.

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 3 figures, 4 tables)

This paper contains 16 sections, 10 equations, 3 figures, 4 tables.

Introduction
Related Work
Method
Sentence Variational Autoencoder (SentenceVAE)
Sentence-level Large Language Models (SLLMs)
Experiment
Experimental Setting
Sentence-level Tokens
Sentence-level LLMs (SLLMs)
Scaling Law of SLLMs
Conclusion & Future Trends
Scaling up SLLMs with Enhanced Architectures
SLLMs in Hybrid Edge-Cloud Inference
SLLMs and Embodied Intelligence
SLLMs in Multimodal Large Models
...and 1 more sections

Figures (3)

Figure 1: The schematic form of SentenceVAE. It clearly illustrates that the encoder of SentenceVAE can compress the information contained within a sentence into a single token, and the decoder can restore the compressed token back to its original sentence form.
Figure 2: (a) The schematic form of published LLMs. (b) The schematic form of SLLMs, which embedded with SentenceVAEs. It can be clearly seen that, unlike the next-token inference method employed by published LLMs, the proposed method adopts a next-sentence prediction method, which significantly reduces the number of inference iterations and the overall inference cost.
Figure 3: Scaling Law of (a) SLLMs and (b) SVAEs.

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

TL;DR

Abstract

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

Authors

TL;DR

Abstract

Table of Contents

Figures (3)