TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo
TL;DR
TRIM addresses the high inference costs of large language models by omitting easily-inferable, low-information words during generation and reconstructing the distilled text with a smaller model. It introduces a word-inferability ranking based on $ΔP$ and builds NaLDA, a 44,800-entry dataset across five generation tasks, to evaluate reconstruction performance. Experiments show average token savings around 18–19% with minimal degradation in semantic meaning, and demonstrate that small reconstruction models like T5 can effectively reconstruct full text, especially when paired with GPT-4o as the generator. The approach is language-agnostic and offers practical implications for scalable, cost-efficient generation in real-world deployments.
Abstract
The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language (i.e., concise outputs that retain essential meaning) when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which the LLM omits a predefined set of semantically irrelevant and easily inferable words based on the context during inference. Then, a specifically trained smaller language model with lower inference cost reconstructs the distilled answer into the ideal answer. Our experiments show promising results, particularly on the proposed NaLDA evaluation dataset focused on the reconstruction task, with 19.4% saved tokens on average for GPT-4o and only a tiny decrease in evaluation metrics. This suggests that the approach can effectively balance efficiency and accuracy in language processing tasks.
