Table of Contents
Fetching ...

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

Adam Filipek

TL;DR

TensorBLEU addresses the GPU bottleneck of in-training evaluation by delivering a fully vectorized, per-sentence Token-ID BLEU implementation in PyTorch. It introduces a memory-efficient batched n-gram counting method using batch-specific dictionaries built with torch.unique, enabling parallel per-sentence scoring and a tensor_corpus_bleu variant. The approach achieves substantial speedups—over $13×$ on a T4 and over $40×$ on an A100—making BLEU calculation a negligible overhead in training loops for RL-based fine-tuning. By clarifying its role as Token-ID BLEU for development and open-sourcing the code, TensorBLEU provides a scalable tool to accelerate research in computationally intensive NLP workflows.

Abstract

Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

TL;DR

TensorBLEU addresses the GPU bottleneck of in-training evaluation by delivering a fully vectorized, per-sentence Token-ID BLEU implementation in PyTorch. It introduces a memory-efficient batched n-gram counting method using batch-specific dictionaries built with torch.unique, enabling parallel per-sentence scoring and a tensor_corpus_bleu variant. The approach achieves substantial speedups—over on a T4 and over on an A100—making BLEU calculation a negligible overhead in training loops for RL-based fine-tuning. By clarifying its role as Token-ID BLEU for development and open-sourcing the code, TensorBLEU provides a scalable tool to accelerate research in computationally intensive NLP workflows.

Abstract

Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

Paper Structure

This paper contains 32 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Tests on T4 GPU (Colab) with 256 tokens sentences
  • Figure 2: Tests on T4 GPU (Colab) with 1024 tokens sentences
  • Figure 3: Tests on A100 80GB GPU (and stronger CPU) with 1024 tokens sentences