Table of Contents
Fetching ...

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy

TL;DR

This work addresses the limitation that biomedical vision-language models truncate long captions by restricting text context to ~77 tokens. By expanding the text encoder context to up to 512 tokens, the authors reduce token waste and unlock richer supervision, introducing BIOMEDICA-LongCAP and the BMC-LongCLIP model. Across long-caption benchmarks (CXR and PMC), longer context yields substantial retrieval gains (up to +30 points in Recall@1) and improved zero-shot classification, while also enabling faster convergence. The approach demonstrates that long-context modeling is a promising direction for advancing biomedical vision-language understanding and retrieval in real-world, long-form textual corpora.

Abstract

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

TL;DR

This work addresses the limitation that biomedical vision-language models truncate long captions by restricting text context to ~77 tokens. By expanding the text encoder context to up to 512 tokens, the authors reduce token waste and unlock richer supervision, introducing BIOMEDICA-LongCAP and the BMC-LongCLIP model. Across long-caption benchmarks (CXR and PMC), longer context yields substantial retrieval gains (up to +30 points in Recall@1) and improved zero-shot classification, while also enabling faster convergence. The approach demonstrates that long-context modeling is a promising direction for advancing biomedical vision-language understanding and retrieval in real-world, long-form textual corpora.

Abstract

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

Paper Structure

This paper contains 19 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (A) Distribution of BIOMEDICA-6M caption token usage with a cutoff of 77 tokens. The blue histogram represents tokens visible to the model, while the pink histogram represents wasted tokens truncated beyond the cutoff (corresponding to 434 million tokens or 55% of total tokens ). (B) Distribution with a cutoff of 512 tokens, showing substantially reduced token waste of 2.2% (17M tokens). (C) Qualitative examples of BIOMEDICA-6M and BIOMEDICA-LongCAP captions, showing truncated vs. full captions, as well as our enhanced captions.
  • Figure 2: Context-length ablation results of BMC-LongCLIP trained with 77, 154, and 512 tokens. (Left) Average retrieval performance (Recall@K) on the PMC long-caption benchmark. (Middle) Average retrieval performance on the CXR benchmark. (Right) Average zero-shot classification accuracy across biomedical datasets. Longer context improves retrieval and classification, with the largest gains on PMC.
  • Figure 3: Training loss curves across context lengths, illustrating that longer text windows accelerate convergence.