No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun; Alejandro Lozano; Javier Gamazo Tejero; Vishwesh Nath; Xiao Xiao Sun; James Burgess; Yuhui Zhang; Kun Yuan; Robert Tibshirani; Sean Huver; Serena Yeung-Levy

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy

TL;DR

This work addresses the limitation that biomedical vision-language models truncate long captions by restricting text context to ~77 tokens. By expanding the text encoder context to up to 512 tokens, the authors reduce token waste and unlock richer supervision, introducing BIOMEDICA-LongCAP and the BMC-LongCLIP model. Across long-caption benchmarks (CXR and PMC), longer context yields substantial retrieval gains (up to +30 points in Recall@1) and improved zero-shot classification, while also enabling faster convergence. The approach demonstrates that long-context modeling is a promising direction for advancing biomedical vision-language understanding and retrieval in real-world, long-form textual corpora.

Abstract

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

TL;DR

Abstract

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)