Table of Contents
Fetching ...

BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed, Mina Attin, Bryar Shareef

TL;DR

This work tackles the challenge of generating narratively accessible breast ultrasound (BUS) reports without paired image–report data. It introduces BUSTR, a descriptor-aware multitask vision–language framework that builds zero-shot supervisory reports from structured descriptors and radiomics, learns descriptor-guided visual representations with a Swin-based encoder, and generates reports by aligning vision tokens with a frozen LLM through a dual loss that combines token-level cross-entropy with cosine similarity alignment. Evaluated on BrEaST and BUS-BRA, BUSTR yields consistent improvements in standard NLG metrics and clinical efficacy, particularly for BI-RADS and pathology descriptors, outperforming state-of-the-art baselines. The approach demonstrates that structured descriptor supervision, when paired with radiomics and a descriptor-aware encoder, can produce coherent, clinically faithful BUS narratives while reducing reliance on costly image–report datasets, with potential for interpretable, workflow-friendly AI-assisted radiology reporting.

Abstract

Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

TL;DR

This work tackles the challenge of generating narratively accessible breast ultrasound (BUS) reports without paired image–report data. It introduces BUSTR, a descriptor-aware multitask vision–language framework that builds zero-shot supervisory reports from structured descriptors and radiomics, learns descriptor-guided visual representations with a Swin-based encoder, and generates reports by aligning vision tokens with a frozen LLM through a dual loss that combines token-level cross-entropy with cosine similarity alignment. Evaluated on BrEaST and BUS-BRA, BUSTR yields consistent improvements in standard NLG metrics and clinical efficacy, particularly for BI-RADS and pathology descriptors, outperforming state-of-the-art baselines. The approach demonstrates that structured descriptor supervision, when paired with radiomics and a descriptor-aware encoder, can produce coherent, clinically faithful BUS narratives while reducing reliance on costly image–report datasets, with potential for interpretable, workflow-friendly AI-assisted radiology reporting.

Abstract

Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

Paper Structure

This paper contains 20 sections, 15 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overall BUSTR architecture. (a) Zero-shot construction of reports from BUS descriptors and radiomics features. (b) Multitask training of a descriptor-aware Swin vision encoder. (c) Report generation, where visual tokens and prompts are fused in a frozen LLM to produce BUS reports.
  • Figure 2: Qualitative comparison between top-performing models. Different colors are used to show each predicted descriptor. Highlighted texts are incorrect predictions.