Table of Contents
Fetching ...

A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation

Jiaqi Yang, Nitish Mehta, Xiaoling Hu, Chao Chen, Chia-Ling Tsai

TL;DR

This work tackles the challenge of OCT lesion segmentation with only image-level labels by introducing a multimodal weakly supervised framework. It combines a dual visual pathway (primary image and retinal-layer–guided structure) with dual textual streams (label-informed CLIP guidance and per-image synthetic descriptions from BLIP) to generate high-quality pseudo labels. Through cross-attention between modalities and a carefully designed multi-term loss, the approach yields state-of-the-art pseudo-label quality and competitive downstream segmentation on three OCT datasets. The results demonstrate the practical potential to reduce annotation burden while improving diagnostic localization of retinal lesions.

Abstract

Accurate segmentation of Optical Coherence Tomography (OCT) images is crucial for diagnosing and monitoring retinal diseases. However, the labor-intensive nature of pixel-level annotation limits the scalability of supervised learning with large datasets. Weakly Supervised Semantic Segmentation (WSSS) provides a promising alternative by leveraging image-level labels. In this study, we propose a novel WSSS approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels, significantly improving segmentation performance. In terms of visual information, our method employs two processing modules that exchange raw image features and structural features from OCT images, guiding the model to identify where lesions are likely to occur. In terms of textual information, we utilize large-scale pretrained models from cross-domain sources to implement label-informed textual guidance and synthetic descriptive integration with two textual processing modules that combine local semantic features with consistent synthetic descriptions. By fusing these visual and textual components within a multimodal framework, our approach enhances lesion localization accuracy. Experimental results on three OCT datasets demonstrate that our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.

A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation

TL;DR

This work tackles the challenge of OCT lesion segmentation with only image-level labels by introducing a multimodal weakly supervised framework. It combines a dual visual pathway (primary image and retinal-layer–guided structure) with dual textual streams (label-informed CLIP guidance and per-image synthetic descriptions from BLIP) to generate high-quality pseudo labels. Through cross-attention between modalities and a carefully designed multi-term loss, the approach yields state-of-the-art pseudo-label quality and competitive downstream segmentation on three OCT datasets. The results demonstrate the practical potential to reduce annotation burden while improving diagnostic localization of retinal lesions.

Abstract

Accurate segmentation of Optical Coherence Tomography (OCT) images is crucial for diagnosing and monitoring retinal diseases. However, the labor-intensive nature of pixel-level annotation limits the scalability of supervised learning with large datasets. Weakly Supervised Semantic Segmentation (WSSS) provides a promising alternative by leveraging image-level labels. In this study, we propose a novel WSSS approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels, significantly improving segmentation performance. In terms of visual information, our method employs two processing modules that exchange raw image features and structural features from OCT images, guiding the model to identify where lesions are likely to occur. In terms of textual information, we utilize large-scale pretrained models from cross-domain sources to implement label-informed textual guidance and synthetic descriptive integration with two textual processing modules that combine local semantic features with consistent synthetic descriptions. By fusing these visual and textual components within a multimodal framework, our approach enhances lesion localization accuracy. Experimental results on three OCT datasets demonstrate that our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.

Paper Structure

This paper contains 19 sections, 9 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visual examples of our proposed multimodal framework's submodules. (a) Example of the proposed label-informed strategy, where X and X' represent the OCT image and structural input, respectively. (b) Synthetic text generation by BLIP, with a natural image shown for comparison; the pretrained BLIP recognizes the shape and object relationship, identifying a 'plane' over 'a hill.' (c) Example of the structural input with the cross-attention mechanism. (d) Illustration of CAMs and pseudo labels generated by our proposed method (green box), compared with pseudo labels from ResNet-50 (red box), and the ground truth (blue box) for reference. Best viewed in color.
  • Figure 2: Overview of the proposed framework for the pseudo label generation.
  • Figure 3: Visualization of pretrained models' outputs and their relationship to lesion localization. (a) Original OCT image. (b) Pretrained model output showing layer segmentation. (c) Pretrained model output showing the GAN-generated healthy counterpart. (d) Anomalous representation. (e) Ground truth annotation. Red circles highlight the relationship between lesions and retinal layers, demonstrating how the noisy information contributes to lesion localization.
  • Figure 4: Structure of the Cross-Attention for the primary features, where $H_s$, $W_s$, and $C_s$ denote the height, width, and number of channels of feature maps at stage $s$, respectively. $F^T_s$ and $F^P_s$ represent feature maps from the structural and primary encoders, respectively.
  • Figure 5: Visualization of common descriptions for each condition in the RESC dataset, displayed as word clouds and histograms of common word distributions. (a) and (b) show results for the healthy image collection using BLIP and ViT-GPT2 text generators, respectively. Similarly, (c) and (d) represent the SRF lesion collection, while (e) and (f) display the PED lesion collection for the two generators.
  • ...and 4 more figures