Table of Contents
Fetching ...

Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

TL;DR

The paper tackles the challenge of surgical workflow analysis under limited annotated data. It introduces Surg-FTDA, a two-stage, text-driven adaptation framework that first selects a small, diverse set of data anchors to align vision and language modalities, and then trains a decoder using only text data to perform downstream tasks on aligned image embeddings. Across discriminative tasks (phase and triplet recognition) and a generative task (image captioning), Surg-FTDA outperforms weakly supervised baselines and approaches fully supervised performance with far fewer image–label pairs, demonstrating strong data efficiency and cross-task generalization. The approach leverages surgical foundation models and multi-task text decoding to provide scalable, multi-task capabilities for surgical workflow analysis, with code and dataset forthcoming on GitHub.

Abstract

Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMA-public/Surg-FTDA

Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

TL;DR

The paper tackles the challenge of surgical workflow analysis under limited annotated data. It introduces Surg-FTDA, a two-stage, text-driven adaptation framework that first selects a small, diverse set of data anchors to align vision and language modalities, and then trains a decoder using only text data to perform downstream tasks on aligned image embeddings. Across discriminative tasks (phase and triplet recognition) and a generative task (image captioning), Surg-FTDA outperforms weakly supervised baselines and approaches fully supervised performance with far fewer image–label pairs, demonstrating strong data efficiency and cross-task generalization. The approach leverages surgical foundation models and multi-task text decoding to provide scalable, multi-task capabilities for surgical workflow analysis, with code and dataset forthcoming on GitHub.

Abstract

Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMA-public/Surg-FTDA
Paper Structure (14 sections, 2 equations, 3 figures, 10 tables)

This paper contains 14 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: (a) Conventional adaptation of multi-modal foundation model requires paired image-label data for training; (b) Our text-driven adaptation of the foundation model does not require a large number of image-label pairs to achieve the surgical workflow analysis.
  • Figure 2: (a) Few-shot data anchor selection based on the visual embedding space using KMeans or FPS; (b) The text-driven training and inference process of Surg-FTDA, demonstrating how the model applies text-based training to various tasks with minimal paired data.
  • Figure 3: Visualization of modality gap across foundation models using few-shot data anchor selection and modality alignment. Yellow points represent the image embedding vectors. Blue points represent the text embedding vectors.