Table of Contents
Fetching ...

VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation

Chika Maduabuchi, Ericmoore Jossou, Matteo Bucci

TL;DR

High-speed video segmentation of boiling phenomena faces generalization issues with traditional CNNs. This work introduces VideoSAM, a fine-tuned SAM variant trained on a diverse HSV frame-mask dataset for phase detection, and provides an open HSV segmentation dataset. VideoSAM outperforms SAM and U-Net on complex fluids (e.g., FC-72, Nitrogen, Argon) across multiple experiments, though simpler scenes like Water remain challenging. A two-stage architecture (CNN-based initial masks followed by transformer refinement) and a patch-based inference pipeline enable cross-modality generalization, offering a robust tool for HSV analysis with potential impact on boiling research and other high-speed imaging tasks.

Abstract

High-speed video (HSV) segmentation is essential for analyzing dynamic physical processes in scientific and industrial applications, such as boiling heat transfer. Existing models like U-Net struggle with generalization and accurately segmenting complex bubble formations. We present VideoSAM, a specialized adaptation of the Segment Anything Model (SAM), fine-tuned on a diverse HSV dataset for phase detection. Through diverse experiments, VideoSAM demonstrates superior performance across four fluid environments -- Water, FC-72, Nitrogen, and Argon -- significantly outperforming U-Net in complex segmentation tasks. In addition to introducing VideoSAM, we contribute an open-source HSV segmentation dataset designed for phase detection, enabling future research in this domain. Our findings underscore VideoSAM's potential to set new standards in robust and accurate HSV segmentation. The code and dataset used in this study are available online at https://github.com/chikap421/videosam.

VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation

TL;DR

High-speed video segmentation of boiling phenomena faces generalization issues with traditional CNNs. This work introduces VideoSAM, a fine-tuned SAM variant trained on a diverse HSV frame-mask dataset for phase detection, and provides an open HSV segmentation dataset. VideoSAM outperforms SAM and U-Net on complex fluids (e.g., FC-72, Nitrogen, Argon) across multiple experiments, though simpler scenes like Water remain challenging. A two-stage architecture (CNN-based initial masks followed by transformer refinement) and a patch-based inference pipeline enable cross-modality generalization, offering a robust tool for HSV analysis with potential impact on boiling research and other high-speed imaging tasks.

Abstract

High-speed video (HSV) segmentation is essential for analyzing dynamic physical processes in scientific and industrial applications, such as boiling heat transfer. Existing models like U-Net struggle with generalization and accurately segmenting complex bubble formations. We present VideoSAM, a specialized adaptation of the Segment Anything Model (SAM), fine-tuned on a diverse HSV dataset for phase detection. Through diverse experiments, VideoSAM demonstrates superior performance across four fluid environments -- Water, FC-72, Nitrogen, and Argon -- significantly outperforming U-Net in complex segmentation tasks. In addition to introducing VideoSAM, we contribute an open-source HSV segmentation dataset designed for phase detection, enabling future research in this domain. Our findings underscore VideoSAM's potential to set new standards in robust and accurate HSV segmentation. The code and dataset used in this study are available online at https://github.com/chikap421/videosam.

Paper Structure

This paper contains 22 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the VideoSAM model architecture and integration with U-Net CNN. The initial segmentation masks generated by fine-tuned U-Net models for each modality are paired with their respective images and fed into the VideoSAM transformer. The image encoder and mask decoder process these inputs to refine the segmentation, leveraging the pre-trained SAM components for HSV segmentation.
  • Figure 2: Left: Original high-speed video frames showcasing randomly sampled frames from the large training dataset. The images illustrate the difference between the modalities of water (image 114) and gas (image 654). Notice the difference in bubble footprints, with gases exhibiting more bubbles with complex shapes compared to water. Right: Visualization of the patched images resulting from the patchification process. This process highlights the segmentation of the original images into smaller patches for detailed analysis.
  • Figure 3: Combined results of Experiment 1: Qualitative and quantitative analysis of VideoSAM's zero-shot generalization performance.
  • Figure 4: Combined table and figure layout comparing the performance of U-Net, VideoSAM, and SAM across different datasets.