Table of Contents
Fetching ...

Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

Mounvik K, N Harshit

TL;DR

Web-Scale Multimodal Summarization tackles real-time, web-scale retrieval and synthesis of text and images for topic-driven summaries. It combines DuckDuckGo-based web/news/image retrieval with a locally fine-tuned CLIP model for semantic alignment and optional BLIP captions to ground images. The system is configurable and exposed via a Gradio API, enabling adjustable fetch limits, thresholds, and export formats. Evaluations on 500 image-caption pairs show ROC-AUC 0.9270 and accuracy 96.99%, supporting strong cross-modal alignment and effective summarization. This work provides a deployable, transparent pipeline integrating language, retrieval, and vision models for scalable, multimodal summarization.

Abstract

We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.

Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

TL;DR

Web-Scale Multimodal Summarization tackles real-time, web-scale retrieval and synthesis of text and images for topic-driven summaries. It combines DuckDuckGo-based web/news/image retrieval with a locally fine-tuned CLIP model for semantic alignment and optional BLIP captions to ground images. The system is configurable and exposed via a Gradio API, enabling adjustable fetch limits, thresholds, and export formats. Evaluations on 500 image-caption pairs show ROC-AUC 0.9270 and accuracy 96.99%, supporting strong cross-modal alignment and effective summarization. This work provides a deployable, transparent pipeline integrating language, retrieval, and vision models for scalable, multimodal summarization.

Abstract

We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.
Paper Structure (19 sections, 1 figure)

This paper contains 19 sections, 1 figure.

Figures (1)

  • Figure 1: Normalized confusion matrix for the alignment model. The high diagonal scores (0.759 and 0.977) show strong performance, contributing to the 96.99% model accuracy.