Table of Contents
Fetching ...

Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

Jonathan Attard, Dylan Seychell

TL;DR

This work tackles automated segmentation of news videos into five scene types by comparing image-based, video-based, and audio-based classifiers, including ResNet, ViViT, AST, and multimodal fusion. On a custom dataset of 41 news videos with 1,832 labeled clips, image-based ResNet achieves the highest accuracy of $84.34\%$, outperforming temporal video models while using far fewer computational resources. Binary detectors for transitions and advertisements reach $94.23\%$ and $92.74\%$ accuracy, respectively, illustrating the strength of certain recurring cues. The study underscores the practical viability of image-based approaches for content organisation tasks like archiving and search, while also highlighting resource challenges and the need for further work on scalable multimodal methods.

Abstract

News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

TL;DR

This work tackles automated segmentation of news videos into five scene types by comparing image-based, video-based, and audio-based classifiers, including ResNet, ViViT, AST, and multimodal fusion. On a custom dataset of 41 news videos with 1,832 labeled clips, image-based ResNet achieves the highest accuracy of , outperforming temporal video models while using far fewer computational resources. Binary detectors for transitions and advertisements reach and accuracy, respectively, illustrating the strength of certain recurring cues. The study underscores the practical viability of image-based approaches for content organisation tasks like archiving and search, while also highlighting resource challenges and the need for further work on scalable multimodal methods.

Abstract

News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

Paper Structure

This paper contains 9 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visualisation of audio-visual representation, demonstrating how multimodal models can collaboratively capture additional features to represent features better (Source: wu2023newsnet)
  • Figure 2: Example frame samples corresponding to each of the five scene classification labels: Advertisement, Story, Studio, Transition, and Visualisation.
  • Figure 3: Interface used for video annotation, demonstrating a video fully annotated. Users had the option to move frame-by-frame for accuracy, jump, highlight, and label different sections while viewing and listening to the video
  • Figure 4: Multi-modal architecture showing the interaction of the vision and audio models combined through a fusion layer.
  • Figure 5: Sample confusion matrices of the results generated from the ResNet and the AST models. Note that the other confusion matrices not presented closely resemble the patterns of the ResNet model.