Table of Contents
Fetching ...

A multimodal deep learning architecture for smoking detection with a small data approach

Robert Lakatos, Peter Pollner, Andras Hajdu, Tamas Joo

TL;DR

This work tackles the challenge of detecting covert tobacco advertising across text and images when labeled data are scarce. It presents a data-efficient, multimodal deep learning pipeline that combines CLIP-based filtering and an EfficientNet B5 image classifier with multilingual text analysis via XLM-RoBERTa large, supplemented by a ChatGPT-generated Hungarian smoking corpus and a human reinforcement loop. Key findings show a 62% accuracy for multimodal filtering, 74% overall accuracy when fused in an ensemble, and 98% accuracy with 0.91 F1 on Hungarian text classification, demonstrating feasibility in low-resource settings. The approach offers a practical, scalable means to quantify tobacco-related media content and supports policy-relevant monitoring, with potential extensions to temporal localization via object detectors and leveraging generative data enhancements.

Abstract

Introduction: Covert tobacco advertisements often raise regulatory measures. This paper presents that artificial intelligence, particularly deep learning, has great potential for detecting hidden advertising and allows unbiased, reproducible, and fair quantification of tobacco-related media content. Methods: We propose an integrated text and image processing model based on deep learning, generative methods, and human reinforcement, which can detect smoking cases in both textual and visual formats, even with little available training data. Results: Our model can achieve 74\% accuracy for images and 98\% for text. Furthermore, our system integrates the possibility of expert intervention in the form of human reinforcement. Conclusions: Using the pre-trained multimodal, image, and text processing models available through deep learning makes it possible to detect smoking in different media even with few training data.

A multimodal deep learning architecture for smoking detection with a small data approach

TL;DR

This work tackles the challenge of detecting covert tobacco advertising across text and images when labeled data are scarce. It presents a data-efficient, multimodal deep learning pipeline that combines CLIP-based filtering and an EfficientNet B5 image classifier with multilingual text analysis via XLM-RoBERTa large, supplemented by a ChatGPT-generated Hungarian smoking corpus and a human reinforcement loop. Key findings show a 62% accuracy for multimodal filtering, 74% overall accuracy when fused in an ensemble, and 98% accuracy with 0.91 F1 on Hungarian text classification, demonstrating feasibility in low-resource settings. The approach offers a practical, scalable means to quantify tobacco-related media content and supports policy-relevant monitoring, with potential extensions to temporal localization via object detectors and leveraging generative data enhancements.

Abstract

Introduction: Covert tobacco advertisements often raise regulatory measures. This paper presents that artificial intelligence, particularly deep learning, has great potential for detecting hidden advertising and allows unbiased, reproducible, and fair quantification of tobacco-related media content. Methods: We propose an integrated text and image processing model based on deep learning, generative methods, and human reinforcement, which can detect smoking cases in both textual and visual formats, even with little available training data. Results: Our model can achieve 74\% accuracy for images and 98\% for text. Furthermore, our system integrates the possibility of expert intervention in the form of human reinforcement. Conclusions: Using the pre-trained multimodal, image, and text processing models available through deep learning makes it possible to detect smoking in different media even with few training data.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic flow diagram of the architecture.
  • Figure 2: The cosine similarity of the images obtained from the video recording in chronological order.
  • Figure 3: The images are in an orderly manner based on the cosine similarity values.