Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

Chuanbo Hu; Bin Liu; Minglei Yin; Yilu Zhou; Xin Li

Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

Chuanbo Hu, Bin Liu, Minglei Yin, Yilu Zhou, Xin Li

TL;DR

This work tackles the problem of accurately rating mobile apps for children by content maturity. It introduces a CoT-endowed multimodal framework using GPT-4V to analyze both app descriptions and screenshots, ranking screenshots by detected maturity content and intensity, then fusing top visuals with textual data to assign one of four maturity levels $\{4+, 9+, 12+, 17+\}$. Across a dataset of $1{,}281$ App Store apps, the proposed method outperforms description-only, image-only, and other multimodal baselines, with Selective CoT Fusion delivering the best results. The findings demonstrate that explicitly incorporating Chain-of-Thought reasoning into multimodal evaluation improves reliability and precision in content rating, contributing to safer digital environments for children. Limitations include a relatively small GPT-4V-enabled dataset and potential biases in visual cues, suggesting future work on larger datasets and refined CoT prompts to further enhance performance and generalizability.

Abstract

Mobile applications (Apps) could expose children to inappropriate themes such as sexual content, violence, and drug use. Maturity rating offers a quick and effective method for potential users, particularly guardians, to assess the maturity levels of apps. Determining accurate maturity ratings for mobile apps is essential to protect children's health in today's saturated digital marketplace. Existing approaches to maturity rating are either inaccurate (e.g., self-reported rating by developers) or costly (e.g., manual examination). In the literature, there are few text-mining-based approaches to maturity rating. However, each app typically involves multiple modalities, namely app description in the text, and screenshots in the image. In this paper, we present a framework for determining app maturity levels that utilize multimodal large language models (MLLMs), specifically ChatGPT-4 Vision. Powered by Chain-of-Thought (CoT) reasoning, our framework systematically leverages ChatGPT-4 to process multimodal app data (i.e., textual descriptions and screenshots) and guide the MLLM model through a step-by-step reasoning pathway from initial content analysis to final maturity rating determination. As a result, through explicitly incorporating CoT reasoning, our framework enables ChatGPT to understand better and apply maturity policies to facilitate maturity rating. Experimental results indicate that the proposed method outperforms all baseline models and other fusion strategies.

Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

TL;DR

. Across a dataset of

App Store apps, the proposed method outperforms description-only, image-only, and other multimodal baselines, with Selective CoT Fusion delivering the best results. The findings demonstrate that explicitly incorporating Chain-of-Thought reasoning into multimodal evaluation improves reliability and precision in content rating, contributing to safer digital environments for children. Limitations include a relatively small GPT-4V-enabled dataset and potential biases in visual cues, suggesting future work on larger datasets and refined CoT prompts to further enhance performance and generalizability.

Abstract

Paper Structure (20 sections, 3 equations, 3 figures, 8 tables)

This paper contains 20 sections, 3 equations, 3 figures, 8 tables.

Introduction
Background and Literature Review
Mobile App Maturity Rating
Text Mining Based App Maturity Rating
Multimodal Large Language Models
Chain-of-Thought Reasoning Based on Large Language Model
Methods
Problem Formulation
Overview of the Proposed Framework
Chain-of-Thought (CoT) Endowed Prompting for App Maturity Rating
Experiments
Experimental Setup
Comparison Against Baseline Methods
Comparative Analysis of Different Multimodal Fusion Strategies for App Maturity Rating
Impact of Chain-of-Thought (CoT) Reasoning on Maturity Rating
...and 5 more sections

Figures (3)

Figure 1: Illustration of the proposed framework for app maturity rating using multimodal large language model (GPT-4V) with chain-of-thought (CoT) reasoning. The inputs (app descriptions and screenshots) are denoted in red font, while outputs (maturity ratings) are shown in blue font. Requests to GPT-4V are denoted with red arrows, and responses from GPT-4V are marked with blue arrows. Maturity content and intensity extraction aims to identify and rank the screenshots according to the exist and intensity of maturity contents. The final maturity rating is determined by combing the top screenshot(s) and the textual descriptions.
Figure 2: Illustration of chain-of-thought (CoT) reasoning endowed prompting for app maturity rating using an app with two screenshots and a description as an example. The intermediate steps in the CoT reasoning include (a) maturity content prompt and (b) maturity intensity prompt to identify and rank the screenshots according to the maturity rating policy. Finally, (c) maturity rating prompt combines screenshot(s) and textual description to generate a maturity rating score. Requests to GPT-4V are highlighted in red font, and responses are in blue.
Figure 3: Confusion matrix of the proposed CoT endowed MLLM framework for app maturity rating.

Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

TL;DR

Abstract

Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

Authors

TL;DR

Abstract

Table of Contents

Figures (3)