Table of Contents
Fetching ...

Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering

Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu

TL;DR

This work evaluates GPT-4o for zero-shot multi-modal video classification on seven TikTok-focused quality categories, using vision-language prompts and a policy-driven framework. It demonstrates that prompt design, particularly shorter policy prompts and decomposition-aggregation prompting for complex categories like Clickbait, can achieve competitive or superior performance without fine-tuning. The study provides a practical, scalable approach for industry-grade content moderation and offers insights into how prompt engineering can bridge gaps between general-purpose LLMs and task-specific moderation needs. Limitations include not testing newer reasoning-models and not incorporating audio modality, suggesting future directions toward hybrid models with fine-tuning and richer multi-modal integration.

Abstract

In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT's performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT's performance without additional finetuning, offering an effective and scalable solution for improving video classification.

Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering

TL;DR

This work evaluates GPT-4o for zero-shot multi-modal video classification on seven TikTok-focused quality categories, using vision-language prompts and a policy-driven framework. It demonstrates that prompt design, particularly shorter policy prompts and decomposition-aggregation prompting for complex categories like Clickbait, can achieve competitive or superior performance without fine-tuning. The study provides a practical, scalable approach for industry-grade content moderation and offers insights into how prompt engineering can bridge gaps between general-purpose LLMs and task-specific moderation needs. Limitations include not testing newer reasoning-models and not incorporating audio modality, suggesting future directions toward hybrid models with fine-tuning and richer multi-modal integration.

Abstract

In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT's performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT's performance without additional finetuning, offering an effective and scalable solution for improving video classification.

Paper Structure

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: To have a fair comparison across categories, we design the experiment such that the item specific user prompt is independent of the category's policy, while the system prompt provided to GPT-4o incorporates the provided policy. For our experiments, given an {CATEGORY} and {POLICY} along with a corresponding dataset, containing at least the item_id and ground truth label, we ask GPT-4o to output a classification prediction, providing its reasoning and score.
  • Figure 2: This figure shows the precision-recall curves when using the score provided by GPT-4o for all categories. Since all datasets are balanced to contain equal numbers of pos & neg cases, they share one lower bound (0.5).
  • Figure 3: These charts plot the normalized (0-1) distribution of scores (x-axis) provided by GPT-4o, where the dashed red line is the classification threshold we choose, orange and black are used to depict negative and positive cases, and a logarithmic scale is used on the y-axis. We show the results for Non-interactive duet in (a), Sensitive and Mature Content in (b), and a shortened version of Sensitive and Mature Content in (c), which uses $96$ words in the policy instead of $4023$. We also report the total portion of False Negatives (FN) in the dataset: which are true positives (black), but classified as negative (left of the threshold), and conversely, the total portion of False Positives (FP), which are true negatives (orange), but classified as positive (right of the threshold).
  • Figure 4: This figure shows the precision-recall curves for the Clickbait category when using both the original prompting technique (GPT-4o-single), as well as the proposed prompting technique which asks to provide a score for each category (GPT-4o-multi). To produce a final score for GPT-4o-multi, we consider the mean, max, as well as the best linear regression of the category scores. We additionally plot the corresponding production model (Baseline).