Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering
Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
TL;DR
This work evaluates GPT-4o for zero-shot multi-modal video classification on seven TikTok-focused quality categories, using vision-language prompts and a policy-driven framework. It demonstrates that prompt design, particularly shorter policy prompts and decomposition-aggregation prompting for complex categories like Clickbait, can achieve competitive or superior performance without fine-tuning. The study provides a practical, scalable approach for industry-grade content moderation and offers insights into how prompt engineering can bridge gaps between general-purpose LLMs and task-specific moderation needs. Limitations include not testing newer reasoning-models and not incorporating audio modality, suggesting future directions toward hybrid models with fine-tuning and richer multi-modal integration.
Abstract
In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT's performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT's performance without additional finetuning, offering an effective and scalable solution for improving video classification.
