FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, Sudheer Chava
TL;DR
This work tackles the challenge of generating topic-aligned captions for financial short-form videos, where dense visual cues like tickers and charts coexist with audio and transcripts. It evaluates seven non-empty modality configurations (T, A, V, TA, TV, AV, TAV) across five tasks using 624 SVs from VideoConviction, employing F1 for ticker–action extraction and a domain-specific G-VEval metric. The study provides the first baselines for financial SV captioning, revealing that selective modality fusion can outperform full tri-modal fusion and that the importance of each modality varies by task, with vision often serving as the anchor. These findings offer a foundation for robust, domain-aware multimodal grounding in finance and guide future research toward task-specific modality design and evaluation; code and data are available on GitHub under CC-BY-NC-SA 4.0.
Abstract
We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.
