HiLight: Technical Report on the Motern AI Video Language Model
Zhiting Wang, Qiangong Zhou, Kangjie Yang, Zongyang Liu, Xin Mao
TL;DR
HiLight tackles video understanding in a billiards domain by aligning video and text and enabling video-based conversations. It combines a refined video encoder (CLIP-ViP+ with SPARC local loss and language masking) with a dual-tower visual-language model (CLIP-ViP+ and Long-CLIP) that uses token mining to feed a Gemma-2B LLM. The training workflow includes modality alignment on video QA data and billiards-specific instruction tuning, with experiments highlighting the benefits of explicit local loss with language masks and dual-tower fusion for robust grounding and long-context modeling. This work advances practical vision-language systems for action-rich, domain-specific scenarios and points toward unified, long-context encoders and multi-video interaction in conversational AI.
Abstract
This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.
