Table of Contents
Fetching ...

HiLight: Technical Report on the Motern AI Video Language Model

Zhiting Wang, Qiangong Zhou, Kangjie Yang, Zongyang Liu, Xin Mao

TL;DR

HiLight tackles video understanding in a billiards domain by aligning video and text and enabling video-based conversations. It combines a refined video encoder (CLIP-ViP+ with SPARC local loss and language masking) with a dual-tower visual-language model (CLIP-ViP+ and Long-CLIP) that uses token mining to feed a Gemma-2B LLM. The training workflow includes modality alignment on video QA data and billiards-specific instruction tuning, with experiments highlighting the benefits of explicit local loss with language masks and dual-tower fusion for robust grounding and long-context modeling. This work advances practical vision-language systems for action-rich, domain-specific scenarios and points toward unified, long-context encoders and multi-video interaction in conversational AI.

Abstract

This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.

HiLight: Technical Report on the Motern AI Video Language Model

TL;DR

HiLight tackles video understanding in a billiards domain by aligning video and text and enabling video-based conversations. It combines a refined video encoder (CLIP-ViP+ with SPARC local loss and language masking) with a dual-tower visual-language model (CLIP-ViP+ and Long-CLIP) that uses token mining to feed a Gemma-2B LLM. The training workflow includes modality alignment on video QA data and billiards-specific instruction tuning, with experiments highlighting the benefits of explicit local loss with language masks and dual-tower fusion for robust grounding and long-context modeling. This work advances practical vision-language systems for action-rich, domain-specific scenarios and points toward unified, long-context encoders and multi-video interaction in conversational AI.

Abstract

This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.
Paper Structure (10 sections, 3 figures)

This paper contains 10 sections, 3 figures.

Figures (3)

  • Figure 1: A improved CLIP-ViP structure -- CLIP-ViP+, where global loss is from the contrastive learning of CLIP-ViP and introduced local loss is from the SPARC methodSPARC.
  • Figure 2: HiLight Dual-Tower VLM Framework. CLIP-ViP receives complete video features, while Long-CLIP receives fixed-sampled video frames. The features of each keyframe outputted by Long-CLIP serve as the query for Cross Attention with the complete video features from CLIP-ViP. Following a projection layer output vision tokens which are fed into the language model concat with the user's text input.
  • Figure 3: the first stage of VLM training, token mining training loss.