Table of Contents
Fetching ...

OVEL: Large Language Model as Memory Manager for Online Video Entity Linking

Haiquan Zhao, Xuwu Wang, Shisong Chen, Zhixu Li, Xin Zheng, Yanghua Xiao

TL;DR

The paper tackles Online Video Entity Linking (OVEL) in live streams, addressing real-time, fine-grained linking of video mentions to a knowledge base. It introduces the LIVE dataset and RoFA metric to evaluate timeliness, robustness, and accuracy in online settings, and proposes a memory-managed framework where an LLM controls a memory block guided by retrieval augmentation and a two-stage MEL process. The method demonstrates that combining retrieval with an LLM-based memory controller achieves superior RoFA performance and remains feasible for online inference, with clear gains over static MEL baselines. This work enables more accurate and timely identification of product entities in live video streams, potentially improving real-time recommendations and user experience in live commerce and similar applications.

Abstract

In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.

OVEL: Large Language Model as Memory Manager for Online Video Entity Linking

TL;DR

The paper tackles Online Video Entity Linking (OVEL) in live streams, addressing real-time, fine-grained linking of video mentions to a knowledge base. It introduces the LIVE dataset and RoFA metric to evaluate timeliness, robustness, and accuracy in online settings, and proposes a memory-managed framework where an LLM controls a memory block guided by retrieval augmentation and a two-stage MEL process. The method demonstrates that combining retrieval with an LLM-based memory controller achieves superior RoFA performance and remains feasible for online inference, with clear gains over static MEL baselines. This work enables more accurate and timely identification of product entities in live video streams, potentially improving real-time recommendations and user experience in live commerce and similar applications.

Abstract

In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.
Paper Structure (26 sections, 13 equations, 6 figures, 4 tables)

This paper contains 26 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The task of OVEL in the live delivery scene. The upper represents an online delivery video. At time t, it takes information before time t as input and identifies salient entities from the video. Relevant entities are pushed to specific persons for recommendation.
  • Figure 2: Overview of framework structure. The initialized memory block is obtained through the summary module and used alongside keyframes extracted from the video by MEL to get initial retrieval candidates. At time t, the LLM memory controller acquires video information within the current input time interval, the memory block before time t, and incorporates retrieval results to update the content within the memory block.
  • Figure 3: Inference time of different method. The Recommended time is determined based on the optimal inference time consumption provided by the actual application scenario.
  • Figure 4: The procedure of LIVE dataset construction.
  • Figure 5: Overview of entity distribution.
  • ...and 1 more figures