Table of Contents
Fetching ...

Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Sherif Abdelwahab

Abstract

Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Abstract

Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

Paper Structure

This paper contains 32 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture overview. On-device (blue): encode and filter frames. Query time (purple): text query projected into image space via LiT adapter, re-ranked in the cloud. Circled numbers match stages in the text.
  • Figure 2: Top-5 retrieval on 19 AEA sequences (553 events). CLIP B/32 selects which frames to keep; each model on the x-axis re-embeds those frames in its own space and retrieves independently. Novelty outperforms all offline baselines in five of eight models. Dashed line: FT-TinyCLIP+LiT on held-out data (25.0%).
  • Figure 3: Left: cumulative improvement on held-out data (dashed line: CLIP B/32 on all frames). Right: re-ranker candidates needed at equal hit rate ($5.7\times$ fewer at 70%).