Table of Contents
Fetching ...

ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

Mustafa Munir, Harsh Goel, Xiwen Wei, Minkyu Choi, Sahil Shah, Kartikeya Bhardwaj, Paul Whatmough, Sandeep Chinchali, Radu Marculescu

TL;DR

ObjectAlign tackles object-level inconsistencies in edited video by fusing learnable perceptual thresholds with formal, neuro-symbolic verification. It pairs SM T-based semantic stability constraints with probabilistic model checking of temporal logic to guarantee frame-to-frame consistency, and employs adaptive neural interpolation to repair inconsistent blocks using nearby keyframes. The approach demonstrates improvements in CLIP-based semantic stability and warp error across multiple editing pipelines, supported by ablations and user studies that validate perceptual gains. This work provides a practical path toward provably consistent video editing by integrating perceptual cues, symbolic reasoning, and adaptive correction in a closed loop.

Abstract

Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

TL;DR

ObjectAlign tackles object-level inconsistencies in edited video by fusing learnable perceptual thresholds with formal, neuro-symbolic verification. It pairs SM T-based semantic stability constraints with probabilistic model checking of temporal logic to guarantee frame-to-frame consistency, and employs adaptive neural interpolation to repair inconsistent blocks using nearby keyframes. The approach demonstrates improvements in CLIP-based semantic stability and warp error across multiple editing pipelines, supported by ablations and user studies that validate perceptual gains. This work provides a practical path toward provably consistent video editing by integrating perceptual cues, symbolic reasoning, and adaptive correction in a closed loop.

Abstract

Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

Paper Structure

This paper contains 32 sections, 9 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of ObjectAlign. ① We first learn per‐metric consistency thresholds from “positive” original video clips and “negative” inconsistently edited clips. ② Next, for each consecutive frame pair in a newly edited video, we compute semantic and perceptual similarities and apply both the learned threshold checks and an SMT‐based object consistency check on the embeddings to flag inconsistent transitions. ③ Finally, each contiguous block of flagged frames is repaired by adaptively interpolating between the nearest preceding and succeeding consistent keyframes, with the interpolation depth chosen according to the segment length. The corrected frames can then be re‐verified in a closed loop until no inconsistencies remain.
  • Figure 2: Qualitative comparison of ObjectAlign corrections across different editing pipelines. Before ObjectAlign correction (left), both SDEdit and PnP in the "Orange Fox" edits incorrectly alter the wolf's shape and color across consecutive frames (highlighted in green boxes). Similarly, StreamV2V and StreamDiffusion in the "Van Gogh Soccerball" edits cause the soccerball to intermittently disappear and reappear (highlighted in yellow boxes). These inconsistencies are accompanied by noticeable color and style drift, perceptual flickering, and identity misalignment. After applying ObjectAlign (right), these issues are effectively mitigated, resulting in greater semantic and temporal consistency.
  • Figure 3: Annotation tool for User Study. Participants are asked to evaluate the efficacy of ObjectAlign in terms of correcting videos to improve subject consistency. We provide a randomized base video and an edited video whose presentation order is randomized to remove bias, and users are asked to compare whether Video 2 is better than Video 1.
  • Figure 4: User Study Results on Perceptual Improvement by ObjectAlign. Participants were asked to evaluate whether the ObjectAlign corrected videos demonstrate noticeable improvements in subject consistency compared to baseline edited videos (users did not know which video is the original and which was the ObjectAlign corrected version). Responses ranged from "Strongly Disagree" to "Strongly Agree". Results indicate a clear user preference for ObjectAlign corrected videos, especially prominent in the PnP method, where 75% of participants expressed strong agreement or agreement. Conversely, StreamDiffusion corrections showed the lowest perceived improvement, indicating variations in ObjectAlign's effectiveness depending on the underlying editing pipeline, but improvements compared to the baseline regardless.
  • Figure 5: Further Qualitative Comparisons of ObjectAlign Corrections. (Top Row) Original real-world input video depicting a man surfing. (Middle Row, SDEdit Pixelart) Pixelart stylized frames produced by SDEdit introduce transient artifacts and distortions (highlighted in red boxes) around the surfer's arm. ObjectAlign correction successfully removes these artifacts, ensuring temporal consistency of object shapes. (Bottom Row, PnP Pixelart) PnP Pixelart stylization introduces significant spatial inconsistencies in the surfer's surfboard and introduces mysterious artifacts as highlighted in the red boxes. The far left frame shows a red artifact, and the second frame from the left introduces a random yellow object passing through the surfer. ObjectAlign effectively corrects these inconsistencies, resulting in a smoother and visually coherent video.