Table of Contents
Fetching ...

Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

Jesse Barkley, Abraham George, Amir Barati Farimani

TL;DR

The paper addresses autonomous edge targeting in dynamic, data-scarce military settings by introducing a hierarchical zero-shot framework that cascades a high-recall semantic trigger (Grounding DINO) with compact edge-class Vision-Language Models (Qwen and Gemma, $4 ext{B}$–$12 ext{B}$). It leverages high-fidelity Battlefield synthetic data and a novel Controlled Input methodology to decouple perception from reasoning, enabling precise diagnosis of failure modes. On 55 Battlefield 6 clips, the system achieves up to 100% false-positive filtering, 97.5% damage assessment accuracy, and strong fine-grained vehicle classification (up to 90%), while the agentic Scout-Commander workflow attains 100% correct asset deployment and a near-perfect reasoning score ($9.8/10$) with sub-$75$ seconds latency. The work reveals distinct failure phenotypes (Perceptual Blindness vs Semantic Non-Compliance) and demonstrates edge-only autonomy using small VLMs, underscoring the need for domain-specific foundation models and rigorous perception-reasoning decoupling for safety-critical defense applications.

Abstract

Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel "Controlled Input" methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.

Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

TL;DR

The paper addresses autonomous edge targeting in dynamic, data-scarce military settings by introducing a hierarchical zero-shot framework that cascades a high-recall semantic trigger (Grounding DINO) with compact edge-class Vision-Language Models (Qwen and Gemma, ). It leverages high-fidelity Battlefield synthetic data and a novel Controlled Input methodology to decouple perception from reasoning, enabling precise diagnosis of failure modes. On 55 Battlefield 6 clips, the system achieves up to 100% false-positive filtering, 97.5% damage assessment accuracy, and strong fine-grained vehicle classification (up to 90%), while the agentic Scout-Commander workflow attains 100% correct asset deployment and a near-perfect reasoning score () with sub- seconds latency. The work reveals distinct failure phenotypes (Perceptual Blindness vs Semantic Non-Compliance) and demonstrates edge-only autonomy using small VLMs, underscoring the need for domain-specific foundation models and rigorous perception-reasoning decoupling for safety-critical defense applications.

Abstract

Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel "Controlled Input" methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.
Paper Structure (35 sections, 3 figures, 6 tables)

This paper contains 35 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the hierarchical zero-shot framework. Grounding DINO serves as a semantic trigger, extracting high-confidence frames that are then verified by edge VLMs for target classification, damage assessment, and tactical decisions.
  • Figure 2: Example detections from Grounding DINO across different categories. Grounding DINO acted as a high recall filter, correctly identifying MBTs and IFVs as military tanks (although not distinguishing between them). However it can be seen that destroyed tanks and armed trucks were also detected. These high-confidence frames are passed to VLMs for semantic verification.
  • Figure 3: Atomic evaluation results across three perception tasks. Qwen models consistently outperform Gemma across all evaluations. Mean inference latency per image: Qwen3-VL-4B (5.7s), Qwen3-VL-8B (10.8s), Gemma3-4B (2.0s), Gemma3-12B (4.8s).