Table of Contents
Fetching ...

Falcon Perception

Aviraj Bevli, Sofian Chaybouti, Yasser Dahou, Hakim Hacid, Ngoc Dung Huynh, Phuc H. Le Khac, Sanath Narayan, Wamiq Reyaz Para, Ankit Singh

Abstract

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.

Falcon Perception

Abstract

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.

Paper Structure

This paper contains 44 sections, 18 equations, 25 figures, 13 tables.

Figures (25)

  • Figure 1: Falcon Perception Architecture: A single autoregressive Transformer processes a unified sequence of image patches, text, and task tokens. The model predicts object properties in a fixed order: <coord>$\to$<size>$\to$<segm>. Bounding boxes' coordinate and size tokens are decoded via specialized heads and re-injected as Fourier features to condition subsequent steps. High resolution segmentation masks are generated by a dot product between the <segm> token of each instance and the upsampled image features, leveraging the early-fusion backbone for instance-aware localization. Visual data flow is shown in green, coordinates in blue, size in orange, and segmentation in purple.
  • Figure 2: Training forward pass.
  • Figure 3: Prompt progression across complexity levels. We keep the image fixed and vary the text prompt to isolate the capabilities targeted by Levels 0--4 (Table \ref{['tab:complexity_levels']}). Each panel shows the predicted masks for the same scene under progressively more specific prompts: from generic object classes (Level 0, e.g."box"), to attribute binding (Level 1, e.g."purple box"), to OCR-based identification (Level 2, e.g."ACME box"), and then spatial/layout constraints (Level 3, e.g."bottle on the right") and fine-grained relational descriptions (Level 4). Falcon Perception remains stable as prompts become more compositional and text-dependent, while SAM 3 starts to fail on OCR-driven queries and higher-level constraints, either returning no masks or segmenting visually plausible but semantically wrong instances.
  • Figure 4: Muon optimizer results in lower training losses for coord, size and cross-entropy, in comparison to AdamW, leading to improved performance (Tab. \ref{['tab:optim']}).
  • Figure 5: Raster ordering of instances results in lower training loss and better performance on both benchmarks, compared to random and size orderings.
  • ...and 20 more figures