Table of Contents
Fetching ...

UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Yicheng Xu, Jiangning Zhang, Zhucun Xue, Teng Hu, Ran Yi, Xiaobin Hu, Yong Liu, Dacheng Tao

Abstract

In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.

UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Abstract

In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.

Paper Structure

This paper contains 24 sections, 8 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Left: Previous fragmented paradigms isolate modalities and tasks, often suffering from non-monotonic shot scaling. Our UniICL mitigates this issue to achieve consistent gains. Middle: Our six-level capability-oriented taxonomy and a radar chart across understanding and generation tasks. Right: ICL examples from UniICL-760K.
  • Figure 2: UniICL-760K curation pipeline that includes four processes: (a) Cascaded dense annotation for visual knowledge repository construction, (b) Generative synthesis and strict quality filtering, (c) Multi-modal feature fusion and DPP sampling for continuous semantic retrieval, and (d) Intent-driven retrieval from the structured annotation space.
  • Figure 3: Statistical distributions of our UniICL-760K from multiple perspectives. After filtering, the final training assets combine 202,750 validated scene-centric samples from the annotation branch and 353,826 quality-controlled synthetic assets from the generative branch. The latter break down into 99,455 instruction-following images, 81,202 edited images, 97,683 refinement pairs, and 11,050 concept-oriented synthetic images.
  • Figure 4: Branch-level filtering statistics used in data curation. Left: after structural correction and validation, the final overall-threshold rule retains 202,750 scene-centric samples from the 750,000-image source pool. Right: among 160,269 valid HPSv3 evaluations, 99,455 exceed the HPSv3$>10$ threshold, and the retained synthetic branch further yields 81,202 edited images and 97,683 refinement pairs after task-specific filtering.
  • Figure 5: Our lightweight and plug-and-play CAPM module adapts to existing Transformer-based models via a four-stage pipeline.
  • ...and 7 more figures