Table of Contents
Fetching ...

Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan

TL;DR

MiMo-VL-Miloco-7B presents a home-centric, edge-deployable vision-language model tailored for smart homes. It adopts a two-stage pipeline—CoT-enabled supervised fine-tuning on home data with token-budget reasoning, followed by GRPO-based reinforcement learning to preserve general multimodal capabilities. The approach yields strong home-scenario understanding, gesture recognition, and activity classification while achieving competitive performance on broad multimodal benchmarks, aided by a release of both full-precision and GGUF-quantized checkpoints. This work demonstrates that targeted domain specialization can coexist with broad, on-device multimodal reasoning, enabling practical privacy-preserving copilots in real-world homes.

Abstract

We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.

Xiaomi MiMo-VL-Miloco Technical Report

TL;DR

MiMo-VL-Miloco-7B presents a home-centric, edge-deployable vision-language model tailored for smart homes. It adopts a two-stage pipeline—CoT-enabled supervised fine-tuning on home data with token-budget reasoning, followed by GRPO-based reinforcement learning to preserve general multimodal capabilities. The approach yields strong home-scenario understanding, gesture recognition, and activity classification while achieving competitive performance on broad multimodal benchmarks, aided by a release of both full-precision and GGUF-quantized checkpoints. This work demonstrates that targeted domain specialization can coexist with broad, on-device multimodal reasoning, enabling practical privacy-preserving copilots in real-world homes.

Abstract

We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.

Paper Structure

This paper contains 19 sections, 6 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 2: The overview of MiMo-VL-Miloco-7B. The model takes as input a video depicting either a home scene or user gestures. Video frames are encoded by a Vision Transformer (ViT) and projected into the LLM embedding space via an MLP projector, forming visual tokens. In parallel, the instruction prompt is tokenized into text tokens. The visual and text tokens are then concatenated and fed into the LLM backbone, which generates the final response.