CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot

Artem Lykov; Mikhail Litvinov; Mikhail Konenkov; Rinat Prochii; Nikita Burtsev; Ali Alridha Abdulkarim; Artem Bazhenov; Vladimir Berman; Dzmitry Tsetserukou

CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot

Artem Lykov, Mikhail Litvinov, Mikhail Konenkov, Rinat Prochii, Nikita Burtsev, Ali Alridha Abdulkarim, Artem Bazhenov, Vladimir Berman, Dzmitry Tsetserukou

TL;DR

The paper tackles the challenge of creating a universal embodied AI robot capable of natural language understanding and physical interaction in open environments. It presents CognitiveDog, a Unitree Go1-based quadruped augmented with a dual-transformer LMM system, inner-monologue reasoning, and an Autogen-inspired multi-agent coordination framework, plus a vision-language transformer for environment comprehension without task-specific training. The architecture integrates a Step by Step Plan Generation module with a Visual Information Analysis pipeline (MiniGPT4-v2) and Visual-SLAM, enabling autonomous planning, object manipulation, and descriptive commentary. Across generalization, emergent capabilities, and complex tasks, the system demonstrates competitive performance against RT-2 baselines, with notable gains in reasoning while using far fewer parameters, marking a step toward practical universal quadruped robotics.

Abstract

This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link: huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset

CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot

TL;DR

Abstract

CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot

Authors

TL;DR

Abstract

Table of Contents

Figures (4)