To See or To Read: User Behavior Reasoning in Multimodal LLMs
Tianning Dong, Luyi Ma, Varun Vasudevan, Jason Cho, Sushant Kumar, Kannan Achan
TL;DR
BehaviorLens investigates how the representation of sequential user history affects reasoning in multimodal LLMs for next-purchase prediction. It compares three modalities—text transcripts, scatter-plot images, and flowchart images—across six MLLMs, measuring accuracy, efficiency, and explanation quality. The key finding is that image-based representations substantially boost prediction accuracy (up to 87.5% improvement) without added computational cost, with model-dependent preferences for scatterplot vs flowchart. The work provides a reproducible benchmarking framework and insights into input design for efficient multimodal reasoning with sequential data.
Abstract
Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.
