What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation

Kainan Shi; Peilin Zhou; Ge Wang; Han Ding; Fei Wang

What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation

Kainan Shi, Peilin Zhou, Ge Wang, Han Ding, Fei Wang

TL;DR

This paper tackles identifying which design choices in LLM-as-feature-extractor pipelines most drive performance in sequential recommender systems. It introduces RecXplore, a modular framework that decouples data processing, feature extraction, feature adaptation, and sequential modeling to enable controlled comparisons. Across four public datasets, the study shows simple, well-chosen designs—such as attribute flattening, mean pooling, CPT+SFT adaptation, and PCA+MoE feature adapters—yield substantial gains, up to 18.7% in NDCG@5 and 15.1% in HR@5 over strong baselines. The results support modular benchmarking as a practical path to robust, deployable LLM-enhanced recommendations and offer actionable guidelines for practitioners.

Abstract

Using Large Language Models (LLMs) to generate semantic features has been demonstrated as a powerful paradigm for enhancing Sequential Recommender Systems (SRS). This typically involves three stages: processing item text, extracting features with LLMs, and adapting them for downstream models. However, existing methods vary widely in prompting, architecture, and adaptation strategies, making it difficult to fairly compare design choices and identify what truly drives performance. In this work, we propose RecXplore, a modular analytical framework that decomposes the LLM-as-feature-extractor pipeline into four modules: data processing, semantic feature extraction, feature adaptation, and sequential modeling. Instead of proposing new techniques, RecXplore revisits and organizes established methods, enabling systematic exploration of each module in isolation. Experiments on four public datasets show that simply combining the best designs from existing techniques without exhaustive search yields up to 18.7% relative improvement in NDCG@5 and 12.7% in HR@5 over strong baselines. These results underscore the utility of modular benchmarking for identifying effective design patterns and promoting standardized research in LLM-enhanced recommendation.

What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation

TL;DR

Abstract

What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)