The Environmental Impact of Ensemble Techniques in Recommender Systems
Jannik Nitschke
TL;DR
This work addresses the environmental cost of ensemble techniques in recommender systems, a gap in prior literature that focused mainly on accuracy gains. Through 93 controlled experiments using Surprise (rating prediction) and LensKit (ranking) across four datasets, it quantifies energy consumption and carbon footprint for four ensemble strategies relative to optimized single models, using EMERS with a Shelly plug for measurement. The study reveals a highly non-linear accuracy-energy relationship: ensembles yield modest accuracy improvements (0.3-5.7%) but can incur large energy overheads (up to 2,549% higher), with selective Top Performers offering better efficiency and clear scalability limits on industrial-scale data. The findings provide actionable guidance for sustainable algorithm selection in recommender systems and establish a methodology for reporting energy and carbon metrics alongside traditional accuracy metrics.
Abstract
Ensemble techniques in recommender systems have demonstrated accuracy improvements of 10-30%, yet their environmental impact remains unmeasured. While deep learning recommendation algorithms can generate up to 3,297 kg CO2 per paper, ensemble methods have not been sufficiently evaluated for energy consumption. This thesis investigates how ensemble techniques influence environmental impact compared to single optimized models. We conducted 93 experiments across two frameworks (Surprise for rating prediction, LensKit for ranking) on four datasets spanning 100,000 to 7.8 million interactions. We evaluated four ensemble strategies (Average, Weighted, Stacking/Rank Fusion, Top Performers) against simple baselines and optimized single models, measuring energy consumption with a smart plug. Results revealed a non-linear accuracy-energy relationship. Ensemble methods achieved 0.3-5.7% accuracy improvements while consuming 19-2,549% more energy depending on dataset size and strategy. The Top Performers ensemble showed best efficiency: 0.96% RMSE improvement with 18.8% energy overhead on MovieLens-1M, and 5.7% NDCG improvement with 103% overhead on MovieLens-100K. Exhaustive averaging strategies consumed 88-270% more energy for comparable gains. On the largest dataset (Anime, 7.8M interactions), the Surprise ensemble consumed 2,005% more energy (0.21 Wh vs. 0.01 Wh) for 1.2% accuracy improvement, producing 53.8 mg CO2 versus 2.6 mg CO2 for the single model. This research provides one of the first systematic measurements of energy and carbon footprint for ensemble recommender systems, demonstrates that selective strategies offer superior efficiency over exhaustive averaging, and identifies scalability limitations at industrial scale. These findings enable informed decisions about sustainable algorithm selection in recommender systems.
