Table of Contents
Fetching ...

Factorio Learning Environment

Jack Hopkins, Mart Bakler, Akbir Khan

TL;DR

The Factorio Learning Environment (FLE) offers a non-saturating, open-ended benchmark for evaluating autonomous agents on long-horizon planning and resource optimization, using Factorio as a rich but controllable testbed. By providing lab-play and open-play settings, an interactive Python/Lua API, and an unbounded production objective, the paper demonstrates that even frontier LLMs struggle with spatial reasoning, iterative error correction, and scalable automation, while coding-oriented models show stronger progress in open-ended tasks. Key contributions include an open-source framework, a persistent, REPL-based programming interface for iterative agent development, and a detailed characterization of model capabilities across structured and unbounded factory challenges. The work highlights FLE’s potential as a relativized, curriculum-like benchmark that can drive progress in planning, synthesis, and robust automation, with implications for scalable AI safety and reproducibility research.

Abstract

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges -- from basic automation to complex factories processing millions of resource units per second. We provide two settings: (1) lab-play consisting of eight structured tasks with fixed resources, and (2) open-play with the unbounded task of building the largest factory on an procedurally generated map. We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).

Factorio Learning Environment

TL;DR

The Factorio Learning Environment (FLE) offers a non-saturating, open-ended benchmark for evaluating autonomous agents on long-horizon planning and resource optimization, using Factorio as a rich but controllable testbed. By providing lab-play and open-play settings, an interactive Python/Lua API, and an unbounded production objective, the paper demonstrates that even frontier LLMs struggle with spatial reasoning, iterative error correction, and scalable automation, while coding-oriented models show stronger progress in open-ended tasks. Key contributions include an open-source framework, a persistent, REPL-based programming interface for iterative agent development, and a detailed characterization of model capabilities across structured and unbounded factory challenges. The work highlights FLE’s potential as a relativized, curriculum-like benchmark that can drive progress in planning, synthesis, and robust automation, with implications for scalable AI safety and reproducibility research.

Abstract

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges -- from basic automation to complex factories processing millions of resource units per second. We provide two settings: (1) lab-play consisting of eight structured tasks with fixed resources, and (2) open-play with the unbounded task of building the largest factory on an procedurally generated map. We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).

Paper Structure

This paper contains 30 sections, 6 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: A plastic bar factory created by Claude 3.5 Sonnet in lab-play. The factory consists of a electricity steam generator (top-left), a coal mine (top), a crude-oil to petroleum gas pipeline (bottom) and a chemical plant (bottom-right). The chemical plant creates plastic bars using the coal and petroleum gas as inputs. By themselves, the cumulative raw resources generate a production score of $224$. With this specific layout, the factory creates $40$ plastic bars per $60$ in-game seconds, for a production score of $352$.
  • Figure 2: Illustration of the Factorio Learning Environment (FLE). FLE is based on the popular construction and management simulation game Factorio. Left: The open-ended goal of the game is to create the largest factory possible. The game enables agents to invest in (an infinite number of) technological advances to produce more resources per second. Middle: Agents interact with the game by using an interactive Python Interpreter, where they take actions and print their observations in a Read-Eval-Print loop. By using the Python namespace, agents may store variables and define functions for later use. We provide a Python API to Factorio which allows direct interaction with the environment. Right: The agent may issue commands to the game server in order to interact with the environment (with associated time penalities), and receive a response as feedback. If the agents chooses, it may view its own production statistics.
  • Figure 3: Example of an FLE program used to create a simple automated iron-ore miner. In step 1 the agent uses a query to find the nearest resources and place a mine. In step 3 the agent uses an assert statement to verify that its action was successful.
  • Figure 4: Models are differentiated by score in Open-Play. Agents are given the instruction to build the biggest possible factory. Left: We find that by evaluating PS against steps (server calls) we can clearly differentiate stronger models from weaker ones in a log/log projection. We overlay milestones, showing the first time the median agent was able to create a new type of entity. Right: We plot the final reward and elapsed game time after 5k steps. We find that while weaker models show promise early-game, they struggle to progress when automation and logistics are required. We report median and standard error over the independent runs.
  • Figure 5: Agents are unable to consistently build complex and efficient factories in Lab-Play. Top: We measure the mean and standard deviation of task success rates across the first 8 complexity levels and task progress (percentage of target ingredients and its sub-ingredients agents factories produce at each time-step) in three tasks of increasing difficulty. We observe a clear decrease in average task success rates as the crafting complexity of the target entity increases. Bottom: In harder tasks, agents show trends of initial rapid progress followed by stagnation or decrease. This is due to agents being unable to scale up initial production or add new sections to factories required to successfully reach the target production levels and often breaking existing structures during the process. The lack of consistent progress is also observed through the large variance in mean task progress across runs.
  • ...and 8 more figures