Table of Contents
Fetching ...

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

Priyan Vaithilingam, Munyeong Kim, Frida-Cecilia Acosta-Parenteau, Daniel Lee, Amine Mhedhbi, Elena L. Glassman, Ian Arawjo

TL;DR

SemanticCommit tackles updating AI memory of user intent at scale by introducing a semantic commit workflow that detects and resolves semantic conflicts during memory updates. The system combines a knowledge-graph–driven RAG pipeline with LLM-based resolution, implemented in a React/TypeScript frontend and a Flask backend, and evaluated through benchmarks and a within-subjects user study against ChatGPT Canvas. Key contributions include a detailed design goal framework, an end-to-end architecture separating retrieval from generation, four domain benchmarks, and empirical evidence that impact analysis and granular, human-in-the-loop edits improve conflict detection and user sense of control. The work offers design implications for AI-agent memory interfaces, advocating proactive impact analysis, adjustable autonomy, and scalable memory-management APIs to support robust, user-aligned memory updates in real-world workflows.

Abstract

How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non-local effects, which we call "semantic conflict resolution." The user commits new intent to a project -- makes a "semantic commit" -- and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an "intent specification"). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph-based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within-subjects study of SemanticCommit for two task domains -- game design documents, and AI agent memory in the style of ChatGPT memories -- where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision-making.

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

TL;DR

SemanticCommit tackles updating AI memory of user intent at scale by introducing a semantic commit workflow that detects and resolves semantic conflicts during memory updates. The system combines a knowledge-graph–driven RAG pipeline with LLM-based resolution, implemented in a React/TypeScript frontend and a Flask backend, and evaluated through benchmarks and a within-subjects user study against ChatGPT Canvas. Key contributions include a detailed design goal framework, an end-to-end architecture separating retrieval from generation, four domain benchmarks, and empirical evidence that impact analysis and granular, human-in-the-loop edits improve conflict detection and user sense of control. The work offers design implications for AI-agent memory interfaces, advocating proactive impact analysis, adjustable autonomy, and scalable memory-management APIs to support robust, user-aligned memory updates in real-world workflows.

Abstract

How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non-local effects, which we call "semantic conflict resolution." The user commits new intent to a project -- makes a "semantic commit" -- and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an "intent specification"). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph-based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within-subjects study of SemanticCommit for two task domains -- game design documents, and AI agent memory in the style of ChatGPT memories -- where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision-making.

Paper Structure

This paper contains 54 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: A high-level depiction of our envisioned interaction between humans and AI assistants for long-term projects. The human-readable intent specification serves as an intermediate layer for enhancing common ground between the human and the AI, and grounds the AI's decision-making. We assume future AI agents will have a similar intent specification layer. Our project squarely concerns how the AI updates this memory in a robust, verifiable manner, and in the process might surface conflicts to the user to get their feedback in resolving them.
  • Figure 2: Example of our SemanticCommit workflow, showing one process of integrating new information into an AI memory of the financial habits of a South Korean student. 1. The user has described a new piece of information and pressed Make Change. 2. SemanticCommit detects conflicts and suggests changes to items it deems the most conflicting, leaving other conflicts for human review. 3. The user hovers over conflicting items to view the AI's reasoning. 4. For one item, they click a button to let the AI make a local rewrite. The user can continuing editing, manually revising, reverting suggested changes, or deleting items at will. 5. When they feel done, they manually resolve items and/or clear remaining conflicts with a global action. (Alternatively, the user could have clicked Check for Conflicts to only perform detection, then handled conflicts locally.)
  • Figure 3: Cursor Rules terragni2025future adapted from the Instructor library instructor_cursor_rules, loaded into our SemanticCommit UI. The user has added a new directive to squash commits before pushing a feature branch. The system adds the new rule to the top, makes a clarifying revision, and flags other lines as potential conflicts. One change is in error, which the user can quickly spot and revert.
  • Figure 4: Comparison of SemanticCommit using a knowledge graph with PageRank relevance assessment and then classification to two baselines: (i) DropAllDocs: takes all documents in context to classify them without a retrieval stage; and (ii) InkSyncLabin2024Beyond implementation, reformulating the prompt to our context. The comparison is across all benchmarks in Table \ref{['tab:benchmark_stats']}, averaged with st. dev. bars, for the GPT-4o and GPT-4o-mini models. Our method, kg-pagerank, achieves higher recall with similar accuracy.
  • Figure 5: Participants' self-reported cognitive load and preference scores that directly compare the two conditions.
  • ...and 4 more figures