Concept Alignment

Sunayana Rane; Polyphony J. Bruna; Ilia Sucholutsky; Christopher Kello; Thomas L. Griffiths

Concept Alignment

Sunayana Rane, Polyphony J. Bruna, Ilia Sucholutsky, Christopher Kello, Thomas L. Griffiths

TL;DR

The paper argues that safe and effective AI alignment requires aligning the concepts humans use with those used by AI, not just aligning values. It surveys how humans acquire and ground concepts, how machines represent and learn concepts, and how interactive, multimodal grounding can enable shared understanding. It advocates leveraging cognitive science methods, interpretability tools, and multimodal generalist models to bootstrap concept alignment, with evaluation grounded in human-like grounded language and interactive adaptation. The work outlines a concrete, cross-disciplinary path: ground AI concepts in perception and multimodal data, use interactive fine-tuning to refine concepts, and develop empirical standards for measuring concept alignment to support broader value and behavior alignment.

Abstract

Discussion of AI alignment (alignment between humans and AI systems) has focused on value alignment, broadly referring to creating AI systems that share human values. We argue that before we can even attempt to align values, it is imperative that AI systems and humans align the concepts they use to understand the world. We integrate ideas from philosophy, cognitive science, and deep learning to explain the need for concept alignment, not just value alignment, between humans and machines. We summarize existing accounts of how humans and machines currently learn concepts, and we outline opportunities and challenges in the path towards shared concepts. Finally, we explain how we can leverage the tools already being developed in cognitive science and AI research to accelerate progress towards concept alignment.

Concept Alignment

TL;DR

Abstract

Concept Alignment

Authors

TL;DR

Abstract

Table of Contents