<aside> 🔥

This document is maintained by the Full Stack Alignment Task Force, including Tan Zhi Xuan and the MAI team.

</aside>

Latest version is in github

https://github.com/meaningalignment/coaligning

o3’s extra concise version

Older version

Abstract

§1 Intro

The growing field of socio-technical alignment argues that beneficial AI outcomes require more than aligning individual systems with operators' intentions. Even perfectly intent-aligned AI systems will become misaligned if deployed within broader institutions—such as profit-driven corporations, competitive nation-states, or inadequately regulated markets—that conflict with global human flourishing.

Once we agree that it’s important to co-align artificial intelligence and institutions, the key question becomes: how do we do that?

In §1.3, we’ll introduce four practical alignment techniques that go beyond preferences and free-form textual ‘intents’ to capture deeper, more structured representations of human values and norms.

§1.1 Inadequacy of Utility Function and Preference-Based Approaches

To address the challenge of socio-technical alignment requires redesigning institutional structures, yet we will claim that the formal toolkit for institution design inherited from the 20th century—microeconomics, game theory, mechanism design, welfare economics, and social choice theory—is inadequate. We call this inadequate set of theories the Standard Institution Design Toolkit (SIDT).

These theories model agents via a thin conception of rationality: individuals are presumed to possess intrinsic preference profiles, utility functions, or payoff matrices with some big limitations: (1) they cannot be inspected by others[*]; (2) they do not reference some underlying notion of the good; (3) they are blind to social context, such as shared values, norms, beliefs, or group identities[1].

  1. People’s preferences are often incomplete, inconsistent, or unstable over time.

    Since utility theory is not designed to model agents who change , reshape, and discover preferences over time — much less agents that reason about which preferences or values are more sensible or justified to hold — it is unlikely that it will be up to the task of capturing human-like reflection about values. Instead, thicker approaches to human values and choice are likely necessary, as we will describe further below.

  2. Need to avoid manipulation of revealed preferences. We’ve known since the debates in welfare economics (most closely associated with Amartya Sen) that revealed preference is quite limited as a measure of benefit. It’s become ever clearer that businesses, governments, and other entities have indeed learned to exploit individuals under the guise of serving their preferences, using AI[2], and it’s also become clear that current AI models are actively engaged in reward hacking[*].

    Since LLM-based AI is taking over much more of society even than recommenders did, there is, thankfully, much more appetite to overcome these problems with revealed preference than there was when Sen was working in development economics.[5]

    This seems to demand a deep economic shift, and one that’s difficult to model using conventional microeconomics or to measure using welfare economics, because in these models humans are considered to be ‘fulfilling their preferences’ even when they are doomscrolling, addicted to AI pornography, etc.