<aside> 🔥
This document is maintained by the Full Stack Alignment Task Force, including Tan Zhi Xuan and the MAI team.
</aside>
Latest version is in github
https://github.com/meaningalignment/coaligning
The growing field of socio-technical alignment argues that beneficial AI outcomes require more than aligning individual systems with operators' intentions. Even perfectly intent-aligned AI systems will become misaligned if deployed within broader institutions—such as profit-driven corporations, competitive nation-states, or inadequately regulated markets—that conflict with global human flourishing.
Once we agree that it’s important to co-align artificial intelligence and institutions, the key question becomes: how do we do that?
Historically, this has been treated as a problem in game theory and social choice, where both human beings and AI agents are modeled as rational choice, expected utility maximizers, or as having utility functions or preference relations.
In the classical paradigm—which includes techniques like inverse reinforcement learning—we find game-theoretic approaches to alignment, and preference aggregation approaches from social choice theory, where preferences are first collected from a large group of individuals, and a social welfare function is then constructed to aggregate them into a single decision or policy. The idea is that once you know each agent’s utility function or ranked preferences, you can use formal methods to align collective action or optimize AI behavior accordingly.
We cover some problems with preference and utility based approaches for FSA in §1.1.
More recently, in practical applications, we’re seeing a shift away from preference-based alignment toward text-based methods.
In single-user alignment, preference models like RLHF (Reinforcement Learning from Human Feedback) are being replaced by processes of self-critique—such as deliberative alignment—where the model engages in a kind of reasoning or self-evaluation in response to a prompt.
In multi-agent or institutional contexts, the basic unit is no longer the preference or utility function, but rather text strings: this includes considerations drawn from Constitutional AI (CCAI), text-based values profiles (https://arxiv.org/abs/2503.15484**),** model specs upon which reasoning models deliberate (https://arxiv.org/abs/2412.16339) to train themselves.
These strings are taken to encode norms, intent, or values, and are then processed by the model through self-critique or adherence routines. The hope is that models will exhibit behavior that conforms to or evaluates against these strings.
Although these methods are more flexible and richer than preference-based ones, they suffer from their own set of limitations, covered in §1.2.
Thus, despite their flexibility, text strings and self-critique lack the rigor required for high-stakes socio-technical alignment.
In §1.3, we’ll introduce four practical alignment techniques that go beyond preferences and free-form textual ‘intents’ to capture deeper, more structured representations of human values and norms.
To address the challenge of socio-technical alignment requires redesigning institutional structures, yet we will claim that the formal toolkit for institution design inherited from the 20th century—microeconomics, game theory, mechanism design, welfare economics, and social choice theory—is inadequate. We call this inadequate set of theories the Standard Institution Design Toolkit (SIDT).
These theories model agents via a thin conception of rationality: individuals are presumed to possess intrinsic preference profiles, utility functions, or payoff matrices with some big limitations: (1) they cannot be inspected by others[*]; (2) they do not reference some underlying notion of the good; (3) they are blind to social context, such as shared values, norms, beliefs, or group identities[1].
People’s preferences are often incomplete, inconsistent, or unstable over time.
Since utility theory is not designed to model agents who change , reshape, and discover preferences over time — much less agents that reason about which preferences or values are more sensible or justified to hold — it is unlikely that it will be up to the task of capturing human-like reflection about values. Instead, thicker approaches to human values and choice are likely necessary, as we will describe further below.
Need to avoid manipulation of revealed preferences. We’ve known since the debates in welfare economics (most closely associated with Amartya Sen) that revealed preference is quite limited as a measure of benefit. It’s become ever clearer that businesses, governments, and other entities have indeed learned to exploit individuals under the guise of serving their preferences, using AI[2], and it’s also become clear that current AI models are actively engaged in reward hacking[*].
Since LLM-based AI is taking over much more of society even than recommenders did, there is, thankfully, much more appetite to overcome these problems with revealed preference than there was when Sen was working in development economics.[5]
This seems to demand a deep economic shift, and one that’s difficult to model using conventional microeconomics or to measure using welfare economics, because in these models humans are considered to be ‘fulfilling their preferences’ even when they are doomscrolling, addicted to AI pornography, etc.