Co‑aligning AI and Institutions

Abstract (draft placeholder)

Artificial‑intelligence alignment cannot be solved by focusing on single systems in a vacuum. Even perfectly intent‑aligned models will go awry when embedded inside misaligned economic, political, or social institutions. We argue that two dominant paradigms—(i) preference/utility maximisation inherited from the Standard Institution Design Toolkit (SIDT) and (ii) prompt‑ or self‑critique–based “text alignment”—are structurally incapable of delivering robust socio‑technical alignment. Instead we propose Full‑Stack Alignment (FSA): a new toolkit centred on explicit, structured representations of human norms and values. After analysing the shortcomings of existing toolkits, we sketch four concrete representation techniques and illustrate their promise through five case studies ranging from AI negotiation to democratic regulation. We close with a research and deployment roadmap toward institutions and AI systems that co‑evolve for global human flourishing.

1 Introduction

The growing field of socio‑technical alignment argues that beneficial AI outcomes require more than aligning individual systems with operators’ intentions. Even perfectly intent‑aligned AI systems will become misaligned if deployed within broader institutions—such as profit‑driven corporations, competitive nation‑states, or inadequately regulated markets—that conflict with global human flourishing.

Once we agree that it is important to co‑align artificial intelligence and institutions, the key question becomes how.

Historically, researchers have treated the problem as one of game theory and social choice, modelling both humans and AI agents as rational expected‑utility maximisers. More recently, practical work has shifted toward text‑based paradigms such as RLHF, Constitutional AI, or model specifications that encode values in natural‑language strings. While each paradigm improved on its predecessor, neither meets the demands of high‑stakes alignment. Sections 2.1 and 2.2 analyse these shortcomings; Section 3 introduces a richer toolkit, and Sections 4–5 lay out evidence and a roadmap.

2 Why Existing Toolkits Fail

2.1 Limitations of the Standard Institution Design Toolkit (SIDT)

To address socio‑technical alignment we must often redesign institutional structures, yet the 20th‑century toolkit—micro‑economics, game theory, mechanism design, welfare economics, and social‑choice theory—is inadequate. We label this package the Standard Institution Design Toolkit (SIDT) and highlight six core limitations:

Unstable, incomplete preferences. People’s preferences are often inconsistent or change over time, yet SIDT assumes fixed utility functions.
Manipulability of revealed preference. Agents and institutions can exploit “revealed preferences,” leading to reward hacking and user manipulation.
Inadequacy for sophisticated democracy. Preference aggregation ignores deliberation, inspiration, and preference evolution.
Thin account of cooperation. Classical solution concepts (e.g. Nash) leave little room for shared norms or commitments.
Blindness to higher goods. Preference satisfaction crowds out moral progress, shared values, or truth‑seeking.
Poor ‘fit’ with human nature. Institutions optimised for SIDT rationality require participants to strategise and calculate in ways that feel alien and exhausting.

(The original text contains richer discussion and citations for each point; they have been retained but lightly compressed for flow.)

Co‑aligning AI and Institutions

Abstract (draft placeholder)

1 Introduction

2 Why Existing Toolkits Fail

2.1 Limitations of the Standard Institution Design Toolkit (SIDT)

2.2 Limitations of Prompt‑ & Self‑Critique‑Based Alignment