<aside> 💡 Ideally, the below will be co-signed by a small group of relevant academics.

(So far the main authors are Joe Edelman and Tan Zhi Xuan, with input from Atoosa Kasirzadeh, Manon Revel, Brad Knox, Ryan Lowe, Davidad, and Seth Lazar.)

</aside>

Naming it

Abstract

The dominant social theories of the 20th century—microeconomics, game theory, social choice, mechanism design, and welfare economics—share a critical limitation: they model agents as thinly rational and context-free, stripped of shared values, norms, and beliefs. While this simplification enabled mathematical tractability and institutional design, five emerging challenges in AI alignment and institution-building reveal its limits: the failure of revealed preferences to align with our deeper values; the limitations of preference-aggregation for collective decision-making; the maintenance of the social fabric against technological disruption; the challenge of instilling human-like cooperation in artificial agents; and the development of ‘wise’ AI systems that can pursue notions of the good beyond preference satisfaction. We propose two complementary theoretical approaches: extending existing frameworks through modified utility functions that incorporate information about norms and values, or more fundamentally, rebuilding our theories from the ground up to model the thickness of human values and choices through new representations, normative foundations, and solution concepts. Through curation and strategic risk-taking by scholars, we aim to build an academic movement that can provide new theories and formal desiderata for how we design algorithms, mechanisms, and institutions, enabling futures that preserve and enliven the depth and meaning of our individual and collective choices.

Why are New Models and Theories Needed?

While academic thought has delivered to us many simplified models of human behavior, the 20th century was dominated by a ‘central axis’ of theories that guided the development of our most [something] institutions, mechanisms, and algorithms: rational choice and microeconomics, game theory and mechanism design, welfare economics and social choice, decision theory, and the principle of reward maximization.

These theories were all supported by an understanding of individual actors with fixed preference profiles or utility functions. They also provided backing for the dominant political theories, which were freedom- and fairness-based, again focusing on individual actors and the (free or fair) pursuit of their individual preferences. More recently, these basic assumptions have ungirded the design of AI systems with their own utility functions, and efforts to ensure their safety and beneficence by aligning them with human preferences [4].

image.png

Something to note about these central axis theories is that they largely model agents as ‘context-free’, under a thin conception of what it means to be a rational agent. Individuals are supposedly born with their preference profiles, utility functions, or payoff matrices intact—without a place in the model for the shared values, shared norms, shared beliefs, group identities, etc, that shape these individual preferences.[1]

Despite this thinness and lack of context (or because of it), we imagine these theories achieved their centrality because they are mathematically expressive and powerful theories, parsimonious, and they sat well with widespread philosophical intuitions. Plus, they seemed ‘good enough’ for the institution design challenges of their day.

But recently, development within AI and AI alignment push against this state of affairs. We see six reasons for this.

  1. The problem with revealed preference. The need to find objective functions or reward models which can safely be used to train AI brings us right back to the debates in welfare economics (most closely associated with Amartya Sen) about the limits of revealed preference as a measure of benefit. But, in the meantime it’s become clearer that businesses, governments, and other entities have indeed learned to exploit individuals under the guise of serving their preferences, using AI.[2] For instance, social media platforms learned to manipulate user engagement leading to addiction-like behaviors, prioritizing platform growth over individual well-being.

    Alignment methods based on explicit values[3], or norms[4], are starting to emerge and showing clear advantages over revealed preference-based approaches. In this setting, there’s much more appetite to overcome these problems with revealed preference than there was when Sen was working in development economics.[5]

  2. Socio-technical alignment requires new institutional forms. AI alignment challenges seem to require inventing new forms of governance, and many directions for institutional innovation aren’t supported by the central axis. To give but one example, the central axis version of social choice is blind to the most powerful lever in deliberation: inspiration. The best mechanisms should not just accommodate existing preferences; they should allow for the formation and inspiration of new ones, by creating environments where individuals grow in their understanding of what is good.[6] Can mechanism design catch up with ancient Athens?

    Recently, mechanisms by Conitzer[7] and Klingefjord et al[8] leave the central axis behind and build on different social theories. The success of these mechanism shows a field of opportunity for mechanism designers, but for now, to go there, they’d need to leave behind the formalisms dear to them (maximin, strategy-proof, Condorcet, Kaldor-Hicks, proportionality, etc). This makes new mechanisms harder to design and justify.

  3. The need to model and preserve ‘the social fabric’. One thing we want powerful AI to be careful about is that shared context: the networks of trust, of values alignment, and of normative cooperation which keep society working. The central axis theories, because they don’t model agents as embedded in a social context, mostly pretend this social fabric doesn't exist.[9] This means alignment efforts aimed to preserve or enhance the social fabric are hard to build on these theories!

    We already see the results of this playing out in society: recent decades see metrics based on preferences or transactions go up. Our “wealth” is increasing. But there seems to be a broad agreement that this fails to account for something. For instance, liberals often suspect our capacity for social cognition has declined (via misinformation, conspiracies, etc). Conservatives suspect we’ve suffered a decline in morals and aesthetics. Both tend to say our norms and channels for cooperation have eroded. Whether such declines are happening or not, it doesn’t seem likely that measures of consumption or engagement/revealed preference would show them. These things are important; they should find a prominent place in our social theories and notions of welfare.

  4. Cooperating AI agents. Another challenge in alignment is to get a vast ecosystem of AI agents cooperating. This turns out to be a sore point for the central axis theories, and a place where ideas about shared norms or values from outside the central axis have already been imported, to address mismatches between the equilibria predicted by context-free models and those observed among real, cooperating agents.

    We can work to make AI agents that cooperate as we do, but to do so we may need to adopt more sophisticated (yet tractable), norm- and value-embedded models of human cooperation.

  5. The challenge of promoting wisdom. The central axis assumes agents have fixed preferences, disconnected from one another and from any broader notion of the good**.** This means, individuals, societies and AI systems cannot collectively aspire to ideals beyond preference satisfaction.

    However you understand our social embedding — whether as shared values, norms, or beliefs — to acknowledge it exists seems to suggest kinds of goodness beyond preference satisfaction. For example, if humans have shared values, this suggests moral progress or learning is possible. If humans have evolving norms, this suggest we could aim at game-theoretic notions of goodness, such as cooperation at higher scales, or across diverse ecosystems. If humans have shared beliefs, there’s the aspiration to discover higher truths.

    Yet, so long as individual preferences are the yardstick of the good, none of these other notions can be admitted, and no one is allowed to know better than anyone else.

  6. New tools. Finally, displacing the central axis theories seems newly feasible. It’s likely that the success of the central axis was partly based on the availability of hard data - behavioral data in the form of votes, purchases, and clicks. This was, for a time, far easier to obtain than intersubjective and qualitative data about shared context.

    LLMs make the systematic investigation of shared norms, values, and beliefs much easier, because (1) they can perform qualitative interviews at scale; and (2) their weights crystalize these linguistic and conceptual aspects of our social fabric into a tangible form. This lets new models of socially embedded agents achieve a similar, data-driven rigor.

All these forces suggest a shift away from the context-free, central axis theories. This is a paradigm shift, potentially much bigger than “behavioral economics”.

Exciting!