<aside> 💡 Ideally, the below will be co-signed by a small group of relevant academics.

(So far the main authors are Joe Edelman and Tan Zhi Xuan, with input from Atoosa Kasirzadeh, Ryan Lowe, Davidad, Seth Lazar, and Manon Revel.)

</aside>

Abstract

The dominant social theories of the 20th century—microeconomics, game theory, social choice, mechanism design, and welfare economics—share a critical limitation: they model agents as context-free, stripped of shared values, norms, and beliefs. While this simplification enabled mathematical tractability and institutional design, five emerging challenges in AI alignment reveal its limits: the failure of revealed preferences to resist algorithmic exploitation; the limitations of preference-aggregation for social choice and deliberation; the maintenance of social fabric against technological disruption; the challenge of instilling human-like cooperation in artificial agents; and the development of ‘wise’ AI systems that can pursue notions of the good beyond preference satisfaction. We propose two complementary theoretical approaches: extending existing frameworks through modified utility functions that incorporate information about norms and values, or more fundamentally, rebuilding our social theories from the ground up to model embedded agency—incorporating social ties, common knowledge, and shared values directly into their mathematical foundations. Through careful curation and strategic risk-taking by scholars, we aim to build an academic movement that can provide new formal desiderata for mechanism design, enabling the development of robustly cooperative human-AI systems and institutions.

Why are New Models Needed?

While academic thought has delivered to us many simplified models of social behavior, the 20th century was dominated by a ‘central axis’ of five social theories: microeconomics, game theory, social choice, mechanism design, and welfare economics. These five were all supported from below by rational choice theory[0], and below that, by an understanding of individual actors with fixed preference profiles or utility functions.

We call these the ‘central axis’, because they were the principle social theories which generated institutional designs, and which justified them. They also provided backing for the dominant political theories, which were freedom- and fairness-based, again focusing on individual actors and the (free or fair) pursuit of their individual preferences.

image.png

Something to note about these central axis theories, is they model agents as ‘context-free’. Individuals are supposedly born with their preference profiles, utility functions, or payoff matrices intact, and there’s no place in the models for shared values, shared norms, shared beliefs, group identities, etc, that shape these individual preferences.[1]

Despite this lack of context (or because of it), we imagine these theories achieved their centrality because they are mathematically expressive and powerful theories, parsimonious, and they sat well with widespread philosophical intuitions. Plus, they were ‘good enough’ for the institution design challenges of their day.

But recently, the field of AI alignment pushes against this state of affairs. We see six reasons for this.

  1. The problem with revealed preference. The need to find objective functions or reward models which can safely be used to train AI brings us right back to the debates in welfare economics (most closely associated with Amartya Sen) about the limits of revealed preference as a measure of benefit. But, in the meantime it’s become clearer that businesses, governments, and other entities have indeed learned to exploit individuals under the guise of serving their preferences, using AI.[2] For instance, social media platforms learned to manipulate user engagement leading to addiction-like behaviors, prioritizing platform growth over individual well-being.

    Alignment methods based on explicit values[3], or norms[4], are starting to emerge and showing clear advantages over revealed preference-based approaches. In general there’s much more of an appetite to overcomes these problems with revealed preference than when these debates happened in the field of development economics.[5]

  2. Socio-technical alignment requires new institutional forms. AI alignment challenges seem to require inventing new forms of governance, and many directions for institutional innovation aren’t supported by the central axis. To give but one example, the central axis version of social choice is blind to the most powerful lever in deliberation: inspiration. The best mechanisms should not just accommodate existing preferences; they should allow for the formation and inspiration of new ones, by creating environments where individuals grow in their understanding of what is good.[6] Can mechanism design catch up with ancient Athens?

    Recently, mechanisms by Conitzer[7] and Klingefjord et al[8] leave the central axis behind and build on different social theories.

  3. The need to model and preserve ‘the social fabric’. One thing we want powerful AI to be careful about is that shared context: the networks of trust, of values alignment, and of normative cooperation which keep society working. The central axis theories, because they don’t model agents as embedded in a social context, mostly pretend this social fabric doesn't exist.[9] This means alignment efforts aimed to preserve or enhance the social fabric are hard to build on these theories!

    We already see the results of this playing out in society: recent decades see metrics based on preferences or transactions go up. Our “wealth” is increasing. But liberals often suspect our capacity for social cognition has declined via misinformation, conspiracies, etc; conservatives suspect we’ve suffered a decline in morals and aesthetics; both tend to say our norms and channels for cooperation have eroded. Whether such declines are happening or not, it doesn’t seem likely that measures of consumption or engagement/revealed preference would show them. These things are important; they should find a prominent place in our social theories and ideas of welfare.

  4. Cooperating AI agents. Another challenge in alignment is to get a vast ecosystem of AI agents cooperating. This turns out to be a sore point for the central axis theories, and a place where ideas about shared norms or values from outside the central axis have already been imported, to address mismatches between the equilibria predicted by context-free models and those observed among real, cooperating agents.

    We can work to make AI agents that cooperate as we do, but to do so we may need to adopt more sophisticated (yet tractable), norm- and value-embedded models of human cooperation.

  5. The challenge of super-wisdom. The central axis assumes agents have fixed preferences, disconnected from one another and from any broader notion of the good**.** This means AIs, individuals and societies cannot collectively aspire to ideals beyond preference satisfaction.

    However you understand our social embedding — whether as shared values, norms, or beliefs — to acknowledge it exists seems to suggest kinds of goodness beyond preference satisfaction. The context of shared values suggests moral progress or learning. A context of evolving norms suggest game-theoretic notions of goodness, such as cooperation at higher scales, or across diverse ecosystems. With a context of shared beliefs, there’s the aspiration to discover higher truths.

    Yet, so long as individual preferences are the yardstick of the good, none of these other notions can be admitted, and no one is allowed to know better than anyone else. To make progress, revealed preferences must be reformulated as exogenous—as a function of underlying values (themselves subject to moral learning), plus social norms and strategic considerations.

  6. New tools. Finally, displacing the central axis theories seems newly feasible. It’s likely that the success of the central axis was partly based on the availability of hard data - behavioral data in the form of votes, purchases, and clicks. This was, for a time, far easier to obtain than intersubjective and qualitative data about shared context.

    LLMs make the systematic investigation of shared norms, values, and beliefs much easier, because they can (1) perform qualitative interviews at scale; (2) (rather poorly atm) simulate people; or (3) because their weights crystalize these linguistic and conceptual aspects of our social fabric as a form we can see.

    This will let new models of socially embedded agents compete for rigor.

All these forces suggest a shift away from the context-free, central axis theories.

This is a paradigm shift, significantly bigger than “behavioral economics”.

Exciting!

3 Approaches to Building New Models