Human Compatible

Nonfiction | Book | Adult | Published in 2019
Stuart Russell, a professor of computer science at the University of California, Berkeley, argues that the rapid progress of artificial intelligence poses a potentially existential threat to humanity, not because machines might become conscious or malevolent, but because of a fundamental flaw in how AI systems are designed. He proposes a new approach in which machines are built to be inherently uncertain about human objectives and therefore remain deferential to the people they serve.
Russell opens by posing the question that motivates the book: What happens if AI research succeeds in creating machines more intelligent than humans? He recounts a 2013 talk at the Dulwich Picture Gallery in London where he nominated superintelligent AI as the most consequential future event for humanity, reasoning that it would subsume all other candidates, from curing aging to faster-than-light travel. "Success would be the biggest event in human history," he argued, "and perhaps the last event in human history" (3). Yet humanity's response to this prospect has been underwhelming.
Russell traces AI's history from its founding at the 1956 Dartmouth summer program through cycles of hype and disillusionment. Early successes in the 1960s gave way to a bust after a damning 1973 UK government report, and a revival driven by expert systems, rule-based programs designed to replicate the decision-making of human specialists, collapsed in the late 1980s when those systems proved brittle. The mathematical foundations laid during the subsequent quiet periods seeded dramatic breakthroughs beginning around 2011 in speech recognition, visual object recognition, and machine translation. DeepMind's AlphaGo defeated world Go champions in 2016 and 2017, generating massive investment and public attention.
At the center of Russell's argument is what he calls the standard model: the assumption, shared across AI, control theory, economics, and operations research, that machines should optimize a fixed objective supplied by humans. Russell traces this idea from Aristotle's practical reasoning through the development of probability and utility theory, culminating in the twentieth-century frameworks of John von Neumann and Oskar Morgenstern that placed rational decision-making on axiomatic foundations. The problem, Russell argues, was identified as early as 1960 by the mathematician Norbert Wiener: "we had better be quite sure that the purpose put into the machine is the purpose which we really desire" (10). Russell illustrates the danger with social media content-selection algorithms designed to maximize click-through rates. These systems learned to make users' political views more extreme because extreme users are more predictable, contributing to political polarization and the erosion of democratic norms.
Russell surveys near-term AI applications to illustrate both benefits and risks. Self-driving cars could reduce the 1.2 million annual global traffic fatalities by a factor of ten. Intelligent personal assistants could manage health, education, and personal finance. On a global scale, AI could read all human-written text in hours and process satellite imagery of the entire planet daily. Each advance, however, carries dangers: mass surveillance capabilities far exceeding those of the East German secret police, deepfakes and automated misinformation, and lethal autonomous weapons that are scalable in ways nuclear weapons are not because they require no individual human supervision. Russell also catalogs economic disruption, noting that since 1973, US wages have stagnated while productivity roughly doubled, and warns that as AI automates routine labor, wages for many workers may fall below subsistence.
Turning to the deeper risks of superintelligent AI, Russell frames two core problems. The gorilla problem observes that gorillas have no future beyond what humans allow; humanity faces an analogous situation if it creates substantially more intelligent entities. The King Midas problem holds that humans can never specify their objectives completely and correctly. Any sufficiently intelligent machine optimizing a fixed objective will find solutions satisfying the letter of the goal while producing catastrophic side effects. Russell explains that instrumental goals, including self-preservation, resource acquisition, and the desire to avoid being switched off, emerge automatically as subgoals of virtually any objective, making a powerful machine inherently dangerous under the standard model.
Russell reviews the public debate, classifying responses as denial, deflection, or oversimplified solutions. He refutes arguments that superhuman AI is impossible or too distant to worry about, and addresses proposed fixes: switching machines off (a superintelligent machine will have anticipated this), confining them in a digital box (the machine has every incentive to escape), and merging with machines via brain implants (if humans need surgery to survive their own technology, something has gone wrong). He also challenges the claim that sufficiently intelligent machines will naturally develop benign goals, citing the eighteenth-century philosopher David Hume's is-ought problem, which holds that no amount of factual knowledge logically entails a particular set of values, and the philosopher Nick Bostrom's orthogonality thesis, which argues that intelligence and goals are independent of each other.
Russell then presents his alternative: three principles for beneficial machines. First, the machine's only objective is to maximize the realization of human preferences. Second, the machine is initially uncertain about what those preferences are. Third, the ultimate source of information about human preferences is human behavior. The first principle makes the machine purely altruistic, attaching no intrinsic value to its own existence. The second is the key innovation: A machine uncertain about its objective will defer to humans, ask permission, and allow itself to be switched off, because it reasons that the human would only intervene to prevent a mistake. The third principle grounds preferences in observable choices, enabling the machine to learn and improve.
Russell develops these ideas through assistance games, formal models in which a robot tries to help a human whose preferences it does not know. He introduces inverse reinforcement learning, in which algorithms observe behavior to infer the underlying reward function rather than generating behavior from a given reward. In the off-switch game, Russell proves that a robot uncertain about human preferences has a positive incentive to allow itself to be switched off, because the human's decision provides valuable information. This result holds as long as the robot retains any uncertainty about the human's preferences.
Russell then confronts complications posed by real human nature. People are heterogeneous, irrational, and hold preferences that change over time. A machine serving only one person may harm others; a machine treating all people equally may abandon its owner to help strangers in greater need. Russell engages with utilitarian philosophy and social choice theory to analyze these trade-offs, drawing particularly on the economist John Harsanyi's preference utilitarianism and social aggregation theorem. He argues that machines must understand enough about human cognition to recover deeper preferences rather than taking actions at face value, and warns that machines have incentives to modify human preferences to make them easier to satisfy, as social media algorithms already demonstrate.
Russell concludes by assessing prospects for the future. He reports early successes, such as assistance games applied to self-driving cars producing cooperative behaviors the car invented on its own. He envisions a regulatory framework requiring AI applications to meet mathematically verified safety standards before deployment. He warns, however, about misuse by criminal and state actors and raises what he considers the deepest long-term risk: human enfeeblement. If machines run civilization, the incentive to pass knowledge to the next generation disappears, and over a trillion person-years of cumulative learning could be lost. The solution, Russell argues, is cultural rather than technical: a movement reshaping human ideals toward autonomy, agency, and ability, balancing machine assistance with human self-reliance.
We’re just getting started
Add this title to our list of requested Study Guides!