Annotated Bibliography of Recommended Materials

The (1) to (5) ratings below indicate each bibliography entry's priority rating, with (1) being the highest. You can set a priority threshold for which materials to show, here:


The following categories represent clusters of ideas that we believe warrant further attention and research for the purpose of developing human-compatible AI (HCAI), as well as prerequisite materials for approaching them. The categories are not meant to be mutually exclusive or hierarchical. Indeed, many subproblems of HCAI development may end up relating in a chicken-and-egg fashion.
Theoretical Background - Prerequisites from fields like probability theory, logic, theoretical CS, and game theory.
Introduction to AI/ML - Pedagogically helpful AI background.
Prevailing Methods in AI/ML - Some important state-of-the-art methods to know about.
Cybersecurity and AI - Valuable perspectives on adversarial digital behavior.
Broad perspectives on HCAI - Overviews of challenges and approaches to HCAI.
Corrigibility - Designing AI systems that do not resist being redesigned or switched off.
Foundational Work - Developing the theoretical underpinnings of AI in ways that enable HCAI research.
Interactive AI - Productively incorporating human oversight with AI operations.
Preference Inference - Building AI systems that learn what humans want.
Reward Engineering - Designing objectives for AIs that are sure to result in desirable behavior.
Robustness - Building systems that perform well in novel environments.
Transparency - Enabling human understanding of AI systems.
Cognitive science - Models of human psychology relevant to HCAI.
Moral Theory - Moral issues that HCAI systems and their developers might encounter.


Theoretical Background

Certain basic mathematical and statistical principles have contributed greatly to the development of today's prevailing methods in AI. This category surveys areas of knowledge such as probability theory, logic, theoretical CS, game theory, and economics that are likely to be helpful in understanding existing AI research, and advancing HCAI research more specifically.
Set your priority threshold for this category:
Course Materials:
(2) Machine Learning (lecture notes), Andrew Ng.

Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.

(3) Mathematics of Machine Learning, Philippe Rigollet.
(3) Algorithmic Game Theory (Fall 2013), Tim Roughgarden.

Covers the interface of theoretical CS and econ: mechanism design for auctions,
algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.

(2) Game Theory: Analysis of Conflict, Roger Myerson.

Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.

(3) Information Theory, Inference, and Learning Algorithms Parts I-III, McKay.
(3) Causality: models, reasoning, and inference, Judea Pearl.

A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as “What will happen if move my arm here?”

(3) Computability and Logic, Chapters 1-4, 8-20, 23, 25, and 27, Boolos and Burgess.
(3) Mathematical Logic : A course with exercises — Part I, Cori and Lascar.

Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.

(3) Mathematical Logic : A course with exercises — Part II, Chapters 5 and 6, Cori and Lascar.

Covers recursion theory, the formalization of arithmetic, and Godel’s theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.

(4) An Introduction to Game Theory, Chapters 1-7,14,15, Osborne.
(4) All of Statistics, Chapters 1-12 or more (an easy-to-read overview of the field), Wasserman.
(4) Bayesian Data Analysis, Gelman and Rubin.
(4) Introduction to Automata Theory, Languages, and Computation, Chapters 1-10, Ullman and Hopcroft.
(4) Introduction to the Theory of Computation, Chapters 1-5,7 (alternatively to Ullman and Hopcroft), Michael Sipser.
(4) A Mathematical Introduction to Logic, Chapters 0-3 (alternatively to Boolos and Burgess), Enderton.
(5) Understanding Formal Methods, Chapters 1-10, Monin.
(5) Principles of Cyber-Physical Systems, Chapters 1-7,9, Alur.
(5) Principles of Model Checking, Chapter 1-7,10, Baier, Katoen, Larsen.
(5) Probability: Theory and Examples, Chapters 1-6, Durrett.

(for the more mathematically inclined reader)

(2) Machine Learning (online course), Andrew Ng.

Video lectures intoducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and “big data” applications.

(3) Game Theory I (Coursera), Jackson, Leyton-Brown, Shoham.
(4) Game Theory II (Coursera), Jackson, Leyton-Brown, Shoham.
(5) Introduction to Probability, Richard S. Sutton, Andrew G. Barto.
Published articles:
(3) Algorithmic Statistics, Peter Gacs, John T. Tromp, Paul M.B. Vitanyi.

Nice overview of some connections between Kolmogorov complexity and information theory.

(4) A Comprehensive Survey of Multiagent Reinforcement Learning, Busoniu, Babuska, De Schutter.
Unpublished articles:
(5) Handbook of Model Checking (to appear soon), Clarke, Henzinger, Veith.

Introduction to AI/ML

This category surveys pedagogically basic material for understanding today's prevailing methods in AI and machine learning. Some of the methods here are no longer the state of the art, and are included to aid in understanding more advanced current methods.
Set your priority threshold for this category:
Course Materials:
(2) Machine Learning, Prof. Tommi Jaakkola.

Perceptrons, SVMs, regularization, regression methods, bias and variance, active learning, model description length, feature selection, boosting, EM, spectral clustering, graphical models.

(3) Deep Reinforcement Learning, OpenAI.
(4) Introduction to Artificial Intelligence, Norvig and Thrun.
(2) Deep Learning, Chapters 6-12, Goodfellow, Bengio, and Courville.

Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.

(2) Deep Learning, Chapters 1-5, Goodfellow, Bengio, and Courville.

Reviews linear algebra, probability and information theory, and numerical computation as part of a course on deep learning.

(2) Reinforcement Learning, Sutton and Barto.

Covers multi-arm bandits, finite MDPs, dynamic programming, Monte Carlo methods, TDL, bootstrapping, tabular methods, on-policy and off-policy methods, eligibility traces, and policy gradients.

(2) Machine Learning A Probabilistic Perspective, Murphy.

A modern textbook on machine learning.

(2) Pattern Recognition and Machine Learning, Bishop.
(2) Neural Networks and Deep Learning, Chapters 1-6 (more entry-level), Michael Nielsen.
(2) Artificial Intelligence: A Modern Approach, Chapters 1-17, Stuart Russell and Peter Norvig.

Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.

(3) Probabilistic graphical models: principles and techniques, Daphne Koller and Nir Friedman.

Comprehensive book on graphical models, which represent probability distributions in a way that is both principled and potentially transparent.

(4) The quest for artificial intelligence, Nils Nilsson.

A book-length history of the field of artificial intelligence, up until 2010 or so.

(1) Reinforcement Learning, Silver.

Lecture course on reinforcement learning taught by David Silver.

(2) Introduction to AI, Pieter Abbeel, Dan Klein.

Informed, uninformed, and adversarial search; constraint satisfaction; expectimax; MDPs and RL; graphical models; decision diagrams; naive bayes, perceptrons, clustering; NLP; game-playing; robotics.

(2) Deep Learning, Nando de Freitas.
(3) Deep Reinforcement Learning, Schulman.
(3) Machine Learning, Nando de Freitas.
Published articles:
(2) A Few Useful Things to Know about Machine Learning, Pedro Domingos.

“[D]eveloping successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons.”

(3) Probabilistic machine learning and artificial intelligence, Zoubin Ghahramani.

Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.

(4) Deep learning in neural networks: An overview, Schmidhuber.

Prevailing Methods in AI/ML

Many currently known methods from machine learning and other disciplines already illustrate surprising results that should inform what we think is possible for future AI systems. Thus, while we do not yet know for certain what architecture future AI systems will use, it is still important for researchers working on HCAI to have a solid understanding of the current state of the art in AI methods. This category lists background reading that we hope will be beneficial for that purpose.
Set your priority threshold for this category:
(3) Neural Networks, Hugo Larochelle.
Published articles:
(2) Imagenet classification with deep convolutional neural networks, Alex Krizevsky, Ilya Sutskever, Geoff Hinton.

The seminal paper in deep learning.

(2) Human-level control through deep reinforcement learning, Mnih et al..
(2) Mastering the game of Go with deep neural networks and tree search, Silver et al.

Describes DeepMind’s superhuman Go-playing AI. Ties together several techniques in reinforcement learning and supervised learning.

(3) Representation Learning: A Review and New Perspectives, Bengio, Courville, Vincent.
(3) Deep Learning, LeCun, Bengio, Hinton.
(4) Human-level concept learning through probabilistic program induction, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum.

State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.

Cybersecurity and AI

Understanding the nature of adversarial behavior in a digital environment is important to modeling how powerful AI systems might attack or defend existing systems, as well as how status quo cyber infrastructure might leave future AI systems vulnerable to attack and manipulation. A cybersecurity perspective is helpful to understanding these dynamics.
Set your priority threshold for this category:
Published articles:
(2) The AGI Containment Problem, James Babcock, Janos Kramar, Roman Yampolskiy.

Argues that testing future AGI systems for safety may itself pose risks, given that under certain circumstances an AGI may be capable of intentionally deceiving the humans overseeing the test.

(4) Can Machine Learning Be Secure?, Marco Barreno, Blaine Nelson, Russell Sears, Anthony D. Joseph, J. D. Tygar.

A taxonomy of attacks on machine learning systems, defenses against those attacks, an analytical
lower bound on an attacker’s work function, and a list of open problems ain security for AI systems.

Unpublished articles:
(2) Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples, Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, Ananthram Swami.

Illustrates how deep learning systems can adversarially me made to have arbitrary desired outputs, by modifying their inputs in ways that are imperceptible to humans.

(2) Universal adversarial perturbations, Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard.

Illustrates how a deep net image classifier can be made to misclassify a large fractions of images by adding a single, image-agnostic, imperceptible-to-humans noise vector to its inputs.

(3) Fansmitter: Acoustic Data Exfiltration from (Speakerless) Air-Gapped Computers, Mordechai Guri, Yosef Solewicz, Andrey Daidakulov, Yuval Elovici.

Exhibits a malware that can acoustically extract encryption keys from an airgapped computer, even when audio hardware and speakers are not present, and 900 bits/hr accross 8 meters.

(4) Learning to Protect Communications with Adversarial Neural Cryptography, Martín Abadi, David G. Andersen (Google Brain).

Demonstrates that a neural network trained with access to both ends of a communication channel and feedback from an adversary can learn to encrypt the channel against the adversary.

(4) Artificial Intelligence Safety and Cybersecurity: a Timeline of AI Failures, Roman V. Yampolskiy.

Provides a long history of reported AI failues.

Broad perspectives on HCAI

There are many problems that need solving to ensure that future, highly advanced AI systems will be safe to use, including mathematical, engineering, and social challenges. We must sometimes take a bird's-eye view on these problems, for example, to notice that technical safety research will not be useful on its own unless the institutions that build AI have the resources and incentives to use that research. This category recommends reading that we think can help develop such a perspective.
Set your priority threshold for this category:
Published articles:
(2) Artificial Intelligence as a Positive and Negative Factor in Global Risk, Eliezer Yudkowsky.

Examines ways in which AI could be used to reduce large-scale risks to society, or could be a source of risk in itself.

(2) The Value Learning Problem, Nate Soares.
(2) How We’re Predicting AI — or Failing To, Stuart Armstrong and Kaj Sotala.
(2) The Superintelligent Will: Motivation and Instrumental Rationality In Advanced Intelligent Agents, Nick Bostrom.

Argues that highly intelligent agents can have arbitrary end-goals (e.g. paperclip manufacturing), but are likely pursue certain common subgoals such as resource acquisition.

(2) Research Priorities for Robust and Beneficial Artificial Intelligence, Stuart Russell, Daniel Dewey, Max Tegmark.

Lays out high-level problems in HCAI, including problems in economic impacts of AI, legal and ethical implications of AI, and CS research for AI validation, verification, security, and control.

(2) The Basic AI Drives, Steve Omohundro.

Informally argues that sufficiently advanced goal-oriented AI systems will pursue certain “drives” such as acquiring resources or increased intelligence that arise as subgoals of many end-goals.

(3) A Model of Pathways to Artificial Superintelligence Catastrophe for Risk and Decision Analysis, Anthony M. Barrett and Seth D. Baum.

Describes models of superintelligence catastrophe risk as trees representing boolean formulae, and gives recommendations for reducing such risk.

(3) Building Machines That Learn and Think Like People, Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman.

Outlines what current systems based on deep neural networks are still missing in terms of what they can learn and how they learn it (compared to humans), and describes some routes towards these goals.

(3) Concept Learning for Safe Autonomous AI, Kaj Sotala.

Argues that relatively few concepts may be needed for an AI to learn human values, from the observation that perhaps humans do this.

(3) A Roadmap towards Machine Intelligence, Tomas Mikolov, Armand Joulin, Marco Baroni.

How some research scientists at Facebook expect the incremental development of intelligent machines to occur.

(3) Thinking Inside the Box: Controlling and Using an Oracle AI, Stuart Armstrong, Anders Sandberg, Nick Bostrom.

Describes some challenges in the design of a boxed ‘Oracle AI’ system, that has strictly controlled input and merely answers questions.

(4) Risks and Mitigation Strategies for Oracle AI, Stuart Armstrong.

Describing Oracle AI and methods of controlling it.

(4) The Singularity: A Philosophical Analysis, David Chalmers.

A philosophical discussion of the possibility of an intelligence explosion, how we could ensure good outcomes, and issues of personal identity.

(5) The evolved radio and its implications for modelling the evolution of novel sensors, Bird et al.

An example of evolutionarily-driven learning approaches leading to very unexpected results that aren’t what the designer hoped to produce.

(5) The AGI Containment Problem, James Babcock, Janos Kramar, and Roman Yampolskiy.
Unpublished articles:
(3) Towards Verified Artificial Intelligence, Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry.
(3) A survey of research questions for robust and beneficial AI, Future of Life Institute.

Broad discussion of potential research directions related to AI Safety, including short- and long-term technical work, forecasting, and policy.

(3) Strategic Implications of Openness in AI Development, Nick Bostrom.
(3) From Seed AI to Technological Singularity via Recursively Self-Improving Software, Roman V. Yampolskiy.
(5) Suffering-focused AI safety: Why “fail-safe” measures might be our top intervention, Lukas Gloor.

Discusses extreme downside risks of highly capable AI, such as dystopian futures.

News and magazine articles:
(1) Can digital machines think?, Alan Turing.

“It is customary […] to offer a grain of comfort […] that some particularly human characteristic could never be imitated by a machine. … I believe that no such bounds can be set.”

(3) Why the Future Doesn’t Need Us, Bill Joy.

Prescient, gloomy, and a bit rambling article about humans’ role in the future, with genetics, robotics, and nanotech.

(5) Exploratory Engineering in AI, Luke Muehlhauser, Bill Hibbard.

Argues for the importance of thinking about the safety implications of formal models of ASI, to identify both potential problems and lines of research.

Blog posts:
(2) Technical and social approaches to AI safety, Paul Christiano.
(3) Three impacts of machine intelligence, Paul Christiano.
(3) Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems, Jacob Steinhardt.

Overviews engineering challenges, potential risks, and research goals for safe AI.

(4) Benefits and Risks of Artificial Intelligence, Tom Dietterich and Eric Horvitz.
(2) Superintelligence: Paths, Dangers, Strategies, Chapters 1-10,12, Nick Bostrom.

A New York Times Bestseller analyzing many arguments for and against the potential danger of highly advanced AI systems.

Open Technical Problems


Consider a cleaning robot whose only objective is to clean your house, and which is intelligent enough to realize that if you turn it off then it won't be able to clean your house anymore. Such a robot has some incentive to resist being shut down or reprogrammed for another task, since that would interfere with its cleaning objective. Loosely speaking, we say that an AI system is incorrigible to the extent that it resists being shut down or reprogrammed by its users or creators, and corrigible to the extent that it allows such interventions on its operation. For example, a corrigible cleaning robot might update its objective function from "clean the house" to "shut down" upon observing that a human user is about to deactivate it. For an AI system operating highly autonomously in ways that can have large world-scale impacts, corrigibility is even more important; this HCAI research category is about developing methods to ensure highly robust and desirable forms of corrigibility. Corrigibility may be a special case of preference inference.
Set your priority threshold for this category:
Published articles:
(3) Corrigibility, Nate Soares, Benya Fallenstein, Eliezer Yudkowsky, Stuart Armstrong.

Describes the open problem of corrigibility—designing an agent that doesn’t have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).

(4) Safely Interruptible Agents, Laurent Orseau, Stuart Armstrong.

A proposal for training a reinforcement learning agent that doesn’t learn to avoid interruptions to episodes, such as the human operator shutting it down.

(4) Formalizing Convergent Instrumental Goals, Tsvi Benson-Tilsen and Nate Soares.

Discusses objections to the convergent instrumental goals thesis, and gives a simple formal model.

Unpublished articles:
(2) The Off-Switch Game, Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell.

Principled approach to corrigibility and the shutdown problem based on cooperative inverse reinforcement learning.

(4) Utility indifference, Stuart Armstrong.

Describes an approach to averting instrumental incentives by “cancelling out” those incentives with artificially introduced terms in the utility function.

Blog posts:
(3) Maximizing a quantity while ignoring effect through some channel, Jessica Taylor and Chris Olah.

Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.

Foundational Work

Reasoning about highly capable systems before they exist requires certain theoretical assumptions or models from which to deduce their behavior and/or how to align them with our interests. Theoretical results can also reduce our dependency on trial-and-error methodologies for determining safety, which could be important when testing systems is difficult or expensive to do safely. Whereas existing theoretical foundations such as probability theory and game theory have been helpful in developing current approaches to AI, it is possible that additional foundations could be helpful in advancing HCAI research specifically. This category is for theoretical research aimed at expanding those foundations, as judged by their expected usefulness for HCAI.
Set your priority threshold for this category:
(3) Universal Artificial Intelligence, Chapters 2-5, Hutter.
Published articles:
(2) Toward Idealized Decision Theory, Nate Soares and Benja Fallenstein.

Motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests.

(2) Reflective Solomonoff Induction, Benja Fallenstein and Nate Soares and Jessica Taylor.

Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.

(3) Reflective Oracles: A Foundation for Game Theory in Artificial Intelligence, Benja Fallenstein and Jessica Taylor and Paul Christiano.

Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.

(3) Bad Universal Priors and Notions of Optimality, Jan Leike and Marcus Hutter.

Shows that Legg-Hutter intelligence strongly depends on the universal prior, and some universal priors heavily discourage exploration.

(3) Two Attempts to Formalize Counterpossible Reasoning in Deterministic Settings, Nate Soares and Benja Fallenstein.
(3) Space-Time Embedded Intelligence, Laurent Orseau and Mark Ring.
(4) Nonparametric General Reinforcement Learning, Jan Leike.

Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.

Unpublished articles:
(2) Logical Induction (abridged), Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor.

Proposes a criterion for “good reasoning” using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.

(2) Parametric Bounded Lob’s Theorem and Robust Cooperation of Bounded Agents, Andrew Critch.

Proves a version of Lob’s theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner’s Dilemma and decision theory more broadly.

(2) Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda, Nate Soares and Benja Fallenstein.

Overview of an agenda to formalize various aspects of human-compatible “naturalized” (embedded in its environment) superintelligence .

(3) Utility Indifference, Stuart Armstrong.
(3) A Formal Solution to the Grain of Truth Problem, Jan Leike, Jessica Taylor, and Benya Fallenstein.

Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.

(3) Logical Induction, Scott Garrabrant, Tsvi Benson-Tilsen, AC, Nate Soares, and Jessica Taylor.

Presents an algorithm that uses Brouwer’s fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.

(3) Formalizing Two Problems of Realistic World-Models, Nate Soares.
(4) Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic, Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire, and Eliezer Yudkowsky.

Illustrates how agents formulated in term of provability logic can be designed to condition on each others’ behavior in one-shot-games to achieve cooperative equilibria.

(4) Vingean Reflection: Reliable Reasoning for Self-Improving Agents, Benja Fallenstein and Nate Soares.
(4) UDT with known search order, Tsvi Benson-Tilsen.

Agents that only use logical deduction to make decisions may need to “diagonalize against the universe” in order to perform well even in trivially simple environments.

(4) Proof-producing reflection for HOL, Benja Fallenstein and Ramana Kumar.
Blog posts:
(4) Building Phenomenological Bridges, RobbBB.

Interactive AI

Incorporating more human oversight and involvement "in the loop" with an AI system creates unique opportunities for ensuring the alignment of an AI system with human interests. This category surveys approaches to achieving interactivity between humans and machines, and how it might apply to developing human compatible AI. Such approaches often require some degree of both transparency and reward engineering for the system, and as such could also be viewed under those headings.
Set your priority threshold for this category:
Published articles:
(3) Learning Language Games Through Interaction, Sida I. Wang, Percy Liang, Christopher D. Manning.

A framework for posing and solving language-learning goals for an AI, as a cooperative game with a human.

(3) Power to the People: The Role of Humans in Interactive Machine Learning, Saleema Amershi, Maya Cakmak, W. Bradley Knox, Todd Kulesza.

Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.

(4) Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation, Akash Srivastava, James Zou, Ryan P. Adams, Charles Sutton.

Producing diverse clusterings of data by elicit experts to reject clusters.

Blog posts:
(1) Approval-directed agents, Paul Christiano.

Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.

(3) The informed oversight problem, Paul Christiano.

The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.

Preference Inference

If we ask a robot to "keep the living room clean", we probably don't want the robot locking everyone out of the house to prevent them from making a mess there (even though that would be a highly effective strategy for the objective, as stated). There seem to be an extremely large number of such implicit common sense rules, which we humans instinctively know or learn from each other, but which are currently difficult for us to codify explicitly in mathematical terms to be implemented by a machine. It may therefore be necessary to specify our preferences to AI systems implicitly, via a procedure whereby a machine will infer our preferences from reasoning and training data. This category highlights research that we think may be helpful in developing methods for that sort of preference inference.
Set your priority threshold for this category:
(3) Linear Matrix Inequalities in System and Control Theory, Stephen Boyd, Laurent El Ghaoui, E. Feron, V. Balakrishnan.

Overview of inverse optimal control methods for linear dynamical systems.

Published articles:
(2) Cooperative Inverse Reinforcement Learning, Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell.

Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.

(2) Algorithms for Inverse Reinforcement Learning, Andrew Ng, Stuart Russell.

Introduces Inverse Reinforcement Learning, gives useful theorems to characterize solutions, and an initial max-margin approach.

(2) Generative Adversarial Imitation Learning, Jonathan Ho, Stefano Ermon.

Exciting new approach to IRL and learning from demonstrations, that is more robust to adversarial failures in IRL.

(3) Active reward learning, Christian Daniel et al.

Good approach to semi-supervised RL and learning reward functions — one of the few such papers.

(3) Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Chelsea Finn, Sergey Levine, Pieter Abeel.

Recent and important paper on deep inverse reinforcement learning.

(3) Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel, Andrew Ng.

IRL with linear feature combinations. Introduces matching expected feature counts as an optimality criterion.

(3) Learning What to Value, Daniel Dewey.

Introduces preference inference from an RL perspective, contrasting to AIXI.

(4) Learning the Preferences of Ignorant, Inconsistent Agents, Owain Evans, Andreas Stuhlmuller, Noah D. Goodman.
(4) When is a Linear Control System Optimal, R.E. Kalman.

Seminal paper on inverse optimal control for linear dynamical systems.

(4) Imitation Learning with a Value-Based Prior, Umar Syed, Robert E. Schapire.
(4) Learning the Preferences of Bounded Agents, Owain Evans, Andreas Stuhlmuller, Noah D. Goodman.
(4) Maximum Margin Planning, Nathan D. Ratliff, J. Andrew Bagnell, Martin A. Zinkevich.

IRL with linear feature combinations. Gives an IRL approach that can use a black box MDP solver.

(4) Defining Human Values for Value Learners, Kaj Sotala.
Unpublished articles:
(2) Towards resolving unidentifiability in inverse reinforcement learning, Kareem Amin, Satinder Singh.

Shows that unidentifiability of reward functions can be mitigated by active inverse reinforcement learning.

(4) Mammalian Value Systems, Gopal Sarma, Nick Hay.

Claims that human values can be decomposed into mammalian values, human cognition, human socio-cultural evolution.

Blog posts:
(2) Ambitious vs. narrow value learning, Paul Christiano.

Argues that “narrow” value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.

Reward Engineering

As reinforcement-learning based AI systems become more general and autonomous, the design of reward mechanisms that elicit desired behaviours becomes both more important and more difficult. For example, if a very intelligent cleaning robot is programmed to maximize a reward signal that measures how clean its house is, it could discover that the best strategy to maximize its reward signal would be to use a computer to reprogram its vision sensors to always display an image of an extremely clean house (a phenomenon sometimes called "reward hacking"). Faced with this and related challenges, how can we design mechanisms that generate rewards for reinforcement learners that will continue to elicit desired behaviors as reinforcement learners become increasingly clever? Human oversight provides some reassurance that reward mechanisms will not malfunction, but will be difficult to scale as systems operate faster and perhaps more creatively than human judgment can monitor. Can we replace expensive human feedback with cheaper proxies, potentially using machine learning to generate rewards?
Set your priority threshold for this category:
Published articles:
(2) Avoiding Wireheading with Value Reinforcement Learning, Everitt and Hutter.

RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn’t.

(2) Delusion, Survival, and Intelligent Agents, Ring and Orseau.

RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don’t.

(3) Interactively shaping agents via human reinforcement: The TAMER framework, Knox, W. Bradley, and Peter Stone.

Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.

(3) Policy invariance under reward transformations: Theory and application to reward shaping, Andrew Ng, Daishi Harada, Stuart Russell.

Investigates conditions under which modifications to the reward function of an MDP preserve the optimal policy.

(3) Reinforcement Learning and the Reward Engineering Principle, Daniel Dewey.

Argues that a key problem in HCAI is reward engineering—designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.

(3) Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning, Shakir Mohammed and Danilo Rezende.

Mixes deep RL with intrinsic motivation — anyone trying to study reward hacking or reward design should study intrinsic motivation.

(3) Model-based utility functions, Bill Hibbard.

Early paper on how to get around wireheading

(4) Two big challenges in machine learning, Leon Bottou.

Gives some examples of reinforcement learners that find bad equilibria due to feedback loops.

(4) Empowerment–an introduction, Christoph Salge, Cornelius Glackin, and Daniel Polani.

Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.

(4) The Basic AI Drives, Section 4: AIs will try to prevent counterfeit utility, Omohundro.

Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.

(4) Motivated Value Selection for Artificial Agents, Stuart Armstrong.

Discusses case where agents might manipulate the process by which their values are selected.

Blog posts:
(2) ALBA: An explicit proposal for aligned AI, Paul Christiano.

A proposal for training a highly capable, aligned AI system, using approval-directed RL agents and bootstrapping.

(2) ALBA on GitHub, Paul Christiano.

A GitHub repo specifying the details of the ALBA proposal.

(2) Capability amplification, Paul Christiano.

The open problem of taking an aligned policy and producing a more effective policy that is still aligned.

(3) Reward engineering, Paul Christiano.


AI systems are already being given significant autonomous decision-making power in high-stakes situations, sometimes with little or no immediate human supervision. It is important that such systems be robust to noisy or shifting environments, misspecified goals, and faulty implementations, so that such perturbations don't cause the system to take actions with catastrophic consequences, such as crashing a car or the stock market. This category lists research that might help with designing AI systems to be more robust in these ways, that might also help with more advanced systems in the longer term.
Set your priority threshold for this category:
(4) Algorithmic learning in a random world, Vladimir Vovk, Alex Gammerman, Glenn Shafer.

Presents a framework for prediction tasks that expresses underconfidence by giving a set of predicted labels that probably contains the true label.

Published articles:
(2) A comprehensive survey on safe reinforcement learning, Javier Garcia, Fernando Fernandez.

Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.

(2) Intriguing properties of neural networks, Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.

Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.

(2) Knows what it knows: a framework for self-aware learning, Lihong Li, Michael L. Littman, Thomas J. Walsh.

Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.

(2) Weight uncertainty in neural networks, Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra.

Uses weight distributions in neural nets to manage uncertainty in a quasi-Bayesian fashion. Simple idea but very little work in this area.

(2) Explaining and harnessing adversarial examples, Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy.

Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.

(3) Online Learning Survey, Shai Shalev-Schwartz.

Readable introduction to the theory of online learning, including regret, and how to use it to analyze online learning algorithms.

(3) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, Yarin Gal, Zoubin Ghahramani.

Connects deep learning regularization techniques (dropout) to bayesian approaches to model uncertainty.

(3) Safe exploration in markov decision processes, Teodor Mihai Moldovan, Pieter Abbeel.

My favorite paper on safe exploration — learns the dynamics of the MDP as it also learns to act safely.

(3) Quantilizers Limited Optimization, Jessica Taylor.
(3) Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, Anh Nguyen, Jason Yosinski, Jeff Clune.

Shows that deep neural networks are liable to give very overconfident, wrong classifications to adversarially generated images.

(4) Theory and applications of robust optimization, Dimitris Bertsimas, David B. Brown, and Constantine Caramanis.

Surveys computationally tractable optimization methods that are robust to perturbations in the parameters of the problem.

(4) Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics, Felix Berkenkamp, Andreas Krause, Angela P. Schoellig.

Approach to known safety constraints in ML systems.

(4) Formal verification of distributed aircraft controllers, Sarah M. Loos, David Renshaw, and Andre Platzer.

Not deep learning or advanced AI, but a good practical example of what it takes to formally verify a real-wold system.

(4) Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, Anh Nguyen, Jason Yosinski, and Jeff Clune.

Another key paper on adversarial examples.

(4) A survey on transfer learning, Sinno Jialin Pan and Qiang Yang.
(4) Situational Awareness by Risk-Conscious Skills, Daniel Mankowitz, Aviv Tamar, Shie Mannor.
(4) Unsupervised Risk Estimation with only Structural Assumptions, Jacob Steinhardt and Percy Liang.

Approach to distributional shift that tries to obtain reliable estimates of error while making limited assumptions.

(4) Safely Interruptible Agents, Laurent Orseau and Stuart Armstrong.

Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted
by a human operator.

Unpublished articles:
(1) Concrete problems in AI safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mane.

Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.

(2) Alignment for Advanced Machine Learning Systems, Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, AC.

Describes eight open problems: ambiguity identification, human imitation, informed oversight, environmental goals, conservative concepts, impact measures, mild optimization, and averting incentives.

(2) Adversarial Examples in the Physical World, Alex Kurakin, Ian Goodfellow, and Samy Bengio.

Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn’t directly optimizing it to account for this process.

(3) Towards Verified Artificial Intelligence, Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry.

Lays out challenges and principles in formally specifying and verifying the behavior of AI systems.

Blog posts:
(3) Learning with catastrophes, Paul Christiano.

Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.


If an AI system used in medical diagnosis makes a recommendation that a human doctor finds counterintuitive, it is desirable if the AI can explain the recommendation in a way that helps the doctor evaluate whether the recommendation will work. Even if the AI has an impeccable track record, if its decisions are explainable or otherwise transparent to a human, this may help the doctor to avoid rare but serious errors, especially if the AI is well-calibrated as to when the doctor's judgment should override it. Transparency may also be helpful to engineers developing a system, to improve their intuition for what principles (if any) can be ascribed to its decision-making. This gives us a second way to evaluate a system's performance, other than waiting to observe the results it obtains. Transparency may therefore be extremely important in situations where a system will make decisions that could affect the entire world, and where waiting to see the results might not be a practical way to ensure safety. This category lists research that we think may be useful in designing more transparent AI systems.
Set your priority threshold for this category:
Course Materials:
(3) Visualizing what ConvNets learn, Andrej Karpathy.

Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature.

Published articles:
(1) Making machine learning models interpretable, A Vellido, JD Martin-Guerrero, PJG Lisboa.

Gives short descriptions of various methods for explaining ML models to non-expert users; mainly interesting for its bibliography.

(2) Deep inside convolutional networks: Visualising image classification models and saliency maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman.

Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.

(2) Visualizing and Understanding Convolutional Networks, Matthew D Zeiler, Rob Fergus.

A novel ConvNet visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.

(2) Mythos of Interpretability, Zachary Lipton.
(3) Graying the Black Box: Understanding DQNs, Tom Zahavy, Nir Ben Zrihem, and Shie Mannor.

Use t-SNE to analyze agent learned through Q-Learning.

(3) “Why Should I Trust You?”: Explaining the Predictions of Any Classifier., Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin.

LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.

(3) Algorithmic Transparency through Quantitative Input Influence: Theory and Experiments with Learning Systems, Anupam Datta, Shayak Sen, Yair Zick.

Introduces a very general notion of ”how much” an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.

(3) How to explain individual classification decisions, D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Muller.

Explains individual classification decisions locally in terms of the gradient of the classifier.

(3) Designing the Whyline: A Debugging Interface for Asking Questions about Program Behavior, Andrew Ko, Brad A. Myers.

Shows how transparency accelerates development: “Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks.”

(3) Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models, Viktoriya Krakovna, Finale Doshi-Velez .

Learning high-level Hidden Markov Models of the activations of RNNs.

(3) Visualizing Dynamics: from t-SNE to SEMI-MDPs, Nir Ben Zrihem, Tom Zahavy, Shie Mannor.

An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent’s policy.

(3) Visualizing Data Using t-SNE, Laurens van der Maaten, Geoff Hinton.

Probably the most popular nonlinear dimensionality reduction technique.

(4) Principles of Explanatory Debugging to Personalize Interactive Machine Learning, Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf.

Software allows end users to influence the predictions that machine learning systems make on their behalf.

(4) Finding Causes of Program Output with the Java Whyline, Andrew Ko, Brad A. Myers.

Shows how transparency helps developers: “Whyline users were successful about three times as often and about twice as fast compared to the control group.”

(4) Transparency and Socially Guided Machine Learning, Andrea L. Thomaz, Cynthia Breazeal.

To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner’s “attention”.

(4) Explaining classifications for individual instances, M. Robnik-Sikonja and I. Kononenko.

Kononenko has a long series of papers on explanation for regression models and machine learning models.

(4) An Explainable Artificial Intelligence System for Small-unit Tactical Behavior, Michael van Lent, William Fisher, Michael Mancuso.

Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.

(4) Synthesizing the preferred inputs for neurons in neural networks via deep generator networks, Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune.

A technique for determining what single neurons in a deep net respond to.

(4) Justification Narratives for Individual classifications, Or Biran and Kathleen McKeown.

Generating natural language justifications to aid a non-expert in trusting a machine learning classifications.

(5) Explainable Artificial Intelligence for Training and Tutoring, H. Chad Lane, Mark G. Core, Michael van Lent, Steve Solomon, Dave Gomboc.

Describes an Explainable Artificial Intelligence (XAI) tool that helps students giving orders to an AI to understand the AIs subsequent behavior and learn to give better orders.

(5) Towards State Summarization for Autonomous Robots, Daniel Brooks, Abraham Shultz, Munjal Desai, Philip Kovac, and Holly A. Yanco.

A system for explaining the behavior of a robot to its human operator via a tree of “reasons” with the action at the root of the tree.

(5) Building Explainable Artificial Intelligence Systems, Mark G. Core, H. Chad Lane, Michael van Lent, Dave Gomboc, Steve Solomon and Milton Rosenberg.

Presents a framework for hand-designing an AI that uses a relational database to make decisions, and can then explain its behavior with reference to that database.

Unpublished articles:
(3) Sequential Feature Explanations for Anomaly Detection, Md Amran Siddiqui, Alan Fern, Thomas G. Dietterich, Weng-Keen Wong.

Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.

(4) Explaining data-driven document classifications, David Martens and Foster Provost.

Examines methods for explaining classifications of text documents. Defines “explanation” as a set of words such that removing those words from the document will change its classicitation.

News and magazine articles:
(4) Transparency reports make AI decision-making accountable, CMU.

News article on the prospect of quantifying “how much” various inputs affect the output of an ML classifier.

Blog posts:
(2) You Say You Want Transparency and Interpretability?, Rayid Ghani.

Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not “gameable” (in the sense of Goodhart’s law).

(2) The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy.

Good intro to RNNs, showcases amazing generation ability, nice visualization of what the units are doing.

(3) Transparency in Safety-Critical Systems, Luke Muehlhauser.

Overviews notions of transparency for AI and AGI systems, and argues for its value in establishing confidence that a system will behave as intended.

(3) Visualizing Representations: Deep Learning and Human Beings, Chris Olah.

Blog about about transparency and user interface in deep neural networks.

(3) Inceptionism: Going deeper into neural networks, Alexander Mordvintsev, Christopher Olah, and Mike Tyka.

Seminal result on transparency in neural networks — the origin of deep dream results.

(5) On Explainability in Machine Learning, David Bianco.

Argues that (perfect) explainability is probably impossible in ML. Does not address much that partial explanations (like we get from fellow humans) are what most people want/expect anyway.

Social Science Perspectives

Cognitive science

Some approaches to human-compatible AI involve systems that explicitly model humans' beliefs and preferences. This category lists psychology and cognitive science research that is aimed at developing models of human beliefs and preferences, models of how humans infer beliefs and preferences, and relevant computational modeling background.
Set your priority threshold for this category:
Published articles:
(2) Bayesian theory of mind: Modeling joint belief-desire attribution, Chris L. Baker, Rebecca R. Saxe, Joshua B. Tenenbaum.

Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent’s actions.

(2) Action Understanding as Inverse Planning, Chris L. Baker, Joshua B. Tenenbaum & Rebecca R. Saxe.

The main psychology paper on modeling humans’ preference inferences as inverse planning.

(2) Probabilistic Models of Cognition, Noah D. Goodman and Joshua B. Tenenbaum.

Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.

(3) The Naive Utility Calculus: Computational Principles Underlying Commonsense Psychology, Julian Jara-Ettinger, Hyowon Gweon, Laura E. Schulz, Joshua B. Tenenbaum.
(3) Beyond Point-and-Shoot Morality: Why Cognitive (Neuro)Science Matters for Ethics, Joshua D Greene.
(3) How to Grow a Mind: Statistics, Structure, and Abstraction, Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, Noah D. Goodman.

Reviews the Bayesian cognitive science approach to reverse-engineering human learning and reasoning.

(4) Help or Hinder: Bayesian Models of Social Goal Inference, Tomer D. Ullman, Chris L. Baker, Owen Macindoe, Owain Evans, Noah D. Goodman and Joshua B. Tenenbaum.
(4) Not so innocent: Reasoning about costs, competence, and culpability in very early childhood, Julian Jara-Ettinger, Joshua B. Tenenbaum, Laura E. Schulz.

Moral Theory

Codifying a precise notion of "good behavior" for an AI system to follow will require (implicitly or explicitly) selecting a particular notion of "good behavior" to codify, and there may be widespread disagreement as to what that notion should be. Many moral questions may arise, including some which were previously considered to be merely theoretical. This category is meant to draw attention to such issues, so that AI researchers and engineers can have them in mind as important test cases when developing their systems.
Set your priority threshold for this category:
Published articles:
(2) Moral Trade, Toby Ord.

Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.

(3) The Unilateralist’s Curse: The Case for a Principle of Conformity, Nick Bostrom, Anders Sandberg, Tom Douglas.

Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group’s interests, merely by trusting their own judgement.

(3) Pascal’s Mugging, Nick Bostrom.

Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.

(3) Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes, Toby Ord, Rafaela Hillerbrand, Anders Sandberg.

Illustrates how model uncertainty should dominate most expert calculations that involve small probabilities.

(4) How Bad is Selfish Voting?, Simina Branzei, Ioannis Caragiannis, Jamie Morgenstern, Ariel D. Procaccia.
(4) Action, Outcome, and Value: A Dual System Framework for Morality, Fiery Cushman.
(5) Limitations and risks of machine ethics, Miles Brundage.

Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.

Unpublished articles:
(3) Loudness: on priors over preference relations, Benya Fallenstein and Nisan Stiennon.

Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.

(3) Ontological crises in artificial agents’ value systems, Peter de Blanc.
(1) Moral Machines: Teaching robots right from wrong, Wendell Wallach.

Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.

(2) The Righteous Mind: Why Good People Are Divided by Politics and Religion, Jonathan Haidt.

Important and comprehensive survey of the flavors of human morality.