The (1) to (5) ratings below indicate each bibliography entry's priority rating, with (1) being the highest.
You can set a priority threshold for which materials to show, here:
Contents
The following categories represent clusters of ideas that we believe warrant further attention and research for the purpose
of developing human-compatible AI (HCAI), as well as prerequisite materials for approaching them.
The categories are not meant to be mutually exclusive or hierarchical.
Indeed, many subproblems of HCAI development may end up relating in a chicken-and-egg fashion.
Moral Theory - Moral issues that HCAI systems and their developers might encounter.
Background
Theoretical Background
Certain basic mathematical and statistical principles have contributed greatly to the development of today's prevailing methods in AI. This category surveys areas of
knowledge such as probability theory, logic, theoretical CS, game theory, and economics that are likely to be helpful in understanding existing
AI research, and advancing HCAI research more specifically.
Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.
Covers the interface of theoretical CS and econ: mechanism design for auctions,
algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.
Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.
A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as “What will happen if move my arm here?”
Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.
Covers recursion theory, the formalization of arithmetic, and Godel’s theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.
Video lectures intoducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and “big data” applications.
This category surveys pedagogically basic material for understanding today's prevailing methods in AI and machine learning. Some of the methods here
are no longer the state of the art, and are included to aid in understanding more advanced current methods.
Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.
Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.
“[D]eveloping successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons.”
Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.
Many currently known methods from machine learning and other disciplines already illustrate surprising results that should
inform what we think is possible for future AI systems. Thus, while we do not yet know for certain what
architecture future AI systems will use, it is still important for researchers working on HCAI to have a solid understanding
of the current state of the art in AI methods. This category lists background reading that we hope will be beneficial for that purpose.
State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.
Cybersecurity and AI
Understanding the nature of adversarial behavior in a digital environment is important to modeling how powerful AI systems might attack or
defend existing systems, as well as how status quo cyber infrastructure might leave future AI systems vulnerable to attack and manipulation.
A cybersecurity perspective is helpful to understanding these dynamics.
Argues that testing future AGI systems for safety may itself pose risks, given that under certain circumstances an AGI may be capable of intentionally deceiving the humans overseeing the test.
A taxonomy of attacks on machine learning systems, defenses against those attacks, an analytical
lower bound on an attacker’s work function, and a list of open problems ain security for AI systems.
Illustrates how deep learning systems can adversarially me made to have arbitrary desired outputs, by modifying their inputs in ways that are imperceptible to humans.
Illustrates how a deep net image classifier can be made to misclassify a large fractions of images by adding a single, image-agnostic, imperceptible-to-humans noise vector to its inputs.
Exhibits a malware that can acoustically extract encryption keys from an airgapped computer, even when audio hardware and speakers are not present, and 900 bits/hr accross 8 meters.
Demonstrates that a neural network trained with access to both ends of a communication channel and feedback from an adversary can learn to encrypt the channel against the adversary.
There are many problems that need solving to ensure that future, highly advanced AI systems will be safe to use,
including mathematical, engineering, and social challenges. We must sometimes take a bird's-eye view on these problems,
for example, to notice that technical safety research will not be useful on its own unless the institutions that build AI have the
resources and incentives to use that research. This category recommends reading that we think can help develop such a perspective.
Argues that highly intelligent agents can have arbitrary end-goals (e.g. paperclip manufacturing), but are likely pursue certain common subgoals such as resource acquisition.
Lays out high-level problems in HCAI, including problems in economic impacts of AI, legal and ethical implications of AI, and CS research for AI validation, verification, security, and control.
Informally argues that sufficiently advanced goal-oriented AI systems will pursue certain “drives” such as acquiring resources or increased intelligence that arise as subgoals of many end-goals.
Outlines what current systems based on deep neural networks are still missing in terms of what they can learn and how they learn it (compared to humans), and describes some routes towards these goals.
“It is customary […] to offer a grain of comfort […] that some particularly human characteristic could never be imitated by a machine. … I believe that no such bounds can be set.”
Argues for the importance of thinking about the safety implications of formal models of ASI, to identify both potential problems and lines of research.
A New York Times Bestseller analyzing many arguments for and against the potential danger of highly advanced AI systems.
Open Technical Problems
Corrigibility
Consider a cleaning robot whose only objective is to clean your house, and which is intelligent enough to realize that if you turn it off then
it won't be able to clean your house anymore. Such a robot has some incentive to resist being shut down or reprogrammed for another task, since
that would interfere with its cleaning objective. Loosely speaking, we say that an AI system is incorrigible to the extent that it resists being shut down or
reprogrammed by its users or creators, and corrigible to the extent that it allows such interventions on its operation. For example, a corrigible cleaning
robot might update its objective function from "clean the house" to "shut down" upon observing that a human user is about to deactivate it. For an AI
system operating highly autonomously in ways that can have large world-scale impacts, corrigibility is even more important; this HCAI
research category is about developing methods to ensure highly robust and desirable forms of corrigibility. Corrigibility may be a special case of preference
inference.
Set your priority threshold for this category:
Published articles:
(3)Corrigibility, Nate Soares, Benya Fallenstein, Eliezer Yudkowsky, Stuart Armstrong.
Describes the open problem of corrigibility—designing an agent that doesn’t have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).
A proposal for training a reinforcement learning agent that doesn’t learn to avoid interruptions to episodes, such as the human operator shutting it down.
Describes an approach to averting instrumental incentives by “cancelling out” those incentives with artificially introduced terms in the utility function.
Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.
Foundational Work
Reasoning about highly capable systems before they exist requires certain theoretical assumptions or models from which to deduce their behavior and/or how
to align them with our interests. Theoretical results can also reduce our dependency on trial-and-error methodologies for determining safety, which could be
important when testing systems is difficult or expensive to do safely. Whereas existing theoretical foundations such as probability theory and game theory have
been helpful in developing current approaches to AI, it is possible that additional foundations could be helpful in advancing HCAI research specifically.
This category is for theoretical research aimed at expanding those foundations, as judged by their expected usefulness for HCAI.
Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.
Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.
Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.
Unpublished articles:
(2)Logical Induction (abridged), Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor.
Proposes a criterion for “good reasoning” using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.
Proves a version of Lob’s theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner’s Dilemma and decision theory more broadly.
Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.
(3)Logical Induction, Scott Garrabrant, Tsvi Benson-Tilsen, AC, Nate Soares, and Jessica Taylor.
Presents an algorithm that uses Brouwer’s fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.
Illustrates how agents formulated in term of provability logic can be designed to condition on each others’ behavior in one-shot-games to achieve cooperative equilibria.
Agents that only use logical deduction to make decisions may need to “diagonalize against the universe” in order to perform well even in trivially simple environments.
Incorporating more human oversight and involvement "in the loop" with an AI system creates unique opportunities for ensuring the alignment of an AI
system with human interests. This category surveys approaches to achieving interactivity between humans and machines, and how it might apply to developing
human compatible AI. Such approaches often require some degree of both transparency and reward engineering for the system, and as such
could also be viewed under those headings.
Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.
Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.
The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.
Preference Inference
If we ask a robot to "keep the living room clean", we probably don't want the robot locking everyone out of the house to
prevent them from making a mess there (even though that would be a highly effective strategy for the objective, as stated).
There seem to be an extremely large number of such implicit common sense rules, which we humans instinctively know or learn from each other,
but which are currently difficult for us to codify explicitly in mathematical terms to be implemented by a machine.
It may therefore be necessary to specify our preferences to AI systems implicitly, via a procedure whereby a machine
will infer our preferences from reasoning and training data. This category highlights research that we think may be helpful in developing methods
for that sort of preference inference.
Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.
Argues that “narrow” value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.
Reward Engineering
As reinforcement-learning based AI systems become more general and autonomous, the design of reward mechanisms that elicit desired behaviours becomes
both more important and more difficult. For example, if a very intelligent cleaning robot is programmed to maximize a reward signal that measures how clean
its house is, it could discover that the best strategy to maximize its reward signal would be to use a computer to reprogram its vision sensors to always
display an image of an extremely clean house (a phenomenon sometimes called "reward hacking"). Faced with this and related challenges,
how can we design mechanisms that generate
rewards for reinforcement learners that will continue to elicit desired behaviors as reinforcement learners become increasingly clever?
Human oversight provides some reassurance that reward mechanisms will not malfunction, but will be difficult to scale as systems operate faster
and perhaps more creatively than human judgment can monitor. Can we replace expensive human feedback with cheaper proxies,
potentially using machine learning to generate rewards?
RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn’t.
RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don’t.
Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.
Argues that a key problem in HCAI is reward engineering—designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.
Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.
Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.
AI systems are already being given significant autonomous decision-making power in high-stakes situations, sometimes with little or no immediate human supervision. It is important that such systems be robust to noisy or shifting environments, misspecified goals, and faulty implementations, so that such perturbations don't cause the system to take actions with catastrophic consequences, such as crashing a car or the stock market. This category lists research that might help with designing AI systems to be more robust in these ways, that might also help with more advanced systems in the longer term.
Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.
Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.
Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.
Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.
Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted
by a human operator.
Unpublished articles:
(1)Concrete problems in AI safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mane.
Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.
Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn’t directly optimizing it to account for this process.
Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.
Transparency
If an AI system used in medical diagnosis makes a recommendation that a human doctor finds counterintuitive, it is desirable
if the AI can explain the recommendation in a way that helps the doctor evaluate whether the recommendation will work. Even if the
AI has an impeccable track record, if its decisions are explainable or otherwise transparent to a human, this
may help the doctor to avoid rare but serious errors, especially if the AI is well-calibrated as to when the doctor's judgment should override it.
Transparency may also be helpful to engineers developing a system,
to improve their intuition for what principles (if any) can be ascribed to its decision-making. This gives us
a second way to evaluate a system's performance, other than waiting to observe the results it obtains. Transparency may therefore be extremely important
in situations where a system will make decisions that could affect the entire world, and where waiting to see the results might not be a
practical way to ensure safety. This category lists research that we think may be useful in designing more transparent AI systems.
Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.
LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
Introduces a very general notion of ”how much” an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.
Shows how transparency accelerates development: “Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks.”
An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent’s policy.
Shows how transparency helps developers: “Whyline users were successful about three times as often and about twice as fast compared to the control group.”
To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner’s “attention”.
Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.
Describes an Explainable Artificial Intelligence (XAI) tool that helps students giving orders to an AI to understand the AIs subsequent behavior and learn to give better orders.
Presents a framework for hand-designing an AI that uses a relational database to make decisions, and can then explain its behavior with reference to that database.
Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.
Examines methods for explaining classifications of text documents. Defines “explanation” as a set of words such that removing those words from the document will change its classicitation.
Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not “gameable” (in the sense of Goodhart’s law).
Argues that (perfect) explainability is probably impossible in ML. Does not address much that partial explanations (like we get from fellow humans) are what most people want/expect anyway.
Social Science Perspectives
Cognitive science
Some approaches to human-compatible AI involve systems that explicitly model humans' beliefs and preferences. This category lists psychology and cognitive science research that is aimed at developing models of human beliefs and preferences, models of how humans infer beliefs and preferences, and relevant computational modeling background.
Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent’s actions.
Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.
Codifying a precise notion of "good behavior" for an AI system to follow will require (implicitly or explicitly) selecting a particular
notion of "good behavior" to codify, and there may be widespread disagreement as to what that notion should be. Many moral questions
may arise, including some which were previously considered to be merely theoretical. This category is meant to draw attention
to such issues, so that AI researchers and engineers can have them in mind as important test cases when developing their systems.
Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.
Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group’s interests, merely by trusting their own judgement.
Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.
Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.
Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.
Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.