## Annotated Bibliography of Recommended Materials

### Contents

### Background

#### Theoretical Background

##### Course Materials:

Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.

Covers the interface of theoretical CS and econ: mechanism design for auctions,

algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.

##### Textbooks:

Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.

Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.

A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as “What will happen if move my arm here?”

Covers recursion theory, the formalization of arithmetic, and Godel’s theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.

##### Videos:

Video lectures introducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and “big data” applications.

##### Published Articles:

Nice overview of some connections between Kolmogorov complexity and information theory.

#### Introduction to AI/ML

##### Course Materials:

Perceptrons, SVMs, regularization, regression methods, bias and variance, active learning, model description length, feature selection, boosting, EM, spectral clustering, graphical models.

##### Textbooks:

Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.

Reviews linear algebra, probability and information theory, and numerical computation as part of a course on deep learning.

Covers multi-arm bandits, finite MDPs, dynamic programming, Monte Carlo methods, TDL, bootstrapping, tabular methods, on-policy and off-policy methods, eligibility traces, and policy gradients.

A modern textbook on machine learning.

Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.

Comprehensive book on graphical models, which represent probability distributions in a way that is both principled and potentially transparent.

A book-length history of the field of artificial intelligence, up until 2010 or so.

##### Videos:

Lecture course on reinforcement learning taught by David Silver.

Informed, uninformed, and adversarial search; constraint satisfaction; expectimax; MDPs and RL; graphical models; decision diagrams; naive bayes, perceptrons, clustering; NLP; game-playing; robotics.

##### Published Articles:

“[D]eveloping successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons.”

Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.

#### Prevailing Methods in AI/ML

##### Videos:

##### Published Articles:

The seminal paper in deep learning.

Describes DeepMind’s superhuman Go-playing AI. Ties together several techniques in reinforcement learning and supervised learning.

State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.

#### Broad Perspectives on HCAI (Human-Compatible AI)

##### Published Articles:

Examines ways in which AI could be used to reduce large-scale risks to society, or could be a source of risk in itself.

Argues that highly intelligent agents can have arbitrary end-goals (e.g. paperclip manufacturing), but are likely pursue certain common subgoals such as resource acquisition.

Lays out high-level problems in HCAI, including problems in economic impacts of AI, legal and ethical implications of AI, and CS research for AI validation, verification, security, and control.

Informally argues that sufficiently advanced goal-oriented AI systems will pursue certain “drives” such as acquiring resources or increased intelligence that arise as subgoals of many end-goals.

Describes models of superintelligence catastrophe risk as trees representing boolean formulae, and gives recommendations for reducing such risk.

Outlines what current systems based on deep neural networks are still missing in terms of what they can learn and how they learn it (compared to humans), and describes some routes towards these goals.

Argues that relatively few concepts may be needed for an AI to learn human values, from the observation that perhaps humans do this.

How some research scientists at Facebook expect the incremental development of intelligent machines to occur.

Describes some challenges in the design of a boxed ‘Oracle AI’ system, that has strictly controlled input and merely answers questions.

Describing Oracle AI and methods of controlling it.

A philosophical discussion of the possibility of an intelligence explosion, how we could ensure good outcomes, and issues of personal identity.

An example of evolutionarily-driven learning approaches leading to very unexpected results that aren’t what the designer hoped to produce.

##### Unpublished Articles:

Broad discussion of potential research directions related to AI Safety, including short- and long-term technical work, forecasting, and policy.

Discusses extreme downside risks of highly capable AI, such as dystopian futures.

##### News and Magazine Articles:

“It is customary […] to offer a grain of comfort […] that some particularly human characteristic could never be imitated by a machine. … I believe that no such bounds can be set.”

Prescient, gloomy, and a bit rambling article about humans’ role in the future, with genetics, robotics, and nanotech.

Argues for the importance of thinking about the safety implications of formal models of ASI, to identify both potential problems and lines of research.

##### Blog Posts:

Overviews engineering challenges, potential risks, and research goals for safe AI.

##### Books:

A New York Times Bestseller analyzing many arguments for and against the potential danger of highly advanced AI systems.

### Open Technical Problems

#### Corrigibility

##### Published Articles:

Describes the open problem of corrigibility—designing an agent that doesn’t have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).

A proposal for training a reinforcement learning agent that doesn’t learn to avoid interruptions to episodes, such as the human operator shutting it down.

Discusses objections to the convergent instrumental goals thesis, and gives a simple formal model.

##### Unpublished Articles:

Principled approach to corrigibility and the shutdown problem based on cooperative inverse reinforcement learning.

Describes an approach to averting instrumental incentives by “cancelling out” those incentives with artificially introduced terms in the utility function.

##### Blog Posts:

Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.

#### Foundational Work

##### Textbooks:

##### Published Articles:

Motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests.

Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.

Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.

Shows that Legg-Hutter intelligence strongly depends on the universal prior, and some universal priors heavily discourage exploration.

Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.

##### Unpublished Articles:

Proposes a criterion for “good reasoning” using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.

Proves a version of Lob’s theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner’s Dilemma and decision theory more broadly.

Overview of an agenda to formalize various aspects of human-compatible “naturalized” (embedded in its environment) superintelligence .

Presents an algorithm that uses Brouwer’s fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.

Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.

Illustrates how agents formulated in term of provability logic can be designed to condition on each others’ behavior in one-shot-games to achieve cooperative equilibria.

Agents that only use logical deduction to make decisions may need to “diagonalize against the universe” in order to perform well even in trivially simple environments.

##### Blog Posts:

#### Interactive AI

##### Published Articles:

A framework for posing and solving language-learning goals for an AI, as a cooperative game with a human.

Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.

Producing diverse clusterings of data by elicit experts to reject clusters.

##### Blog Posts:

Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.

The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.

#### Preference Inference

##### Textbooks:

Overview of inverse optimal control methods for linear dynamical systems.

##### Published Articles:

Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.

Introduces Inverse Reinforcement Learning, gives useful theorems to characterize solutions, and an initial max-margin approach.

Exciting new approach to IRL and learning from demonstrations, that is more robust to adversarial failures in IRL.

Good approach to semi-supervised RL and learning reward functions — one of the few such papers.

Recent and important paper on deep inverse reinforcement learning.

IRL with linear feature combinations. Introduces matching expected feature counts as an optimality criterion.

Introduces preference inference from an RL perspective, contrasting to AIXI.

Seminal paper on inverse optimal control for linear dynamical systems.

IRL with linear feature combinations. Gives an IRL approach that can use a black box MDP solver.

##### Unpublished Articles:

Shows that unidentifiability of reward functions can be mitigated by active inverse reinforcement learning.

Claims that human values can be decomposed into mammalian values, human cognition, human socio-cultural evolution.

##### Blog Posts:

Argues that “narrow” value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.

#### Reward Engineering

##### Published Articles:

RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don’t.

RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn’t.

Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.

Investigates conditions under which modifications to the reward function of an MDP preserve the optimal policy.

Argues that a key problem in HCAI is reward engineering—designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.

Mixes deep RL with intrinsic motivation — anyone trying to study reward hacking or reward design should study intrinsic motivation.

Early paper on how to get around wireheading

Gives some examples of reinforcement learners that find bad equilibria due to feedback loops.

Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.

Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.

Discusses case where agents might manipulate the process by which their values are selected.

##### Blog Posts:

A proposal for training a highly capable, aligned AI system, using approval-directed RL agents and bootstrapping.

A GitHub repo specifying the details of the ALBA proposal.

The open problem of taking an aligned policy and producing a more effective policy that is still aligned.

#### Robustness

##### Textbooks:

Presents a framework for prediction tasks that expresses underconfidence by giving a set of predicted labels that probably contains the true label.

##### Published Articles:

Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.

Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.

Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.

Uses weight distributions in neural nets to manage uncertainty in a quasi-Bayesian fashion. Simple idea but very little work in this area.

Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.

Readable introduction to the theory of online learning, including regret, and how to use it to analyze online learning algorithms.

Connects deep learning regularization techniques (dropout) to bayesian approaches to model uncertainty.

My favorite paper on safe exploration — learns the dynamics of the MDP as it also learns to act safely.

Shows that deep neural networks are liable to give very overconfident, wrong classifications to adversarially generated images.

Surveys computationally tractable optimization methods that are robust to perturbations in the parameters of the problem.

Another key paper on adversarial examples.

Approach to known safety constraints in ML systems.

Not deep learning or advanced AI, but a good practical example of what it takes to formally verify a real-wold system.

Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted

by a human operator.

Approach to distributional shift that tries to obtain reliable estimates of error while making limited assumptions.

##### Unpublished Articles:

Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.

Describes eight open problems: ambiguity identification, human imitation, informed oversight, environmental goals, conservative concepts, impact measures, mild optimization, and averting incentives.

Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn’t directly optimizing it to account for this process.

Lays out challenges and principles in formally specifying and verifying the behavior of AI systems.

##### Blog Posts:

Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.

#### Transparency

##### Course Materials:

Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature.

##### Published Articles:

Gives short descriptions of various methods for explaining ML models to non-expert users; mainly interesting for its bibliography.

Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.

A novel ConvNet visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.

Use t-SNE to analyze agent learned through Q-Learning.

LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.

Learning high-level Hidden Markov Models of the activations of RNNs.

Shows how transparency accelerates development: “Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks.”

Explains individual classification decisions locally in terms of the gradient of the classifier.

Introduces a very general notion of ”how much” an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.

Probably the most popular nonlinear dimensionality reduction technique.

An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent’s policy.

To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner’s “attention”.

Software allows end users to influence the predictions that machine learning systems make on their behalf.

Kononenko has a long series of papers on explanation for regression models and machine learning models.

Shows how transparency helps developers: “Whyline users were successful about three times as often and about twice as fast compared to the control group.”

Generating natural language justifications to aid a non-expert in trusting a machine learning classifications.

A technique for determining what single neurons in a deep net respond to.

Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.

##### Unpublished Articles:

Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.

Examines methods for explaining classifications of text documents. Defines “explanation” as a set of words such that removing those words from the document will change its classicitation.

##### News and Magazine Articles:

News article on the prospect of quantifying “how much” various inputs affect the output of an ML classifier.

##### Blog Posts:

Good intro to RNNs, showcases amazing generation ability, nice visualization of what the units are doing.

Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not “gameable” (in the sense of Goodhart’s law).

Overviews notions of transparency for AI and AGI systems, and argues for its value in establishing confidence that a system will behave as intended.

Seminal result on transparency in neural networks — the origin of deep dream results.

Blog about about transparency and user interface in deep neural networks.

### Social Science Perspectives

#### Cognitive Science

##### Published Articles:

Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent’s actions.

The main psychology paper on modeling humans’ preference inferences as inverse planning.

Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.

Reviews the Bayesian cognitive science approach to reverse-engineering human learning and reasoning.

#### Moral Theory

##### Published Articles:

Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.

Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group’s interests, merely by trusting their own judgement.

Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.

Illustrates how model uncertainty should dominate most expert calculations that involve small probabilities.

Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.

##### Unpublished Articles:

Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.

##### Books:

Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.

Important and comprehensive survey of the flavors of human morality.