NeurIPS 2020 accepted seven papers co-authored by CHAI researchers:
“The MAGICAL Benchmark for Robust Imitation” by Sam Toyer, Rohin Shah, Andrew Critch, and Stuart Russell. Existing benchmarks for imitation learning and inverse reinforcement learning only evaluate how well agents can satisfy the preferences of a human demonstrator in the specific setting where demonstrations were provided. This paper proposes the MAGICAL benchmark, which makes it possible to evaluate how well agents satisfy demonstrator preferences in a range of situations that differ systematically from the one in which demonstrations were provided.
“Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design” by Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch and Sergey Levine (PDF will be available later). This paper was accepted for oral presentation. Many approaches for robustness and transfer learning require specifying a distribution of tasks in which a policy will be trained. However, existing approaches to generate environments suffer from common failure modes: domain randomization rarely generates structure, and minimax adversarial training leads to unsolvable environments. We propose generating environments which minimize worst case regret, to generate difficult but solvable environments. To approximate this, we introduce a second, antagonist agent that is allied with an adversary which chooses the parameters of the environment. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and higher zero-shot transfer performance.
“SLIP: Learning to predict in unknown dynamical systems with long-term memory” by Paria Rashidinejad, Jiantao Jiao, and Stuart Russell. This paper was accepted for an oral presentation. The authors consider the problem of prediction in unknown and partially observed linear dynamical systems (LDS). When the system parameters are known, the optimal linear predictor is the Kalman filter. When the system parameters are unknown, the performance of existing predictive models is poor in important classes of LDS that are only marginally stable and exhibit long-term forecast memory. The authors investigate the possibility of a uniform approximation by analyzing a generalized version of the Kolmogorov width of the Kalman filter coefficient set. This motivates the design of an algorithm, which the authors call spectral LDS improper predictor (SLIP), based on conducting a tight convex relaxation of the Kalman predictive model via spectral methods. The authors provide a finite-sample analysis, showing that our algorithm competes with the Kalman filter in hindsight with only logarithmic regret. The authors’ regret analysis relies on Mendelson’s small-ball method and circumvents concentration, boundedness, or exponential forgetting requirements, providing a sharp regret bound independent of mixing coefficient and forecast memory. Empirical evaluations demonstrate that the authors’ algorithm outperforms state-of-the-art methods in LDS prediction.
“Avoiding Side Effects in Complex Environments” by Alex Turner, CHAI visiting researcher. This paper was accepted for a spotlight talk. Reward function specification can be difficult, even in simple environments. Realistic environments contain millions of states. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoids side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway’s Game of Life. By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead, completes the specified task, and avoids side effects.
“AvE: Assistance via Empowerment” by Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, and Anca Dragan. The paper addresses the problem of goal inference in assistive artificial agents with a new paradigm that increases the human’s ability to control their environment. They formalize this approach by augmenting reinforcement learning with human empowerment and successfully demonstrate their method in a shared autonomy user study for a challenging simulated teleoperation task with human-in-the-loop training.
“Reward-rational (implicit) choice: A unifying formalism for reward learning” by Hong Jun Jeon, Smitha Milli, and Anca Dragan. It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years. The authors’ key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly. The formalism offers both a unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. The authors provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.
“Preference learning along multiple criteria: A game-theoretic perspective” by Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, and Martin Wainwright. In this paper, the authors generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell’s approachability. Their framework allows for nonlinear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization