CHAI Internship Mentor Profiles
This is a tentative list of mentors who will be accepting interns. Please review the list to get a sense of our research interests; you will have the option to list mentors you’d be especially excited to work with in the application form.
Michael K. Cohen: Michael’s main research interest is solving the control problem by designing agents that avoid the assumptions of this paper, such as those described in short blog posts here. Another research interest, also for the purpose of solving the control problem, is the Eliciting Latent Knowledge research agenda described here.
Cameron Allen: I am interested in the foundations of intelligence—the computations that enable agency, learning, planning, abstraction, and interaction. My research is focused on trying to answer some of the following questions. What techniques enable agents to quickly learn abstract models of the world? And how can agents flexibly apply what they’ve learned to new problems? How can agents identify the important aspects of their decision problem? When should agents remember something, and what should they remember? How can we design agents so they’ll reliably do what we want, while avoiding unwanted side-effects? I build AI systems as a way to study these core aspects of intelligence, in the hopes that they will allow us to better understand ourselves and our world, and ultimately to improve both.
Scott Emmons: I am interested in both the theory and practice of AI alignment. I have helped characterize how RLHF can lead to deception when the AI sees more than the human, develop multimodal attacks and benchmarks for open-ended agents, and use mechanistic interpretability to find evidence of learned look-ahead in a chess-playing neural network. I am planning to advise internships on similar topics, including follow-ups to these projects and new, related directions.
Erik Jenner: I am interested in using internal representations of neural networks to monitor them and flag bad behavior at runtime. I believe this is one of the more promising approaches to ultimately deal with risks from deceptive alignment/scheming or measurement tampering (and could help against trojans or jailbreaks in the shorter term). More concretely, I am excited about creating proxy tasks for these future risks and empirically testing different monitoring methods (such as mechanistic anomaly detection or coup probes). Projects will likely be on these topics, potentially also with connections to interpretability, AI control, and other directions. See this post for example projects, but note that these will likely change over time.
Micah Carroll: I’m interested in humans’ preference, value, and belief changes, and how they may be affected by interactions with AI systems. I’ve studied how unintended AI manipulation, deception, and preference change behavior may emerge from seemingly reasonable choices of optimization algorithms in RL, both conceptually and in practice (in the context of LLMs and recommender systems). More recently, I’ve also been more excited about understanding reward hacking of LLM reward models, and approaches to scalable oversight. I expect to be most interested in advising projects on these or related topics.
Ben Plaut: I mostly study generalization: how a model handles unfamiliar inputs. I think many types of safety and alignment failures can be framed as misgeneralization (e.g., a robot has never seen a vase before so it doesn’t know whether breaking vases is positive, negative, or neutral). I’m specifically interested in training agents to recognize when they’re in unfamiliar situations and then act cautiously (e.g., ask for help). I think this idea is applicable to many different ML regimes, but I mainly work on LLMs and RL. I do a mix of theoretical and empirical work with the goal of designing methods which can potentially scale to very advanced AI but are testable on systems we have today.
Bhaskar Mishra: My research focuses on designing AI systems capable of forming and reasoning about well-defined probabilistic beliefs. To this end, I am interested in a variety of approaches to designing scalable probabilistic inference algorithms, including alternative representations, evidence sub-sampling, reinforcement learning for meta-level control, and more. One direction I’m exploring frames inference in large Bayesian networks as a meta-level control problem, using reinforcement learning to learn a policy for efficient evidence sub-sampling, approximating inference by focusing on relevant evidence over less relevant evidence. Additionally, I’m designing approximate inference algorithms for Probabilistic Dependency Graphs, an alternative representation that, unlike Bayesian networks, allows an agent to hold intersecting, potentially incompatible beliefs. Interns will have the opportunity to contribute to ongoing projects or pursue new directions related to improving scalability in probabilistic reasoning or exploring applications of these approaches.
Justin Svegliato: My goal is to build AI agents that complete tasks in the world for us reliably and safely using both large language models and reinforcement learning. To do this, I’ve worked on both theoretical and empirical approaches to AI agents, focusing on chat assistants, autonomous vehicles, and humanoid robots. Recently, I’ve been involved in a number of research projects, some of which include safeguarding LLM agents against prompt attacks and jailbreaks [1, 2], embedding ethical theories into agents [3, 4, 5], enabling agents to use deep metareasoning to control their reasoning process [6, 7, 8] or learn their reasoning model [9, 10], and allowing agents to recover from exceptions while maintaining safety [11, 12, 13, 14, 15]. This summer, I plan to work with several interns on new research that focuses on using large language models and reinforcement learning to build AI agents that can more effectively reason and make decisions.
Jiahai Feng: I’m interested in understanding empirical deep learning phenomena, in order to predict, monitor, or control the behavior of deep learning systems. In the past, I have studied how language models bind entities in context, and used the resulting insights to build probes that decode language model beliefs as logical propositions. More recently, I am thinking about training dynamics, and about how language models are able to learn and reason out of context.
Shreyas Kapur: How can we build AI systems that learn rich, interpretable world models as quickly as humans do? And how can we build systems that use these models to plan actions in the future? I use techniques from probabilistic programming, Bayesian inference, programming languages, search, and neuro symbolic AI to build systems that can reason and plan on interpretable ways from the ground up.
Cassidy Laidlaw: I’m interested in understanding and improving reinforcement learning using a combination of theory and empirical experimentation to explain why and when deep RL works; e.g., see these papers. I also work on human-AI coordination, developing tools like assistance games and better human models that account for how people provide unreliable feedback to improve on current alignment techniques like RLHF. Finally, I’m interested in improving uncertainty quantification and OOD detection in deep learning, particularly for learning human values and preferences.
Aly Lidayan: I am interested in advancing scientific understanding of intelligent agents, in particular intrinsically motivated, open-ended and continual/lifelong learning agents. Recently I’ve been studying intrinsic motivation and exploration (theoretically and empirically), and for future work I’d be interested in advising intern projects on topics including meta-RL and evolutionary perspectives, abstraction discovery and goal setting, causal learning, decision-making based on internal as well as external state, and/or any of the above themes in combination with large language models.
Niklas Lauffer: I’m interested in multiagent learning, cooperative-AI, and human value learning in both RL and LLM settings. Usually my research involves exploring a foundational concept that can be scaled to empirical settings. One of my primary lines of research explores the idea that in most multiagent interactions, only some pieces of information are strategically relevant. Recently, I’ve been exploring how these concepts can be used to decompose multiagent games, better coordinate with and learn from humans, and train more robust and adaptable agents. I’m also interested in designing benchmarks for evaluating agents in multiagent interactions and incorporating formal structure into learning. In the past I’ve worked on online learning in games and formal methods.