News RSS feed

Getting By Goal Misgeneralization With a Little Help From a Mentor

25 Dec 2024

“Tu Trinh, Ben Plaut, Khanh Nguyen, and Mohamad Danesh wrote the paper, “Getting By Goal Misgeneralization With a Little Help From a Mentor.” This paper explores whether goal misgeneralization can be mitigated by allowing an agent to ask for help when it is uncertain. The answer is mostly yes, although our current methods have substantial weaknesses and there are lots of interesting avenues for future work.”Tu Trinh, Ben Plaut, Khanh Nguyen, and Mohamad Danesh wrote the paper, This paper explores whether goal misgeneralization can be mitigated by allowing an agent to ask for help when it is uncertain. The answer is mostly yes, although our current methods have substantial weaknesses and there are lots of interesting avenues for future work.

Linear Probe Penalties Reduce LLM Sycophancy

14 Dec 2024

Visiting ETH MsC student Henry Papadatos and supervising CHAI PhD student Rachel Freedman publish an article “Linear Probe Penalties Reduce LLM Sycophancy” at the NeurIPS SoLaR workshop. The paper demonstrates a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning

Rachel Freedman selected as inaugural Cooperative AI Fellow

30 Nov 2024

Rachel Freedman, PhD Student, has been selected as one of the fellows for Cooperative AI’s PhD Fellow Program.

Representative Social Choice: From Learning Theory to AI Alignment

12 Nov 2024

Tianyi Qiu, CHAI Intern, wrote this paper which was accepted by NeurIPS 2024 Pluralistic Alignment Workshop. Here is the link to the paper.

Getting By Goal Misgeneralization With a Little Help From a Mentor

10 Oct 2024

Khanh Nguyen, Mohamad Danesh, Ben Plaut, and Alina Trinh wrote this paper which was presented at Towards Safe & Trustworthy Agents Workshop at NeurIPS 2024.

Language-Guided World Models: A Model-Based Approach to AI Control

18 Sep 2024

Khanh Nguyen, CHAI Postdoctoral Fellow, published a paper at the Fourth International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics (ACL 2024).

“The Alignment Problem” Wins Xingdu Book Award

29 Aug 2024

Brian Christian’s book “The Alignment Problem” was announced as the sole winner in the New Knowledge Category for Imported Editions at the Xingdu Book Award ceremony in China. The Chinese translation was published this past year by Hunan Science & Technology Press.

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

07 Aug 2024

Rachel Feedman, CHAI Phd Student, and Wes Holliday, CHAI Affiliate, published a paper at the International Conference on Machine Learning

AI Alignment with Changing and Influenceable Reward Functions

23 Jul 2024

CHAI Researchers, Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, and Anca Dragan, wrote the paper, “AI Alignment with Changing and Influenceable Reward Functions” which was accepted to ICML.

Forget deepfake videos. Text and voice are this election’s true AI threat.

08 Jul 2024

Jonathan Stray, Senior Scientist at CHAI, and Jessica Alter, tech entrepreneur and co-founder of Tech for Campaigns, wrote an op-ed for The Hill regarding the risks posed by AI in this current election cycle.

Next Page »