Getting By Goal Misgeneralization With a Little Help From a Mentor

25 Dec 2024

“Tu Trinh, Ben Plaut, Khanh Nguyen, and Mohamad Danesh wrote the paper, “Getting By Goal Misgeneralization With a Little Help From a Mentor.” This paper explores whether goal misgeneralization can be mitigated by allowing an agent to ask for help when it is uncertain. The answer is mostly yes, although our current methods have substantial weaknesses and there are lots of interesting avenues for future work.”Tu Trinh, Ben Plaut, Khanh Nguyen, and Mohamad Danesh wrote the paper, This paper explores whether goal misgeneralization can be mitigated by allowing an agent to ask for help when it is uncertain. The answer is mostly yes, although our current methods have substantial weaknesses and there are lots of interesting avenues for future work.

This paper was accepted in NeurIPS Workshop on Safe & Trustworthy Agents 2024.

https://arxiv.org/pdf/2410.21052