Linear Probe Penalties Reduce LLM Sycophancy
14 Dec 2024
Visiting ETH MsC student Henry Papadatos and supervising CHAI PhD student Rachel Freedman publish an article “Linear Probe Penalties Reduce LLM Sycophancy” at the NeurIPS SoLaR workshop. The paper demonstrates a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning