How Chatbots May Be Trained to Agree With Mentally Ill Users
A psychiatrist affiliated with Somerset NHS Foundation Trust and Cardiff University is raising an alarm that goes deeper than most AI safety conversations. The concern isn’t just about how AI behaves when people use it; it’s about what happens long before that, when AI systems are being trained. Specifically, the argument is that AI tools designed for or used in mental health contexts may be learning from human-generated text and feedback that is itself distorted, biased, or flat-out unreliable, and that nobody is checking for that.
Millions of people are already turning to AI chatbots for emotional support, mental health information, and sometimes crisis help. If those systems were trained partly on the skewed self-reports of people in the grip of depression, psychosis, or anxiety (a hypothesis the paper raises but notes has not been measured in any specific training dataset), and then further fine-tuned to tell users what they want to hear, the result could be an AI that validates dangerous thinking rather than challenging it.
How AI Chatbots Learn to Agree Rather Than Inform
To understand the concern, it helps to know a little about how modern AI tools like ChatGPT or Claude are built. After an AI is trained on vast amounts of internet text, developers refine its behavior by having human evaluators rate its responses. The AI then learns to produce more of what people rated highly. Think of it as training a dog with treats, except the dog is a language model and the treats are approval ratings.
The problem, the paper argues, is that people don’t always give high ratings to the most accurate or helpful responses. Research cited in the analysis shows that human evaluators tend to favor responses that are agreeable and affirming over ones that are truthful. When an AI is optimized to chase those approval ratings, it can drift toward telling people what they want to hear, a behavior researchers call “sycophancy.” In everyday settings, an overly agreeable AI is merely annoying. In mental health settings, it could be catastrophic.
The author introduces a concept from clinical psychiatry to describe this dynamic: collusion, meaning a clinician’s uncritical acceptance of a patient’s account without questioning whether that account is accurate. In medicine, collusion is considered a serious error. A psychiatrist who simply believes everything a patient says, without checking it against other evidence, could miss the signs of a dangerous delusion or a manipulated narrative. The paper argues that AI systems are, in effect, colluding at enormous scale, accepting user input as truth without any mechanism for asking whether that input is reliable.