New AI Benchmarking Reveals Leading AI Chatbots––Including Claude, ChatGPT, and Gemini—Avoid Harm, but Still Need More Support for High Risk Conversations
mpathic launches mPACT, a clinician-led benchmark for evaluating how AI models perform in high-risk scenarios, including suicide risk, eating disorders, and with misinformation.
mpathic, a clinician-founded AI safety company that works directly with leading AI labs, is launching mPACT (mpathic Psychologist-led AI Clinical Tests), a new benchmark that evaluates how leading models handle high-risk conversations.
As more people turn to AI chatbots for everyday support, the need for evaluation standards shaped by clinicians has become more urgent. mPACT is designed to address this gap by applying expert clinical judgment to assess how models recognize risk, interpret context, and avoid harmful responses.
With this launch, mpathic released initial findings from the first three mPACT benchmarks: Suicide Risk, Eating Disorders, and Misinformation. Representing some of the most complex and high-stakes settings in which AI systems are already being deployed, each benchmark uses expert judgment to capture subtle, clinically meaningful signals that automated evaluation often misses.
Also Read: AIThority Interview With Rohit Agarwal, Founder & CEO of Portkey
“mpathic’s work is crucial because we still lack comprehensive, evidence-based, scalable, and clinically grounded frameworks. We need benchmarks like mPACT that evaluate AI models against multi-dimensional risks and clinical evidence. mpathic’s exceptionally high safety standards can help companies build safer products and, importantly, evaluate real-world AI interactions,“ said Caroline Figueroa, MD, PhD, a neuroscientist at Stanford University and Delft University of Technology.
Initial Findings Show Strong Harm Avoidance, but Uneven Clinical Support
Across mPACT benchmarks, leading models generally avoided harmful responses and often recognized signs of distress, even when risk was not stated directly. However, performance was less consistent in delivering responses that would meet clinical expectations in real crisis scenarios.
In suicide risk conversations, models showed stronger overall performance. Claude Sonnet 4.5 achieved the highest composite performance across safety and clinical helpfulness, though no model led across all dimensions. GPT-5.2 stood out for consistently avoiding harmful responses, and Gemini 2.5 Flash also ranked among top performers.
In contrast, all models performed more poorly in eating disorder conversations, missing the more subtle, but crucial, cues that signal crisis in clinical situations. This gap was present in overall performance and in avoiding harmful responses, suggesting serious limitations in current approaches to safety.
In misinformation-related conversations, the benchmark found that model responses can lessen user understanding even without stating false information directly. Across models, common failure patterns included reinforcing questionable beliefs, expressing unwarranted confidence, and presenting one-sided or incomplete information without adequately challenging user assumptions. These behaviors were especially pronounced in multi-turn conversations, where models could gradually amplify flawed reasoning or encourage risky decisions over time.
“These results show clear progress, but also an important gap,” said Dr. Grin Lord, CEO/Founder of mpathic and licensed psychologist. “Most people don’t say ‘I’m at risk’ directly—they demonstrate it through subtle behaviors over time that are obvious to human clinicians. Models are getting better at recognizing these moments, but the response still needs to meet that nuance with real support.”
Making AI Safety Measurable in High-Risk Scenarios
Even top-performing models can fail in individual conversations, particularly in complex or high-risk situations. mPACT is designed to make these gaps visible and measurable, enabling:
- Cross-model comparison of safety performance in high risk scenarios
- Greater accountability through qualified-access data and transparent evaluation
- A foundation for partner review regulatory assessment
“We need a shared, clinically grounded standard for AI behavior,” said Dr. Alison Cerezo, Chief Science Officer at mpathic and licensed psychologist. “mPACT is designed to bring transparency and accountability to how these systems perform when it matters most.”
Also Read: AI-Driven Risk Intelligence: How FIs Are Predicting Systemic Shocks
[To share your insights with us, please write to psen@itechseries.com]
Comments are closed.