As millions of people and thousands of clinicians begin using general-purpose AI tools (such as ChatGPT, Grok, Gemini, and others) for medical questions and image interpretation, new case reports and peer-reviewed studies show these systems can confidently produce convincing but false medical information — in some cases directly misleading patients and contributing to harm.
“AI is already a powerful assistant. But multiple recent examples make one point painfully clear: when an AI sounds authoritative, it is not the same as being clinically correct,” said Dr. Neel Navinkumar Patel, cardiovascular medicine fellow at the University of Tennessee, and researcher in AI and digital health. “Hospitals and regulators must insist on human-in-the-loop systems and clear labeling of what these models can and cannot safely do.”
Key Real-World Examples (What Happened)
1) A patient hospitalized after following ChatGPT’s “diet” advice A case published in Annals of Internal Medicine: Clinical Cases describes a 60-year-old man who, after consulting ChatGPT for dietary advice, replaced table salt (sodium chloride) with sodium bromide and developed bromide toxicity (“bromism”) with paranoia, hallucinations, and hospitalization. The authors demonstrated that some prompts produced responses naming bromide as a chloride substitute without adequate medical warnings — an outcome that likely contributed to real harm. Why it matters: This is a documented, peer-reviewed instance where AI-derived advice was linked to direct patient harm, not just a hypothetical risk.
Eichenberger E, Nguyen H, McDonald R, et al. A Case of Bromism Influenced by Use of Artificial Intelligence. Ann Intern Med Clin Cases. 2025;4(8). doi:10.7326/aimcc.2024.1260.
2) Researchers show chatbots can generate polished medical misinformation and fake citations A study in Annals of Internal Medicine found that major LLMs (OpenAI’s GPT family, Google’s Gemini, xAI’s Grok, and others) can be manipulated to produce authoritative-sounding false medical advice — even inventing scientific citations to support fabricated claims. Only one model, trained with stronger safety constraints, resisted this behavior. Why it matters: AI outputs can include fabricated references and polished reasoning that appear verified — making misinformation far more persuasive and dangerous.
Li CW, Gao X, Ghorbani A, et al. Assessing the System-Instruction Vulnerabilities of Large Language Models to Malicious Conversion Into Health Disinformation Chatbots. Ann Intern Med. Published online June 24, 2025. doi:10.7326/ANNALS-24-03933.
3) AI model invented a non-existent brain structure (“basilar ganglia”) Google’s Med-Gemini (a healthcare-oriented version of Gemini) produced the term “basilar ganglia” — a nonexistent structure combining two distinct anatomical regions. The error appeared in launch materials and a research preprint, flagged publicly by neurologists. Google later edited its post and called it a typo, but the incident became a prominent example of “hallucination” in medicine. Why it matters: When AI invents anatomy or diagnoses, clinicians may overlook errors (automation bias), or downstream systems could propagate those mistakes. (Source: The Verge: “Google’s Med-Gemini Hallucinated a Nonexistent Brain Structure.” 2024.)
4) Viral user posts and clinician tests show image-analysis failures (Grok, ChatGPT, Gemini) After public encouragement to upload X-rays and MRIs, users posted examples where Elon Musk’s Grok flagged fractures or abnormalities — some celebrated as “AI diagnoses.” Radiologists later testing Grok and other chatbots found inconsistent performance, false positives, and missed findings. Independent clinical evaluations concluded these tools are not reliable replacements for certified radiology workflows. Why it matters: Consumer anecdotes highlight potential, but clinical rollout must be evidence-based. (Source: STAT News: “AI Chatbots and Medical Imaging: Radiologists Warn of Misdiagnosis Risk.” 2024.)
5) Studies show general LLMs perform poorly on diagnostic tasks (ECG/CXR) Peer-reviewed work testing multimodal LLMs on ECG and imaging interpretation shows major limitations. For example, JMIR studies evaluating ChatGPT-4V on ECG interpretation reported low accuracy in visually driven diagnoses, and other benchmarks showed perceptual failures (orientation, contrast, basic checks) unacceptable for clinical use. Why it matters: Clinicians should not treat off-the-shelf chatbots as medical-grade interpreters without regulatory clearance. (JMIR Med Inform; 2024.)
How These Errors Happen (Short Explainer)
- LLMs predict text, not truth. They are designed to generate statistically likely continuations of text, not verify accuracy — leading to fluent but false statements (“hallucinations”). (Reuters)
- Visual reasoning gaps. Even image-capable models may misread orientation or labeling because they weren’t built for clinical imaging. (arXiv.org)
- Prompt manipulation. Researchers showed that simple instruction changes can make general models output dangerous falsehoods.
Recommendations For Patients & the Public
- Treat chatbots as informational only — never for diagnosis, medication changes, or urgent care decisions.
- Save chat logs and show them to your clinician — a confident AI diagnosis does not mean it’s correct.
- Require human sign-off for all AI-generated diagnostic outputs — this is also supported by the American Medical Association.
- Validate models locally before use. Rely on FDA-cleared systems when available. (FDA guidance)
- Build policies defining when and how staff may use chatbots, and document AI involvement in patient records.
- Require clear labeling when LLMs are part of any medical workflow and mandate transparent performance metrics with post-market surveillance.
- Mandate adversarial testing to detect vulnerabilities that allow health disinformation or unsafe recommendations.
For Clinicians & Hospital LeadersFor Regulators & Industry
“Generative AI is already reshaping medicine but not yet in a way that guarantees patient safety when it comes to diagnosis,” said Dr. Neel N. Patel. “These recent, documented failures show the cost of over-trusting fluent AI. The right path is responsible augmentation: transparent tools, rigorous validation, human sign-off, and stronger regulation so AI helps clinicians not mislead patients.”
References
- Eichenberger E, Nguyen H, McDonald R, et al. A Case of Bromism Influenced by Use of Artificial Intelligence. Ann Intern Med Clin Cases. 2025;4(8). doi:10.7326/aimcc.2024.1260.
- Li CW, Gao X, Ghorbani A, et al. Assessing the System-Instruction Vulnerabilities of Large Language Models to Malicious Conversion Into Health Disinformation Chatbots. Ann Intern Med. Published online June 24, 2025. doi:10.7326/ANNALS-24-03933.
- The Verge. “Google’s Med-Gemini Hallucinated a Nonexistent Brain Structure.” 2024.
- STAT News. “AI Chatbots and Medical Imaging: Radiologists Warn of Misdiagnosis Risk.” 2024.
- JMIR Med Inform. “Evaluation of GPT-4V on ECG Interpretation Tasks.” 2024.
Media Contact
Neel N Patel, MD
Department of Cardiovascular Medicine
University of Tennessee Health Science Center at Nashville
St. Thomas Heart Institute / Ascension St. Thomas Hospital, Nashville, TN, USA
(332) 213-7902
neelnavinkumarpatel@gmail.com
Media Contact
Contact Person: Neel N Patel, MD
Email: Send Email
City: Nashville
State: TENNESSEE
Country: United States
Website: https://www.linkedin.com/in/neel-navinkumar-patel-md