Why Expert AI Still Gets Things Wrong

Introduction

Specialist AI tools are often marketed as safer alternatives to general-purpose chatbots. A legal research assistant, medical advice platform or finance-focused AI appears more trustworthy because it operates within a defined domain. In many cases, these systems do reduce certain kinds of error by using curated data, retrieval systems, expert training material or domain-specific workflows. Yet specialist branding does not eliminate the core problem behind AI hallucinations and fluent wrong answers: the system can still generate information that is inaccurate, incomplete or unsupported while presenting it with confidence.

Specialist Tools illustration 1 For critical thinkers, the key lesson is simple. A specialist interface may improve the odds of getting a useful answer, but it does not remove the need for verification. Evidence, not branding, remains the real measure of reliability.

What Legal Benchmarks Reveal About Specialist Tools

Legal AI provides one of the clearest tests of whether domain-specific systems can be trusted without human oversight. Law is highly structured, heavily documented and often publicly accessible, making factual claims easier to verify than in many other fields.

Research from the Stanford Human-Centered Artificial Intelligence institute found that legal AI systems can still hallucinate cases, citations and legal conclusions despite being designed for legal work. The researchers warned that even specialised legal models continued to produce fabricated or misleading outputs under realistic testing conditions. [Stanford HAI]hai.stanford.eduHAIAI on Trial: Legal Models Hallucinate in 1 out of 6 (or MoreStanford HAIAI on Trial: Legal Models Hallucinate in 1 out of 6 (or More…May 23, 2024 — Nearly three quarters of lawyers plan on using…Published: May 23, 2024

More recent benchmark work has shown improvement in some specialist legal tools, particularly systems that combine language models with legal databases and retrieval mechanisms. However, the same studies found persistent reasoning errors, missed statutory provisions and retrieval failures. Even advanced legal platforms that advertise AI-assisted research produced significant inaccuracies when compared against expert-reviewed legal datasets. [arXiv]arxiv.orgarXiv Benchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysBenchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysFebruary 7, 2026…Published: February 7, 2026

The practical consequences have become increasingly visible in courts. Judges in the United States and United Kingdom have sanctioned lawyers who submitted AI-generated legal citations that did not exist or misrepresented the underlying law. Courts have repeatedly emphasised that responsibility remains with the human professional, regardless of which AI tool generated the text. [The Guardian+2The Guardian]theguardian.comRobert Booth UK technology editor.Read moreThe GuardianHigh court tells UK lawyers to stop misuse of AI after fake…June 6, 2025 — 7 Jun 2025 — Ruling follows two cases blighted…Published: June 6, 2025

A striking example emerged in 2026 when a US federal judge disqualified attorneys from both sides of a case after fabricated AI-generated legal citations appeared in court filings. The court’s response was not aimed at the software vendor but at the lawyers who failed to verify the output before relying on it. [Reuters]reuters.comJudge rules both sides in lawsuit misused AI, disqualifies lawyersDistrict Judge in Mississippi, Sharion Aycock, has disqualified all attorneys involved in a contract dispute case after discovering both…

The lesson is not that specialist legal AI is useless. Rather, legal benchmarks and courtroom incidents demonstrate that domain expertise built into software does not remove the need for human review. It changes the nature of the review.

Why Domain Labels Can Raise False Confidence

One reason specialist AI deserves extra scrutiny is psychological rather than technical. People often assume that a system designed for a specific profession has already solved the reliability problem.

This assumption can create what researchers sometimes call automation bias: the tendency to trust computer-generated recommendations more than independent judgement. When an answer arrives through a platform labelled “medical AI”, “legal AI” or “research AI”, users may apply less scepticism than they would when reading a response from a general chatbot. Yet the underlying technology frequently remains a large language model that predicts plausible text rather than directly reasoning from verified facts. [Nature]nature.comA framework to assess clinical safety and hallucination…by E Asgari · 2025 · Cited by 220 — They have proposed a human evaluatio…

Marketing can unintentionally strengthen this effect. Specialist tools often highlight benchmark scores, expert partnerships or domain-specific training. These features can be valuable, but they do not guarantee correctness in every situation. Even highly optimised systems may fail when confronted with unusual cases, ambiguous evidence, outdated information or questions that fall outside the data they were designed around. [arXiv]arxiv.orgarXiv Benchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysBenchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysFebruary 7, 2026…Published: February 7, 2026

The danger is subtle. Users may stop asking, “How do I know this is true?” and start asking, “Which expert system said it?” Critical thinking weakens when institutional appearance substitutes for evidence.

Specialist Tools illustration 2

Why Better Data Does Not Eliminate Wrong Answers

Many specialist AI systems attempt to reduce hallucinations through retrieval-augmented generation, often abbreviated as RAG. Instead of relying entirely on model memory, the system retrieves documents from trusted databases before generating a response.

This approach often improves accuracy, but it does not eliminate mistakes. Errors can arise when relevant documents are not retrieved, when retrieved material is misunderstood, when the model combines sources incorrectly, or when it presents uncertain findings as settled facts. Legal benchmarking research has documented all of these failure modes, including retrieval failures and reasoning mistakes even when relevant source material exists. [arXiv]arxiv.orgarXiv Benchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysBenchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysFebruary 7, 2026…Published: February 7, 2026

Healthcare research shows a similar pattern. Studies evaluating medical chatbots have found that hallucinated references, unsupported claims and misleading advice can still appear despite domain-focused design. Researchers have proposed specialised frameworks to measure clinical safety precisely because standard performance metrics do not fully capture the risks of incorrect medical outputs. [PMC+2Nature]pmc.ncbi.nlm.nih.govPMCReference Hallucination Score for Medical Artificialby F Aljamaan · 2024 · Cited by 133 — The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authen…

In other words, specialist systems often reduce one class of errors while leaving others intact. The user sees a cleaner, more authoritative interface, but the underlying challenge of validating claims remains.

Where Human Review Remains Non-Negotiable

Human checking becomes most important when decisions carry legal, financial, medical, educational or reputational consequences.

Several situations deserve particular caution:

Citations and references. AI-generated references should be checked against original documents because fabricated or distorted citations remain a recurring failure mode. [PMC]pmc.ncbi.nlm.nih.govPMCReference Hallucination Score for Medical Artificialby F Aljamaan · 2024 · Cited by 133 — The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authen…
Professional advice. Medical diagnoses, legal interpretations and regulatory guidance should be reviewed against authoritative sources or qualified experts before action is taken. [Oxford University]ox.ac.uk2026 02 10 new study warns risks ai chatbots giving medical adviceford UniversityNew study warns of risks in AI chatbots giving medical advice10 Feb 2026 — Patients need to be aware that asking a large…

Edge cases and unusual scenarios. Specialist systems often perform best on common situations represented in their training data and may struggle with rare exceptions. [arXiv]arxiv.orgHuman-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language AgentMarch 11, 2026…Published: March 11, 2026
Summaries of complex material. A summary can appear accurate while omitting crucial qualifications, exceptions or uncertainties.
High-stakes decisions. Whenever an error could significantly affect health, liberty, finances or professional obligations, independent verification remains essential.

The goal of human review is not simply to catch factual mistakes. It is also to identify missing context, unsupported assumptions and overconfidence—problems that may not appear as obvious errors but can still lead to poor decisions.

Specialist Tools illustration 3

The Strongest Use of Specialist AI

The most successful real-world pattern is not replacing human judgement but combining it with machine assistance.

Recent research in healthcare has shown that AI systems can help clinicians broaden diagnostic possibilities and correct some initial errors, while simultaneously introducing risks of automation bias if their suggestions are accepted uncritically. The benefits were greatest when human experts treated the AI as an assistant rather than an authority. [arXiv]arxiv.orgHuman-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language AgentMarch 11, 2026…Published: March 11, 2026

The same principle appears in legal practice. Courts, regulators and professional organisations increasingly accept that AI can support research, drafting and document review. What they reject is the assumption that AI-generated work can bypass professional verification. [Stanford Law School+2Reuters]law.stanford.eduStanford Law SchoolUse of AI Generally in Legal Practice | Stanford Law SchoolLaw firms, courts, and law clinics are rushing to experimen…

Within the broader challenge of AI hallucinations and fluent wrong answers, specialist tools represent an improvement, not a solution. They can narrow the error rate, accelerate research and surface useful information. But the final safeguard remains the same one that applies to social media claims, search results and expert opinions: examine the evidence, verify important facts and remain willing to question outputs that merely sound authoritative.

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

NCT 127 Fact Check 5th Album YUTA JAEHYUN JUNGWON Sticker Nemo Smart Ver.

Search eBay.co.uk: fact checking sticker

Browse similar on eBay.co.uk

Example eBay listing

NCT 127 Taeyong Fact Check Official Sticker

Search eBay.co.uk: fact checking sticker

Browse similar on eBay.co.uk

Example eBay listing

NCT 127 [FACT CHECK] 5th Album EXHIBIT JUNGWOO Ver/CD+4 Post Card+Card+Sticker

Search eBay.co.uk: fact checking sticker

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: hai.stanford.edu
Title: HAIAI on Trial: Legal Models Hallucinate in 1 out of 6 (or More
Link: https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-or-more-benchmarking-queries
Source snippet
Stanford HAIAI on Trial: Legal Models Hallucinate in 1 out of 6 (or More...May 23, 2024 — Nearly three quarters of lawyers plan on using...

Published: May 23, 2024
Source: arxiv.org
Link: https://arxiv.org/abs/2401.01301
Source: arxiv.org
Title: arXiv Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys
Link: https://arxiv.org/abs/2603.03300
Source snippet
Benchmarking Legal RAG: The Promise and Limits of AI Statutory SurveysFebruary 7, 2026...

Published: February 7, 2026
Source: reuters.com
Link: https://www.reuters.com/legal/litigation/us-appeals-court-sanctions-lawyers-over-ai-hallucinations-lack-candor-2026-06-03/
Source snippet
appeals court sanctioned two lawyers for submitting court briefs containing fictitious, AI-generated case citations, referred to as "hall...
Source: reuters.com
Title: Judge rules both sides in lawsuit misused AI, disqualifies lawyers
Link: https://www.reuters.com/legal/litigation/judge-rules-both-sides-lawsuit-misused-ai-disqualifies-lawyers-2026-06-09/
Source snippet
District Judge in Mississippi, Sharion Aycock, has disqualified all attorneys involved in a contract dispute case after discovering both...
Source: nature.com
Link: https://www.nature.com/articles/s41746-025-01670-7
Source snippet
A framework to assess clinical safety and hallucination...by E Asgari · 2025 · Cited by 220 — They have proposed a human evaluatio...
Source: pmc.ncbi.nlm.nih.gov
Title: PMCReference Hallucination Score for Medical Artificial
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11325115/
Source snippet
by F Aljamaan · 2024 · Cited by 133 — The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authen...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11795331/
Source snippet
Language Models for Chatbot Health Advice Studiesby B Huo · 2025 · Cited by 164 — This research has investigated the ability of chatbots...
Source: arxiv.org
Link: https://arxiv.org/abs/2603.10492
Source snippet
Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language AgentMarch 11, 2026...

Published: March 11, 2026
Source: law.stanford.edu
Link: https://law.stanford.edu/juelsgaard-intellectual-property-and-innovation-clinic/use-of-ai-generally-in-legal-practice/
Source snippet
Stanford Law SchoolUse of AI Generally in Legal Practice | Stanford Law SchoolLaw firms, courts, and law clinics are rushing to experimen...
Source: reuters.com
Title: trouble with ai hallucinations spreads big law firms 2025 05 23
Link: https://www.reuters.com/legal/government/trouble-with-ai-hallucinations-spreads-big-law-firms-2025-05-23/
Source snippet
Trouble with AI 'hallucinations' spreads to big law firms23 May 2025 — AI-generated fictions, known as "hallucinations," have cropped up...

Published: May 2025
Source: dho.stanford.edu
Title: Legal RAG Hallucinations
Link: https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf
Source snippet
Assessing the Reliability of Leading AI...by V Magesh · 2025 · Cited by 438 — And in a recent survey of 1200 lawyers practicing in the U...
Source: arxiv.org
Link: https://arxiv.org/html/2604.23445v1
Source snippet
AI Safety Training Can be Clinically Harmful6 days ago — These findings motivate a five-axis evaluation framework (protocol fidelity, hal...
Source: arxiv.org
Title: A Reasoning-Focused Legal Retrieval Benchmark
Link: https://arxiv.org/html/2505.03970v1
Source snippet
May 6, 2025 — We introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal r...

Published: May 6, 2025
Source: theguardian.com
Title: Robert Booth UK technology editor.Read more
Link: https://www.theguardian.com/technology/2025/jun/06/high-court-tells-uk-lawyers-to-urgently-stop-misuse-of-ai-in-legal-work
Source snippet
The GuardianHigh court tells UK lawyers to stop misuse of AI after fake...June 6, 2025 — 7 Jun 2025 — Ruling follows two cases blighted...

Published: June 6, 2025
Source: theguardian.com
Title: two us lawyers fined submitting fake court citations chatgpt
Link: https://www.theguardian.com/technology/2023/jun/23/two-us-lawyers-fined-submitting-fake-court-citations-chatgpt
Source snippet
Two US lawyers fined for submitting fake court citations...23 Jun 2023 — A US judge has fined two lawyers and a law firm $5,000 (£3,935)...
Source: ox.ac.uk
Title: 2026 02 10 new study warns risks ai chatbots giving medical advice
Link: https://www.ox.ac.uk/news/2026-02-10-new-study-warns-risks-ai-chatbots-giving-medical-advice
Source snippet
ford UniversityNew study warns of risks in AI chatbots giving medical advice10 Feb 2026 — Patients need to be aware that asking a large...
Source: theguardian.com
Title: utah lawyer chatgpt ai court brief
Link: https://www.theguardian.com/us-news/2025/may/31/utah-lawyer-chatgpt-ai-court-brief
Source snippet
fake precedent generated by ChatGPT.” As a result of the false citations, ABC4 reported, Bednar was ordered to pay the respondent's attor...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10552880/
Source snippet
Call to Address AI “Hallucinations” and How Healthcare...by R Hatem · 2023 · Cited by 211 — We continue to believe the term "AI hallucin...

Additional References

Source: independent.co.uk
Link: https://www.independent.co.uk/news/uk/home-news/research-chatgpt-grok-deepseek-loughborough-university-b2957759.html
Source snippet
AI chatbots often 'hallucinate' and give inaccurate medical...2 days ago — Research found half of the information given in response to 5...
Source: linkedin.com
Link: https://www.linkedin.com/posts/dr-shervin-molayem-2502b521_multi-model-assurance-analysis-showing-large-activity-7370836680473423872-_CD4
Source snippet
AI hallucinations in healthcare: A serious patient safety issueNew research in Nature highlights a serious patient safety issue in health...
Source: neelguha.github.io
Link: https://neelguha.github.io/assets/pdf/building_genai_benchmarks_for_law_oxford_chapter.pdf
Source snippet
Building GenAI Benchmarks: A Case Study in Legal...by N Guha · Cited by 2 — GenAI's potential for use in highly technical fields like la...
Source: businessinsider.com
Link: https://www.businessinsider.com/mississippi-judge-removes-lawyers-lawsuit-ai-hallucinations-court-filings-2026-6
Source snippet
U.S. District Judge Sharion Aycock sanctioned four attorneys involved in a contractual dispute for submitting briefs containing bogus cit...
Source: lawnext.com
Link: https://www.lawnext.com/2025/05/ai-hallucinations-strike-again-two-more-cases-where-lawyers-face-judicial-wrath-for-fake-citations.html
Source snippet
AI Hallucinations Strike Again: Two More Cases Where...14 May 2025 — Two more cases have emerged of lawyers submitting briefs containing...

Published: May 2025
Source: publishing.rcseng.ac.uk
Link: https://publishing.rcseng.ac.uk/doi/abs/10.1308/rcsann.2026.0021
Source snippet
AI chatbots exhibit heterogeneous reference integrity, with risks of hallucinations and biases underscoring the need for prompt...Read more...
Source: mountsinai.org
Link: https://www.mountsinai.org/about/newsroom/2025/ai-chatbots-can-run-with-medical-misinformation-study-finds-highlighting-the-need-for-stronger-safeguards
Source snippet
AI Chatbots Can Run With Medical Misinformation, Study...6 Aug 2025 — The team created fictional patient scenarios, each containing one...
Source: edrm.net
Link: https://edrm.net/2025/08/reasonable-or-overreach-rethinking-sanctions-for-ai-hallucinations-in-legal-filings/
Source snippet
Reasonable or Overreach? Rethinking Sanctions for AI...18 Aug 2025 — A proposed four-pillar framework guides fair, proportional sanction...
Source: bostonbar.org
Title: chatgpt is not a lawyer using generative ai responsibly and ethically in law
Link: https://bostonbar.org/journal/chatgpt-is-not-a-lawyer-using-generative-ai-responsibly-and-ethically-in-law/
Source snippet
ChatGPT Is Not a Lawyer: Using Generative AI...Mar 2, 2026 — Lawyers using GAI tools have a duty of competence, including maintaining re...
Source: esquiresolutions.com
Title: federal court turns up the heat on attorneys using chatgpt for research
Link: https://www.esquiresolutions.com/federal-court-turns-up-the-heat-on-attorneys-using-chatgpt-for-research/
Source snippet
Federal Court Turns Up the Heat on Attorneys Using...13 Aug 2025 — Dunn, the court declared that monetary sanctions are proving ineffect...

Why Expert AI Still Gets Things Wrong

Introduction

What Legal Benchmarks Reveal About Specialist Tools

Why Domain Labels Can Raise False Confidence

Why Better Data Does Not Eliminate Wrong Answers

Where Human Review Remains Non-Negotiable

The Strongest Use of Specialist AI

Further Reading

The Alignment Problem

Human Compatible

Artificial Intelligence

Co-Intelligence

Marketplace Samples

NCT 127 Fact Check 5th Album YUTA JAEHYUN JUNGWON Sticker Nemo Smart Ver.

NCT 127 Taeyong Fact Check Official Sticker

NCT 127 [FACT CHECK] 5th Album EXHIBIT JUNGWOO Ver/CD+4 Post Card+Card+Sticker

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 5