|

Why ChatGPT, Claude, and General AI Tools Fall Short on Medical Record Review

Why ChatGPT and General AI Fall Short on Medical Record Review

General AI tools are cheap, fast, and already on your computer. When a PI attorney has 3,000 pages of medical records and a deadline, the temptation to paste them into ChatGPT is real. The output looks polished, the dates appear to line up, and the summary reads with professional clarity.

But that summary may not be accurate — and it may not hold up to scrutiny.

This post breaks down why general AI tools fail at medical record review for PI litigation — from the standpoint of what courts, opposing counsel, and malpractice carriers actually care about. It explains what purpose-built platforms do differently and why those differences matter.

The Appeal of General AI for Medical Record Review

What Most Firms Try First

Most PI law firms arrive at specialized medical record tools the same way: after trying something else first. They begin with a paralegal reading records manually, billing 10 to 20 hours per case. Then someone tries ChatGPT and discovers that the output cannot be verified against the source documents it summarized.

The initial test looks compelling: paste a progress note, ask for a summary, and in under 15 seconds the AI returns a clean paragraph with dates, diagnoses, and treatment history.

What you do not see in those 15 seconds is what is missing.

Why the Output Looks Convincing

General AI tools are trained on vast medical and legal text corpora. They understand what a progress note looks like, what ICD codes are, and how to phrase clinical language in a way that reads as authoritative. The output surface is polished enough to pass a casual read.

This creates a specific confidence problem. The summary is professional enough that reviewers stop cross-checking it against the original record. That is the moment an error becomes case-threatening.

What Medical Record Review Actually Requires

Medical Records Are Not Clean Documents

Medical records arrive in formats that general AI handles poorly. Handwritten notes scanned at low resolution. Duplicate pages from multiple providers. Records filed out of chronological order. Pages from a different patient mixed into the chart. Missing records that a provider forgot to include.

Purpose-built review platforms address these intake problems before any AI summarization begins — through OCR correction, deduplication, and gap flagging at the processing layer. General AI tools skip all of that.

You are not passing clean text to a general AI model. You are passing raw, messy, high-stakes clinical documentation and expecting it to behave like a specialized legal tool.

What “Review” Means in a Litigation Context

A medical record review for litigation is not just a summary. It is a defensible narrative of what happened, when, to whom, and based on what clinical evidence. Every factual claim in a demand letter needs a traceable source.

Without source-linking — a citation back to the specific page, provider, and date of each clinical event — an attorney cannot confirm that the AI’s output accurately reflects the underlying record. That makes every downstream use of the summary, from demand drafting to trial prep, rest on an unverifiable foundation.

That is a problem in settlement negotiations. It becomes a larger problem if the case goes to litigation and the summary is challenged.

Where General AI Breaks Down on Accuracy

ChatGPT and other general AI models do not produce source-linked output by default. Even with prompt engineering they cite inconsistently — sometimes plausibly but incorrectly, which is harder to catch than citing nothing at all.

Purpose-built platforms generate source-linked chronologies where every clinical event traces to a specific document page and provider. That is the minimum standard for defensible work product in personal injury litigation.

EvenUp’s guide to AI medical record review processes describes the same standard: without source attribution, there is no mechanism to audit the output. If you cannot trace a fact to its source, you cannot stand behind it in a negotiation or at trial.

Hallucinations in Medical Contexts

Medical hallucinations are more dangerous than generic AI errors. A general model might invent a hospitalization date, misattribute a procedure to the wrong provider, or silently omit a diagnosis buried in a handwritten note on page 847 of a 2,000-page record set.

In a PI case, these are not abstract data quality failures. They produce demand letters that misrepresent the plaintiff’s medical history. That creates credibility risk in negotiation, discovery exposure, and potential malpractice liability for the attorney who relied on the summary.

General AI tools have no built-in mechanism to detect or flag these errors. They produce output with the same confident tone whether the underlying claim is accurate or fabricated — a design constraint, not a configuration issue.

Token Limits and Record Volume

Most PI cases involve 2,000 to 15,000 pages of records. Larger cases — catastrophic injury, nursing home litigation, multi-year workers’ comp — regularly exceed 50,000 pages.

General AI models have context windows that max out at roughly 100,000 to 200,000 tokens. At 500 to 800 tokens per scanned page after OCR, a 5,000-page record set exceeds 2.5 million tokens. That is far beyond what any single session can process.

Supio’s analysis of AI medical chronologies makes the same point: general-purpose tools were not designed for the documentation scale that complex litigation involves. Specialized platforms process records in parallel, with automatic chunking, deduplication, and output reassembly across the full record set. With a general AI tool, that engineering problem falls to the user.

Most attorneys and paralegals do not solve it — they process a subset of the records and assume they captured the most important pages. Often, they did not.

The HIPAA Problem With General AI Tools

What HIPAA Requires From AI Vendors

HIPAA’s Business Associate Agreement requirement applies whenever a law firm shares protected health information with a vendor that processes it on the firm’s behalf. Uploading patient records to ChatGPT is sharing PHI with OpenAI.

Standard ChatGPT plans — including the free tier and the Teams subscription — do not offer HIPAA-compliant configurations. OpenAI does not execute Business Associate Agreements for these products. Uploading medical records through the standard interface may constitute a HIPAA violation before the AI produces a single word of output.

Wisedocs, which serves both law firms and insurance carriers, explicitly structures its platform around BAA compliance as a baseline requirement — not an enterprise add-on. That contrast illustrates the design difference: tools built for healthcare-adjacent industries treat HIPAA compliance as foundational, not optional.

What BAA Coverage Actually Means

Enterprise AI plans sometimes offer BAA support, but having a BAA does not end the compliance analysis. A BAA specifies how the vendor handles PHI — data retention policies, sub-processor agreements, model training opt-outs — and those terms vary significantly across providers.

Purpose-built medical record platforms are built around HIPAA compliance from the start. They operate on dedicated infrastructure, enforce BAA obligations across their full stack, and carry explicit liability for the protections they promise. For a detailed look at how the platform’s security and HIPAA compliance posture is structured, the documentation covers each of these standards.

DigitalOwl, another purpose-built medical record platform, similarly treats HIPAA infrastructure as a prerequisite, not a configuration. General AI providers are not built for healthcare data. When something goes wrong, your exposure depends on contract terms written to protect a tech company, not a litigation support tool.

Why No QA Layer Is a Deal-Breaker

Who Catches the Errors

The output of a general AI session flows directly to the attorney or paralegal with no secondary review step built in. If the AI makes an error, the person reading the output has to catch it — which eliminates much of the efficiency gain and the cost advantage entirely.

Purpose-built platforms build quality assurance into the pipeline — for high-stakes litigation, that QA layer includes human reviewers who verify AI output against source documents before it leaves the system. That is the professional standard for AI medical record review in active PI cases.

Attorney Obligations Under ABA Model Rule 1.1

Model Rule 1.1 requires competent representation, and Comment 8 extends that duty to understanding the benefits and risks of relevant technology. Multiple state bars have issued ethics opinions on AI use over the past two years. The direction is consistent: attorneys bear responsibility for verifying AI-generated output used in client matters.

CasePeer’s analysis of AI in chronology work makes the same point: the tool does not absorb the attorney’s verification duty. Using an AI summary without a verification mechanism is not resolved by noting “the AI looked right.” Attorneys who rely on unverified output in demand letters or trial submissions accept personal liability for those errors.

General AI vs. Purpose-Built Medical Record Review

The table below compares general AI against purpose-built platforms on the criteria that matter in PI litigation, using the tools law firms most frequently evaluate.

CapabilityInQueryWisedocsSupioDigitalOwlChatGPT Enterprise
Source-linked outputYes — page-levelPartialPartialStructured outputNo
HIPAA BAA standardYesYesYesYesEnterprise tier only
Handles 50,000+ pagesYesYesYesYesNo
Human QA layerBuilt-in optionOptionalOptionalOptionalNone
OCR and deduplicationYesYesPartialPartialNo
Audit trailFull chain of custodyYesYesYesNone
Integration with case managementNative + APILimitedLimitedLimitedLimited

What Attorneys Are Actually Liable For

The Discovery Problem

AI-generated work product is subject to discovery. If opposing counsel requests the source materials behind a medical record review, an attorney who used unverified general AI may need to disclose that the summary lacked source-linking or quality review.

Federal courts have sanctioned attorneys for submitting AI-generated briefs with fabricated citations, and the same scrutiny is extending to AI-assisted medical summaries. Understanding common mistakes in PI medical record summaries helps identify the liability profile before a case reaches that stage.

Legalyze.ai’s comparison of AI medical record platforms notes that the platforms attorneys can defend in court have auditable, source-attributed output — not summaries produced by a general assistant. That distinction matters when opposing counsel starts asking questions.

When AI Output Becomes the Exhibit

In a personal injury case, the medical chronology can become a central exhibit in settlement negotiations or at trial. Errors in that document carry direct consequences: reduced settlement outcomes, adverse credibility findings, and potential bar complaints if the errors were material.

A source-linked review produced by a purpose-built platform provides a chain of custody that general AI output cannot replicate. Each line traces to a specific page in the original record, and that auditability separates defensible work product from a liability.

For more on how AI-assisted medical record review is structured for law firm use, the workflow comparison shows where accountability lands across manual, general-AI, and purpose-built approaches.

How Purpose-Built Tools Address Each Failure Mode

Source-Linked Chronologies

Medical record review platforms built for legal work generate source-linked output by design. Every clinical event — a diagnosis, a procedure, a prescription, a referral — links to the specific document page in the original record. An attorney can verify any claim in under 60 seconds without rereading the full record set.

MOS Medical Record Review’s analysis of AI platforms uses source attribution as the primary differentiator: output designed to be audited, not just read.

The guide to evaluating medical summarization platforms walks through each technical criterion in practice.

Human QA Integration

The most defensible review process pairs AI summarization with human quality assurance — not as a redundant step, but as the designed standard for high-stakes litigation where accuracy matters more than raw speed.

Purpose-built platforms integrate QA into the workflow with clear handoffs between AI output and human verification. General AI tools cannot do this, so the verification burden falls entirely on the user — usually a paralegal working under deadline with no systematic way to check the output.

See the document review process for PI attorneys for where human review is required and where AI can carry the load.

Platform Comparison: Technical Criteria

The table below compares leading purpose-built platforms on specific technical criteria relevant to PI litigation. General AI tools are included for reference so the gaps are visible in context.

PlatformSource CitationsHuman QABAA CoverageMax Record VolumeHallucination Controls
InQueryPage-levelBuilt-in optionStandardUnlimitedConfidence scoring + QA review
WisedocsStructured outputOptionalYesLargePartial
DigitalOwlStructured outputOptionalYesLargePartial
SupioPartialOptionalYesLargePartial
ChatGPT EnterpriseNoneNoneEnterprise only~150K tokensNone
ChatGPT StandardNoneNoneNo~32K tokensNone

The Hidden Cost of Using General AI for Medical Records

General AI tools appear free or near-free compared to purpose-built platforms. The comparison breaks down when you account for what errors actually cost.

Cost CategoryGeneral AIPurpose-Built Platform
Per-review verification timeHigh — manual checking requiredLow — QA built into pipeline
HIPAA compliance risk$100–$50,000 per violationCovered by platform BAA
Malpractice exposureHigh — unverifiable outputLow — source-linked, auditable
Rework after errorsHigh — errors propagate to demand lettersLow — QA catches before delivery

Time spent verifying output. A paralegal spending 90 minutes checking a 3,000-page AI summary has not saved time compared to a purpose-built platform that delivers a verified review in 40 minutes. The per-hour billing rate makes general AI expensive even when the tool itself is free.

Malpractice exposure. Errors in AI-generated summaries that materially affect settlement outcomes represent liability. That liability is not covered by the tool’s pricing, and it does not require malice to trigger — only negligence.

HIPAA remediation. A data incident from PHI uploaded to a non-compliant AI service carries regulatory penalties that scale by willfulness. Gain Servicing’s overview of medical record management covers the compliance obligations that apply at every stage of how records are handled — including who you share them with.

Rework after errors. When a demand letter goes out with a factual error in the medical summary, the downstream cost — re-reviewing records, redrafting, managing client expectations — exceeds the original time saved. The right comparison is not cost-per-query; it is the total cost of a wrong summary against the total cost of a right one.

For the full picture of what purpose-built medical record review software costs at scale, the pricing breakdown covers per-case, volume, and annual contract structures.

Frequently Asked Questions

Can a PI attorney use ChatGPT for medical record review?

Technically yes, but not safely at scale. Standard ChatGPT plans do not include HIPAA BAAs, which makes uploading patient records a likely compliance violation. Even on enterprise plans with BAA coverage, ChatGPT produces no source-linked output, cannot process full record volumes in a single session, and has no QA layer. Most firms that try it move to purpose-built platforms after running into these limits on a real case. See the medical record summary guide for a fuller picture of what the review process should include.

What makes medical AI hallucinations especially dangerous?

In a medical context, a hallucination is not just an abstract data error — it is a fabricated or distorted claim about a real patient’s health history. A falsely added hospitalization, a misattributed procedure, or a silently omitted diagnosis can produce a demand letter that misrepresents the plaintiff’s case. That carries negotiation risk, discovery risk, and potential malpractice liability if the error was material to the outcome. General AI tools have no built-in mechanism to detect these errors before output is delivered.

Does HIPAA apply when law firms upload medical records to general AI tools?

Yes — any vendor that processes PHI on behalf of a law firm operating as a business associate must sign a Business Associate Agreement. Standard ChatGPT, Google Gemini, and most general AI tools for consumers do not execute BAAs. Uploading records through those interfaces is a likely HIPAA violation regardless of whether the AI output is accurate. Enterprise plans with BAA support still require verification of data retention and sub-processor terms before records are shared.

What should a law firm look for in a medical record review platform?

Four criteria matter most: source-linked output citing each fact to its source page, a HIPAA BAA included as standard, a QA mechanism either human or automated, and the ability to handle your actual record volumes. InQuery is purpose-built for PI litigation and meets all four. The platform features evaluation guide provides a structured framework for comparing tools.

How does InQuery differ from a general AI tool for medical record review?

InQuery is built specifically for legal medical record review, not repurposed from a general-purpose assistant. It produces source-linked summaries and chronologies where every clinical event traces to a specific page in the original record. The platform operates under HIPAA-compliant infrastructure, includes a BAA as standard, and integrates a human QA layer for high-stakes cases. See the overview of what AI medical record review includes for a full breakdown of how the process works.

How do I test whether a medical record review tool is accurate enough for litigation?

Accuracy benchmarking is possible with controlled testing: run the same record set through multiple platforms and compare output against a manually verified ground truth. The key metrics are recall (did the tool find all clinically relevant events), precision (did it introduce false facts), and source attribution accuracy (do citations match source pages). The guide to medical record review accuracy benchmarks explains how to run this evaluation. Kroolo’s analysis of legal document summarization with AI covers benchmark methodology applicable to the medical record context.

Erick Enriquez

Erick Enriquez

CEO & Co-Founder at InQuery

Share this article