How to Actually Evaluate a Medical-Record Summarization Tool
A field guide for claims handlers and attorneys who have to trust the output
It’s no secret that a lot of companies now build tools for reviewing and organizing the information in large medical claim files. New summarization tools ship every few months, and from the outside they look remarkably alike.
Every one will tell you the same three things:
- Their summaries are accurate.
- Their output is defensible.
- A human checks the work.
The words are nearly identical from one demo to the next, which leaves you, the person who actually has to put a summary in front of an adjuster or opposing counsel, with no real way to tell the tools apart.
So buyers keep asking us a fair question: there are so many of these on the market now, how do I tell them apart? What am I actually looking for?
We have a point of view here, and we’ll be upfront about why.
Over the last several months, we’ve competed for and won business against nearly every major competitor on the market. We’ve noticed that every business owner we work with is redesigning a similar evaluation process from scratch, and we thought we’d share those evaluation methods here. It’s the same set of tests an honest reviewer would run against any product in the category, including ours. If you use it well, it should help you find the right tool for your work, whoever made it.
So here’s the single best thing you can do, and it’s more work than it sounds, which is exactly why it works. Treat tool evaluation like hiring, not shopping. You wouldn’t hire someone off one easy interview question, and you shouldn’t pick a tool off one clean demo.
Identify a set of use cases or opportunities similar to how you would design a job description for a role. Design one evaluation process, the way you’d design an interview loop, and put every vendor through the identical set of questions. Same cases, same security review, same paperwork, same scorecard. You build it once, then you run each contender through it.
- ✕One clean demo, then a decision
- ✕The vendor's hand-picked sample records
- ✕Whichever tool sounds most polished
- ✕A choice made off a feature checklist
- →One evaluation loop, every vendor through it
- →Your own real, messy files as the test
- →Identical questions for every contender
- →A score against a rubric you weighted first
The heart of it is your own files, and they’re the interview questions.
Pick 3-5 real cases and have every vendor process the same set. Vendor demos use clean, hand-picked records chosen to look good. Your caseload is messy, and the gap between those two things is where most buying mistakes happen.
Choose the cases the way a good interviewer writes questions: start easy, then escalate, and pick files that you’re actually familiar with so you can easily verify that they accurately captured relevant context. Include at least one straightforward file, the softball, so you can see baseline quality on a record that should be simple. Then build in the hard ones, and be deliberate about the kind of hard.
Pick a file you know contains duplicate records. Test whether the tool collapses them or just repeats them.
Pick one with the messy realities you actually face: handwriting, bad scans, rotated pages, faxes of faxes.
Then add the curveball — a file with commingled records where two patients’ documents got mixed into one. Does the tool catch it, or silently blend them?
The tools that hold up on your hardest cases are the ones that will hold up on the caseload you actually have.
Standing this up takes real effort, and you should embrace that rather than shortcut it:
Yes, this is friction. It’s meant to be. You’re making a decision you’ll live with for years, across thousands of files, so it’s worth measuring twice to cut once. The few days you spend here are cheap against the cost of switching tools eighteen months in.
There’s a tell hidden in this process, too.
Once your process is built and the contenders are in it, the rubric below is what you score them on. It’s organized by what you’re really protecting against.
| # | Evaluation axis | The core test | The failure that burns you |
|---|---|---|---|
| 1 | Accuracy & faithfulness | Does every clinical fact appear in the source — and what got left out? | Invented details, or a quietly dropped visit |
| 2 | Defensibility & traceability | Can you click any statement to its exact source page? | A citation that points to the wrong page |
| 3 | Deduplication vs. tagging | Does it collapse repeats, not just label page types? | Over-merging two genuinely different events |
| 4 | Granularity & structure | Can you control the detail level and export real fields? | Inference blended into the objective record |
| 5 | Messy real-world input | How well does it read bad scans, handwriting, and faxes? | A page it can't read, silently skipped |
| 6 | Relevance & real insight | Does it surface causation, gaps, and inconsistencies? | A confident flag that is simply wrong |
| 7 | Workflow, editing & control | How easily can a reviewer correct and shape the output? | Fast turnaround that is fast and wrong |
| 8 | Security & compliance | HIPAA, SOC 2, data residency, training, retention? | Your records used to train a vendor's models |
| 9 | Who's behind the tool | Who answers when it breaks — and where is it headed? | Betting on a team that doesn't know claims work |
Accuracy and faithfulness
This is whether the summary tells the truth about the record. Pick a file you already know cold and read the output against it.
Does every clinical fact in the summary actually appear in the source? Or does the tool invent plausible-sounding details that were never there?
Then check the harder failure: what did it leave out?
A summary that’s accurate about what it includes, but quietly drops a key visit or an adverse finding, is the dangerous kind — it reads as clean. Omission burns you later, so weight it heavily.
Defensibility and traceability
Everyone uses the word “defensible.” Make them prove it.
The core test is simple. Can you click any statement in the summary and land on the exact source page it came from? Page-level citation is the standard. Line or span-level is better. Document-level (“it’s somewhere in these 400 pages”) is not good enough to rely on.
Then pressure-test the citations themselves. Are they on every assertion, or just some? When you spot-check them, do they point to the right page?
A citation that points to the wrong place is worse than no citation. It gives false confidence.
Deduplication versus tagging
These get blurred together, and they’re not the same.
Deduplication means the tool recognizes that the same record appearing five times across a large production is one event, and collapses it.
Tagging means it labels pages by type (radiology, billing, intake) without necessarily reconciling the repeats.
For large files, dedup quality matters enormously. The hard case is the near-duplicate: the same visit note with one line added, or the same lab faxed twice with different headers.
Feed it a file you know contains both exact and near-duplicates. Does it catch the exact ones? The near ones? Does it ever over-merge two genuinely different events into one? That false collapse is its own accuracy risk.
Granularity and structure
The right question about granularity is whether you can control it. Can you get a high-level chronology and then drill down into a detailed entry, or are you stuck at one altitude?
A strong chronology sorts and filters by date, provider, facility, and document type.
Check whether the structured data is genuinely structured — fields you can export and query — or just tidy-looking prose.
Also check that it keeps the objective record separate from any inference the tool adds. Blending the two creates a defensibility problem.
Handling messy real-world input
This is where tools quietly fall apart, and where demos hide the ball. Run a genuinely ugly file: handwriting, poor scans, rotated pages, faxes of faxes.
Watch three things:
- How good is the text extraction on bad scans and handwriting?
- Does the tool flag low-confidence reads, or does it silently guess?
- When it hits a page it truly can’t read, does it tell you — or skip it as if the page never existed?
The silent skip is the one that hurts.
Relevance and real insight
A summary that includes everything is just a shorter pile.
Does the tool surface what matters for your purpose — causation, pre-existing conditions, treatment gaps, or inconsistencies between what the claimant reports and what the chart shows?
Several tools now advertise inconsistency detection and condition-progression insights. Test whether those flags are real and correct, not just confident.
A false flag is worse than no flag. Acting on a wrong one costs you more than missing a subtle one.
Workflow, editing, and control
These tools keep a human in the loop, so the editing experience is part of the product.
How easily can a reviewer fix an error? Does the fix improve future output, or just patch this one file?
Check three things: whether you can shape summaries to your house style or evaluation format, what formats you can export, and whether the citations survive the export intact.
Turnaround time matters too, but treat it as a tiebreaker. Fast and wrong helps no one.
Security and compliance
You’re handling protected health information, so confirm rather than assume.
Check HIPAA posture, SOC 2, where the vendor processes and stores the data, whether they train models on your records, and the retention and deletion terms.
This rarely separates the good tools from the bad on quality. But it can be a hard gate.
Who’s behind the tool
Everything above is about the product in front of you today. This last one is about who you’re betting on. In a category moving this fast, it matters more than people expect.
The tool you buy this year is not the tool you’ll use next year. Models change, features change, and the messy edge cases you hit in month three are the ones the vendor has to fix for you.
So the question isn’t only “how good is the output today.” It’s “who picks up the phone when it isn’t, and where is this product going?”
A few things to look at:
- Support. When something breaks, who responds, how fast, and do they understand claims work or just software? Ask to speak with a current customer about exactly that.
- Domain fluency. Does the team understand the medical-legal world you live in? Domain understanding separates a tool that fixes the right things from one that ships features nobody asked for.
- Roadmap. Ask directly: what are they building next, and does it line up with where your work is heading?
This sits below accuracy and defensibility on purpose. A great team can’t rescue a tool that hallucinates or can’t cite its sources, so the table-stakes tests come first.
But once two tools clear that bar — and a few will — the team, the support, and the trajectory are often what actually separate them. You’re not just buying output. You’re picking a partner for a problem that keeps changing.
Scoring the process you built
You’ve designed one evaluation, cleared the PHI, signed the paperwork, and put every vendor through the identical interview. The scoring has to be just as disciplined, or the effort is wasted.
Score every tool on the axes above, on the same files. Write one line on why for each score. Those notes are what you’ll defend the decision with later, to your team and to yourself.
Decide your weights before you start, not after. This is where settling on your use case up front pays off.
- Buying primarily for litigation defense? Let traceability and omission-accuracy dominate the scorecard.
- Buying for high-volume claims triage? Weight dedup, messy-input handling, and turnaround higher.
Set those weights up front, in writing. A flashy feature shouldn’t win a decision it shouldn’t, and no vendor should be able to steer you toward the axes they happen to be good at.
And make sure your file set includes at least one known case — a record where you already know every fact, every duplicate, and every buried problem. Then see which tool reconstructs the truth you already hold.
That single comparison tells you more than any feature list, any demo, and any blog post — including this one.
The whole exercise is more work than a few quick demos. That’s the point. The vendors worth your time will move through it with you without flinching, and that alone tells you a great deal before you’ve tallied a single score.
Erick Enriquez
CEO & Co-Founder at InQuery