A
| Domain | Key Insight | Evidence from Document |
|---|---|---|
| Benchmark saturation | LLMs now exceed 90% on medical exams, making traditional tests less useful. | “Newer top models are routinely exceeding 90% accuracy… we are quickly losing the ability to evaluate them effectively.” |
| Clinical reasoning gap | High exam scores ≠ real‑world reasoning; models misread context, anchor early, and fail to ask follow‑ups. | “Performing well on a test is not always the same as performing well in the real world… failing to ask the right follow-up questions.” |
| PrIME‑LLM metric | New scoring system evaluates 5 domains: differential, testing, final diagnosis, management, misc reasoning. | “PrIME‑LLM calculates a normalized polygonal area across five domains…” |
| Overall accuracy vs balanced reasoning | Raw accuracy 81–90%, but balanced PrIME‑LLM scores only 64–78%. | “Models ranged from 64% to 78%, showing more separation than the narrow 81–90% accuracy band.” |
| Differential diagnosis failure | Catastrophic weakness: >80% failure rate across all models. | “Failure rates exceeded 80% across all 21 models tested. Not some of them, all of them.” |
| Premature closure | Models collapse onto one answer instead of holding multiple possibilities. | “LLMs appear to collapse prematurely onto a single answer rather than preserving uncertainty…” |
| Analogy: pizza vs alternatives | LLMs identify the obvious diagnosis but fail to generate alternatives. | “They detect tomato… and they say ‘pizza.’… but a skilled chef would also consider what else it could be.” |
| High‑stakes risk | Missing alternatives in medicine = missed cancer, delayed intervention. | “Anchoring… is not a lost round—it is a missed cancer, a delayed intervention…” |
| Baseline, not ceiling | Study used off‑the‑shelf models without tools, search, or structured reasoning. | “No real-time search… no structured reasoning workflows… this was a baseline evaluation.” |
| Reasoning‑optimised models | These models scored higher (76% vs 67%), showing architecture matters. | “Reasoning models scored significantly higher… mean: 76%… nonreasoning models: 67%.” |
| Human vs AI performance | LLMs can outperform clinicians on structured cases, but not proven to improve clinician decisions. | “GPT‑4 outscored both attending physicians… but giving physicians access to an LLM… did not meaningfully improve their performance.” |
| Clinical deployment caution | Safe for low‑stakes tasks; unsafe for autonomous diagnostic reasoning. | “For lower-stakes… the case for adoption is reasonable… Autonomous diagnostic reasoning is a different proposition entirely.” |
| Core risk: false confidence | Models project confidence where uncertainty is required. | “These models may project confidence precisely where clinical reasoning demands uncertainty.” |
| Bottom line | Until prospective real‑world evidence exists, caution is essential. | “Until we have prospective data… we should continue exercising caution…” |
No comments:
Post a Comment