Independent Evidence Review · Peer-Reviewed Research

Aidoc's AI tools: What independent research shows in real-world settings.

A structured summary of peer-reviewed studies evaluating the real-world clinical performance of Aidoc's AI triage tools. Intended for health system executives, clinical informaticists, and radiology leaders assessing the evidence base.

7 independent studies  ·  2021–2026  ·  Journals: NPJ Digital Medicine, Radiology, AJNR, AJR, Stroke, PubMed
45.5%
Sensitivity for subacute intracranial bleeds
Emory · NPJ Digital Medicine, 2025
38.7%
Positive Predictive Value for cervical spine fractures
Lahey Hospital · AJNR, 2021
22%
False negative rate for large vessel occlusions in one cohort
Academic Medical Center · Stroke (AHA), 2025
2 of 2
Prospective controlled studies found no statistically significant improvement in radiologist diagnostic accuracy
UAB · Radiology 2023 & AJR 2024
99.3%
Specificity with AI — vs. 99.8% without it
UAB · AJR, 2024
Published Studies
Filter by clinical condition or view all. Each entry links directly to the peer-reviewed publication.
Intracranial Hemorrhage 2025 NPJ Digital Medicine
Real-World Performance Evaluation of a Commercial Deep Learning Model for Intracranial Hemorrhage Detection
Emory University · Published in NPJ Digital Medicine (Nature)
Key Findings Sensitivity was notably lower for subacute bleeds (45.5%) and chronic bleeds (54.8%) than for acute presentations. Overall sensitivity in outpatient settings was 72.2%. The authors note this is lower than figures reported in controlled validation studies, and attribute the gap in part to the more heterogeneous patient mix and imaging conditions in routine clinical practice.
Read the study
Intracranial Hemorrhage 2024 AJR Online
Prospective Evaluation of Artificial Intelligence Triage of Intracranial Hemorrhage on Noncontrast Head CT
University of Alabama at Birmingham (UAB) · Published in AJR Online
Key Findings Radiologists using the AI showed no statistically significant difference in diagnostic accuracy compared to reading without it. Specificity was higher without the AI (99.8% vs. 99.3%), suggesting the AI added false positives without a measurable offsetting gain in sensitivity in this cohort.
Read the study
Pulmonary Embolism 2023 Radiology (RSNA)
Prospective Evaluation of AI Triage of Pulmonary Emboli on CT Pulmonary Angiograms
University of Alabama at Birmingham (UAB) · Published in Radiology
Key Findings The study found no statistically significant improvement in radiologist accuracy, miss rate, or report turnaround time when the AI was used. The tool did reprioritize positive scans in the worklist, but this did not translate into a measurable difference in diagnostic outcomes in this prospective evaluation.
Read the study
Intracranial Hemorrhage 2026 PubMed
Head-to-Head Comparison of Two AI Triage Solutions for Detecting Intracranial Hemorrhage
Baylor / Texas Medical Center Network · Published on PubMed
Key Findings The study documented false-negative rates of approximately 6% and nearly 100 false positives within the study cohort. Error analysis found that motion artifacts and scanner hardware variations were common contributing factors — sources of variability that are routine in multi-vendor hospital environments.
Read the study
C-Spine Fractures 2021 AJNR
Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for Cervical Spine Fracture Detection
Lahey Hospital & Medical Center · Published in American Journal of Neuroradiology
Key Findings Sensitivity was 54.9% with a Positive Predictive Value of 38.7%. Failure mode analysis found that degenerative disc changes were a common source of false positive classifications. The authors noted the algorithm had received "little or no external validation" prior to clinical use, and called for more rigorous independent validation before deployment of similar tools.
Read the study
Large Vessel Occlusion 2025 Stroke (AHA)
A Retrospective Analysis Comparing AIDoc and RAPIDAI in the Detection of Large Vessel Occlusions
Academic Medical Center · Published in Stroke: Vascular and Interventional Neurology
Key Findings A 22% false negative rate for Aidoc in large vessel occlusion detection within this cohort — meaning roughly 1 in 5 cases were missed. The study concluded that neither major platform was superior and that both required extreme caution given the miss rates observed. The authors stopped short of recommending clinical reliance on either tool without further validation.
Read the study
Pediatric ICH 2026 PubMed
Performance Evaluation of a Commercial Deep Learning Software for Detecting ICH in a Pediatric Population
Independent Researchers · Published on PubMed
Key Findings The study found elevated false positive rates in pediatric patients, driven by normal anatomical features in children — choroid plexus calcifications and hyperdense venous sinuses — that the algorithm, trained predominantly on adult data, classified as hemorrhage. The authors conclude that population-specific validation is necessary before applying adult-trained AI tools to pediatric imaging.
Read the study

Operational & Patient Safety Risk Factors
Cross-cutting issues relevant to procurement, deployment governance, and clinical risk management — not limited to individual studies.
⚠️
Performance Gap: Benchmarks vs. Real-World Deployment
Performance figures in vendor materials are typically generated under controlled validation conditions. The independent studies reviewed here document meaningful sensitivity gaps in real-world settings — for example, 72.2% overall outpatient ICH sensitivity (Emory, 2025). This gap between validation-study performance and real-world deployment is a recognized challenge in clinical AI broadly. Health systems should require independent, site-specific validation before treating vendor benchmarks as predictive of their own environment.
🔬
Trained on Adults, Deployed on Children
The pediatric ICH study found that the algorithm — trained predominantly on adult imaging data — produced frequent false positives in children due to normal pediatric anatomical features (choroid plexus calcifications, hyperdense venous sinuses) that differ from adult presentation. This raises a broader due-diligence question health systems should ask of any AI vendor: has the tool been validated on the specific patient populations you serve, including pediatric or other subgroups that may differ from the training cohort?
📉
Prospective Evidence: No Significant Improvement in Radiologist Accuracy
Two prospective UAB studies — one evaluating PE triage, one evaluating ICH triage — found no statistically significant improvement in radiologist diagnostic accuracy when using the AI compared to reading without it. In the ICH study, specificity was lower with the AI than without it. Prospective controlled study designs are generally considered the strongest basis for evaluating clinical impact; the absence of a detectable benefit in two such studies is a meaningful data point for procurement decisions.
🤖
Scanner Dependency and Artifact Sensitivity
Multiple studies found Aidoc's performance varied significantly by scanner manufacturer, imaging protocol, and the presence of artifacts — motion blur, hardware-related noise, and calcifications. Hospitals use diverse scanner fleets; a tool that performs well on one manufacturer's hardware and fails on another's is not reliably deployable at scale.
📊
Lack of External Validation Before Deployment
The Lahey cervical spine study explicitly flagged that many similar AI algorithms — including Aidoc's — received "little or no external validation" before being deployed clinically. FDA clearance requires demonstrating performance, but does not require external validation across diverse hospital populations or independent replication of performance claims.
🔄
Model Drift and the Monitoring Gap
AI models can degrade over time as patient populations, scanner firmware, or imaging protocols shift — a phenomenon called model drift. Post-deployment performance monitoring in clinical AI is not yet standardized across the industry. Health systems should ask vendors specifically how ongoing performance is tracked, what thresholds trigger re-validation, and what contractual obligations exist if deployed performance diverges materially from validated benchmarks.

Sources
All studies are peer-reviewed and published in indexed medical journals.