Publication|Articles|March 13, 2024

Vol. 16/No. 1
Volume 16
Issue 1

The Medical AI Revolution

Listen

0:00 / 0:00

Key Takeaways

Traditional models like Wells criteria and CHA2DS2-VASc remain dominant due to simplicity, validation, and domain knowledge, despite AI advancements.
AI models often lack external validation and reproducibility, hindering clinical adoption and trust among practitioners.
The FDA regulates AI in medicine, with most approved applications in radiology, but few in predictive modeling.
Future AI models must be rigorously validated, incorporate domain knowledge, and be easily adoptable to achieve clinical relevance.

We consider artificial intelligence hype vs reality, focusing on predictive modeling to answer medical questions about diagnosis, treatment choice, etc.

For years, society has been promised that the world would be changed by artificial intelligence (AI), with the medical profession no less susceptible to this hype. A sobering reminder of this optimism is a favorite quote from an article in The New England Journal of Medicine:

“After hearing for several decades that computers will soon be able to assist with difficult diagnoses, the practicing physician may well wonder why the revolution has not occurred…”

That article was published in 1987.¹

Recent advancements with large language models and generative AI, especially with the popularity of ChatGPT, have amplified expectations of how AI will transform our lives. However, the practicing oncologist must still wonder how the revolution might affect them because its mainstream clinical presence is currently subtle at best. Here we will consider AI hype vs reality, with a specific focus on predictive modeling meant to answer medical questions regarding diagnosis, detection, prognosis, treatment choice, etc.

Predictive modeling using AI is the modern correlate of evidence-based medicine scoring algorithms that were rooted in expert opinion and accomplished through traditional statistics. Perhaps surprisingly, these simple calculators, the majority of which use stoichiometric point–assignment systems, are still the standard of care today with the Wells criteria for pulmonary embolism (PE) risk, CHA₂DS₂-VASc for atrial fibrillation stroke risk, and others being taught at all levels of medical education.

Consider Phillip Steven Wells, MD, MSc, and his eponymous risk assessment tool. What is standard use throughout medicine today began in 1995 by “reviewing the literature and coming to a consensus on a scoring system” for a pilot study of 91 patients.² He used expert opinion and literature review to identify factors that would split patients into low, moderate, and high pretest probability for PE and then tailored a scoring system that would replicate real-world risk stratification with acceptable error margins.

The model was used on a larger cohort of 529 patients in 1998³ and again in 2001, but with 1239 patients for his seminal article that cemented the importance of the Wells criteria to medical practice.⁴ Expectedly, there have been many updates to the Wells criteria, including AI-powered approaches. Modeling electronic health record (EHR)–derived features produced an area under the receiver operating characteristic (AUROC) performance of PE detection of 0.71 in one study,⁵ modeling 12-lead electrocardiograms (ECGs) alone to detect PE that was scored with an AUROC of 0.75 in another study.⁶ A fusion model of EHR features, ECG waveforms, and clinical notes natural language processing among 21,183 patients was the most impressive of all with an AUROC of 0.84 in a third study.⁷And yet, the Wells criteria system still reigns supreme despite having an AUROC decidedly lower in each of these model’s cohorts.

This is not a mistake, and understanding why the Wells criteria system is still the best choice for the average emergency department reveals the bollards holding back a surge of predictive medical models. The Wells criteria system is trusted because it leverages domain knowledge given that its scoring system is based on expert opinion, is easy to implement, has been externally validated numerous times, and accomplishes a single goal with a negative predicted value of 99.5% in appropriately selected patients.⁴

These characteristics combine to make it very attractive to boots-onthe- ground clinicians. The highest scoring AI alternative discussed here used more than 35 independent EMH features in a large model, removing much of the domain knowledge that is captured with pointed feature selection prior to modeling, and thereby making the model cumbersome to implement or externally validate.⁷

A similar story unfolds regarding CHA₂DS₂-VASc, in which several newer, AI-powered models significantly outperform CHA₂DS₂-VASc head-to-head, but lack external validation, information on model calibration, and other necessary metrics of robustness and ultimately fall by the wayside.⁸

Poor reproducibility in AI predictive modeling is the single greatest barrier to adoption in practice but pervades the field. A review that evaluated 86 radiologic diagnostic models found that 70 such models had decreased performance when applied to external data, 21 of which produced significantly incongruent results.⁹ In contrast, CHA₂DS₂-VASc has at least 147 validation studies published to date.⁸ Creating standards by which to judge AI medical models is a contemporary challenge without clear guidance.

The FDA now regulates AI applications in medicine with the classification of software as a medical device (SaMD) and released an action plan addressing specific AI concerns.¹⁰ Its registry details over 690 AI/machine learning (ML)–powered medical devices, which have received 510(k) clearance, that were granted de novo request status, or given premarket approval. The rising wave of medical AI applications is clear, with a projected 30% increase in approved devices over the preceding year. Most of these devices are computer vision applications in radiology (79%), followed by various uses in cardiology (5%) and other subspecialties to a lesser degree. Only 15 devices have been approved in hematology, mostly AI-powered lab equipment like cell component analyzers, and no devices have been approved for oncology (although there are oncology-specific radiology models in the database). Of note, and why we mainly consider predictive AI models here instead of ChatGPT-like applications, is that as of late 2023, the FDA had not approved any device that leverages generative AI or large language models.

For those in favor of augmenting clinical practice using AI, the year-to-year FDA growth in AI/ML SaMD approval may be encouraging, but very few of those solutions are directly involved with predictive modeling as dreamt of by our colleague in 1987. To reach the relevance of the Wells criteria, a predictive model would likely need to satisfy all the of following:

The demographics of the training cohort should be representative of the intended use–scenario population, with special efforts to include minorities and marginalized groups.

The modeling algorithm selected should be the lowest complexity possible that is still able to achieve the desired outcome (don’t use deep learning if random forest would suffice).

While some of the features included in the model may be novel, other clinically relevant features guided by expert consensus should be included to infuse the model with domain knowledge.

The model must be evaluated with rigorous quality control metrics such as k-fold cross validation, bootstrapping, variance estimation, and other calibration analyses specific to the modeling strategy.

Iterative feature reduction should be done to decrease model complexity and allow ease of adoptability, and those remaining features should undergo importance weighting such as through Shapley statistics so that the end clinician understands why the prediction was made.

External validation must be completed at the time of publication.

It may seem unfair to compare burgeoning AI models to CHA₂DS₂- VASc with its 147 validation studies. It is also true that given a good quality dataset, the most well-intending data scientist may accidently create a prediction model that is factually irrelevant but seemingly impressive with a high AUROC due to overfitting, highvariance hyperparameter tuning, and recursive training with only the incidentally highest scoring model being selected. External validation is the only true safeguard against these pitfalls, without which a predictive model hasn’t even entered the race against traditional evidence-based medicine scoring systems.

The current era in medicine will almost definitely see the big promises of AI come to fruition given the recent advancements in AI and the depth of good-quality data available in health care. Not far in the future we will see living databases that undergo nightly self-updates from EMR data streams and allow continuous retraining of models that combine clinical features, radiology, pathology, and more into true multimodal predictive machines. Getting there will require models that not only outperform their counterparts, but also demonstrate the maturity, rigor, and reproducibility that is expected in medicine.

Jacob Shreve, MD, MS, is a Senior oncology fellow at Mayo Clinic in Rochester, Minnesota.

References

Schwartz WB, Patil RS, Szolovits P. Artificial intelligence in medicine. Where do we stand? N Eng J Med. 1987;316(11):685-688. doi:10.1056/ NEJM198703123161109
Wells PS, Hirsh J, Anderson DR, et al. Accuracy of clinical assessment of deepvein thrombosis. Lancet. 1995;345(8961):1326-1330. doi:10.1016/s0140- 6736(95)92535-3
Wells PS, Ginsberg JS, Anderson DR, et al. Use of a clinical model for safe management of patients with suspected pulmonary embolism. Ann Intern Med. 1998;129(12):997-1005. doi: 10.7326/0003-4819-129-12- 199812150-00002
Wells PS, Anderson DR, Rodger M, et al. Excluding pulmonary embolism at the bedside without diagnostic imaging: management of patients with suspected pulmonary embolism presenting to the emergency department by using a simple clinical model and d-dimer. Ann Intern Med. 2001;135(2):98-107. doi: 10.7326/0003-4819-135-2-200107170-00010
Banerjee I, Sofela M, Yang J, et al. Development and performance of the pulmonary embolism result forecast model (PERFORM) for computed tomography clinical decision support. JAMA Netw Open. 2019;2(8):e198719. doi:10.1001/jamanetworkopen.2019.8719
Silva BV, Marques J, Menezes MN, Oliveira AL, Pinto, FJ. Artificial intelligencebased diagnosis of acute pulmonary embolism: development of a machine learning model using 12-lead electrocardiogram. Rev Port Cardiol. 2023;42(7):643-651. doi: 10.1016/j.repc.2023.03.016
Somani SS, Honarvar H, Narula S, et al. Development of a machine learning model using electrocardiogram signals to improve acute pulmonary embolism screening. Eur Heart J Digit Health. 2021;3(1):56-66. doi:10.1093/ehjdh/ ztab101
Van der Endt VHW, Milders J, de Vries BBLP, et al. Comprehensive comparison of stroke risk score performance: a systematic review and meta-analysis among 6 267 728 patients with atrial fibrillation. Europace. 2022;24(11):1739-1753. doi:10.1093/europace/euac096
Yu AC, Mohajer B, Eng J. External validation of deep learning algorithms for radiologic diagnosis: a systematic review. Radiol Artif Intell. 2022;4(3):e210064. doi:10.1148/ryai.210064
Artificial intelligence and machine learning in software as a medical device. FDA. September 22, 2021. Accessed January 31, 2024. https://www.fda.gov/ medical-devices/software-medical-device-samd/artificial-intelligence-andmachine- learning-software-medical-device

Articles in this issue

almost 2 years ago

Article

How to Make Clinical Decisions With AI in Oncology

almost 2 years ago

Article

Handling Parenting Pressures

almost 2 years ago

Article

Innovation Is the Answer to Making Progress Against Cancer

almost 2 years ago

Article

Coming Full Circle

almost 2 years ago

Article

Evolving With the Times

Stay up to date on the most recent and practice-changing oncology data

Subscribe Now!

The Medical AI Revolution

Key Takeaways

References

Articles in this issue

Newsletter

Related Content

Real-World Data Support Clinical Benefit With Lifileucel in Previously Treated Advanced Melanoma

Single-Center, Retrospective Data Show Low Rate of Lifileucel Infusion Following Referral in Advanced Melanoma

Long-Term Cilta-Cel Data Show Low Rates of PFS Events in Standard-Risk R/R Myeloma

Staging, Surgery, and Emerging Targeted Strategies in Chondrosarcoma: Your Key Clinical Questions Answered

Orca-T With Allogeneic CAR T-Cell Therapy Is Safe and Effective in High-Risk B-ALL

Latest CME

Community Oncology Connections™: Beyond Primary End Points—Digging Into Randomized and Real-World Data to Guide Challenging Treatment Decisions in HR+/HER2− Metastatic Breast Cancer | Washington State Medical Oncology Society

A Breath of Strength: Managing Cancer Associated LEMS and Lung Cancer as One

Striking the Right Nerve: Managing Cancer Associated LEMS in Lung Cancer Patients

Show Me the Data™: Bridging Clinical Gaps Along the Continuum From Resectable, Early Stage to Advanced Gastric/Gastroesophageal Junction Cancers

Community Oncology Connections™: Beyond Primary End Points—Digging Into Randomized and Real-World Data to Guide Challenging Treatment Decisions in HR+/HER2− Metastatic Breast Cancer | Kentucky Society of Clinical Oncology

Community Oncology Connections™: Beyond Primary End Points—Digging Into Randomized and Real-World Data to Guide Challenging Treatment Decisions in HR+/HER2− Metastatic Breast Cancer | Indiana Oncology Society

19th Annual New York GU Cancers Congress™

Medical Crossfire®: Expert Interpretations of the Latest Data in CLL Management – Understanding the Impact of Optimal Treatment Selection on Patient Outcomes

Virtual Testing Board: Digging Deeper on Your Testing Reports to Elevate Patient Outcomes in Advanced Non–Small Cell Lung Cancer

Medical Crossfire® – From Diagnostic Dilemmas to Potential Treatment Breakthroughs: Exploring Novel Targets for Extrapulmonary Neuroendocrine Carcinomas

Addressing Unmet Needs in HER2+ Metastatic BTC

Community Practice Connections™: Tailored Treatment Approaches for Older Patients With Advanced HR+/HER2– Breast Cancer

Community Practice Connections™: Empowering Interventional Radiologists in the Emerging Era of Oncolytic Immunotherapies for Melanoma

GI Tumor Board—Applying Recent Advances in Biomarker Testing and Treatment in Metastatic Colorectal Cancer

Medical Crossfire®: Harnessing the Power of Modern Therapies in Newly Diagnosed Multiple Myeloma

Medical Crossfire®: Expert Perspectives on Targeting c-Met Overexpression and 𝘔𝘌𝘛 Genomic Alterations in NSCLC – Unveiling the Complexities of 𝘔𝘌𝘛 Dysregulation

PER Tumor Board®: Applying Recent Advances to Transform the Treatment Paradigm in SCLC—Expert Perspectives on New Approvals and Emerging Strategies

Cases & Conversations™: Transforming AML Care—Precision Strategies, Evolving Therapies, and Clinical Insights

Medical Crossfire®: Precision Medicine in Glioma Treatment — Integration of Molecular Profiling to Inform Targeted Therapies

Medical Crossfire®: Integrating Next-Generation Endocrine Targeting Therapies to Improve Outcomes for Patients With HR+/HER2- Breast Cancer

Cases and Conversations™: Sorting Through the Expanding Treatment Options for Patients with Relapsed/Refractory Multiple Myeloma

Medical Crossfire®: Improving Patient Outcomes in Myeloproliferative Neoplasms With Novel Therapeutic Approaches

Community Oncology Connections™: Optimizing SCLC Treatment Strategies and Managing Adverse Events Across Disease Stages | South Carolina

Personalized Management in NSCLC: Strategies for Early Detection, Molecular Testing, and Targeted Therapies | Kansas

Personalized Management in NSCLC: Strategies for Early Detection, Molecular Testing, and Targeted Therapies | Wyoming and Montana

Personalized Management in NSCLC: Strategies for Early Detection, Molecular Testing, and Targeted Therapies | New Mexico

Community Oncology Connections™: Optimizing SCLC Treatment Strategies and Managing Adverse Events Across Disease Stages | North Carolina

Live Tumor Board: Squamous Cell Carcinoma of the Head & Neck – Post-CRT Decisions in the Locally Advanced Setting

Community Practice Connections™: Optimizing Treatment Outcomes and Preserving Fertility in Premenopausal HR+ Breast Cancer

From Bench to Bedside: Paradigm Shifts in HER2+ Metastatic BTC Treatment

Proactive Adverse Event Management for HER2+ BTC Treatments

A Case-Guided Discussion on Managing Immune Thrombocytopenic Purpura (ITP)

Tumor Board: Expert Insights on Managing Classical 𝘌𝘎𝘍𝘙 Mutations, 𝘌𝘎𝘍𝘙 Exon 20 Insertions, and Atypical 𝘌𝘎𝘍𝘙 Mutations in Metastatic NSCLC

Evolving Treatment Strategies in Pancreatic Cancer: Current Standards, Emerging Targets, and the Role of Molecular Testing

Breast Cancer Tumor Board: Targeting TROP2 – Innovations in Triple-Negative Breast Cancer Treatment

Trending on OncLive

Long-Term Cilta-Cel Data Show Low Rates of PFS Events in Standard-Risk R/R Myeloma

Nonresponse to Bridging Therapy and Peak ALC After Cilta-Cel Are Associated With Neurotoxicity, NRM in Myeloma

FDA Updates Axi-Cel Label to Remove Limitation of Use in R/R PCNSL

Real-World Data Support Clinical Benefit With Lifileucel in Previously Treated Advanced Melanoma

FDA Underscores Risks Associated With DPD Deficiency and Capecitabine/5-FU Use in Cancer Care