Publication
Article
Author(s):
We consider artificial intelligence hype vs reality, focusing on predictive modeling to answer medical questions about diagnosis, treatment choice, etc.
For years, society has been promised that the world would be changed by artificial intelligence (AI), with the medical profession no less susceptible to this hype. A sobering reminder of this optimism is a favorite quote from an article in The New England Journal of Medicine:
“After hearing for several decades that computers will soon be able to assist with difficult diagnoses, the practicing physician may well wonder why the revolution has not occurred…”
That article was published in 1987.1
Recent advancements with large language models and generative AI, especially with the popularity of ChatGPT, have amplified expectations of how AI will transform our lives. However, the practicing oncologist must still wonder how the revolution might affect them because its mainstream clinical presence is currently subtle at best. Here we will consider AI hype vs reality, with a specific focus on predictive modeling meant to answer medical questions regarding diagnosis, detection, prognosis, treatment choice, etc.
Predictive modeling using AI is the modern correlate of evidence-based medicine scoring algorithms that were rooted in expert opinion and accomplished through traditional statistics. Perhaps surprisingly, these simple calculators, the majority of which use stoichiometric point–assignment systems, are still the standard of care today with the Wells criteria for pulmonary embolism (PE) risk, CHA2DS2-VASc for atrial fibrillation stroke risk, and others being taught at all levels of medical education.
Consider Phillip Steven Wells, MD, MSc, and his eponymous risk assessment tool. What is standard use throughout medicine today began in 1995 by “reviewing the literature and coming to a consensus on a scoring system” for a pilot study of 91 patients.2 He used expert opinion and literature review to identify factors that would split patients into low, moderate, and high pretest probability for PE and then tailored a scoring system that would replicate real-world risk stratification with acceptable error margins.
The model was used on a larger cohort of 529 patients in 19983 and again in 2001, but with 1239 patients for his seminal article that cemented the importance of the Wells criteria to medical practice.4 Expectedly, there have been many updates to the Wells criteria, including AI-powered approaches. Modeling electronic health record (EHR)–derived features produced an area under the receiver operating characteristic (AUROC) performance of PE detection of 0.71 in one study,5 modeling 12-lead electrocardiograms (ECGs) alone to detect PE that was scored with an AUROC of 0.75 in another study.6 A fusion model of EHR features, ECG waveforms, and clinical notes natural language processing among 21,183 patients was the most impressive of all with an AUROC of 0.84 in a third study.7 And yet, the Wells criteria system still reigns supreme despite having an AUROC decidedly lower in each of these model’s cohorts.
This is not a mistake, and understanding why the Wells criteria system is still the best choice for the average emergency department reveals the bollards holding back a surge of predictive medical models. The Wells criteria system is trusted because it leverages domain knowledge given that its scoring system is based on expert opinion, is easy to implement, has been externally validated numerous times, and accomplishes a single goal with a negative predicted value of 99.5% in appropriately selected patients.4
These characteristics combine to make it very attractive to boots-onthe- ground clinicians. The highest scoring AI alternative discussed here used more than 35 independent EMH features in a large model, removing much of the domain knowledge that is captured with pointed feature selection prior to modeling, and thereby making the model cumbersome to implement or externally validate.7
A similar story unfolds regarding CHA2DS2-VASc, in which several newer, AI-powered models significantly outperform CHA2DS2-VASc head-to-head, but lack external validation, information on model calibration, and other necessary metrics of robustness and ultimately fall by the wayside.8
Poor reproducibility in AI predictive modeling is the single greatest barrier to adoption in practice but pervades the field. A review that evaluated 86 radiologic diagnostic models found that 70 such models had decreased performance when applied to external data, 21 of which produced significantly incongruent results.9 In contrast, CHA2DS2-VASc has at least 147 validation studies published to date.8 Creating standards by which to judge AI medical models is a contemporary challenge without clear guidance.
The FDA now regulates AI applications in medicine with the classification of software as a medical device (SaMD) and released an action plan addressing specific AI concerns.10 Its registry details over 690 AI/machine learning (ML)–powered medical devices, which have received 510(k) clearance, that were granted de novo request status, or given premarket approval. The rising wave of medical AI applications is clear, with a projected 30% increase in approved devices over the preceding year. Most of these devices are computer vision applications in radiology (79%), followed by various uses in cardiology (5%) and other subspecialties to a lesser degree. Only 15 devices have been approved in hematology, mostly AI-powered lab equipment like cell component analyzers, and no devices have been approved for oncology (although there are oncology-specific radiology models in the database). Of note, and why we mainly consider predictive AI models here instead of ChatGPT-like applications, is that as of late 2023, the FDA had not approved any device that leverages generative AI or large language models.
For those in favor of augmenting clinical practice using AI, the year-to-year FDA growth in AI/ML SaMD approval may be encouraging, but very few of those solutions are directly involved with predictive modeling as dreamt of by our colleague in 1987. To reach the relevance of the Wells criteria, a predictive model would likely need to satisfy all the of following:
The demographics of the training cohort should be representative of the intended use–scenario population, with special efforts to include minorities and marginalized groups.
The modeling algorithm selected should be the lowest complexity possible that is still able to achieve the desired outcome (don’t use deep learning if random forest would suffice).
While some of the features included in the model may be novel, other clinically relevant features guided by expert consensus should be included to infuse the model with domain knowledge.
The model must be evaluated with rigorous quality control metrics such as k-fold cross validation, bootstrapping, variance estimation, and other calibration analyses specific to the modeling strategy.
Iterative feature reduction should be done to decrease model complexity and allow ease of adoptability, and those remaining features should undergo importance weighting such as through Shapley statistics so that the end clinician understands why the prediction was made.
External validation must be completed at the time of publication.
It may seem unfair to compare burgeoning AI models to CHA2DS2- VASc with its 147 validation studies. It is also true that given a good quality dataset, the most well-intending data scientist may accidently create a prediction model that is factually irrelevant but seemingly impressive with a high AUROC due to overfitting, highvariance hyperparameter tuning, and recursive training with only the incidentally highest scoring model being selected. External validation is the only true safeguard against these pitfalls, without which a predictive model hasn’t even entered the race against traditional evidence-based medicine scoring systems.
The current era in medicine will almost definitely see the big promises of AI come to fruition given the recent advancements in AI and the depth of good-quality data available in health care. Not far in the future we will see living databases that undergo nightly self-updates from EMR data streams and allow continuous retraining of models that combine clinical features, radiology, pathology, and more into true multimodal predictive machines. Getting there will require models that not only outperform their counterparts, but also demonstrate the maturity, rigor, and reproducibility that is expected in medicine.
Jacob Shreve, MD, MS, is a Senior oncology fellow at Mayo Clinic in Rochester, Minnesota.