Can AI really revolutionize healthcare? Systematic review reveals hidden gaps in patient benefits and barriers to meaningful clinical integration.
In a recent study published in The Lancet Regional Health – Europea group of researchers evaluated the advantages and disadvantages of artificial intelligence (AI)-related algorithmic decision-making (ADM) systems used by healthcare professionals compared to standard care, focusing on relevant outcomes for patients.
Background
Advances in AI have enabled systems to outperform medical experts in tasks such as diagnosis, personalized medicine, patient monitoring and drug development. Despite this progress, it remains unclear whether improved diagnostic accuracy and performance measures translate into tangible benefits for patients, such as reduced mortality or morbidity.
Current research often prioritizes analytical performance over clinical outcomes, and many AI-based medical devices are approved without appropriate evidence from randomized controlled trials (RCTs).
Additionally, the lack of transparency and standardized assessments of harm associated with these technologies raises ethical and practical concerns. This highlights a critical gap in AI research and development, requiring further evaluations focused on patient-relevant outcomes to ensure meaningful and safe integration into healthcare.
About the study
This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to ensure methodological rigor. Searches were performed in the Medical Literature Analysis and Retrieval System Online (MEDLINE), Excerpta Medica Database (EMBASE), MEDLINE Public/Publisher (PubMed), and the Institute of Electrical and Electronics Engineers (IEEE) Xplore, covering a period of 10 years. until March 27, 2024, when AI-related ADM systems became relevant in healthcare studies. The search included terms related to AI, machine learning (ML), decision-making algorithms, healthcare professionals, and patient outcomes.
Eligible studies included interventional or observational designs involving AI decision support systems developed with or using ML. Studies were required to report relevant patient outcomes, such as mortality, morbidity, length of hospital stay, readmission, or health-related quality of life. Exclusion criteria included studies without pre-registration, lacking standard of care control, or focusing on robotics or other systems unrelated to AI-based decision-making. The protocol for this review has been pre-registered on the International Prospective Register of Systematic Reviews (PROSPERO), with all modifications documented.
Reviewers screened titles, abstracts, and full texts using predefined criteria. Data extraction and quality assessment were performed independently using standardized forms. Risk of bias was assessed using the Cochrane Risk of Bias 2 (RoB 2) tool and the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I) tool to address confounding factors potential, while reporting transparency was assessed using the Consolidated Standards. Expansion of Reporting Trials – Artificial Intelligence (CONSORT-AI) and Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis – Artificial Intelligence (TRIPOD-AI).
Data extracted included study parameters, design, intervention and comparator details, patient and professional demographics, algorithm characteristics, and outcome measures. Studies were also categorized by AI system type, clinical area, prediction goals, and regulatory and funding information. The analysis also examined whether the unique contributions of AI systems to outcomes were isolated and validated.
Study results
The systematic review included 19 studies, including 18 RCTs and one prospective cohort study, selected after reviewing 3000 records. These studies were conducted in various regions, including nine in the United States, four in Europe, three in China, and others spread around the world. Settings included 14 hospital-based studies, three in outpatient clinics, one in a nursing home, and one in a mixed environment.
The studies covered a range of medical specialties, including oncology (4 studies), psychiatry (3 studies), internal hospital medicine, neurology and anesthesiology (2 studies each), as well as single studies in diabetology, pulmonology, intensive care and other specialties. .
The median number of participants in the studies was 243, with a median age of 59.3 years. Female representation averaged 50.5% and racial or ethnic composition was reported in 10 studies, with a median of 71.4% White participants. Twelve studies described intended healthcare professional users, such as charge nurses or primary care providers, and nine detailed training protocols, ranging from brief introductions to the platform to multi-day supervised sessions.
AI systems varied in type and function, with seven studies using monitoring systems for real-time monitoring and predictive alerts, six employing treatment personalization systems, and four integrating multiple functionalities. Examples included algorithms for diabetes glycemic control, personalized psychiatric care, and venous thromboembolism monitoring. Development data sources ranged from large in-house datasets to pooled multi-institutional data, with various ML models applied, such as gradient augmentation, neural networks, Bayesian classifiers, and regression-based models. . Despite these developments, external validation of the algorithms was limited in most studies, raising concerns about their generalizability to broader patient populations.
Risk of bias was assessed as low in four RCTs, moderate in seven, and high in seven others, while the cohort study demonstrated serious risk of bias. Compliance with CONSORT-AI and TRIPOD-AI guidelines was variable, with three studies achieving full compliance, while others ranged from high to low. Most studies conducted before the introduction of these guidelines showed moderate adherence, although explicit references to the guidelines were rare.
The results highlighted a mixture of advantages and disadvantages. Twelve studies reported significant benefits for patients, including reduced mortality, better management of depression and pain, and improved quality of life. However, only eight studies included standardized risk assessments, and most failed to comprehensively document adverse events. Although six AI systems have received regulatory approvals, the associations between regulatory status, study quality, and patient outcomes are inconclusive.
Conclusions
This systematic review highlights the paucity of high-quality studies evaluating patient-relevant outcomes of AI-related ADM systems in healthcare. While psychiatry has consistently shown benefit, other areas have shown mixed results, with limited evidence on mortality, anxiety, and improvement in hospital stay. Most studies lacked balanced assessments of benefits and harms and failed to isolate the unique contributions of AI.
The findings highlight the urgent need for transparent reporting, robust validation practices, and standardized frameworks to guide the safe and effective integration of AI into clinical settings.