Photo credit: Ran Dahan/TPS
As the integration of artificial intelligence into medicine advances, there is also growing interest in using AI models to interpret complex medical information, going beyond traditional medical uses of AI .
Current AI medical tasks are more about task automation and pattern recognition, used in applications such as chatbots that respond to patient queries, algorithms predicting diseases, generating synthetic data for privacy and educational tools for medical students.
But despite these advances, interpreting medical information involves a higher level of understanding and distinguishing complex medical concepts and carries life-and-death consequences.
A study recently published in the peer-reviewed journal Computers in biology and medicine by researchers at Ben-Gurion University of the Negev sheds new light on the performance of AI models in deciphering medical data, revealing both their potential and significant limitations.
Doctoral student Ofir Ben Shoham and Dr. Nadav Rappaport of the university’s Department of Software and Information Systems Engineering conducted a study to evaluate how effectively AI models understand medical concepts . They have developed a dedicated assessment tool called “MedConceptsQA”, which includes over 800,000 questions covering different levels of complexity. This tool was designed to evaluate the ability of models to interpret medical codes and concepts, such as diagnoses, procedures, and medications.
*MedConceptsQA* questions have been categorized into three difficulty levels: easy, requiring basic medical knowledge; medium, requiring a moderate understanding of medical concepts; and difficult, which tested the ability to discern nuanced differences between closely related medical terms.
The results were surprising. Most AI models, including those specifically trained on medical datasets, have performed poorly, often no better than random guesses. However, some general-purpose models, such as ChatGPT-4, outperformed others, achieving an accuracy rate of around 60%. Although better than chance, this performance still falls short of the precision required for critical medical decisions.
“It often appears that models specifically trained for medical purposes achieve accuracy levels close to random estimates. Even specialized training in medical data does not necessarily translate into superior performance in interpreting medical codes,” Rappaport said.
Interestingly, general-purpose AI models like ChatGPT-4 and Llama 3-70B outperformed specialized clinical models, such as OpenBioLLM-70B, by 9-11%. This highlighted both the limitations of current clinical models and the adaptability of general-purpose models, despite their lack of medical focus, the researchers said.
The study shows that AI models need more specialized training on diverse, high-quality clinical data to better understand medical codes and concepts. This could lead to the development of more effective AI tools. With further advances, AI could help triage patients, recommend treatments based on medical history, or flag potential errors in diagnoses.
The findings also suggest that AI models need better training to handle the complexity of medical coding, which could streamline administrative tasks and improve the efficiency of health systems.
“Our benchmark provides a valuable resource for assessing the capabilities of large language models to interpret medical codes and distinguish concepts,” explained Ben Shoham. “This allows us to test new models as they are released and compare them with existing ones.”