AI thought X-rays were related to eating refried beans or drinking beer

Medical imaging is the cornerstone of diagnosis, and artificial intelligence (AI) promises to revolutionize this. With the power to detect features and trends invisible to the human eye, AI holds promise for faster and more accurate diagnoses.

But beneath that promise lies a worrying flaw: AI’s tendency to take shortcuts and jump to conclusions.

These shortcuts can lead to misleading and sometimes dangerous conclusions. Like, for example, algorithms that think they can “predict” the outcome of an x-ray based on whether someone is drinking beer or not.

An x-ray of a knee joint
Sound like a beer drinker? Image via Wiki Commons.

Researchers trained convolutional neural networks (CNNs) – one of the most popular types of deep learning algorithms – to perform a bizarre task: predict whether a patient avoided eating refried beans or drinking beer simply by looking on x-rays of the knee. The model did just that: it achieved a 63% accuracy rate for predicting bean avoidance and a 73% accuracy rate for beer avoidance.

Obviously, this defies logic. There is no connection between the anatomy of the knee and dietary preferences. However, the models produced statistically significant results. But, this strange result was not due to some hidden medical perspective. Instead, it was a textbook example of shortcut learning.

Learning shortcuts and confounding variables

This study used the Osteoarthritis Initiative (OAI) dataset, a vast collection of over 25,000 knee X-rays. The data set included various confounders—variables that could distort model learning. The researchers found that AI models could predict the gender of the patientthe race, the clinical site and even the manufacturer of the X-ray machine with striking accuracy. For example:

  • Gender Prediction: 98.7% accuracy.
  • Prediction of clinical site: 98.2% accuracy.
  • Race Prediction: 92.1% accuracy.

That’s good information, but here’s the thing: AI can use these confounds as shortcuts. For example, if a particular clinical site has more patients from a certain demographic, the AI ​​might associate that demographic with certain diagnoses—a shortcut that reflects bias rather than medical reality.

Shortcut learning occurs when AI models exploit superficial patterns in data rather than learning meaningful relationships. In medical imaging, shortcut learning means that the model does not recognize medical conditions, but instead latches onto irrelevant cues.

“Although AI has the potential to transform medical imaging, we must be cautious,” says lead study author Dr. Peter Schilling, orthopedic surgeon at Dartmouth Health’s Dartmouth Hitchcock Medical Center and assistant professor of orthopedics at Dartmouth’s Geisel School of Medicine. .

“These models can see patterns that humans can’t, but not all of the patterns they identify are significant or reliable,” says Schilling. “It is essential to recognize these risks to prevent misleading conclusions and to ensure scientific integrity.”

It could become a bigger problem

Society at large is still deciding what is the acceptable way to use AI in healthcare. Practitioners agree that AI should not be left to interpret medical imaging alone; at most, it should be used as a crutch, with the results and interpretation still being reviewed by an expert. But with the use of AI becoming more widespread and with widespread labor shortagesAI can play a more central role.

This is why the findings are so worrying.

For example, AI could identify a specific clinical site based on unique markers in the X-ray image, such as the placement of labels or blacked-out sections used to hide patient information. These markers can be correlated with patient demographics or other latent variables such as age, race, or diet—factors that should not affect diagnosis but can skew AI predictions.

Imagine an AI trained to detect disease in chest x-rays. If the AI ​​learns to associate a particular hospital’s labeling style with disease prevalence, its predictions will be unreliable when applied to images from other hospitals. This type of bias can result in misdiagnoses and wrong research findings.

Shortcut learning also undermines the credibility of AI-based discoveries. Researchers and clinicians can be misled into believing that AI has identified a revolutionary medical insight, when in fact it has only exploited a meaningless model.

“This goes beyond biases from race or gender cues,” says Brandon Hill, co-author of the study and a machine learning scientist at Dartmouth Hitchcock. “We found that the algorithm could even learn to predict the year an X-ray was taken. It is harmful – when you prevent him from learning one of these elements, he will instead learn another that he previously ignored. This danger can lead to some really dubious claims, and researchers need to be aware of how easily this happens when using this technique.”

Can we fix it?

It is very difficult to eliminate shortcut learning. Even with extensive preprocessing and image normalization, the AI ​​still identified patterns that humans couldn’t see and tended to make interpretations based on them. This ability to “cheat” by finding irrelevant but statistically significant correlations poses a serious risk for medical applications.

The challenge of shortcut learning has no easy solution. Researchers have proposed various methods to reduce bias, such as balancing data sets or removing confounding variables. But this study shows that these solutions are often insufficient. Shortcut learning can involve multiple, interwoven factors, making it difficult to isolate and correct for each.

The authors of the study argue that AI in medical imaging needs more attention. Deep learning algorithms are not hypothesis tests – they are powerful pattern recognition tools. When used for scientific discovery, their results must be rigorously validated to ensure they reflect true medical insights rather than statistical artifacts.

Essentially, we need to subject AI to much greater scrutiny, especially in the medical context.

“The burden of proof increases when it comes to using models for new model discovery in medicine,” says Hill. “Part of the problem is our own bias. It’s incredibly easy to fall into the trap of assuming the model “sees” the same way we do. In the end, it doesn’t happen.”

The researchers also caution against treating AI as a fellow expert.

“AI is almost like dealing with an alien intelligence,” Hill continues. “You mean the model cheats,” but that anthropomorphizes the technology. He learned a way to solve the task he was given, but not necessarily how a person would. It has no logic or reasoning as we usually understand it.”

Journal Reference: Ravi Aggarwal et al., Diagnostic Accuracy of Deep Learning in Medical Imaging: A Systematic Review and Meta-Analysis, npj Digital Medicine (2021). DOI: 10.1038/s41746-021-00438-z