Soil bacteria exhale more CO2 after sugar-free meals

Artificial intelligence models often play a role in medical diagnosis, especially when it comes to analyzing images such as X-rays. However, research has shown that these models don't always perform well across all demographics, and they tend to do worse among women and people of color.

These models also appear to develop some surprising skills. In 2022, MIT researchers reported that AI models can make accurate predictions about a patient's race based on their chest X-rays — something the most skilled radiologists cannot do.

That research team has now found that the models that are most accurate at making demographic predictions also exhibit the largest “fairness gaps” — that is, discrepancies in their ability to accurately diagnose images of people of different races or genders. The findings suggest that these models may be using “demographic shortcuts” when making their diagnostic evaluations, leading to incorrect results for women, black people and other groups, the researchers say.

“It is well known that high-capacity machine learning models are good predictors of human demographics such as self-reported race, gender, or age. This paper again demonstrates that capacity and then links that capacity to the lack of performance in different groups, which has never been done before,” said Marzyeh Ghassemi, an associate professor of electrical engineering and computer science at MIT, a member of MIT's Institute for Medical Engineering and Science and the study's lead author.

The researchers also found that they could retrain the models in a way that improved their fairness. However, their “debiasing” approach worked best when the models were tested on the same types of patients they were trained on, such as patients from the same hospital. When these models were applied to patients from different hospitals, the fairness gaps reappeared.

“I think the key points are that, first, you need to thoroughly evaluate all external models on your own data, because any guarantee of fairness that model developers provide on their training data may not transfer to your population. Second, when there is enough data available is, you have to train models on your own data,” says Haoran Zhang, an MIT graduate and one of the lead authors of the new paper. MIT graduate Yuzhe Yang is also a lead author on the paper, which will appear in NaturopathyJudy Gichoya, an associate professor of radiology and imaging sciences at Emory University School of Medicine, and Dina Katabi, the Thuan and Nicole Pham Professor of electrical engineering and computer science at MIT, are also authors of the paper.

Removing prejudices

As of May 2024, the FDA has approved 882 AI-enabled medical devices, 671 of which are designed for use in radiology. Since 2022, when Ghassemi and her colleagues showed that these diagnostic models can accurately predict race, she and other researchers have shown that such models are also very good at predicting gender and age, even though the models are not trained to do those tasks.

“A lot of popular machine learning models have superhuman demographic prediction capabilities — radiologists can’t detect self-reported race in a chest X-ray,” Ghassemi says. “These are models that are good at predicting disease, but as they train, they learn to predict other things that may not be desirable.” In this study, the researchers wanted to see why these models don’t work as well for certain groups. In particular, they wanted to see if the models were taking demographic shortcuts to make predictions that ended up being less accurate for some groups. These shortcuts can arise in AI models when they use demographic features to determine whether a medical condition is present, instead of relying on other features of the images.

Using publicly available chest X-ray datasets from Beth Israel Deaconess Medical Center in Boston, the researchers trained models to predict whether patients had one of three different medical conditions: fluid buildup in the lungs, collapsed lung, or enlargement of the heart. They then tested the models on x-rays extracted from the training data.

Overall, the models performed well, but most showed “fairness differences” – that is, discrepancies between accuracy rates for men and women, and for white and black patients.

The models were also able to predict the gender, race and age of the radiographers. Furthermore, there was a significant correlation between each model's accuracy in making demographic predictions and the size of the fairness gap. This suggests that the models may be using demographic categorizations as a shortcut to make their disease predictions.

The researchers then attempted to reduce the differences in fairness using two types of strategies. For one set of models, they trained them to optimize “subgroup robustness,” meaning that the models are rewarded for performing better on the subgroup for which they perform worst, and are penalized if their error rate for one group is higher than the others.

In another set of models, the researchers forced them to remove all demographic information from the images, using “group adversarial” approaches. Both strategies worked reasonably well, the researchers found.

“For in-distribution data, you can use existing state-of-the-art methods to reduce the fairness differences without significantly compromising overall performance,” Ghassemi says. “Subgroup robustness methods force models to be sensitive to mispredicting a specific group, and group-adversarial methods try to remove group information entirely.”

Not always fairer

However, these approaches only worked when the models were tested on data from the same types of patients they were trained on—for example, only patients from the Beth Israel Deaconess Medical Center dataset.

When the researchers tested the models that had been “cleaned” using the BIDMC data to analyze patients from five other hospital datasets, they found that the overall accuracy of the models remained high, but that some models showed large gaps in fairness.

“If you question the model in one group of patients, that fairness doesn't necessarily apply when you move to a new group of patients from a different hospital in a different location,” says Zhang.

This is concerning because in many cases hospitals use models developed based on data from other hospitals, especially in cases where an off-the-shelf model is purchased, the researchers say.

“We found that even state-of-the-art models that perform optimally on data similar to their training sets are suboptimal – that is, they do not make the best trade-off between overall and subgroup performance – in new settings,” says Ghassemi. “Unfortunately, this is actually how a model is likely to be deployed. Most models are trained and validated with data from a single hospital or source, and then deployed broadly.”

The researchers found that the models stripped of bias using group adversarial approaches were slightly fairer when tested on new patient groups than the models stripped of bias using subgroup robustness methods. They now plan to try to develop and test additional methods to see if they can create models that do a better job at making fair predictions on new data sets.

The findings suggest that hospitals using these types of AI models should evaluate them on their own patient populations before implementing them, to ensure they are not giving inaccurate results for certain groups.

The research was funded by a Google Research Scholar Award, the Robert Wood Johnson Foundation Harold Amos Medical Faculty Development Program, RSNA Health Disparities, the Lacuna Fund, the Gordon and Betty Moore Foundation, the National Institute of Biomedical Imaging and Bioengineering, and the National Heart , Lung, and Blood Institute.