Jennifer Mnookin Chancellor | Official website
Jennifer Mnookin Chancellor | Official website
University of Wisconsin–Madison researchers have raised concerns about the reliability of artificial intelligence (AI) tools used in genomic studies. These tools, which are gaining popularity in genetics and medicine, can lead to incorrect conclusions regarding the relationship between genes and physical traits, including disease risk factors such as diabetes.
The problem stems from the use of AI to support genome-wide association studies. These studies analyze genetic variations across many individuals to find links between genes and physical traits. However, databases used for these studies often lack complete health condition data.
"Some characteristics are either very expensive or labor-intensive to measure, so you simply don’t have enough samples to make meaningful statistical conclusions about their association with genetics," says Qiongshi Lu, an associate professor at UW–Madison's Department of Biostatistics and Medical Informatics.
To address missing data, researchers increasingly rely on advanced machine-learning AI models to predict complex traits and disease risks. However, Lu and his colleagues have identified risks associated with this approach. Their study published in Nature Genetics reveals that certain machine learning algorithms may falsely link genetic variations with Type 2 diabetes risk.
"The problem is if you trust the machine learning-predicted diabetes risk as the actual risk, you would think all those genetic variations are correlated with actual diabetes even though they aren’t," Lu explains.
In response, Lu's team has developed a statistical method aimed at reducing biases introduced by AI in genome-wide association studies. "This new strategy is statistically optimal," Lu states, highlighting its effectiveness in pinpointing genetic associations with bone mineral density.
Moreover, the research team cautions against over-reliance on proxy information instead of algorithms for filling data gaps in genomic studies. They found that such practices could lead to misleading genetic correlations between Alzheimer's risk and cognitive abilities.
"These days, genomic scientists routinely work with biobank datasets that have hundreds of thousands of individuals; however, as statistical power goes up, biases and the probability of errors are also amplified in these massive datasets," says Lu. The team's findings emphasize the need for statistical rigor in large-scale research.