Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”

Reading Time: 5 minutes

Many of Jennifer Listgarten’s arguments are compelling: in particular, that the protein folding problem is an outlier relative to other grand challenges in science, both in terms of the precise way the problem can be stated and performance measured and in terms of the amount of available, high quality data 1 . However, although existing biological databases tend to be small relative to the compendia used to train large language models, it seems plausible that one type of biological data — whole genome sequencing — will soon be generated at massive scales, opposite to what was argued 1 . As genome sequencing costs go down and the potential for clinical use of genomic data goes up, it will make economic sense to fully sequence everyone. Each 3 billion base-pair individual genome can be represented as 30 million unique bases, so fully sequencing the US population of 300 million individuals yields a total of 9 × 1015 bases, which is comparable in size to the 400-terabyte Common Crawl dataset used to train large language models. Using such data to train large-scale machine learning models will be challenging because of privacy considerations. Nonetheless, I see at least four paths where such models could be built on massive genomic data.

The first path involves federated data access. A federated approach uses software to enable multiple databases to function as one, facilitating interoperability while maintaining autonomy and decentralization 2 . Federation capabilities are supported by existing genomic biobanks, such as the UK Biobank, NIH All of Us and Finland’s FinnGen initiative 3 , and are further facilitated by commercial entities such as lifebit.ai. In a federated approach, a deep learning model can be trained from data drawn from multiple biobanks while maintaining privacy guarantees.

The federated approach is already proving to be successful, but it does face a fundamental limitation: namely, that by design the genomic data can only be matched to phenotypic data within the database. This approach thus precludes linking genomics to other rich sources of data, including electronic health records and aggregated customer data profiles.

A second potential path for training deep learning models on large-scale genomic data involves hacking. Although large leaks of supposedly private consumer data have become commonplace, the high level of security at publicly funded biobanks, the massive amount of data that would need to be leaked, and the subsequent resources required to link the leaked data to external profiles and train and deploy a massive machine learning model make this pathway seem unlikely.

More likely, a significant proportion of genomic data may end up voluntarily moving into the public or quasi-public domain. Individuals may choose to upload genomic profiles for ancestry analysis, online dating or patient support groups such as All4Cure 4 . Such data may then be incorporated into the increasingly detailed consumer data profiles that are routinely collected and aggregated by commercial entities. Once such data begin entering our online profiles, the motivation to maintain individual privacy goes down. After all, once the genomes of an individual’s parents, siblings and children are public, then that individual’s genome is effectively public as well.

Finally, there is the possibility that some state actors may decide to train large-scale genomic models themselves. Government access to genomic data, electronic health records, consumer data and law enforcement data could make for a particularly powerful dataset for downstream analysis.

This leads to the question of whether population-scale genomic data can be used to address important scientific questions. On its own, such data may enable drawing inferences about human evolutionary history. In conjunction with electronic health records and aggregated customer data profiles, genomic data may also enable machine learning models to begin drawing inferences about the relationship between genotype and phenotype. An individual’s online history — social media, purchasing, browser history — can be remarkably detailed, and there is already strong economic motivation to discover trends in such data. Once such data are linked to genomic data and potentially electronic medical records, those trends could be linked to human disease. Thus, a self-supervised machine learning system with access to all three of these types of data could learn to automatically tease out relationships among lifestyle choices, medical outcomes and genetic profiles. Rather than solving a particular, supervised grand challenge like protein folding, a model that learns in this fashion could be an engine for scientific hypothesis generation.

Such a model would likely yield profound insights, but it will also reflect the biases and inaccuracies in the data provided to it. Due to the natural tendency of governments to prioritize characterizing their own populations, genomic data collection is currently heavily skewed toward individuals in richer countries. This trend will be exacerbated by precision medicine, in which genomic data are collected to enable tailored medical care. Such approaches, at least in the short term, are also likely to be heavily biased toward richer patients, or at least toward patients in richer countries. Machine learning models automatically learn the biases in their training data — for example, achieving higher accuracy at face recognition tasks on white, male faces when the dataset is skewed toward lighter skinned, male individuals 5 . Such biases may be easier to detect in a well-defined setting such as face recognition than in an unsupervised system that attempts to identify novel relationships among variables. The risk in such a system is doubled because the potential for bias is not only in the answers that the system provides but also in the questions that it decides to ask. All of these considerations suggest a strong need for equity in collection of genomic data, as well as increased research in identifying and mitigating biases in large-scale machine learning systems.

References

Listgarten, J. Nat. Biotechnol. 42, 371–373 (2024).

Article 
CAS 
PubMed 

Google Scholar 

Alvarellos, M. et al. Front. Genet. 13, 1045450 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar 

Global Alliance for Genomics and Health. Science 352, 1278–1280 (2016).

Article 

Google Scholar 

Gubar, S. A cancer researcher takes cancer personally. The New York Times (15 February 2018).

Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).

Download references

Author information Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, WA, USA

William Stafford Noble

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA

William Stafford Noble

Corresponding author

Correspondence to
William Stafford Noble.

Ethics declarations

Competing interests

The author declares no competing interests.

Rights and permissions About this article

Cite this article

Noble, W.S. Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”.
Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02230-2

Download citation

Published: 01 May 2024

DOI : https://doi.org/10.1038/s41587-024-02230-2

Article Source




Information contained on this page is provided by an independent third-party content provider. This website makes no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact editor @americanfork.business

Skip to content