Background

Artificial intelligence (AI) and machine learning (ML) are being increasingly heralded as transformational technologies that may potentially be able to address long­standing and persistent challenges in health care practice and delivery related to patient safety, health care quality, costs, equity, and access.1 Promising use cases demon­strating the potential value of AI/ML have been described across the clinical and administrative spectrum,2 includ­ing drug discovery, analysis and interpretation of radiological images, early detection of sepsis, hospital resource manage­ment, automated generation of clinical encounter notes, and insurance claims processing. However, while the opportuni­ties enabled by AI/ML are doubtless encouraging, we must temper the optimism and excitement generated by these tools with intentional and constant vigilance to the adverse interaction they may have with health equity goals.3

The prevalence of health inequities, starkly underscored by the COVID-19 pandemic, remains a pernicious threat to marginalized populations.4 Despite widespread acknowl­edgement of the serious nature of this threat and ongoing policy efforts, unequal health outcomes continue to be pres­ent among communities and sub-populations differentiated by a range of sociodemographic social determinants of health (SDOH), and racial and ethnic factors.5 Marginalized populations experience worse outcomes across multiple disease conditions and interactions with the health care sys­tem, including cancer survival rates, blood pressure control in hypertensive patients,6 unmanaged diabetes, infant and maternal mortality, and access to medical resources.7 Systematically eliminating such differences must be a major objective in our policy and practice efforts as AI/ML contin­ues to be diffused throughout health care.

The juxtaposition of a system that is already rife with health inequities with the digital transformation of health care and its accelerating reliance on data and algorithms illustrates the potential danger posed by AI/ML tools. AI/ML tools are constructed on a foundation of data and incorpo­rated into models and algorithms by data scientists and tool developers. It is important to recognize that data and models are fundamentally artifacts of human creation and inevitably reflect societal, structural, and individual biases that are root causes of health inequities.8 When such data and models are institutionalized and broadly implemented into systems of care delivery, the concern is that they will further exacer­bate biases and adversely affect health equity.

Policymakers and practitioners seeking to leverage the power of AI/ML to positively affect health care must ensure that the principle of health equity remains front and center in tool development, and that appropriate guardrails are con­structed for AI/ML applications being proposed for deploy­ment in clinical and administrative tasks. In previous work we have presented an “AI Bias-Aware Framework” to guide the efforts of data scientists as they develop AI applica­tions for health care.9 In this commentary, we build upon that work with guidance for policymakers and practitioners. Accomplishing health equity goals implies paying attention to five overarching issues. Three considerations are technical in nature and reflect concerns that may arise from the way in which specific AI/ML tools are developed. The final two issues relate to broader societal questions for which policies and guidelines need to be constructed.

The Equity Imperatives for AI/ML in Health Care

Data Quality and Representativeness

As the oil that fuels AI/ML applications, data present a critical source of bias in ML. Building these tools requires large training datasets that are culled together from a variety of sources that may include patient data stored in electronic health record systems, claims data maintained by payers, and daily activity data captured through smart sensors and other wearable technologies. ML algorithms use these datasets to learn patterns and make predictions such as how likely a par­ticular patient is to be readmitted within 30 days after being discharged,10 the probability that a patient will respond positively to a specific therapeutic intervention, or whether a radiographic image shows signs of a malignant tumor.

The critical reliance on data for training makes ML appli­cations especially vulnerable to data biases. Thus, there is a pressing need to carefully examine underlying sources of data and the degree to which they are representative of the true diversity present across subpopulations and communi­ties. For example, a range of biases have been identified in EHR data alone, including missing data, inadequate sample sizes for certain vulnerable groups that contribute to lower predictive power, and misclassification or errors in the data. Marginalized populations have historically had lower trust in the health care enterprise and exhibited reticence in engag­ing with the health care system in general, and data sharing in particular.11 This further reduces the availability of data for these communities that can feed the ML applications.

The literature documents numerous examples of the inad­equate performance of ML algorithms for diverse patients arising from data limitations. These include underdiagnosis of minority populations in the analysis of chest radiography images,12 the presence of gender imbalance in image data­sets used for automated diagnosis of thoracic diseases,13 limitations of publicly available datasets used for the diag­nosis of skin cancer where dark-skinned patients are under­represented, and lower performance of diabetic retinopathy applications in low-income countries.14

The limited availability of data for subpopulations and the associated negative impact on health equity are vividly cap­tured in the notion of health data poverty: “the inability for individuals, groups, or populations to benefit from a discov­ery or innovation due to insufficient data that are adequately representative”.15 Health data poverty amplifies the dif­ference in benefits that accrue to vulnerable populations, further damaging health equity. A recent study found that genome-wide association studies (GWAS) that are being utilized for understanding the relationship between genetic diversity and disease to enable the design of more targeted therapies conduct analyses in data that are dominated by populations with European ancestry. As a result, the oppor­tunity for other races and ethnicities to benefit from these discoveries is limited.16

ML Model Specification Biases

Bias in ML models can arise from data non-representa­tiveness but can also be caused by the specific performance objectives and outcomes for which the model is optimized. During development, ML models are subjected to a process of iterative optimization to determine which model performs best for a pre-specified set of objectives. To illustrate, a model may be optimized for objectives such as predictive accuracy or minimization of false positives. However, such objectives can result in a penalty for underrepresented populations in the data: improving accuracy for the dataset as a whole may yield a model that is optimized for the majority group but underperforms for others. This “fairness-accuracy” trade-off is widely acknowledged in the ML literature17 and is increasingly being recognized in health care. Model develop­ers as well as clinical and administrative leaders must pay close attention to not only the overall performance of ML algorithms but also the outcomes they yield for underrepre­sented populations.

A compelling example of how the choice of outcome variable or prediction target for an algorithm can uninten­tionally cause harm is described by Obermeyer and coau­thors (2019).18 An algorithm widely used to predict risk scores for determining whether patients should be enrolled in care management programs used health costs as a proxy for health needs. Black patients historically had lower health care spending and were erroneously assumed to be in lower need of further care.18 To ensure conformance with equity goals, policymakers need clear visibility into the comparative performance of models across subpopulations, as well as the specific performance objectives that developers sought to achieve. The US Food and Drug Administration (FDA) should establish a flexible, risk-based regulatory framework that encourages innovation while prioritizing patient safety, and address concerns about exacerbating existing disparities in health care access and outcomes.

ML Model Drift

The data that fuel ML applications are not static: they change over time. This can sometimes make the historical data used for training ML models obsolete, depending on the degree of volatility in the data. In health care, examples of data obsolescence can be found across the clinical spectrum. For example, new drugs routinely replace older therapies, and clinical guidelines for cancer treatment are consistently revised to reflect the latest scientific developments.19 ML models can become inaccurate when the data used to deter­mine the model parameters changes over time, a phenom­enon referred to as “drift.” When such drift occurs, model performance can degrade substantially and the generaliz­ability of the model across a different context and setting (e.g., trained with data from one hospital system to be used in another) is likely to be low.

What types of drift can potentially occur in medical AI applications? Sahiner and coauthors (2023) describe three categories of drift: input, clinical context of use, and concept.20 Input drift refers to changing characteristics of ML input data, such as differences in the instrumentation used to capture clinical data when they are generated. Input drift can only occur when there are differences in the patient popula­tions used for model training and model deployment. An ML model developed with EHR data from a large urban health system may perform poorly in a safety-net hospital. The sec­ond form of drift, clinical context of use, can occur when the environments for model development and deployment are heterogeneous.20 Sahiner and coauthors (2023) use the illustration of cancer prevalence to demonstrate this form of drift: when changes occur in disease prevalence between the time of model training and development to the time of actual deployment, the model calibration may underperform and produce results that are inaccurate or incorrect. The final form of drift, concept drift, is a result of the first two categories and reflects the changes that have occurred in the relationship between input variables (e.g., a patient’s health data and history) and output (the predicted probability of a certain disease being present).

To mitigate the potential for model drift, practitioners need to “stress test” AI models across different contexts and settings prior to broader deployment. It is necessary to fully disclose the sources of data, as well as details about the data collection and generation process used for different data elements. Policymakers need to exercise oversight over the safety and efficacy of AI methods and models that are likely to be increasingly bundled into EHR systems to ensure that they are addressing equity considerations.21

Building Diverse Teams for AI/ML Development

The goals of health equity can only be met when diverse voices, experiences, and perspectives are shared and heard. This implies that teams responsible for the conceptualiza­tion, development, and deployment of AI/ML applications in health care must include diverse members that reflect the heterogeneity in race, gender, ethnicity, and SDOH in the nation’s population. There are two challenges in accom­plishing this. First, the underlying scientific disciplines for AI/ML are the science, technology, engineering, and math­ematics (STEM) fields. Diversity in STEM education and the STEM workforce has been a longstanding concern in the United States and despite some improvements over the past decade, minority populations in STEM occupations are not proportionally represented (e.g., women are only one-third of the STEM workforce; Black, Hispanic, Alaska Native, and American Indian individuals comprise 35% of the US popu­lation but only 24% of the STEM workforce).22 Second, the medical profession in the United States does not reflect the full diversity of the population and, as in the case of the STEM workforce, although enrollment of minorities in medi­cal education has been higher in recent years gaps in repre­sentation still remain.23 In 2023, less than 6% of all US physicians identified as Black.24 The lack of diversity in the two groups that will be leading the development of AI/ ML applications in health care will doubtless have an adverse effect on health equity goals.

The construction of diverse teams must be an intentional activity of practitioners charged with AI/ML development efforts. From a policy perspective, the NIH’s AIM-AHEAD program that is specifically seeking to address health equity through amplifying the representation of minorities in AI/ML research and development is a model that can potentially be expanded and replicated across states.25

Patient Literacy and Autonomy

A key aspect of health equity is that all patients have the cognitive resources and literacy to fully understand the nuances of the care being provided to them, including diag­noses and recommended treatments. Patients also have the right to expect that providers are able to provide explana­tions or rationale underlying proposed care plans, and that final decisions about care will be made with joint discussion and deliberation.26

The introduction of AI/ML in direct patient care poses two concerns for health equity goals. One: with the introduction of AI/ML, patients may experience a diminished personal connection with the provider, further attenuating candor and potentially resulting in them choosing to hide information from the physician or avoid seeking care even when neces­sary. To the degree that reticence in information sharing is more prevalent among marginalized populations, this can further intensify the issue of data poverty for these groups. Recent public opinion surveys indicate that more than 50% of patients believe that the use of AI would negatively affect their relationship with their doctor.27

A second concern relates to patient autonomy: the abil­ity of patients to have a voice and self-determination in their own care.28 Patients can exercise autonomy only when they have the competence to evaluate the nature and qual­ity of care being offered to them. With the infusion of AI/ ML into direct care delivery, several ethical dilemmas arise. What happens when the care plan is constructed by an algo­rithm? If the algorithm is a “black box,” as many ML models are,29 on what basis will the provider explain the recom­mendation to the patient? What is the nature of informed consent in such a setting? Does the patient have a right to know that the AI is augmenting the doctor? The consequence of diminished patient autonomy is likely to be experienced by vulnerable populations who are already burdened by unequal health outcomes.

Conclusion

The promise of AI/ML to address effectiveness, equity, and efficiency outcomes in health care is robust. In recent work we have shown the myriad of ways in which AI can augment the critical care delivery work of clinicians, poten­tially even helping to correct human biases they may exhibit.30 Yet, this promise comes with the potential threat of further eroding equity in a system that already exhibits an unacceptable level of inequity. Practitioners, including lead­ers in health care delivery organizations, clinicians respon­sible for the delivery of direct patient care, and technology industry executives, must pay tenacious attention to the considerations discussed here and continually inspect and audit AI/ML applications for fairness. Policymakers must design incentives that promote equity considerations, such as benchmarks for model performance across subgroups prior to the approval by the FDA and continued audits of AI/ML models used in clinical settings to ensure relevance and adaptation to new research findings, clinical guidelines, and treatment protocols.31 Policies that encourage edu­cational programs for the broader dissemination of AI/ML knowledge and understanding among vulnerable populations can be critical in bringing these groups to the table to collec­tively shape the development of applications. The Executive Order issued recently by the White House on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence32 represents an important step toward ensur­ing that AI/ML further health equity goals, embodying prin­ciples and actions reflected in the OECD Collective Action for Responsible AI in Health.33

Acknowledgments

We thank Jennifer Bagdasarian and Rui Han for research assistance.

The authors have no conflicts of interests related to the writing and publication of this article.