HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors

A. Conev, R. Fasoulis, S. Hall-Swan, R. Ferreira, and L. E. Kavraki, “HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors,” iScience, vol. 27, no. 1, Jan. 2024.

Abstract

Peptide-HLA (pHLA) binding prediction is essential in screening peptide candidates for personalized peptide vaccines. Machine Learning (ML) pHLA binding prediction tools are trained on vast amounts of data and are effective in screening peptide candidates. Most ML models report generalizing to HLA alleles unseen during training (“pan-allele” models). However, the use of datasets with imbalanced allele content raises concerns about biased model performance. First, we examine the data bias of two ML-based pan-allele pHLA binding predictors. We find that the pHLA datasets overrepresent alleles from geographic populations of high-income countries. Second, we show that the identified data bias is perpetuated within ML models, leading to algorithmic bias and subpar performance for alleles expressed in low-income geographic populations. We draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term “pan-allele” to describe models trained with currently available public datasets.

Publisher: http://dx.doi.org/10.1016/j.isci.2023.108613

PDF preprint: http://kavrakilab.org/publications/conev2024-hlaequity.pdf