Abstract
Prediction of antigens likely to be recognized by the immune system is a fundamental challenge for development of immune therapy approaches. We explore the utility of deep learning for in silico prediction of peptide binding affinity to major histocompatibiliy complex Type I molecules (pMHC-I binding). This process is a critical step in the immune system’s response to cancer cells, which may present highly specific neoantigen peptides bound to MHC proteins at the cell surface. With the advent of high-throughput sequencing and the recognition that somatic mutations in the exome can produce neoantigens, fast in silico prediction of these affinities has become increasingly relevant to precision cancer immunotherapy.
We have developed five machine learning methods and use a benchmark from the Immune Epitope Database of experimental pMHC-I binding affinities to compare them to existing machine learning approaches. All methods were used to score, rank, and classify pMHC-I pairs. The best six methods, which include three of our own, were identified and found to make highly correlated predictions, even for individual pMHC-I pairs. The most effective deep learning methods were a gated recurrent unit and a long short-term memory neural network, enhanced by transfer learning. These methods can handle peptides of any length without the need for artificial lengthening or shortening and were substantially faster than the most widely-used standard neural networks.
Major findings The best in silico predictors of peptide major histocompatibility complex binding must be identified for application in precision cancer immunotherapy. We design and test a variety of machine learning methods for this purpose. We identify six best-in-class methods, three of our own design. Surprisingly, the best deep and standard machine learning methods make highly correlated predictions. Several standard methods run significantly slower and may have less utility as high-throughput sequence analysis for precision immunotherapy becomes more common. Performance of all methods varies by MHC allele, and most of this variance can be explained by data-driven, rather than biological properties. Increasing the quantity of publicly available experimental data has the potential to improve all machine learning methods applied to this problem, and in particular deep learning methods.
Quick Guide to Equations and Assumptions
In silico prediction methods used in cancer research must be rigorously evaluated. The machine learning methods assessed here predict a continuous numeric response value (IC50) for each pMHC-I pair. The methods are considered to be supervised, because they learn to predict the response value from training examples, whose IC50 is assumed to be known. They are expected to make correct predictions on a set of independent, held-out test examples. Misleading and overly optimistic evaluations of supervised machine learning methods are common, in particular when there is an overlap of training and test data, or when training and testing is done on a small set of examples that do not reflect the true diversity of the problem space. The benchmark set used in this work [1] completely separates training and test examples and covers 50 MHC Class I alleles. Performance on this benchmark is a good indicator of the effectiveness of a prediction method. Each predicted response value can be related to an experimental pMHC-I IC50 measurement through the equation y = max(0,1 − log50KIC50). The equation is used to reduce the possible range of predicted values from between 0 and 50,000nM to between zero and one, a range of values that is more mathematically tractable for neural network optimization algorithms. The assumption is that the largest IC50 value that can be observed is 50,000nM.
There are many ways to evaluate the predictive performance of the output response values. We selected three metrics, one of which measures the correlation between the response values of each predictor and experimental measurements (Kendall-Tau correlation), two which assess each predictor as a classifier of binding vs. non-binding pMHC-I pairs, using an IC50 threshold of 500nM to separate binders and non-binders (AUC, F1 score).
To compute Kendall-Tau correlation, the response values of a predictor are ranked and compared to a ranked list of experimentally measured pMHC-I IC50 values from the Immune Epitope Database. The Kendall-Tau correlation is then: where nc is the number of concordant pairs (a prediction and observation are concordant if the pMHC-I pair has the same rank in both lists), nd is the number of discordant pairs. n0, n1, and n2 are used to handle rank ties.
The F1 and AUC scores are also computed by comparing the response values of a predictor with an ordered list of experimentally measured pMHC-I IC50 values. In the case of the F1 score, both lists (the list of predictions and the list of observations) are dichotomized by labeling binders and non-binders. The dichotomization assumes any IC50 ≤ 500nM is a binder and a non-binder otherwise. The two lists are then compared and precision is calcluated as the number of items in the list where the prediction of binder agrees with the observation, divided by the number of observed binders. Recall is calaculated as the number of items where the prediction of binder agrees with the observation, divided by the number of items in the list. For AUC, the list of predicted output values is compared to a dichotomized list of observations, and the true positive rate and false positive rate at every threshold on the list is computed. All metrics are described in detail in (Materials and Methods).
Introduction
Peptide binding to Major Histocompatibility Complex (MHC) proteins and formation of a stable binary complex (pMHC) is a key step in the activation of CD8+ (MHC Class I) and CD4+ (MHC Class II) T-cells. MHC Class I proteins bind peptides of 8-11 amino acid residues. The presentation of pMHCs on the surface of antigen-presenting cells and recognition by T-cell receptors is fundamental to the mammalian adaptive immune system. Recent advances in cancer immunotherapy have highlighted the need for improved understanding of which peptides will bind to MHC proteins and generate an immune response [2], [3], [4], [5], [6]. Because experimental characterization of pMHC binding is costly and time-consuming, computational researchers have been working for decades on in silico tools to predict pMHC affinities [7]. Approaches have included sequence-based profiles [8] [9], structure-based predictions [10], generative probabilistic models [11], and machine learning [12] [13] [14].
Neural network supervised machine learning methods are the most widely-used technique and have been shown to outperform other methods in multiple studies [15] [1] [16]. However, many researchers remain dissatisfied with the available in silico pMHC binding predictors for neoantigen discovery, particularly in a clinical setting [17] [18]. We explored whether the most recent generation of neural network algorithms and architectures, known as deep learning, could effectively improve pMHC binding predictions.
Neural networks were originally modeled on the human brain. They consist of connected units, organized into layers. Theoretically, given enough layers and units, they can act as universal function approximators [19]. However, as the number of units increases, they become prone to overfitting (reviewed in [20]). Deep learning research has introduced innovative network architectures and regularization techniques, allowing for training of very deep and wide networks to approximate complex functions, with reduced risk of overfitting. Given sufficient training data, deep architectures outperform other methods in many domains [21]. They are now widely used in robotics, self-driving vehicles, sentiment classification, machine translation, and user-based recommendation systems [22] [23] [24] [25] [26].
In this work, we assessed the performance of major deep learning architectures: convolutional neural networks (CNNs) [27] and recurrent neural networks, particularly gated recurrent units (GRUs) [28], and long short-term memory units (LSTMs) [29]. The deep learning methods were compared to several standard machine learning methods: fully-connected one-layer networks (NetMHC [30], NetMHCpan [31], and our own MHCnuggets-FC), a two-layer embedding network (MHCflurry) [32], and a Bayesian least-squares method (SMMPMBEC) [15]. We evaluated the methods on a carefully designed and previously published benchmark set of immuno-fluorescent binding experiments for diverse pMHC-I allele pairs, described hereafter as the Kim benchmark set [1], derived from the IEDB database [33]. The Kim benchmark set was cleaned for redundancy, using a protocol designed by [32]. In our hands, the best deep learning methods had similar prediction performance to the best standard machine learning methods. Furthermore, the performance of these methods was very similar both on the aggregate set of all alleles, and when alleles were evaluated individually. Predictions for individual pMHC-I pairs by the best methods were highly correlated (≥ 0.9). Convolutional neural networks, a popular deep learning technique, had poorer performance and do not appear to be well suited for pMHC-I binding prediction (Results). Of the best-performing six methods, three methods developed by us (MHCnuggets-FC, MHCnuggets-LSTM, MHCnuggets-GRU) had the fastest runtimes. In contrast to standard neural networks, which require that all inputs be of the same length, the deep learning networks MHCnuggets-LSTM and MHCnuggets-GRU can handle peptides of any length. They do not require the artifical lengthening and shortening heuristics (padding and cutting) that are used by MHCnuggets-FC, MHCflurry, NetMHC, and NetMHCpan.
MHC proteins are highly polymorphic, and there are thousands of MHC Class I alleles, each with a different peptide binding surface. While there are tens of thousands of experimentally characterized binding affinities in the Kim benchmark set, they are distributed unevenly across the alleles. For the most common alleles, particularly those associated with European ancestry, there are as many as 9000 characterized peptides, while for others there may be as few as 100. When we stratified prediction performance by individual alleles, performance was seen to vary widely by allele, but not by the in silico method.
We explored whether the allele-specific differences in in silico binding prediction were data-driven or due to biological properties of the different alleles. Most of the variance in prediction performance could be explained by differences training or test set size, class imbalance, or sequence diversity of peptides in the training set. Because the differences appeared to be primarily data-driven, we designed a training protocol in which information from alleles with the richest training data was shared with other alleles, a form of transfer learning [34]. We clustered the MHC Class I alleles based on the utility of the transferred information between them, yielding improvements in prediction performance for most alleles (Materials and Methods, Results).
Deep learning was originally designed for computational problems where a large amount of training data is available. As data size increases, deep learning performance can increase dramatically, as has been shown in the field of image proccessing, where deep learning methods quickly advanced to be the best in class by a wide margin [35]. It is likely that deep learning methods will benefit the most from increases in experimental training data and will become increasingly relevant to the field of immunology.
Software for the MHCnuggets neural networks described in this work is available at https://github.com/KarchinLab/mhcnuggets.
Results
Overview of approach
We developed a new machine learning approach to predict peptide binding to MHC Class I proteins, using deep recurrent neural networks. Our approach differs from previous machine learning methods designed for the same task, which augment their training data artificially or train separate networks for each peptide length. The artificial augmentation used by previous methods includes in silico shortening and lengthening of each peptide and generation of non-binding peptides, both of which introduce noise. The recurrent neural networks use only experimentally verified peptide-MHC binding pairs as training data and naturally handle peptides of different lengths. We trained networks for individual MHC alleles and developed a protocol to share information between networks. We hypothesized that this approach could achieve state-of-the art prediction performance at significantly faster speeds. Additionally, we reasoned that these methods could be adapted to improve standard neural networks and to develop a standard network with similar performance and faster runtime.
The best deep and standard machine learning methods have similar prediction performance in aggregate testing of all MHC Class I alleles
We tested several deep learning methods: convolutional neural networks (CNNs) [27], and recurrent neural networks, in particular long short-term memory networks (LSTMs) [29] and gated recurrent units (GRUs) [28]. In total we tested five architectures, four designed by us. We also tested two types of standard machine learning methods: fully connected one- or two-layer networks and a Bayesian matrix method. Five standard methods in total were tested, one designed by us. The best performing deep methods (MHCnuggets-LSTM, MHCnuggets-GRU) and the best standard methods (MHCnuggets-FC, MHCflurry, NetMHC, NetMHCpan) had similar prediction performance. Deep convolutional neural networks (CNNs) had the weakest performance (Table 1).
To the best of our ability, all training and testing of methods was done with the Kim benchmark set. However, for NetMHC and NetMHCpan, open source training software was not available, and we used their self-reproted results on the Kim benchmark set. These methods have been previously reported to augment their training sets with additional proprietary data, beyond what is in the Kim benchmark training set [1, 32], so their self-reported performance estimates may be overly optimistic.
Prediction performance was assessed with three metrics (Materials and Methods). Dichotomous classification of peptides as binders or non-binders was measured with area under the ROC curve (AUC) and F1 scores, with IC50 of 500nM used as the separating threshold [36]. Continuous prediction of IC50 measurements was assessed with the Kendall-Tau correlation coefficient.
The predictions of the best deep and standard machine learning methods are highly correlated
We identified six best-in-class methods (Table 1) and for each method, ranked the predicted IC50 binding affinities for all pMHC-I partners in the Kim benchmark set. The ranked predictions were strikingly correlated (Spearman-Rho rank correlation ≥ 0.9 for all method pairings) (Figure 2). Given that the six methods use different machine learning algorithms, training protocols, and network architectures, the high rank correlation on the level of individual predictions was surprising.
For individual MHC Class I alleles in silico prediction performance varies
Next, we compared how the six best performing methods from Table 1 handled each of the MHC Class I alleles. Performance was assessed with the Kendall-Tau correlation. (Figure 3). For a few of these alleles, one of the methods slightly outperformed the others. For example, MHCnuggets-LSTM has slightly higher Kendall-Tau correlations for HLA-A*26:02 and HLA-B*15:09 than other methods. However, in general the alleles could be clustered into two groups, a group of alleles in which all the methods performed well and a group in which none of the methods performed well (dendogram on left hand side of Figure 3). This result is consistent with the high Spearman rank correlation among the methods with respect to predictions of pMHC-I affinities (Figure 2), and their similar performance on the Kim test benchmark set (Table 1).
The best six methods index the columns of the heatmap in Figure 3, and the 50 MHC Class I alleles in the Kim test set index the rows. Each cell represents the performance of a method on an allele’s test set, measured by Kendall-Tau correlation. For example, the top right cell shows the Kendall-Tau for MHCnuggets-LSTM and H-2-Kb. Figure S1 shows heatmaps for F1 score and AUC.
Allele-specific differences in in silico prediction performance are largely data-driven
We considered whether properties of the training and test data in the Kim benchmark set could explain the variation in prediction performance for different MHC Class I alleles. For each of the performance metrics, we used ANOVA comparison of nested linear regression models to assess the variance in performance explained by: the number of training examples (peptides with measured binding affinities) for each allele, the class imbalance in binder and non-binder peptides in the training examples, the binder/non-binder class imbalance in the test set, the sequence diversity of the peptides in the training examples (measured by network modularity) (Materials and Methods), the in silico method used (only the best six methods were considered), and the number of examples in the test set. Choice of in silico method and test set size did not explain any additional variance in performance, once the most informative covariates were considered. For performance measured by Kendall-Tau correlation, 70.4% of the variation across alleles was explained by test set class imbalance (65.5%, p=5e-11), training set size (4.1%, p=5e-11), and sequence diversity of all peptides in the training set (0.5%, p=0.01). For performance measured by F1 score, 56.4% of the variation across was alleles was explained by training set class imbalance (35.3%, p=1e-34), training set size (9.8%, p=4e-14), test set class imbalance (3.8%, p=7e-07), binder diversity (3.9%, p=2e-07), all diversity (3.6%, p=1.62e-07). For performance measured by AUC, only 25% of the variation was explained by test set class imbalance (12%, p=1e-11) and training set size (12.7%, p=2e-13), but we believe that this is because the AUC is the least stringent of the three metrics. Our results suggest that properties of the training and test data explain a large proportion of the variance among performance of in silico methods across different MHC Class I alleles, and that biochemical properties of allele-specific pMHC-I complexes are not the main contributor.
Transfer learning improves prediction performance
Transfer learning is a technique commonly used in deep learning, in which information is transfered between distinct but related problem domains [34]. We designed a new transfer learning protocol for neural networks trained on individual MHC alleles (Figure 4), and applied it to the best performing networks (MHCnuggets-GRU, MHCnuggets-LSTM, MHCnuggets-FC). The protocol was compared to standard random initialization of network weights with the Glorot uniform distribution [37]. Transfer learning improved prediction performance for most of the alleles, and for several alleles which had the worst in silico prediction performance across all tested methods, the improvement was substantial (e.g., HLA-B*08:02, HLA-B*08:03, HLA-A*25:01, HLA-B*15:17) (Figure 5). The relationships between alleles defined by utility for transfer learning had overlap to standard MHC nomenclature and to MHC allele organization into supertypes [38] [39]. The performance differences measured by Kendall-Tau correlation and AUC are in Figure S2.
Runtime of the best deep and standard machine learning methods varies substantially
We computed the runtime to score all the peptides in the Kim benchmark set (163,898) for a single MHC Class I allele, for the six best perfoming methods. Computations were repeated five times and averaged. The fastest method was MHCnuggets-FC (7.32s±0.05) and the slowest method was NetMHCpan (282.58s±18.64). The best-performing deep learning MHCnuggets methods were relatively fast (22.76s±0.5 for MHCnuggets-GRU, 23.91s±0.37 for MHCnuggets-LSTM). Timing was done on a single compute node containing 2 Intel Xeon E5-2680v3 (Haswell) processors running at 2.5GHz, with 12 cores and 128 GB RAM.
The runtime differences are substantial if we estimate the time to run each predictor on a cohort of cancer patients, based on their somatic missense mutations. For example, the head and neck cancer (HNSC) cohort from TCGA has 495 samples (as of June 15, 2017), with six potential MHC Class I alleles per patient (3 loci, possibly all heterozygous). If we consider all possible 8-11 length windows around each somatic missense mutation, the average candidate peptides per patient is 4128. For MHCnuggets-FC, the run would take approximately 9 minutes, for MHCnuggets-GRU and MHC-nuggets-LSTM approximately 30 minutes, and for NetMHCpan approximately 6 hours. Complete timing results are shown in Table S1.
Discussion
Recent advances in precision cancer immunotherapy have highlighted the need for improved in silico methods to predict peptide-MHC (pMHC) binding affinities. A patient’s repertoire of somatic mutations may yield processed peptides, which are presented to T-cell receptors by MHC proteins to trigger an immune response. Therefore, high-throughput in silico prediction of which somatic mutations are most likely to generate immunogenic peptides has increasing clinical utility. These mutations may serve as biomarkers to identify patients most likely to respond to checkpoint blockade and may also have utility for engineered vaccines, designed to stimulate immune response and destroy cancer cells. [17] [18]. As new in silico methods are developed, it is critical that they are carefully assessed to determine whether new algorithms, such as deep learning networks, provide advantages over established ones. In this work, we focused on the Kim benchmark set [1], which contains a sizeable and heterogeneous population of experimentally characterized peptides and pMHC-I affinities, cleanly separated into training and test data. By comparing ten methods with the Kim benchmark, we avoided bias that could be introduced by smaller or less diverse benchmark sets. Of the ten methods tested, we found that six were best-in-class with respect to prediction performance. Three of the methods, which were designed by us, also had signficantly faster run-times, which may be important to users who are interested in scalable, high-throughput neoantigen prediction pipelines.
Because MHC alleles are highly polymorphic, particularly at the peptide binding site, each allele has its own repertoire of binding peptides. Experimental tests of pMHC-I binding have not been uniformly applied to all alleles, with certain alleles such as HLA-A*02:01 represented with 12,357 experiments in the Immune Epitope Database (IEDB) [33] and 9,565 in the Kim benchmark. Other alleles such as HLA-B*08:03 have very few experimentally validated pMHC-I affinities (470 in IEDB, 217 in the Kim benchmark). Furthermore, the class imbalance between experimentally tested binders and non-binders ranges from 0.041 in HLA-B*08:02 (19 binders, 468 non-binders) to 0.86 Mamu-A*02 (1003 binders, 1261 non-binders). In silico predictors work better for some MHC class I alleles than others. It has been hypothesized that there is a biological/chemical/structural explanation, so that an immune response might be generated from weaker affinity peptide binding for some alleles, but only stronger affinity for others. If this is true, 500nM may not be a good separator of binding and non-binding peptides for all alleles, and allele-specific affinity thresholds should be considered [40]. While we cannot rule out the relevance of biologically-driven factors, we found that the most of the variance in prediction performance across the different alleles could be explained by several data-driven factors. We designed a linear regression in which the response variable was a performance metric (Kendall-Tau correlation, F1, or AUC) as calculated for the best six methods, and the covariates were values that differed across the 50 alleles in the Kim benchmark. Choice of in silico method was also considered as a covariate. For Kendall-Tau correlation, test set class imbalance, training set size, and sequence diversity of training set peptides explained 70.4% of the variation. For F1 score and AUC, covariates explained 56.4% of the variation and 25% of the variation, respectively. Choice of in silico method did not explain any additional variance in performance for any of the metrics, once the most informative covariates were considered. Our results suggest that properties of the training and test data explain a large proportion of the variance among performance of in silico methods across different MHC Class I alleles, and that biochemical properties of allele-specific pMHC-I complexes are not the main contributor.
When we applied the machine learning technique of transfer learning to MHCnuggets-GRU, MHCnuggets-LSTM, and MHCnuggets-FC, we observed a substantial increase in performance, in particular the F1 score and Kendall Tau coefficient. Our transfer learning approach leveraged a network of related MHC Class I alleles. Base networks were trained on alleles with a larger number of training examples, and trained weights were transferred to initialize the network of a related allele. Our results indicate that transfer learning is a useful technique for pMHC-I binding prediction, in particular for alleles with few training examples. The relationships between alleles that we found empirically were in agreement with previously described MHC supertypes, but also included cases where more distant relationships provided useful transferable information from one allele’s training data to another.
Of the deep learning methods we investigated, CNNs were the least successful. We attribute this result to the importance of positional information in the pMHC-I binding process. It is well known that specific residues are required at anchor positions [41] The filter sizes typically used in CNNs were designed for problems in which position invariance is critical, such as identifying an object regardless of its position or angle within an image. In contrast, the presence of a leucine amino acid residue has different meaning at a pMHC-I anchor position than elsewhere, thus position invariance is detrimental. We were able to improve CNN performance by increasing filter size, but they were still unable to match the performance of the GRU, LSTM, or standard neural networks.
Our work highlights the critical role of experimental training data for further development of in silico pMHC binding predictors. Even the best methods identified by our benchmarking efforts have sharply decreased performance for those MHC-I alleles having a small and/or highly imbalanced number of experimental measurements, and most of the variance in predictive performance could be explained by these data-driven factors. We encourage anyone who is measuring pMHC binding affinities through immuno-fluorescent assays or other technologies to deposit their results in public databases such as IEDB.
Materials and Methods
Dataset
The Immune Epitope Database (IEDB) [33] provides a comprehensive public set of experimentally characterized peptides and pMHC binding affinities. Database entries were curated from published literature, and the majority of affinities were calculated based on immunofluorescent assays. These affinities are represented as an IC50 value, the half-maximal inhibitory concentration in nano-molar (nM) units of peptide to MHC molecules. A total of approximately 250,000 examples from immunofluorescent assays was available as of May 2017, spanning multiple mammalian and avian species, and 740 MHC alleles, of which 459 are MHC Class I.
For purposes of benchmarking in silico pMHC Class I binding predictors, Kim et al. generated a new dataset from IEDB, partitioned into training and test sets [1]. The dataset was further processed by a shell script [32] and available at https://github.com/hammerlab/mhcflurry/tree/master/downloads-generation/data_kim2014, that removed any peptide in the test set with identical length and ≥ 80% sequence identity to a peptide in the training set. The resulting benchmark is referred to in this work as the Kim benchmark. It contains 106 unique MHC alleles and 137,654 IC50 measurements, published prior to 2009 (training set) and 53 unique MHC alleles with 27,680 IC50 measurements, published from 2009-2013 (test set). All peptides in the Kim benchmark set consist of 8-11 amino acid residues.
In silico methods
MHCnuggets
Recurrent neural networks
MHCnuggets-GRU is a gated recurrent unit (GRU) [28] with a fully connected layer of 64 hidden units, regularized with a dropout and recurrent dropout [42] probability of 0.2, trained for 200 epochs. MHCnuggets-LSTM is a long short-term memory network [29] with a fully connected layer of 64 hidden units, regularized with a dropout and recurrent dropout probability of 0.2, trained for 100 epochs. Both accept variable length inputs.
Convolutional neural networks
MHCnuggets-Chunky-CNN and MHCnuggets-Spanny-CNN are 1D convolutional neural networks [27]. The Chunky-CNN fits network weights to sliding windows (kernels) across each peptide, which cover two or three amino acid residues. The Spanny-CNN has an additional kernel that covers the entire peptide. Both networks have a single fully connected layer, with 64 hidden units. These networks require fixed length inputs.
MHCnuggets-FC
MHCnuggets-FC is a fully-connected single-layer network with 64 hidden units, regularized with a dropout probability of 0.8, trained for 250 epochs. This network requires fixed length inputs, but we designed a simplified cutting and padding procedure to minimize noise and computation time.
For all MHCnuggets architectures, a separate network was trained for each MHC allele in the Kim benchmark set. Additional technical details are provided in Supplementary Materials and Methods.
Transfer learning
We designed a transfer learning protocol that leverages relationships between MHC alleles, whose peptide binding surfaces may have common biochemical properties. We reasoned that if such a similarity existed, the final learned weights for one MHC allele’s network could contain useful information for other alleles. In the transfer learning protocol, the final weights from one network are used as the initial weights for training subsequent networks. To identify potential similarities, we applied an empirical, bottom-up approach (Figure S3). First, we extracted the network weights from the MHC allele with the largest number of training examples (HLA-A*02:01). The trained weights from this network were used to intialize and train networks for the remaining 50 alleles. For some alleles, HLA-A*02:01 was not expected to have the most transferable information, so we applied all of the trained networks to score the training examples for all alleles. In many cases, there was an allele whose network had better predictive performance than HLA-A*02:01 at this step. For these alleles, we extracted the trained weights from the best performing network and used them to initialize a new round of network training. Figure 4 shows the final transfer learning protocol, with nodes representing the networks for each allele and edges represent the direction and transfer of weights from one network to another. We used AUC to measure network performance (all three performance metrics were highly correlated) and required that to be utilized for transfer learning, a source network had a larger number of training examples than its target and a perfomance AUC ≥ 0.9. To avoid overfitting, no examples from the Kim benchmark test set were used.
NetMHC/NetMHCpan
NetMHC [30] is a single-layer fully connected neural network that encodes nine amino acid residue peptides with a 378-length input vector, which incorporates both a smoothed one-hot encoding (0.9 and 0.05 replace 1 and 0) and a BLOSUM-62 encoding of each amino acid. To accommodate peptides that are shorter or longer than nine residues, contiguous padding or cutting operations are applied at every possible position. When padded or cut versions of the same peptide receive different predicted binding affinities, the strongest affinity is selected. Separate networks are trained for each MHC allele.
NetMHCpan [31] is also a single-layer fully connected neural network that encodes both peptide and polymorphic residues of the relevant MHC allele. It uses the same smoothed one-hot plus BLOSUM-62 encoding and padding/cutting protocol as NetMHC. A single network is trained for all MHC alleles. NetMHC and NetMHCpan are currently the most widely used in silico pMHC-I binding tools.
Both methods use artificially generated non-binding peptides, by applying the NetChop algorithm [43] to the entire human proteome.
MHCflurry
MHCflurry [32] is a neural network, which jointly discovers informative amino-acid residue encodings and predicts pMHC-I binding affinities. Each amino acid residue type is assigned an integer, and peptides are encoded as 9-length vectors of integers. An initial embedding layer maps input peptides to a 32-dimensional space, which then feeds into a fully connected layer. Padding and cutting of peptides that are shorter or longer than 9 residues is done identically to the protocol of NetMHC/NetMHCpan. The final predicted binding affinity is the geometric mean of all padded and cut versions of the same peptide. MHCflurry augments its training data with peptides randomly generated in silico from a uniform distribution.
SMMPMBEC
SMMPMBEC [15] is a Bayesian framework for regularized least-squares regression. Peptides to be used for training are represented with one-hot encoding and stacked to form a single NxL matrix, where N is the number of peptides and L is the peptides’ length x 20. A scoring matrix is trained to minimize the error between N experimental binding affinities and the matrix product of the peptide matrix and the scoring matrix. The optimization is constrained by a pre-trained 20x20 pairwise substitution matrix for the amino acids, based on the covariance of their contributions to pMHC-I binding free energy in different contexts. A scoring matrix is computed for each MHC allele and each peptide length. To predict the pMHC-I binding affinity for a peptide of interest, its one-hot representation is multiplied by the scoring matrix.
HLA-CNN
HLA-CNN [13] is a deep learning convolutional neural network, consisting of an embedding layer, two 1D convolutional layers, and a fully connected layer. The input to the embedding layer is an Lx15 matrix, which is initialized with a learned peptide representation based on the natural language processing (NLP) skip-gram model [44]. Skip-gram is an unsupervised technique to discover word embeddings. Sentences are encoded as vectors of integers and projected into a lower dimensional space. HLA-CNN treats peptides as sentences and amino acid residues as words. The initial embedding is learned from the set of all peptides across all MHC alleles in the training set. The convolutional layers consist of 32 filters, with kernel size of 7, stride=1, and are initialized with Glorot normal distribution [37]. A network is trained for each MHC allele and each peptide length.
Comparison metrics
All in silico pMHC-I binding predictors assessed in this work are fundamentally regression methods, which produce a continuous-valued prediction, expected to be related to the true experimental IC50 measurement. Continuous-valued predictions can be thresholded at relevant biological IC50s to produce dichotomous class predictions (binder or non-binder). We used an IC50 threshold of 500nM, where a value ≤ 500nM implied a binder and > 500nM implied a non-binder [36]. We applied the Kendall-Tau correlation coefficient for continuous-valued predictions (Equation 2) where nc is the number of concordant pairs, nd is the number of discordant pairs, and n0, n1, and n2 are given by Equations 3, 4, 5. where n is the total number of experimental and predicted binding affinity pairs, ti is the number of tied values in the ith group of tied experimental affinities, and uj is the number of tied values in the jth group of tied predicted affinities.
The Kendall-Tau coefficient provides assessment of whether continous predictions and experimental measurements have a monotonic relationship and is robust to non-linearity.
We used area under the ROC curve (AUC) and F1 (Equation 6) metrics to assess dichotomous class predictions. AUC assesses the dichotomous classification between binders and non-binders at any threshold of a predictor’s continuous output. At each threshold, the true positive rate (precision) and false positive rate is computed and a curve is plotted (Figure 1A). The final metric is the area under this curve. where and . Binders are defined as peptides with IC50 ≤ 500nM.
A good AUC does not guarantee that the separation occurs at a biologically relevant threshold. In particular, if the output does not have a good inverse mapping to experimental measurements, the best separation may occur at an uninformative threshold. The F1 score ensures that a dichotomous classification occurs at a selected threshold, and the threshold can be selected to be biologically relevant. It provides a balanced summary of the precision and recall of a dichotomous classification.
Variability in prediction performance
For each of the 50 MHC Class I alleles in the Kim benchmark test set, we calculated the following covariates: the number of training examples n, the training example class imbalance where nb is the number of training examples ≤ 500nM and nnb is the number > 500nM (in practice we use to handle cases where nb > nnb), the test example class imbalance, the network modularity of binding peptides in the training set, and the network modularity of all peptides in the training set. The ANOVA R package was utilized to measure the variance explained by each covariate, using nested linear regression models and with either F1 score, Kendall-Tau score, or AUC as the response variable. covariates were added to each model with a greedy maximization of variance algorithm, in which the covariate that increased the variance the most at each iteration was added to the model, if it had a p-value < 0.05.
Network modularity of peptide sequence analysis
We constructed a network for each of the 50 MHC Class I alleles in the Kim benchmark test set. The nodes of these networks were the peptide sequences found in the training set of each allele. Two nodes were connected by an edge if the sequence identity between the peptides was > 0 (i.e., they shared at least one identical amino acid residue at the same position). Edges were weighted by sequence identity (normalized by peptide length). Peptides were trivially aligned by cutting and padding them to length nine, as described in (MHCnuggets). Community detection was performed with the fast unfolding algorithm [45], as implemented by the community_multilevel function in igraph. Once nodes were assigned to communities, the weighted modularity of the community assignment was calculated as: where Aij is the weight of the edge between i and j, ki = ∑j Aij is the sum of the weights of the edges associated with vertex i, ci is the community to which vertex i is assigned, δ is an indicator function such that δ(u, v) = 1 if u = v and is 0 otherwise, and .
Acknowledgments
We would like to thank the MHCflurry team for their inspiration and the obvious influence on the name of our methods, MHCnuggets. MHCflurry open sources all of their software and data, including scripts for pre- and post-processing that we used in this work.
Footnotes
↵* E-mail: karchin{at}jhu.edu