PT - JOURNAL ARTICLE AU - Rohit Bhattacharya AU - Collin Tokheim AU - Ashok Sivakumar AU - Violeta Beleva Guthrie AU - Valsamo Anagnostou AU - Victor E. Velculescu AU - Rachel Karchin TI - Prediction of peptide binding to MHC Class I proteins in the age of deep learning AID - 10.1101/154757 DP - 2017 Jan 01 TA - bioRxiv PG - 154757 4099 - http://biorxiv.org/content/early/2017/06/23/154757.short 4100 - http://biorxiv.org/content/early/2017/06/23/154757.full AB - Prediction of antigens likely to be recognized by the immune system is a fundamental challenge for development of immune therapy approaches. We explore the utility of deep learning for in silico prediction of peptide binding affinity to major histocompatibiliy complex Type I molecules (pMHC-I binding). This process is a critical step in the immune system’s response to cancer cells, which may present highly specific neoantigen peptides bound to MHC proteins at the cell surface. With the advent of high-throughput sequencing and the recognition that somatic mutations in the exome can produce neoantigens, fast in silico prediction of these affinities has become increasingly relevant to precision cancer immunotherapy.We have developed five machine learning methods and use a benchmark from the Immune Epitope Database of experimental pMHC-I binding affinities to compare them to existing machine learning approaches. All methods were used to score, rank, and classify pMHC-I pairs. The best six methods, which include three of our own, were identified and found to make highly correlated predictions, even for individual pMHC-I pairs. The most effective deep learning methods were a gated recurrent unit and a long short-term memory neural network, enhanced by transfer learning. These methods can handle peptides of any length without the need for artificial lengthening or shortening and were substantially faster than the most widely-used standard neural networks.Major findings The best in silico predictors of peptide major histocompatibility complex binding must be identified for application in precision cancer immunotherapy. We design and test a variety of machine learning methods for this purpose. We identify six best-in-class methods, three of our own design. Surprisingly, the best deep and standard machine learning methods make highly correlated predictions. Several standard methods run significantly slower and may have less utility as high-throughput sequence analysis for precision immunotherapy becomes more common. Performance of all methods varies by MHC allele, and most of this variance can be explained by data-driven, rather than biological properties. Increasing the quantity of publicly available experimental data has the potential to improve all machine learning methods applied to this problem, and in particular deep learning methods.