Protein identification with a binary code, a nanopore, and no proteolysis

G. Sampath

doi:10.1101/119313

Abstract

If protein sequences are recoded with a binary alphabet derived from a division of the 20 amino acids into two subsets, a protein can be identified from its subsequences by searching through a recoded sequence database. A binary-coded primary sequence can be obtained for an unbroken protein molecule from current blockades in a nanopore. Only two (instead of 20) blockade levels need to be recognized to identify a residue’s subset; a hard or soft detector can do this with two current thresholds. Computations were done on the complete proteome of Helicobacter pylori (http://www.uniprot.org; database id UP000000210, 1553 sequences) using a binary alphabet based on published data for residue volumes in the range ∼0.06 nm³ to ∼0.225 nm³. With volumes normally distributed, more than 93% of binary subsequences of length 20 from the primary sequences of H. pylori are correct with a confidence level of 90–95%; they can uniquely identify over 98% of the proteins. Most of them have a large number of identifying subsequences so the false detection rate is low. Recently published work shows that a 0.5 nm diameter nanopore can measure residue volume with a resolution of ∼0.07 nm³, so the procedure described here is both feasible and practical. This is a non-destructive single-molecule method without the vagaries of proteolysis.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.