Abstract
If protein sequences are recoded with a binary alphabet derived from a division of the 20 amino acids into two subsets, a protein can be identified from its subsequences by searching through a recoded sequence database. A binary-coded primary sequence can be obtained for an unbroken protein molecule from current blockades in a nanopore. Only two (instead of 20) blockade levels need to be recognized to identify a residue’s subset; a hard or soft detector can do this with two current thresholds. Computations were done on the complete proteome of Helicobacter pylori (http://www.uniprot.org; database id UP000000210, 1553 sequences) using a binary alphabet based on published data for residue volumes in the range ∼0.06 nm3 to ∼0.225 nm3. With volumes normally distributed, more than 93% of binary subsequences of length 20 from the primary sequences of H. pylori are correct with a confidence level of 90–95%; they can uniquely identify over 98% of the proteins. Most of them have a large number of identifying subsequences so the false detection rate is low. Recently published work shows that a 0.5 nm diameter nanopore can measure residue volume with a resolution of ∼0.07 nm3, so the procedure described here is both feasible and practical. This is a non-destructive single-molecule method without the vagaries of proteolysis.