Protein Inference
This page describes what protein inference is, why it is so complicated, and how protein inference is handled in bioinformatic tools.
What is protein inference?
Shotgun proteomics relies on the identification of proteins via peptides obtained from digestion of complex mixtures. The protein inference is the task of inferring a set of proteins from identified peptide sequences.
Why is it so complicated?
Shared peptides
The peptide-centric approach is by design flawed by the presence of shared peptides also named degenerated peptides whose sequence is shared between different proteins. When such peptides are encountered, it is common practice to group the matching proteins into ambiguity groups, see Nesvizhskii and Aebersold for a detailed formalism of this so-called protein inference problem.
How shared peptides influence the scoring of protein matches is left to the appreciation of the bioinformatician - and strongly impacts the list of proteins validated at a desired quality level. For more details on identification results validation see Nesvizhskii and the Chapter 1.5 of our free tutorials for peptide and protein identification. Consequently, which proteins are actually retained in the end strongly varies between tools. Indeed, there is room for variability between a minimal set of proteins where one protein per ambiguity group is selected and a maximal set where every possible accession is retained as reviewed here for example.
As a result, different software will give you different proteins from the same peptide list. Various tools are available for this task, integrated in a larger environment like the Trans-Proteomic Pipeline (TPP) or OpenMS, integrated in a software like MassSieve, MaxQuant or PeptideShaker, or standalone like IDPicker.
Practical examples can be found in the Chapter 1.4 of our free tutorials for peptide and protein identification.
Technical implementation
Search engines provide their own peptide to protein mapping. However, the mapping can be different between search engines or for a single search engine across runs as reviewed here. For this reason, it is mandatory to remap every peptide to every protein when working with identification results. Remapping peptides to proteins can be very slow on large databases: for every run the software has to search thousands of peptides in the FASTA file.
Also, this task is complicated by the presence of Xs (which can be any amino-acid), combination of amino-acids like B or J and indistinguishable amino-acids like I and L: the software then has to test every possibility and false peptide to protein matches might occur. An extreme example where protein inference breaks down is proteins containing series of Xs (like UniProtKB/Swiss-Prot Mucin-3A at the time of writing) which can basically map to any peptide.
Peptide to protein mapping is implemented in online resources like in the Protein Information Resource and in software packages like OpenMS. In this package, the mapping is done via PeptideMapper. PeptideMapper also presents the particularity to map partial sequences similarly to DirecTag. It is used in PeptideShaker and DeNovoGUI. For more information please refer to the PeptideMapper wiki page.
For the sake of speed and quality, it is thus crucial that the protein database used contains as few sequences as possible with as low ambiguity as possible - while maintaining the best coverage of the proteins in the sample including potential contaminants.