- FASTA Format
- Non Standard FASTA
- UniProt Databases
- IPI Databases
- Decoy Sequences
- Mascot Users
Besides the spectra themselves, the sequence database to search in is the most important input of the search. If a sequence is not in the database the corresponding peptide/protein cannot be identified.
On the other hand, adding sequences of proteins that cannot occur in your experiment will increase the rate of false identifications. It is therefore essential to use the correct sequence database.
The standard format for sequence databases is called FASTA. SearchGUI therefore requires that the sequences are stored in this format. A FASTA file will usually end with .fasta, but .fast and .fas is also used.
In a FASTA file each sequence is represented by a header and the sequence itself. The header contains information about the protein, e.g., protein accession number, database and species. However, the format of the header varies from database to database. SearchGUI supports the most encountered databases like UniProt, Ensembl, NextProt, NCBI and IPI, plus a long list of other databases.
It is strongly recommended to use one of the standard databases, and of these UniProt is the preferred option.
Non Standard FASTA
Non-standard home made sequence databases with non-standard headers can also be used, but the downstream usage may be limited, e.g., in PeptideShaker.
Databases that do not match the standard header formats of the common databases (like UniProt, NCBInr etc) can be added using a generic header format (supported from SearchGUI version 1.7.3 and PeptideShaker version 0.14.6 onwards):
>generic[your tag]|[protein accession]|[protein description] or >generic[your tag]|[protein accession]
[your tag] can be empty.
>generic_contig-535081|AC:123132|Hypothetical protein >generic|AC:123132|Hypothetical protein >generic|AC:123132
AC:123132 will then be used as the protein accession number and
Hypothetical protein as the protein description (if provided).
If you have a sequence database that cannot be parsed, please let us know be setting up an issue at the SearchGUI home page</a>.
To parse databases with generic FASTA headers in Mascot we recommend the following Mascot database parse rules:
accession: ">generic[your tag]|\([^|]*\)|(.*)" description: ">generic[your tag]|[^|]*|\(.*\)"
Replace or remove the
[your tag] part depending on if you have a user defined tag or not.
As mentioned above it is strongly recommended to use sequence databases based on SwissProt/UniProt. This ensures that you get reviewed, maintained and well-annotated sequences that can easily be linked to a long list of other resources.
To get the SwissProt/UniProt sequence database for your species go to www.uniprot.org and type organism:’name of your species’ into the Query field at the top, e.g., organism:”homo sapiens”.
Next select one of the three provided options: show only reviewed entries (UniProtKB/Swiss-Prot), show unreviewed entries (UniProtKB/TrEMBL), or show only entries from a complete proteome set. Which option to chose depends on the properties of your experiment and on how well-annotated your given species is. (See [http://www.uniprot.org/faq/7(www.uniprot.org/faq/7) for details.)
Next click on the Download link in the upper right corner, and under the FASTA header select and download either Canonical sequence data in FASTA format or Canonical and isoform sequence data in FASTA format. Again the choice depends on the properties of your experiment. (See http://www.uniprot.org/faq/30 for details.)
Note that the presence of isoforms makes the downstream protein inference task more complex. It is generally advised to use a simple database and subsequently improve the complexity if needed.
For more details on UniProt databases see http://www.uniprot.org/faq/.
Sequence databases based on the International Protein Index (IPI) used to be a very common in proteomics. However, this format has been discontinued and ought to be replaced by corresponding SwissProt/UniProt databases as explained above.
For more information on why IPI was discontinued see the IPI home page.
In order to conduct an unbiased validation of the identification results, it is possible to append non-existing sequences (so-called decoy sequences) to your protein sequences (target sequences). Decoy sequences must fulfill two necessary and sufficient conditions: (1) similarity: the similarity between target and decoy sequences will ensure that false positives occur in equal amounts in both target and decoy sequences; (2) orthogonality: the absence of shared peptides between target and decoy sequences will allow the distinction of target and decoy hits. After searching this concatenated target/decoy database, results can hence be thresholded to a desired level of quality. For this task, we recommend the use of PeptideShaker.
There are various ways of creating the decoy sequences, the most popular being reversed versions of the actual sequences. Adding decoy sequences is easily done by clicking the Decoy button next to the database file text field. Reversed versions of every sequence in the original sequences database will then be added to the FASTA file.
The reason for this is that Mascot decoy sequences are not present in the FASTA file. Moreover, when combining results from different search engines it is vital that the database and decoys used are identical, something that cannot be guaranteed when using Mascot’s Automatic Decoy Search.
To combine Mascot results with your OMSSA and X!Tandem results you therefore have to use the same target-decoy database as the one used in the SearchGUI search and not select the decoy option when performing the Mascot search.
Finally, to ensure compatibility between search engines, be sure to use the exact same database for all algorithms.
Note that for your search results to be compatible with PeptideShaker, decoy sequences have to be added as explained in the [Decoy Sequences(#decoy-sequences) section above.