5. Database
5.a. A given database header format
Your database must contain both target and decoy sequences with generic headers. You can use [DBToolKit] (/projects/dbtoolkit) to generate your decoy protein sequences from your target protein sequences.
Your database must have the headers in the right format for Xilmass identifications. At the moment, Xilmass works with a fasta file that contains only generic headers, instead of Uniprot headers. Therefore, all protein headers must start with “>generic”, instead of any other headers like “>sp|”, “>sw|” or so on. For example,
>generic|P17156|Protein info
is required instead of >sp|P17156|Protein info
Additionally, a decoy protein must have a header that contains “REVERSED” or “SHUFFLED” in capital and seperated by an underscore from an accession number. For example,
>generic|P17156_REVERSED|Protein info
is required instead of >generic|P17156REVERSED|Protein info
or
>generic|P17156reversed|Protein info
or
>generic|P17156_reversed|Protein info
>generic|P17156_SHUFFLED|Protein info
is required instead of >generic|P17156SHUFFLED|Protein info
or
>generic|P17156shuffled|Protein info
or
>generic|P17156_shuffled|Protein info
Lastly, all headers must contain only unicode-headers.
5.b. A cross-linked peptide search database
A cross-linked peptide search database is written to a modified FASTA version with the .fastacp
extension. Even though it is a plain text file, the database generated by Xilmass cannot be used by any other search engine. Sequence lines contain asterisks (*
) and vertical bars also known as a pipe (|
).
Header: Headers start with >
protein accession number such as generic or [UniProtKB] (http://www.uniprot.org/) accessions are written with both start and end residue numbers of the peptide within the protein on which the enzymatic cleavage occurs (on the example, (25-37)
and (11-21)
). A linked-residue number on each peptide is written next (on the example, 1
and 6
). The first part comes from a protein with a longer putative peptide sequence. If two peptides have the same length, the first peptide is randomly selected. In case that the entry is for mono-linked peptides, this header contains the information about the peptide and -monolinked
.
Sequence: Peptide sequences are written with an asterisk (*
). Two peptide sequences are separated from each other by a vertical bar (|
). The first part comes from the longer putative peptide sequence. The selection is based on firstly a peptide length, then a peptide mass. If two peptides have the same length, the heavier peptide is selected as the first peptide. In case that the entry is for mono-linked peptides, this peptide contains only one peptide with an asterisk (*
) that shows a linked residue.
An example for a cross-linked peptide entry:
>Q15149(25-37)_1_Q15149(11-21)_6
K*TFTKWVNKHLIK|ASEGKK*DERDR
An example for a mono-linked peptide entry:
>4Q57:B(22-27)_5-monolinked
DRVQ K*K
An example of such cross-linked peptide database can be found on here