5. Database

5.a. A given database header format

Your database must contain both target and decoy sequences with generic headers. You can use [DBToolKit] (/projects/dbtoolkit) to generate your decoy protein sequences from your target protein sequences.

Your database must have the headers in the right format for Xilmass identifications. At the moment, Xilmass works with a fasta file that contains only generic headers, instead of Uniprot headers. Therefore, all protein headers must start with “>generic”, instead of any other headers like “>sp|”, “>sw|” or so on. For example,

>generic|P17156|Protein info is required instead of >sp|P17156|Protein info

Additionally, a decoy protein must have a header that contains “REVERSED” or “SHUFFLED” in capital and seperated by an underscore from an accession number. For example,

>generic|P17156_REVERSED|Protein info is required instead of >generic|P17156REVERSED|Protein info or >generic|P17156reversed|Protein info or >generic|P17156_reversed|Protein info

>generic|P17156_SHUFFLED|Protein info is required instead of >generic|P17156SHUFFLED|Protein info or >generic|P17156shuffled|Protein info or >generic|P17156_shuffled|Protein info

Lastly, all headers must contain only unicode-headers.

5.b. A cross-linked peptide search database

A cross-linked peptide search database is written to a modified FASTA version with the .fastacp extension. Even though it is a plain text file, the database generated by Xilmass cannot be used by any other search engine. Sequence lines contain asterisks (*) and vertical bars also known as a pipe (|).

Header: Headers start with > protein accession number such as generic or [UniProtKB] (http://www.uniprot.org/) accessions are written with both start and end residue numbers of the peptide within the protein on which the enzymatic cleavage occurs (on the example, (25-37) and (11-21)). A linked-residue number on each peptide is written next (on the example, 1 and 6). The first part comes from a protein with a longer putative peptide sequence. If two peptides have the same length, the first peptide is randomly selected. In case that the entry is for mono-linked peptides, this header contains the information about the peptide and -monolinked.

Sequence: Peptide sequences are written with an asterisk (*). Two peptide sequences are separated from each other by a vertical bar (|). The first part comes from the longer putative peptide sequence. The selection is based on firstly a peptide length, then a peptide mass. If two peptides have the same length, the heavier peptide is selected as the first peptide. In case that the entry is for mono-linked peptides, this peptide contains only one peptide with an asterisk (*) that shows a linked residue.

An example for a cross-linked peptide entry:

>Q15149(25-37)_1_Q15149(11-21)_6
K*TFTKWVNKHLIK|ASEGKK*DERDR

An example for a mono-linked peptide entry:

>4Q57:B(22-27)_5-monolinked
DRVQ K*K

An example of such cross-linked peptide database can be found on here