The parsed data frame from saas::parse_msgf_mzid function contains sometimes multiple entries for a spectrum. (eg. if sequence can be assigned to multiple protein ids). This function takes care of this by default.

preprocess(dat, remove_target_decoy_PSM = TRUE,
  remove_multiple_proteins_PSM = FALSE, is_subset = NULL)

Arguments

dat

Data frame generated by the saas::parse_msgf_mzid function.

remove_target_decoy_PSM

TRUE to remove PSMs that match both a target and decoy sequence.

remove_multiple_proteins_PSM

TRUE to remove PSMs that can be assigned to multiple protein ids.

is_subset

Location of fasta file with protein_id of the subset of interest in the fasta headers.

Value

Data frame with the same columns as ``dat''. The column protein_id contains all protein_ids that can be assigned to this PSM. Multiple protein_ids are separated by ``;''. When ``is_subset'' is specified, two columns are added:

subset

TRUE if sequence can be assigned to a subset protein id

non_subset

TRUE if sequence can be assigned to a non subset protein id

Every spectrum haves only 1 row in the data frame.

Examples

## Location of the zipped data files zip_file_path = system.file("extdata", "extdata.zip", package = "saas") ## Unzip and get the (temporary) location of the mzid file with the MS-GF+ search results from a ## competitive target decoy search of the complete pyrococcus proteome against a pyrococcus dataset. mzid_file_path = unzip(zip_file_path, 'pyrococcus.mzid',exdir = tempdir()) ## Read and parse the mzid file data = parse_msgf_mzid(mzid_file_path) ## Unzip and get the (temporary) location of the file with fasta headers. ## Each fasta header contains a protein_id from the protein subset of interest. ## These protein_ids match the protein_ids in the mzid result file. fasta_file_path = unzip(zip_file_path, 'transferase_activity_[GO:0016740].fasta', exdir = tempdir()) ## Preprocess the data before FDR estimation. data_prep = preprocess(data, is_subset = fasta_file_path) ## Estimate the FDR in the subset. data_result = calculate_fdr(data_prep, score_higher = FALSE) ## Check how many PSMs are retained at the 1% FDR threshold. table(data_result$FDR_stable < .01)
#> #> FALSE TRUE #> 31 73