Training new MS2PIP models

Write feature vectors for model training

To compile a feature vector dataset you need to supply the MS2 .mgf file (option -s) and the name of the file to write the feature vectors to (option -w) to The spec_id column in the <peptide file> should match the TITLE field of the corresponding MS2 spectrum in the .mgf file and is used to find the targets for the feature vectors.

Test feature extraction

In the folder tests, run pytest. This will run the tests in, which verify if the feature and target extraction are working properly. (The tests must be updated when we add or remove features!) To do this the pytest package must be installed (pip install pytest)

Optimize and train XGBoost models

The script

usage: [-h] [-c INT] [-e FILE] [-p] [-g] <_vectors.pkl> <type>

XGBoost training

positional arguments:
  <_vectors.pkl>  feature vector file
  <type>          model type

optional arguments:
  -h, --help      show this help message and exit
  -c INT          number of CPUs to use
  -e FILE         additional evaluation file
  -p              output plots
  -g              perform Grid Search CV to select best parameters

reads the pickled feature vector file <vectors.pkl or .h5> and trains an XGBoost model. The type option indicates the ion type for which a model should be trained. This has to match the name of the column in the vector file that contains the targets for the given ion type. For instance B will match the column targetsB and will lead to a model for b-ions.

Hyper parameters can be optimized by performing a grid search, using the argument g. Be sure to define the appropriate search space. This is hard coded in the script.

Optionally, an evaluation vector file can be given. In this case predictions will be made on these vectors using the final model. If no evaluation file was given, predictions will be made on the test set.

The script will write the XGBoost models as .c files that can be compiled and linked through Cython. Just put the models in the /models folder, change the #include directives in ms2pipfeatures_c.c, and recompile the model by running the script.