# Introduction

In this vignette we will demonstrate how to use the ‘search all, asses subset’ strategy (all-sub) to correctly calculated the FDR on a PSM subset of interest in a shotgun proteomic experiment.

In short, the basic all-sub workflow comprises the following steps:

1. Search the experimental MS2 spectra against all proteins potentially present in the sample.
2. Remove all PSMs that match a protein you are not interesed in.
3. Calculate the FDR on the resulting subset of PSMs.

The search-all-assess-subset FDR procedure has been published in Sticker et al. Nature Methods 14, 643–644 (2017) doi:10.1038/nmeth.4338.

Mass spectrometrists should search for all peptides, but assess only the ones they care about
Adriaan Sticker, Lennart Martens, Lieven Clement (2017)

Please include this reference if you use search-all-assess-subset FDR procedure as part of a publication.

## False discovery rate (FDR) estimation

Let $$x$$ be the PSM score and assume that larger score values indicate a better match to the theoretical spectrum. The FDR is known to have a Bayesian interpretation upon defining a mixture model for the distribution of $$x$$ (Efron 2008). $\begin{equation} f(x) = \pi_0 f_0(x) + (1-\pi_0) f_1(x), \end{equation}$

with $$f(x)$$ the target distribution of the test statistic, $$f_0(x)$$ the mixture component corresponding to incorrect PSMs, $$f_1(x)$$ the mixture component corresponding to the correct PSMs and $$\pi_0$$ the fraction of incorrect PSMs.

Based on the mixture distribution the FDR can be defined as the posterior probability $\begin{equation} \text{FDR}(x) = Pr[\text{PSM is incorrect}\vert X\geq x]. \end{equation}$ Let $$F(x)$$, $$F_0(x)$$ and $$F_1(x)$$ indicate the complementary cumulative distribution function (CCDF) $\begin{equation} F_{.}(x)=\int\limits_{X=x}^{+\infty} f_{.}(X) dx \end{equation}$ so $\begin{equation} F(x)= \pi_0 F_0(x) + (1-\pi_0) F_1(x) \end{equation}$ and the FDR becomes $\begin{equation} \text{FDR}(x)=\frac{\pi_0 F_0(x) }{F(x)} \end{equation}$

The empirical Bayes/two-groups approach also has the great virtue that independence of the scores is not necessary as long as (1) the estimated CCDF of the marginal distribution is roughly unbiased and (2) the null component is estimated empirically (Efron 2008).

## Target-decoy approach

In a competitive target decoy search, FDRs are estimated by dividing the number of accepted decoys PSMs by the number of accepted target PSMs at a certain MS-GF+ score cutoff (Elias and Gygi 2007). For FDR estimation in the subset of interest, use only target and decoy PSMs belonging to the subset in the calculation. We assure a monotonic increase of FDR estimates with decreasing score values by replacing the FDR at a particular score $$x$$ by the minimum TDA FDR in the interval $$[-\infty,x]$$.

Note, that in a competitive target decoy search the FDR with TDA is estimated by $\begin{equation} \widehat{\text{FDR}}(x) =\frac{\# decoys | X\geq x}{\# targets | X\geq x}, \end{equation}$ which can be rewritten as $\begin{eqnarray*} \widehat{\text{FDR}}(x) &=&\frac{\frac{\#decoys}{\#targets}\frac{\# decoys | X\geq x}{\# decoys}}{\frac{\# targets | X\geq x}{\# targets}}\\ &=&\frac{\hat \pi_0 \bar F_0(x)}{\bar F(x)} \end{eqnarray*}$ with $$\bar F_0(x)$$ and $$\bar F(x)$$ the empirical complementary cumulative distribution functions (ECCDFs) estimated using decoy and target PSMs, respectively. So, in a competitive TDA the FDR implicitly relies on the estimator of the proportion of incorrect PSMs among the targets $\begin{equation} \hat \pi_0=\frac{\#decoys}{\#targets}. \end{equation}$

## Stable target-decoy approach for subsets

In small subsets we typically can expect a low number of decoy PSMs and the $$\bar F_0(x)$$ will be a poor estimator of $$F_0(x)$$. For improving the stability of the target-decoy approach in small subsets we propose to estimate the mixture component for incorrect PSMs using a large set of decoys with similar properties as the incorrect subset PSMs. Hence, the method only relies on the assumption that the incorrect target PSM scores follow the same distribution as the PSM scores of the decoy set.

This large set of decoys PSM scores can be designed by the user. However often, the set of decoys in the complete search seems a good candidate set. This allows for the estimation of the mixture component of the incorrect PSMs and the marginal distribution of all PSMs non-parametrically, i.e. based on the ECCDFs $$\bar F_0(x)$$ using a large decoy set (e.g. of the complete search) and $$\bar F(x)$$ derived from the targets in the subset, respectively.

The mixture proportion $$\pi_0$$ can still be estimated based on the ratio of the number of decoys on the number of targets in the subset upon a competitive target decoy search. However, the estimator $$\hat \pi_0$$ often becomes zero in very small subsets, inducing an FDR estimate of zero for the set of all subset target PSMs. We therefore propose to use the estimator $\begin{equation} \hat \pi_0= \begin{cases} \frac{\#\text{decoys}_\text{subset}+1}{\#\text{targets}_\text{subset}}, & \text{if}\ \#\text{targets} > \#\text{decoys}\\ 1, & \text{if}\ \#\text{targets} \leq \#\text{decoys}\\ \end{cases} , \end{equation}$

which has a lower bound of $$\frac{1}{\text{targets}_\text{subset}}$$ and an upper bound of 1. Note that this estimator has an upward bias in small subsets, which disappears for moderate to large subsets. In the extreme case that we observe only one target PSM, $$\hat \pi_0$$ is 1 and the FDR estimator reduces to a regular p-value.

Note, that a conservative estimator of the FDR is also available using the link between the Bayesian and the Benjamini-Hochberg (BH) FDR procedure, i.e. by setting the mixture proportion of the mixture component for incorrect PSMs to $$\pi_0=1$$ (Efron 2008).

$\begin{equation} \widehat{\text{FDR}}(x)=\frac{\bar F_0(x)}{\bar F(x)}. \end{equation}$

In very small subsets, the latter FDR estimator is still valid as long as a good estimator of the mixture component for incorrect PSMs is available. Benjamini and Hochberg, 1995, for instance, included an example with 15 comparisons in their seminal paper (Benjamini and Hochberg 1995). Also note that their FDR estimator also reduces to a regular p-value in the extreme case that the subset of interest consists of one target PSM, i.e. we set $$\pi_0=1$$ and $$\bar F=1$$ for this extreme edge case.

The ECCDF $$\bar F_0(x)$$ is not a good estimator for $$F_0(x)$$ in small subsets because only a few decoys are available. TDA-FDR estimates in small subsets have high sample-to-sample variability and tend to be too liberal. Therefore we recommend to have at least 1000 decoys to obtain sufficiently reliable FDR estimates in most situations.

Lastly, we assure a monotonic increase of FDR estimates with decreasing score values by replacing the FDR at a particular score $$x$$ by the minimum TDA FDR in the interval $$[-\infty,x]$$.

# Example data analysis

## The data

In this vignette we use data from a Pyrocococus furiosis sample run on a LTQ-Orbitrap Velos mass spectrometer. The data can be found in the PRIDE repository with indentifier PXD001077. The Pyrocococus furiosis reference proteome fasta file was downloaded from Uniprot on April 22, 2016. In this use case we are interested in Pyrococcus proteins related to transferase activity as defined by their Gene Ontology identifier (GO:0016740). A fasta file with these proteins was also downloaded from Uniprot.

## Reading a MS-GF+ MZident output file.

## Load the saas library
library(saas)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)

## Location of the zipped data files
zip_file_path = system.file("extdata", "extdata.zip", package = "saas")
## Unzip and get the (temporary) location of the mzid file with the MS-GF+ search results.
mzid_file_path = unzip(zip_file_path, 'pyrococcus.mzid',exdir = tempdir())

## Read and parse the mzid file
data = parse_msgf_mzid(mzid_file_path)

glimpse(data)
## Observations: 15,639
## Variables: 7
## $spec_id <dbl> 9834, 10918, 12207, 12179, 11387, 11027, 7833, 9... ##$ sequence      <chr> "GLEVSGYNCYIYPAMALAYGTSAIGAHHK", "MLVDSLGDIVITND...
## $protein_id <chr> "Q8U1K3|Formaldehyde:ferredoxin", "Q8TZL6|Thermo... ##$ score         <dbl> 2.383677e-34, 6.238274e-32, 2.329653e-31, 1.0498...
## $database <chr> "complete", "complete", "complete", "complete", ... ##$ decoy         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $database_size <dbl> 2051, 2051, 2051, 2051, 2051, 2051, 2051, 2051, ... ## Preprocessing the MS-GF+ search results. By default, the ‘preprocess()’ function removes all PSMs that assigned to both a decoy and target sequence. All rows in the dataframe that belong to the same PSMs (eg. one PSMs that match multiple proteins) are collapsed into one row. When a path to a fasta file with the protein_ids from a subset of proteins in the fasta headers is provided, a new column is added that indicates if this PSM belongs to this subset of interest. These protein_ids should match the protein_ids from orginal fasta file used in the MS-GF+ search (and as indicated in the protein_id column in the data frame). This information is needed to use the all-sub method. Optionally, it’s also possible to remove PSM’s that can be assigned to multiple proteins. ## Unzip and get the (temporary) location of the file with fasta headers containing ## the protein_ids from the protein subset of interest. fasta_file_path = unzip(zip_file_path, 'transferase_activity_[GO:0016740].fasta', exdir = tempdir()) ## Preprocess the data. data_prep = preprocess(data, is_subset = fasta_file_path) glimpse(data_prep) ## Observations: 15,107 ## Variables: 9 ##$ spec_id       <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $sequence <chr> "YEQIPWTQECSK", "NVQQKWFGK", "NLLESPKIK", "KHLEQ... ##$ decoy         <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FA...
## $database <chr> "complete", "complete", "complete", "complete", ... ##$ database_size <dbl> 2051, 2051, 2051, 2051, 2051, 2051, 2051, 2051, ...
## $score <dbl> 2.560707e-05, 1.458271e-04, 9.241630e-06, 1.7355... ##$ subset        <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $non_subset <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,... ##$ protein_id    <chr> "Q8U285|Uncharacterized", "Q8U168|Glycosyl", "P8...

## Evaluating the decoy PSM distribution.

The implementation of the all-sub in this packages adopt the target-decoy approach for FDR estimation. A subset FDR can be obtained by applying TDA on the subset target and decoy PSMs.

However, in very small subsets with little subset target and decoy PSMs, there can be a large sample-to-sample variability on the FDR estimates. We provide extensions to the basic subset TDA FDR to obtain more stable FDR estimates. Key is the use of a large set of decoy PSMs to reliable estimate the distribution of incorrect target PSM scores. By default we use all decoys from the complete search (subset and non-subset PSMs). The user can also define another set of extra decoys and use this for FDR calculation (see example 2).

Bot the classical TDA FDR and our more stable TDA FDR relie on the assumption that the distribution of incorrect target PSM scores can be approximated by the decoy PSM score distribution.

In this package we provide diagnostic plots to verify the assumption that the decoy PSM score distribution follows the incorrect target PSM distribution. Tthe ‘score_higher’ parameter in ‘plot_diag()’ indicates if higher score values mean a more confident PSM. We take the MS-GF+ SpecEValue as the PSM score for FDR calculation and smaller SpecEValue indicate a better match. Therefore we have to set the ‘score_higher’ parameter to ‘FALSE’. Note that in the manuscript we used $$- log ($$MS-GF+ SpecEValue$$)$$ as the score for visualisation purposes. Please consult the documentation of your search engine of choice on how to interpret the PSM score.

diagnostics = plot_diag(data_prep, score_higher = FALSE)
diagnostics$all Panel a shows the posterior distribution of pi_0 given the observed number of target and decoy PSMs in the subset. The vertical line indicates the conservative pi_0 estimate used in the calculations. At very high $$\pi_0$$ uncertainty (broad peak), you can also opt to use the Benjamine Hochberg procedure to minimize sample to sample variability which would mean that pi_0 would be set to 1 (see FDR_BH in the next section). However, this will come at the expense of too conservative PSM lists. The distributional assumption for the decoys can be verified through a PP-plot where the empirical Cumulative Distribution Function (eCDF) of the decoys is plotted against the eCDF of the subset target PSMs. The PP-plots in panel b - d display the target subset PSMs plotted against all decoy PSMs from the complete search, the decoy subset PSMs plotted against all decoy PSMs from the complete search, and the target subset PSMs plotted against the decoy PSMs from the complete search, respectively. The full line in panel b and d indicates a line with a slope of $$\pi_0$$. The full line in panel c indicates the identity line. When the distributional assumption holds then the first part of the plot in b and d should be linear with a slope that equals $$\pi_0$$. The second part of the plot deviates from the line towards higher percentiles and will ultimately become vertical (decoy percentile = 1). If we see this profile in panel b, we have a good indication that the set of decoys from the complete search is representative for the mixture component for incorrect PSMs of the target mixture distribution. Deviations from this pattern might be subtle, therefore we provide the PP plots in c and d to support the conclusion drawn from panel b. When there is high uncertainty on pi_0 as indicated by a, then the linear pattern in the data points might deviate from the drawn solid line, but should still be more or less linear. The PP-plot in panel c shows the subset decoy PSMs plotted against all decoy PSMs. The whole plot should follow the identity line, indicating that the complete set of decoys is a good representation of the subset decoys. To verify that the subset decoys are representative for the mixture component for incorrect PSMs of the target mixture distribution, we look at the PP-plot of the subset decoys against the subset targets in panel d. The profile should look as described for panel b. If the profile matches in panel d but does not for panel b, then we suggest to not use the extra decoy set and use only the subset decoys for FDR estimation. When the profile does not match in panel d, the subset decoys might not be representative for incorrect PSMs. This can indicate that $$\pi_0$$ is estimated incorrectly, since this is based on the subset PSM scores. In this case, the first part of the plot in panel d can deviate from the (incorrect) $$\pi_0$$ slope line. But if this first part is linear, it still indicates that the extra set of decoys is representative for the mixture component of incorrect target PSMs. Since $$\pi_0$$ is unknow in this case we set $$\pi_0$$ to 1 (see FDR_BH in the next section). We can conclude from the above diagnostic plots that the large decoy set is an appropiate candiate set to estimate the incorrect subset target PSM distribution. ## Estimating FDR on the subset of interest. The output from ‘preprocess()’ can be immediatly used to estimate the FDR in the PSM subset of interest. data_results = calculate_fdr(data_prep, score_higher = FALSE) glimpse(data_results) ## Observations: 15,107 ## Variables: 13 ##$ spec_id       <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $sequence <chr> "YEQIPWTQECSK", "NVQQKWFGK", "NLLESPKIK", "KHLEQ... ##$ decoy         <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FA...
## $database <chr> "complete", "complete", "complete", "complete", ... ##$ database_size <dbl> 2051, 2051, 2051, 2051, 2051, 2051, 2051, 2051, ...
## $score <dbl> 2.560707e-05, 1.458271e-04, 9.241630e-06, 1.7355... ##$ subset        <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $non_subset <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,... ##$ protein_id    <chr> "Q8U285|Uncharacterized", "Q8U168|Glycosyl", "P8...
## $pi_0_cons <dbl> 0.3653846, 0.3653846, 0.3653846, 0.3653846, 0.36... ##$ FDR           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $FDR_BH <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ##$ FDR_stable    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

We can choose in the output from three different all-sub FDR estimations: The estimated stable FDR at this score cutoff for subset PSMs.

1. FDR: Calculated according the classical TDA method. Does not work well small subsets.
2. FDR_stable: Our improved FDR method which is more stable in small subsets. For large subsets, FDR_stable estimates will be close to FDR estimates.
3. FDR_BH: Bejamini Hochberg FDR procedure. Use this when you have a large decoy set but no decoy information on the subset PSMs (e.g. when the search engine does not return decoy protein ids). This FDR estimate is more conservative then FDR and FDR_stable.

The ‘transferase activity’ subset is rather small with 104 target and 37 decoy PSMs.

count(data_results, subset, decoy)
## Source: local data frame [4 x 3]
## Groups: subset [?]
##
##   subset decoy     n
##    <lgl> <lgl> <int>
## 1  FALSE FALSE 13158
## 2  FALSE  TRUE  1808
## 3   TRUE FALSE   104
## 4   TRUE  TRUE    37

From the diagnostic plots we concluded that the set of decoy PSMs from the complete search is a good candidate set to approximate the incorrect target PSM distribution.

So, in this case it’s recommended to use the more stable FDR estimates in FDR_stable.

results_1_FDR = filter(data_results, subset, !decoy, FDR_stable <= .01)
glimpse(results_1_FDR)
## Observations: 73
## Variables: 13
## $spec_id <dbl> 273, 409, 412, 439, 521, 1109, 1201, 1527, 1531,... ##$ sequence      <chr> "IKKEDYQR", "KKYEEYR", "KKYEEYR", "ILHDEDKEGTK",...
## $decoy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,... ##$ database      <chr> "complete", "complete", "complete", "complete", ...
## $database_size <dbl> 2051, 2051, 2051, 2051, 2051, 2051, 2051, 2051, ... ##$ score         <dbl> 7.725776e-09, 7.241932e-10, 7.455879e-10, 2.2812...
## $subset <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ... ##$ non_subset    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $protein_id <chr> "Q8U2R4|Acetyl", "Q8TH05|O-linked", "Q8TH05|O-li... ##$ pi_0_cons     <dbl> 0.3653846, 0.3653846, 0.3653846, 0.3653846, 0.36...
## $FDR <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, ... ##$ FDR_BH        <dbl> 0.0016578989, 0.0000000000, 0.0000000000, 0.0000...
## $FDR_stable <dbl> 0.0006057708, 0.0000000000, 0.0000000000, 0.0000... At 1% FDR, we return 31 PSMs. # Example 2: Using an external set of decoys In the previous example, we used scores from all decoys generated by the database search to obtain more stable FDR estimates when only few decoys are available in the subset of interest. Sometimes there are still not enough decoys available from the complete search or the PP plots showed that they do not estimate the distribution of incorrect target PSMs very well. However, it’s still possible to stabilize FDR estimation in a small subset if you obtain a set of decoys PSMs from somewhere else (eg. another experiment). ## The data For demonstration purposes we will simulate two datasets using the sample_dataset() functon. The first dataset contains 35 PSMs: 5 correct subset target PSMs, 10 incorrect subset target PSMs, 10 subset decoy PSMs and 10 non-subset decoy PSMs. The non-subset decoy PSMs, however, are biased towards lower score values. The second dataset contains 2000 decoys PSMs that are sampled from the same distribution as the decoy and incorrect target subset PSMs from dataset 1. We create a third dataset were we remove the non-subset decoys and add the decoys from dataset 2. ## Load the saas library library(saas) ## Load some convenience functions library(dplyr, quietly = TRUE, warn.conflicts = FALSE) set.seed(10) ## dataset 1 H0_mean = 2.75 data_1 = sample_dataset(H0_n = 20, decoy_n = 20, H1_n = 10, decoy_large_n = 2000, H0_mean = H0_mean, decoy_large_mean = H0_mean - .1) ## dataset 2 data_2 = sample_dataset(H0_n = 100, decoy_n = 100, H1_n = 50, decoy_large_n = 1900) ## dataset 3 with the subset PSMs from dataset 1 and the decoy PSMs from dataset 2 ### select all subset PSMs from dataset 1 data_1_filtered = filter(data_1, subset) ### select all decoys from dataset 2 data_2_filtered = filter(data_2,decoy) ### make sure they are labeled as non-subset decoys data_2_filtered = mutate(data_2_filtered, subset = FALSE) ### add them to the subset decoys of dataset 1 data_3 = bind_rows(data_1_filtered, data_2_filtered) ## Evaluating the decoy PSM distribution. We evaluate in dataset 1 the assumption that the distribution of incorrect target PSM scores can be approximated by the decoy PSM score distribution with diagnostic plots. diagnostics = plot_diag(data_1) diagnostics$all The PP plot in panel b shows clear deviation from the $$\pi_0$$ line. This suggests that the complete set of decoy PSM scores are a bad approximation of the incorrect target PSM score distribution.

The PP plot in panel d shows a profile that follows closely the $$\pi_0$$ line. This suggests that the subset decoys PSMs scores are a good approximation of the incorrect target PSM score distribution. But since we only have 20 subset decoy PSMs, we can expect large sample-to-sample variability in the FDR estimation.

Therefore we use a larger set of decoys derived from another dataset. We evaluate again in this exetended dataset (dataset 3) the assumption that the distribution of incorrect target PSM scores can be approximated by the decoy PSM score distribution.

diagnostics = plot_diag(data_3)
diagnostics$all Now, the PP plot in panel b shows a profile that follows closely the $$\pi_0$$ line. We can safely assume that the complete set of decoy PSM scores are a good approximation of the incorrect target PSM score distribution. ## Estimating FDR on the subset of interest. Since we have a very small subset, we again recommended to use the more stable FDR estimates in FDR_stable exploiting the information in the larger set of decoy PSMs. data_results = calculate_fdr(data_3) results_1_FDR = filter(data_results, subset, !decoy, FDR_stable <= .01) glimpse(results_1_FDR) ## Observations: 9 ## Variables: 8 ##$ score      <dbl> 3.271756, 3.553800, 3.532864, 3.200799, 3.137270, 3...
## $decoy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA... ##$ H0         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $subset <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE ##$ pi_0_cons  <dbl> 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7
## $FDR <dbl> 0.0, 0.0, 0.0, 0.1, 0.1, 0.0, 0.0, 0.0, 0.0 ##$ FDR_BH     <dbl> 0.000000000, 0.000000000, 0.000000000, 0.003712871,...
## \$ FDR_stable <dbl> 0.000000000, 0.000000000, 0.000000000, 0.002599010,...

At 1% FDR, we return 9 PSMs. Note that none of the returned PSMs where sampled out of the incorrect target distribution ($$H_0$$).

# sessionInfo()

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.7.1
## LAPACK: /usr/lib/liblapack.so.3.7.1
##
## locale:
##  C
##
## attached base packages:
##  stats     graphics  grDevices utils     datasets  methods   base
##
## loaded via a namespace (and not attached):
##   compiler_3.4.1  backports_1.1.0 magrittr_1.5    rprojroot_1.2
##   tools_3.4.1     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.12
##   stringi_1.1.5   rmarkdown_1.6   knitr_1.17      stringr_1.2.0
##  digest_0.6.12   evaluate_0.10.1