ModPred: predictor of post-translational modification sites in proteins

Home | Help | Datasets | FAQ

Usage

Input

The input to ModPred is a protein sequence in FASTA format (example), which can be pasted into the text box, or uploaded as a file. Only the 20 conventional amino acid symbols are supported; entering one of the ambiguous symbols (BJOUXZ) will produce an error. The sequence should be 30 or more residues long and should contain at least one modifiable residue for the modification of interest.

An additional optional input can be provided if evolutionary features are to be included in the prediction process. This file is a standard Position-Specific Scoring Matrix (PSSM) file as produced by PSI-BLAST (example). To generate a PSSM file, we recommend using the blastpgp program provided in the legacy BLAST package (version 2.2.18). For usage information, please refer to the blastpgp documentation page. For training our models, we ran blastpgp as follows:

blastpgp -i seq.fasta -d nr -Q seq.pssm -h 0.0001 -j 3
where seq.fasta and seq.pssm are the input FASTA file and the output PSSM file, respectively. The NR database was downloaded from NCBI (June 2013) and is available upon request.

Note (important): Due to limited computational resources, we allow prediction of only one sequence (5000 residues long) per user at a time. For larger datasets, we recommend the user-interface version, ModPredUI (hundreds of sequences) or the command-line executable (thousands of sequences).

We provide stand-alone versions of the predictor (Linux and Windows) that you can download and install on your workstation. There is a command-line tool that can easily be used to generate large-scale predictions and/or can be called from within other programs. Additionally, for users who are not familiar with the command-line environment and/or basic programming, there is a version with a user interface that works for medium-scale datasets. Installation and usage instructions for these are available here.

Output

The output consists of 5 columns:

  1. Position
  2. Modification
  3. Score (between 0 and 1)
  4. Confidence (Not modified, low, medium or high)
  5. Remarks (Comments on whether the site is already known or whether it is associated with a motif)
Predictions are made on all modifiable residues of a query sequence, for a given PTM.

Depending on the actual predictor scores, we label PTM site predictions as low, medium or high confidence. Sites with scores of at least 0.5 are labeled as low confidence sites. Exceptions to this rule include amidation (non-motif) and proteolytic cleavage because of the high chance, in these cases, that any residue can be predicted to be modified at random. In these cases, only medium and high-confidence sites are labeled. Sensitivity and specificity estimates for different predictor scores for each of the models are given in the following tables:

Without PSSMs (Basic model)
Modification type Residue Motif/non-motif Medium confidence High confidence
Specificity Sensitivity Threshold Specificity Sensitivity Threshold
Acetylation K NA 0.9010 0.2768 0.66 0.9916 0.0464 0.82
ADP-ribosylation E, R NA 0.9063 0.3562 0.67 0.9927 0.0633 0.94
Amidation All Motif 0.9053 0.6243 0.63 0.9951 0.0798 0.98
Non-motif 0.9018 0.9520 0.50 0.9907 0.7850 0.86
C-linked glycosylation W NA 0.9085 0.7562 0.63 0.9932 0.4542 0.91
Carboxylation E NA 0.9019 0.7946 0.50 0.9918 0.3592 0.93
Disulfide linkage C NA 0.9108 0.1824 0.68 0.9911 0.0366 0.78
Farnesylation C NA 0.9186 0.5325 0.84 0.9999 0.0472 0.98
Geranylgeranylation C NA 0.9124 0.5711 0.81 0.9999 0.0433 0.96
GPI-anchor amidation N NA 0.9034 0.9075 0.50 0.9911 0.5183 0.90
Hydroxylation K, P, Y NA 0.9061 0.5349 0.63 0.9924 0.1089 0.94
Methylation K NA 0.9034 0.1936 0.69 0.9910 0.0266 0.89
R 0.9031 0.4372 0.66 0.9910 0.2112 0.91
Myristoylation G NA 0.9123 0.3532 0.83 0.9969 0.0380 0.98
N-linked glycosylation N Motif 0.9082 0.1690 0.65 0.9919 0.0202 0.76
Non-motif 0.9039 0.5477 0.64 0.9908 0.2779 0.90
N-terminal acetylation A NA 0.9025 0.4466 0.79 0.9941 0.0694 0.97
G 0.9116 0.5667 0.61 0.9999 0.0460 0.98
M 0.9021 0.7133 0.67 0.9924 0.1288 0.99
S 0.9039 0.2840 0.78 0.9917 0.0565 0.93
T 0.9102 0.2707 0.75 0.9970 0.0205 0.97
O-linked glycosylation S NA 0.9030 0.3664 0.70 0.9910 0.0700 0.90
T 0.9040 0.3328 0.72 0.9915 0.0520 0.92
Palmitoylation C NA 0.9041 0.6245 0.61 0.9926 0.1916 0.94
Phosphorylation S NA 0.9016 0.4727 0.68 0.9903 0.1272 0.89
T 0.9040 0.3763 0.67 0.9914 0.0898 0.87
Y 0.9071 0.2787 0.67 0.9908 0.0517 0.81
Proteolytic cleavage All NA 0.9013 0.3794 0.69 0.9907 0.0854 0.88
PUPylation K NA 0.9044 0.2184 0.68 0.9917 0.0418 0.89
Pyrrolidone carboxylic acid Q NA 0.9045 0.6824 0.57 0.9920 0.1875 0.93
Sulfation Y NA 0.9029 0.7716 0.58 0.9980 0.0565 0.95
SUMOylation K Motif 0.9026 0.4326 0.76 0.9936 0.0482 0.97
Non-motif 0.9034 0.1959 0.69 0.9910 0.0223 0.90
Ubiquitination K NA 0.9076 0.1639 0.66 0.9911 0.0201 0.79

With PSSMs (Evolutionary model)
Modification type Residue Motif/non-motif Medium confidence High confidence
Specificity Sensitivity Threshold Specificity Sensitivity Threshold
Acetylation K NA 0.9056 0.3124 0.68 0.9913 0.0573 0.85
ADP-ribosylation E, R NA 0.9056 0.3694 0.67 0.9915 0.0623 0.95
Amidation All Motif 0.9041 0.6648 0.64 0.9971 0.0665 0.99
Non-motif 0.9019 0.9557 0.50 0.9903 0.8082 0.84
C-linked glycosylation W NA 0.9085 0.8365 0.55 0.9938 0.4146 0.97
Carboxylation E NA 0.9021 0.8432 0.50 0.9912 0.5351 0.89
Disulfide linkage C NA 0.9044 0.3908 0.73 0.9912 0.0781 0.90
Farnesylation C NA 0.9164 0.6325 0.81 0.9999 0.0410 0.99
Geranylgeranylation C NA 0.9101 0.6867 0.86 0.9999 0.4122 0.94
GPI-anchor amidation N NA 0.9024 0.9052 0.50 0.9915 0.5282 0.93
Hydroxylation K, P, Y NA 0.9030 0.7322 0.57 0.9918 0.2530 0.94
Methylation K NA 0.9037 0.2503 0.70 0.9911 0.0419 0.90
R 0.9033 0.4446 0.65 0.9910 0.2024 0.91
Myristoylation G NA 0.9112 0.5141 0.79 0.9997 0.0081 0.99
N-linked glycosylation N Motif 0.9042 0.2767 0.67 0.9908 0.0659 0.83
Non-motif 0.9022 0.5607 0.64 0.9905 0.2548 0.92
N-terminal acetylation A NA 0.9034 0.4636 0.82 0.9948 0.0663 0.99
G 0.9124 0.7889 0.55 0.9999 0.0556 0.99
M 0.9030 0.6989 0.65 0.9933 0.1448 0.98
S 0.9038 0.3368 0.80 0.9919 0.0736 0.97
T 0.9079 0.3916 0.75 0.9972 0.0269 0.98
O-linked glycosylation S NA 0.9022 0.3789 0.71 0.9916 0.0893 0.93
T 0.9024 0.3690 0.72 0.9910 0.0733 0.93
Palmitoylation C NA 0.9026 0.6789 0.60 0.9922 0.2435 0.95
Phosphorylation S NA 0.9017 0.4815 0.68 0.9918 0.1159 0.90
T 0.9023 0.3893 0.67 0.9906 0.0970 0.87
Y 0.9047 0.2966 0.67 0.9909 0.0519 0.83
Proteolytic cleavage All NA 0.9033 0.4200 0.70 0.9914 0.1015 0.92
PUPylation K NA 0.9038 0.4356 0.70 0.9931 0.0586 0.95
Pyrrolidone carboxylic acid Q NA 0.9031 0.7697 0.54 0.9912 0.3893 0.91
Sulfation Y NA 0.9026 0.8322 0.54 0.9999 0.2680 0.97
SUMOylation K Motif 0.9028 0.4577 0.77 0.9928 0.0894 0.97
Non-motif 0.9077 0.2098 0.67 0.9915 0.0271 0.89
Ubiquitination K NA 0.9033 0.1853 0.68 0.9909 0.0254 0.84

Datasets

For the 23 PTMs, proteins containing known modification sites were extracted from databases such as SwissProt (The UniProt Consortium, 2012), HPRD (Prasad, et al., 2009), PDB (Berman, et al., 2000), Phospho.ELM (Dinkel, et al., 2011), PhosphoSitePlus (Hornbeck, et al., 2012) & PHOSIDA (Gnad, et al., 2011) and an ad-hoc literature search. From these proteins, we extracted modified (positive) fragments, each containing up to 12 upstream and downstream residues around the central residue of interest. The set of negative fragments constituted sites not known to be modified extracted from the same proteins. To obtain a non-redundant dataset, no two fragments within the positive or negative datasets, as well as across the two datasets, were allowed to share 40% sequence identity. When a similar pair between a positive and negative example occurred, the negative site was always removed as less reliably labeled. The resulting datasets contained 126,036 positive and 971,129 negative fragments.

Raw datasets (before redundancy removal) can be downloaded here.

Predictor Evaluation

To evaluate ModPred, a 10-fold cross-validation strategy was chosen for all PTMs where the number of positive examples was greater than 100 (2-fold for phosphorylation because of the large dataset size. If this was not the case, a leave-one-out strategy was adopted. We measured accuracy on a per-PTM-per-residue/dataset level by estimating sensitivity (sn) and specificity (sp). Sensitivity represents the percentage of true positives predicted to be positive (modified),while specificity represents the percentage of true negatives predicted to be negative (non-modified) (Hastie, et al., 2001). In addition to sensitivity and specificity we report the area under the ROC curve (AUC) for each modification. The ROC curve represents a mapping of 1 – sp to sn and the corresponding AUCs were estimated using the trapezoid rule.

References

  1. The UniProt Consortium. (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Research, 40: D71-D75.
  2. Prasad, T. S. K. et al. (2009). Human Protein Reference Database - 2009 update. Nucleic Acids Research, 37: D767-D772.
  3. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Research, 28: 235-242.
  4. Dinkel, H., Chica, C., Via, A., Gould, C. M., Jensen, L. J., Gibson, T. J., and Diella, F. (2011). Phospho.ELM: a database of phosphorylation sites - update 2011. Nucleic Acids Research, 39: D261-D267.
  5. Hornbeck, P. V., Kornhauser, J. M., Tkachev, S., Zhang, B., Skrzypek, E., Murray, B., Latham, V., and Sullivan, M. (2012). PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Research, 40: D261-D270.
  6. Gnad, F., Gunawardena, J., and Mann, M. (2011). PHOSIDA 2011: the posttranslational modification database. Nucleic Acids Research, 39: D253-D260.
  7. Hastie, T., Tibshirani, R., and Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction. New York, NY, Springer Verlag.