Variation ProfilingOGPET v1.0 is a tool for the prediction of mucin-type O-glycosylated residues based on two novel approaches developed in our laboratory, namely variation profiling and prediction constraints . Although the biology behind this tool is rather simple, the mathematical handling and interpretation of this sort of biological data is much more complicated than conventional bioinformatics approaches. Moreover, every time new experimental validation of O-glycosylation Thr/Ser residues is published, OGPET v1.0 updates its profiling set in order to have the most comprehensive tool available at hand for the scientific community.Approximately 2400 mucin-type O-glycosylation sites, found in 242 glycoproteins, were extracted and arranged into patterns from sites extracted from O-GLYCBASE v6.00 (Gupta et al., 1999). This revised database includes mucin-type glycoproteins from a variety of eukaryotic organisms, as diverse as Homo sapiens and Drosophila melanogaster. OGPET v1.0 matches the input sequence(s) with series of patterns. Each pattern shares a common structure which includes 5 relevant positions (-3, -1, +1, +3, and +4) relative to the experimentally mapped Thr/Ser residue (position 0). These 5 positions are known to affect the interaction of the polypeptide GalNAc-transferase (ppGalNacT) with the target peptide/protein (Gerken et al., 2004; Nehrke et al., 1996; Nehrke et al., 1997). There are two positions (-2 and +2, relative to the Thr/Ser residue), however, that do not play a significant role in the O-glycosylation process according to our analysis. The basic structure of a pattern:
where 'X' can be any amino acid (-2 and +2 amino acids are non-relevant residues)In order to create the patterns, the flanking sequence of each experimentally mapped O-glycosylation Thr/Ser residue was extracted and used as a profiling pattern. For example, consider the Granulocyte-macrophage colony-stimulating factor from Homo sapiens (DB REF: SWISS P04141, CSF2_HUMAN.): MWLQSLLLLGTVACSISAPARSPSPSTQPWEHVNAIQEARRLLNLSRDTAAEMNETVEVISEMFD .....................S.S.ST......................................Three Ser and one Thr residues are experimentally mapped to be O-glycosylated. Taking the flanking sequence for each site in the order they appear along the protein sequence, and considering that OGPET v1.0 ignores those amino acids located at -2 and +2 positions, we have: PnRSPnPS RnPSPnTQ PnPSTnPW SnSTQnWEwhere the amino acid marked in bold red is the O-glycosylated one and 'n' can be any amino acid. OGPET v1.0 creates the same patterns for each of the approximately 2400 residues extracted from O-GLYCBASE v6.00. The prediction capabilities are further extended by the addition of the variation profiling approach. Its basic principle is that any given amino acid located at any of the five possible positions (-3, -1, +1, +3, and +4) can be substituted for another amino acid with exactly the same physical chemical properties at that same position without altering the overall context of the flaking sequence of a potential O-glycosylation residue. In order to do this, all 20 amino acids were grouped according to the physical and chemical properties they share with one another. Six different properties were considered, and each property has a regexp code number assigned to it:
Each amino acid was then grouped up with all the other amino acids that share the exact same set of properties. For example, consider Thr; a polar, small and turnlike (4, 5, 6) amino acid, just like Cys, which is 4, 5 and 6 as well. Although those amino acids that are well known to highly favor the O-glycosylation process could not be grouped up (i.e., Pro, Ala, Val, and Ser), in the end, three visible substitution groups were formed:
Back to our previous example, and considering only the second and third patterns, it is now possible to make the variations on the original patterns using the three different substitution groups:
where the amino acid marked in red is the O-glycosylated one, and the one in blue and underlined is the substituted amino acid. In the end, the total number of prediction patterns significantly increased. Nonetheless, the rate of false-positive hits remained statistically the same. OGPET v1.0 was tested with sets of glycoproteins and sets of proteins experimentally known not to be O-glycosylated (i.e., albumin, ribonucleases, certain bovine lipases, human complement subunits, and CD40 ligands) showing no variation in its performance. Thus meaning that its overall sensitivity and specificity (>0.97 and >0.98, respectively) of this algorithm did not change. Scoring SystemThe score (max.=1.0000) of each potential Thr/Ser residue is determined by the likelihood of the residue to be O-glycosylated depending upon the amino acid composition of its flanking sequence. A set of 140 amino acid patterns, from 50 randomly selected O-GLYCBASE v6.00 v6.00 glycoproteins served as the training set to set up a score threshold model based on the residues that were experimentally verified to be O-glycosylated. A cutoff score (=0.4247) was implemented. Any potential residue has to obtain a greater value than the cutoff in order to be considered a true-positive hit.
The numerical values for each AA in any given position are calculated based upon two variables: its particular number of appearances, and its number of appearances ratio relative to the most frequent amino acid. In a), the most common amino acid is Ser, which contributes the most to the overall score with a numerical value of 0.15. In b) and c), the most frequent amino acid is Thr with a particular contribution of 0.20 and 0.15, respectively. Prediction ConstraintsO-glycosylation seems to be occurring without any associated pattern per se but just by the presence of certain amino acids that make this biological process likely to occur. Moreover, it is now accepted that peptide substrate specificities can vary among the different ppGalNAcT family members, especially due to the fact that the activity of the different enzymes is significantly altered by prior peptide glycosylation (Schwientek et al., 2002; Ten Hagen et al., 2001).Currently, there are twelve members of the mammalian ppGalNAcT family (ppGalNAcT1-T12) that have been described to date, and homologous ppGalNAcTs have been described in Drosophila melanogaster and Caenorhabditis elegans as well (Hagen and Nehrke, 1998; Schwientek et al., 2002; Ten Hagen et al., 2001). Studies with ppGalNAcTs have revealed that not all of the 5 positions considered in OGPET v1.0 's analysis are necessarily involved in the interaction of the ppGalNAcT with the target peptide/protein. Therefore, a new set of prediction constraints was developed to suit the specificity of each possible ppGalNacT isoform. Originally, OGPET v1.0 would only match the input sequence with any of the original or variation patterns from the training set. However, despite of the huge number of prediction patterns, this approach seems to be very limited for some cases. After several trial-and-error tests, the prediction capabilities of OGPET v1.0 increased when the user was allowed to manipulate the positions they want to consider in the analysis. The rate of false-positive hits increases, as well as the probabilities of finding novel sites. It is up to the user to define the best prediction constraint for the model under study. Nevertheless, we highly discourage the user to use any other combination than the following in the given order:
OGPET v1.0 was tested with sets of proteins experimentally known not to be O-glycosylated (e.g. BSA, Human Ribonucleases A& B, Human CD40 ligands). The removal of a single position in a given prediction constraint significantly increased the rate of false-positive hits compared to the default option, which includes all 5 positions (# of false-positive hits = 1). The probabilities of finding novel sites that were not originally included on any of OGPET v1.0 's training sets significantly increase as well. The removal of two or more positions skyrockets this rate, making these particular prediction constraints unnecessary ones. References
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||