Prediction of Protein Structure Using Backbone Fragment Library and a Multilayered Learning Algorithm

 Pramod P Wangikar, Ashish V Tendulkar, Sunita Sarawagi

Department of Chemical Engineering, Indian Institute of Technology, Bombay, Powai Mumbai 400076 INDIA. 

Kanwal Rekhi School of Information Technology, Indian Institute of Technology, Bombay,
 Powai Mumbai 400076 INDIA.  Email: 

 The current approaches for protein structure prediction rely on 

(i) homology of the entire protein sequence with a template structure or 
(ii) ab initio prediction methods.  These methods suffer from the disadvantages of 
    (a) lack of homologous template structure for a majority of new sequences 
    (b) untractably large conformational search space for ab initio predictions.  We propose a method that 
         exploits the correlation between conformation and sequence features in the local region.  
         For this purpose, we first constructed a library of local conformation classes or backbone fragments. 
         We use octapeptide as an arbitrary unit of local conformation.  Using a “geometric invariant based 
         approach”1,2, we show that the octapeptide fragment structures can be clustered into 46 structural
The protein 3-D structure can now be described as a sequence of backbone fragments or structure labels. 
Note that the average 3-D structure for each of the 46 structure labels is available and that the 3-D structure 
of a protein can be reconstructed from the sequence of structure labels. Analysis of the sequence features 
reveals  the presence of sequence-structure relationship in local regions, which can be exploited to predict local 
conformations of protein based on its amino acid sequence.   We have formulated this problem as a classical 
text segmentation problem using Conditional Random Field (CRF).   CRF considers a Markov random field (Y) 
globally conditioned on another random field (X).  In this case, Y is the sequence of structure labels while X is the
amino acid sequence.  The accuracy of the CRF predictions was augmented by using Support Vector Machine 
(SVM) as an additional layer of learning.  In this layered algorithm, CRF manages the task of segmentation while 
SVM provides input for labeling of the segments.  The model was trained with 146 high resolution x-ray crystal 
structures to obtain over 30% prediction accuracy for 60 unseen sequences.  We believe that the prediction 
accuracy can be improved further by fine tuning the model parameters or by using a larger data set.  
We argue that the results of this prediction algorithm can be used to complement the efforts of homology 
modeling as well as ab initio predictions.

  1. Tendulkar et al (2005) “A geometric invariant-based framework for the analysis of protein conformational space” Bioinformatics, 21, 3622-3628

  2. Tendulkar et al (2004) “Clustering of Protein Structural Fragments Reveals Modular Building Block Approach of Nature” J. Mol. Biol, 338, 611-629