PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-based Ensemble Approach

Affan       Alim; Abdul       Rafay; Imran       Naseem

doi:10.2174/1574893615999200707141926

Abstract

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process, which may cause the rupture in the internal cells and tissues. AFP’s have also attracted attention and interest in food industries and cryopreservation.

Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce a competitive performance on a highly distinct AFP structure.

Methods: In this study, machine learning-based algorithms including Principal Component Analysis (PCA) followed by Gradient Boosting (GB) were proposed to be used for anti-freeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments' composition of amino acid and dipeptides are used. PCA, in particular, is proposed for dimension reduction and high variance retaining of data, which is followed by an ensemble method named gradient boosting for modeling and classification.

Results: The proposed method obtained a superfluous performance on PDB, Pfam, and Uniprot datasets as compared to the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components, a high accuracy of 89.63% was achieved, which is superior to 87.41% utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different datasets such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, the proposed method attained high sensitivity of 79.16% which is 12.50% better than state-of-the-art RAFP-pred method.

Conclusion: AFPs have a common function with a distinct structure. Therefore, the development of a single model for different sequences often fails for AFPs. Robust results have been shown by the proposed model on the diversity of training and testing datasets. The results of the proposed model outperformed compared to the previous AFPs prediction method, such as RAFP-Pred. The proposed model consists of PCA for dimension reduction, followed by gradient boosting for classification. Due to simplicity, scalability properties, and high performance result, this model can be easily extended for analyzing the proteomic and genomic datasets.

Keywords: Terms-protein, antifreeze protein, PCA, gradient boosting, classifier, identification.

« Previous Next »

Graphical Abstract

[1] 
Griffith M, Ala P, Yang DS, Hon W-C, Moffatt BA. Antifreeze
protein produced endogenously in winter rye leaves. Plant Physiol 1992; 100(2): 593-6.
[http://dx.doi.org/10.1104/pp.100.2.593] [PMID:  16653033] 
[2] 
Kuiper MJ, Morton CJ, Abraham SE, Gray-Weale A. The biological function of an insect antifreeze protein simulated by molecular dynamics. eLife  2015; 4e05142
[3] 
Urrutia ME, Duman JG, Knight CA. “Plant thermal hysteresis proteins,” Biochimica et Biophysica Acta (BBA)-. Protein Struct Mol Enzym  1992; 1121(1-2): 199-206.
[http://dx.doi.org/10.1016/0167-4838(92)90355-H] 
[4] 
Sinha P, Muralidharan S, Sengupta S, Veerappapillai S. A brief review on antifreeze proteins: structure, function and applications. Res J Pharm Biol Chem Sci  2016; 7(3): 914-9.
[5] 
Kandaswamy KK, Chou K-C, Martinetz T, et al. AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol  2011; 270(1): 56-62.
[http://dx.doi.org/10.1016/j.jtbi.2010.10.037] [PMID:  21056045] 
[6] 
Davies PL, Hew CL. Biochemistry of fish antifreeze proteins. FASEB J  1990; 4(8): 2460-8.
[http://dx.doi.org/10.1096/fasebj.4.8.2185972] [PMID:  2185972] 
[7] 
Fletcher GL, Goddard SV. Antifreeze proteins and their genes: from basic research to business opportunity. Chemtech  1999; 29(6): 17-28.
[8] 
Ewart, K. V., Qing Lin, and C. L. Hew. Structure, function and evolution of antifreeze proteins. Cellular and Molecular Life Sciences  CMLS 552 (1999): 271-283 
[http://dx.doi.org/10.1007/s000180050289] 
[9] 
Feeney RE, Yeh Y. Antifreeze proteins: current status and possible food use. Trends Food Sci Technol  1998; 9(3): 102-6.
[http://dx.doi.org/10.1016/S0924-2244(98)00025-9] 
[10] 
Griffith M, Ewart KV. Antifreeze proteins and their potential use in frozen foods. Biotechnol Adv  1995; 13(3): 375-402.
[http://dx.doi.org/10.1016/0734-9750(95)02001-J] [PMID:  14536093] 
[11] 
Regand A, Goff HD. Ice recrystallization inhibition in ice cream as affected by ice structuring proteins from winter wheat grass. J Dairy Sci  2006; 89(1): 49-57.
[http://dx.doi.org/10.3168/jds.S0022-0302(06)72068-9] [PMID:  16357267] 
[12] 
Clarke CJ, Buckley SL, Lindner N. Ice structuring proteins - a new name for antifreeze proteins. Cryo Lett  2002; 23(2): 89-92.
[PMID:  12391489] 
[13] 
Payne SR, Sandford D, Harris A, Young OA. The effects of antifreeze proteins on chilled and frozen meat. Meat Sci  1994; 37(3): 429-38.
[http://dx.doi.org/10.1016/0309-1740(94)90058-2] [PMID:  22059547] 
[14] 
Khan S, Naseem I, Togneri R, Bennamoun M. Rafp-pred: robust prediction of antifreeze proteins using localized analysis of n-peptide compositions. IEEE/ACM Trans Comput Biol Bioinformatics  2018; 15(1): 244-50.
[http://dx.doi.org/10.1109/TCBB.2016.2617337] 
[15] 
Usman M, Lee JA. Afp-cksaap: prediction of antifreeze proteins
  using the composition of k-spaced amino acid pairs with deep
  neural network. 2019 IEEE 19th International Conference on
  Bioinformatics and Bioengineering (BIBE) 2019.. 
[16] 
Pratiwi R, Malik AA, Schaduangrat N, et al. Cryoprotect: a web server for classifying antifreeze proteins from nonantifreeze proteins. J Chem  2017; 20179861752
[http://dx.doi.org/10.1155/2017/9861752] 
[17] 
Eslami M, Shirali Hossein Zade R, Takalloo Z, et al. afpCOOL: a tool for antifreeze protein prediction. Heliyon  2018; 4(7)e00705
[http://dx.doi.org/10.1016/j.heliyon.2018.e00705] [PMID:  30094375] 
[18] 
Chou K-C, Shen H-B. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc  2008; 3(2): 153-62.
[http://dx.doi.org/10.1038/nprot.2007.494] [PMID:  18274516] 
[19] 
Chou K-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins  2001; 43(3): 246-55.
[http://dx.doi.org/10.1002/prot.1035] [PMID:  11288174] 
[20] 
Bateman A, Coin L, Durbin R, et al. The Pfam protein families database. Nucleic Acids Res  2004; 32(Database issue)(Suppl. 1): D138-41.
[http://dx.doi.org/10.1093/nar/gkh121] [PMID:  14681378] 
[21] 
Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments Proteins 1997; 28(3): 405-20.
  http://dx.doi.org/10.1002/(SICI)1097-
  0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L. 
[PMID: 9223186] 
[22] 
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics  2001; 17(3): 282-3.
[http://dx.doi.org/10.1093/bioinformatics/17.3.282] [PMID:  11294794] 
[23] 
Chou K-C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics  2009; 6(4): 262-74.
[http://dx.doi.org/10.2174/157016409789973707] 
[24] 
Srivastava A, Kumar R, Kumar M. BlaPred: Predicting and classifying β-lactamase using a 3-tier prediction system via Chou’s general PseAAC. J Theor Biol  2018; 457: 29-36.
[http://dx.doi.org/10.1016/j.jtbi.2018.08.030] [PMID:  30138632] 
[25] 
Pearson K. Liii. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci  1901; 2(11): 559-72.
[http://dx.doi.org/10.1080/14786440109462720] 
[26] 
Fisher RA, Mackenzie WA. Studies in crop variation. ii. the manurial response of different potato varieties. J Agric Sci  1923; 13(3): 311-20.
[http://dx.doi.org/10.1017/S0021859600003592] 
[27] 
Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet  2008; 40(5): 646-9.
[http://dx.doi.org/10.1038/ng.139] [PMID:  18425127] 
[28] 
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal  2002; 38(4): 367-78.
[http://dx.doi.org/10.1016/S0167-9473(01)00065-2] 
[29] 
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media 2009.
[http://dx.doi.org/10.1007/978-0-387-84858-7] 
[30] 
Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics  2003; 19(12): 1589-91.
[http://dx.doi.org/10.1093/bioinformatics/btg224] [PMID:  12912846] 
[31] 
Berman HM, Bourne PE, Westbrook J, Zardecki C. The protein data bank.,in Protein Structure.  CRC Press 2003; pp. 394-410.
[32] 
Bairoch A, Apweiler R, Wu CH, et al. The universal protein resource (uniprot). Nucleic Acids Res  2005; 33(Database issue)(Suppl. 1): D154-9.
[http://dx.doi.org/10.1093/nar/gki070] [PMID:  15608167] 
[33] 
Wang Y, Hu M, Li Q, Zhang X-P, Zhai G, Yao N. Abnormal respiratory patterns classifier may contribute to large-scale
 screening of people infected with covid-19 in an accurate and
 unobtrusive manner. arXiv preprint arXiv:200205534 2020.. 
[34] 
Khatri R, Varghese V, Sharma S, Kumar GS, Chhabra HS. Pullout strength predictor: A machine learning approach. Asian Spine J  2019; 13(5): 842-8.
[http://dx.doi.org/10.31616/asj.2018.0243] [PMID:  31154706] 
[35] 
Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Programs Biomed  2018; 153: 1-9.
[http://dx.doi.org/10.1016/j.cmpb.2017.09.005] [PMID:  29157442] 

Rights & Permissions Print Cite

Article Metrics

25

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893615999200707141926	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-based Ensemble Approach

Abstract

Graphical Abstract

Related Journals

Related Books

Related Articles