A Comparison of Mutual Information, Linear Models and Deep Learning
Networks for Protein Secondary Structure Prediction

Saida   Saad Mohamed   Mahmoud; Beatrice      Portelli; Giovanni      D'Agostino; Gianluca      Pollastri; Giuseppe      Serra; Federico      Fogolari

doi:10.2174/1574893618666230417103346

Abstract

Background: Over the last several decades, predicting protein structures from amino acid sequences has been a core task in bioinformatics. Nowadays, the most successful methods employ multiple sequence alignments and can predict the structure with excellent performance. These predictions take advantage of all the amino acids at a given position and their frequencies. However, the effect of single amino acid substitutions in a specific protein tends to be hidden by the alignment profile. For this reason, single-sequence-based predictions attract interest even after accurate multiple-alignment methods have become available: the use of single sequences ensures that the effects of substitution are not confounded by homologous sequences.

Objective: This work aims at understanding how the single-sequence secondary structure prediction of a residue is influenced by the surrounding ones. We aim at understanding how different prediction methods use single-sequence information to predict the structure.

Methods: We compare mutual information, the coefficients of two linear models, and three deep learning networks. For the deep learning algorithms, we use the DeepLIFT analysis to assess the effect of each residue at each position in the prediction.

Results: Mutual information and linear models quantify direct effects, whereas DeepLIFT applied on deep learning networks quantifies both direct and indirect effects.

Conclusion: Our analysis shows how different network architectures use the information of single protein sequences and highlights their differences with respect to linear models. In particular, the deep learning implementations take into account context and single position information differently, with the best results obtained using the BERT architecture.

Keywords: Secondary structure prediction, single sequence, mutual information, linear model, deep learning, neuralnetwork, LSTM, BERT.

« Previous Next »

[1]
Anfinsen CB. Principles that govern the folding of protein chains. Science  1973; 181(4096): 223-30.
 [http://dx.doi.org/10.1126/science.181.4096.223] [PMID:  4124164]

[2]
Rost B, Sander C, Schneider R. Redefining the goals of protein secondary structure prediction. J Mol Biol  1994; 235(1): 13-26.
 [http://dx.doi.org/10.1016/S0022-2836(05)80007-5] [PMID:  8289237]

[3]
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Natur  2021; 596(7873): 583-9.
 [http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID:  34265844]

[4]
Zhou Y, Karplus M. Interpreting the folding kinetics of helical proteins. Natur  1999; 401(6751): 400-3.
 [http://dx.doi.org/10.1038/43937] [PMID:  10517642]

[5]
Ozkan SB, Wu GA, Chodera JD, Dill KA. Protein folding by zipping and assembly. Proc Natl Acad Sci USA  2007; 104(29): 11987-92.
 [http://dx.doi.org/10.1073/pnas.0703700104] [PMID:  17620603]

[6]
Plaxco KW, Simons KT, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol  1998; 277(4): 985-94.
 [http://dx.doi.org/10.1006/jmbi.1998.1645] [PMID:  9545386]

[7]
Yang Y, Gao J, Wang J, et al. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief Bioinform  2016; 19(3): bbw129.
 [http://dx.doi.org/10.1093/bib/bbw129] [PMID:  28040746]

[8]
Rost B, Sander C. Third generation prediction of secondary structures. In: Protein Structure Prediction: Methods and Protocols.  Totowa, NJ: Humana Press 2000; pp. 71-95.
 [http://dx.doi.org/10.1385/1-59259-368-2:71]

[9]
Pauling L, Corey RB. Configurations of polypeptide chains with favored orientations around single bonds: Two new pleated sheets. Proc Natl Acad Sci USA  1951; 37(11): 729-40.
 [http://dx.doi.org/10.1073/pnas.37.11.729] [PMID:  16578412]

[10]
Pauling L, Corey RB, Branson HR. The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci  1951; 37(4): 205-11.
 [http://dx.doi.org/10.1073/pnas.37.4.205] [PMID:  14816373]

[11]
Chou PY, Fasman GD. Prediction of protein conformation. Biochemistry  1974; 13(2): 222-45.
 [http://dx.doi.org/10.1021/bi00699a002] [PMID:  4358940]

[12]
Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol  1978; 120(1): 97-120.
 [http://dx.doi.org/10.1016/0022-2836(78)90297-8] [PMID:  642007]

[13]
Gibrat JF, Garnier J, Robson B. Further developments of protein secondary structure prediction using information theory. J Mol Biol  1987; 198(3): 425-43.
 [http://dx.doi.org/10.1016/0022-2836(87)90292-0] [PMID:  3430614]

[14]
Garnier J, Gibrat JF, Robson B. GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol  1996; 266: 540-53.
 [http://dx.doi.org/10.1016/S0076-6879(96)66034-0] [PMID:  8743705]

[15]
Rost B. Review: Protein secondary structure prediction continues to rise. J Struct Biol  2001; 134(2-3): 204-18.
 [http://dx.doi.org/10.1006/jsbi.2001.4336] [PMID:  11551180]

[16]
Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins  2002; 47(2): 228-35.
 [http://dx.doi.org/10.1002/prot.10082] [PMID:  11933069]

[17]
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J  2020; 18: 1301-10.
 [http://dx.doi.org/10.1016/j.csbj.2019.12.011] [PMID:  32612753]

[18]
Heffernan R, Paliwal K, Lyons J, Singh J, Yang Y, Zhou Y. Single‐sequence‐based prediction of protein secondary structures and solvent accessibility by deep whole‐sequence learning. J Comput Chem  2018; 39(26): 2210-6.
 [http://dx.doi.org/10.1002/jcc.25534] [PMID:  30368831]

[19]
Kotowski K, Smolarczyk T, Roterman-Konieczna I, Stapor K. ProteinUnet-An efficient alternative to SPIDER3‐single for sequence‐based prediction of protein secondary structures. J Comput Chem  2021; 42(1): 50-9.
 [http://dx.doi.org/10.1002/jcc.26432] [PMID:  33058261]

[20]
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In 34th ICML.  Sydney, Australia 2017; pp. 3145-53. Available from: http://arxiv.org/abs/1704.02685

[21]
Chowdhury R, Bouatta N, Biswas S, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol  2022; 40(11): 1617-23.
 [http://dx.doi.org/10.1038/s41587-022-01432-w] [PMID:  36192636]

[22]
Lei Z, Gao S, Zhang Z, Zhou MC, Cheng J. MO4: A many-objective evolutionary algorithm for protein structure prediction. IEEE Trans Evol Comput  2022; 26(3): 417-30.
 [http://dx.doi.org/10.1109/TEVC.2021.3095481]

[23]
Rashid S, Sundaram S, Kwoh CK. Empirical study of protein feature representation on deep belief networks trained with small data for secondary structure prediction. IEEE/ACM Trans Comput Biol Bioinformatics  2022; 1.
 [http://dx.doi.org/10.1109/TCBB.2022.3168676]

[24]
Hu L, Yang S, Luo X, Yuan H, Sedraoui K, Zhou MC. A distributed framework for large scale protein-protein interaction data analysis and prediction using MapReduce. IEEE/CAA J. IEEE/CAA J of Automat Sinic  2022; 9(1): 160-72.
 [http://dx.doi.org/10.1109/JAS.2021.1004198]

[25]
Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer arXiv 200405150 2020.

[26]
Wu H. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst  2021; 34: 22419-30.

[27]
Zhang J. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. Proceedings of the 37th International Conference on Machine Learning.  Vienna, Austria. 2020; pp. 11328-39. Available from: https://arxiv.org/abs/1912.08777

[28]
Wang G, Dunbrack RL Jr. PISCES: A protein sequence culling server. Bioinformatics  2003; 19(12): 1589-91.
 [http://dx.doi.org/10.1093/bioinformatics/btg224] [PMID:  12912846]

[29]
Rost B. PHD: Predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol  1996; 266: 525-39.
 [http://dx.doi.org/10.1016/S0076-6879(96)66033-9] [PMID:  8743704]

[30]
Touw WG, Baakman C, Black J, et al. A series of PDB-related databanks for everyday needs. Nucleic Acids Res  2015; 43(D1): D364-8.
 [http://dx.doi.org/10.1093/nar/gku1028] [PMID:  25352545]

[31]
Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers  1983; 22(12): 2577-637.
 [http://dx.doi.org/10.1002/bip.360221211] [PMID:  6667333]

[32]
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA  1992; 89(22): 10915-9.
 [http://dx.doi.org/10.1073/pnas.89.22.10915] [PMID:  1438297]

[33]
Heffernan R, Paliwal K, Lyons J, et al. Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning. Sci Rep  2015; 5(1): 11476.
 [http://dx.doi.org/10.1038/srep11476] [PMID:  26098304]

[34]
Heffernan R, Yang Y, Paliwal K, Zhou Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics  2017; 33(18): 2842-9.
 [http://dx.doi.org/10.1093/bioinformatics/btx218] [PMID:  28430949]

[35]
Matsuda H. Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics  2000; 62(3): 3096-102.
 [http://dx.doi.org/10.1103/PhysRevE.62.3096] [PMID:  11088803]

[36]
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput  1997; 9(8): 1735-80.
 [http://dx.doi.org/10.1162/neco.1997.9.8.1735] [PMID:  9377276]

[37]
Sibi P, Jones SA, Siddarth P. Analysis of different activation functions using back propagation neural networks. J Theor Appl Inf Technol  2013; 47: 1264-8. Available from: https://www.jatit.org/volumes/Vol47No3/61Vol47No3.pdf

[38]
Devlin J. BERT: Pre-training of deep bidirectional transformers for language understanding ACL Anthology 2019; 1: 4171-86.
 [http://dx.doi.org/10.18653/v1/N19-1423]

[39]
Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare  2022; 3(1): 1-23.
 [http://dx.doi.org/10.1145/3458754]

[40]
Chalkidis I. LEGAL-BERT: The muppets straight out of law school. arXiv  2020; 2898-904.

[41]
Feng Z. CodeBERT: A pre-trained model for programming and natural languages. arXiv:200208155  2020; 1536-47.
 [http://dx.doi.org/10.18653/v1/2020.findings-emnlp.139]

[42]
Raffel C. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res  2020; 21: 1-67.
 [http://dx.doi.org/10.48550/arXiv.1910.10683]

[43]
Paszke A. Automatic differentiation in pytorch. 2017. Available from: https://openreview.net/forum?id=BJJsrmfCZ

[44]
Benesty J. Pearson correlation coefficient. In: Noise reduction in speech processing.  Berlin: Springer 2009; pp. 1-4.

Rights & Permissions Print Cite

Article Metrics

36

3

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893618666230417103346	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

A Comparison of Mutual Information, Linear Models and Deep Learning Networks for Protein Secondary Structure Prediction

Abstract

Related Journals

Related Books