Title:Suitability of Sequence-Based Feature Vector for Classification Algorithm Improves Accuracy of Human Protein-Protein Interaction Prediction: A Red Blood Cell Case Study
Volume: 11
Issue: 2
Author(s): Afsaneh Maali, Mahmood A. Mahdavi and Reza Gheshlaghi
Affiliation:
Keywords:
classification algorithms; Protein-protein interaction prediction; sequence-based feature vectors; machine learning;
human protein-protein interaction; accuracy of interaction prediction.
Abstract: To classify human protein-protein interaction information and consolidate existing data,
supervised learning algorithms are implemented. These algorithms require a feature vector to generate
a prediction model and feature vectors could be constructed based on various input data. The
suitability of feature vector for classification algorithm results in a more predictive model and
predictions with higher accuracies based on low-dimension vectors. To investigate the proper
combination of feature sets and the algorithms, three feature vectors including AA Frequency, AA
Graphical Parameter, and AA Triplex based on the sole knowledge of primary structure of human red
blood cell proteins were constructed and then applied to five different classification methods. The results indicated that
support vector machine (SVM) algorithm produced the highest accuracy of 84.65% with AA Graphical Parameter feature
set while it reached accuracy of 80.65% with AA Triplex feature set. Random forest (RF) achieved high accuracy of
83.69% with all three feature sets on average. Bayesian classifier of TAN performed better than NB using all three
features. Artificial neural network (ANN) classifier demonstrated the lowest average accuracy of 76%; however, the
performance was comparable with TAN where AA triplex learning feature was used with the accuracy of 77.90%. These
figures demonstrated that selecting an appropriate feature set for a classification task results in a higher accuracy with the
advantage of utilizing low-dimension feature vectors constructed from more simple data.