Title:m5C-HPromoter: An Ensemble Deep Learning Predictor for Identifying
5-methylcytosine Sites in Human Promoters
Volume: 17
Issue: 5
Author(s): Xuan Xiao*, Yu-Tao Shao, Zhen-Tao Luo and Wang-Ren Qiu*
Affiliation:
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333403, Jing-De-Zhen, China
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333403, Jing-De-Zhen, China
Keywords:
5-methylcytosine, human promoters, frequency-based One-Hot encoding, deep neural network, ensemble deep learning, DNA methylation.
Abstract:
Aims: This paper is intended to identify 5-methylcytosine sites in human promoters.
Background: Aberrant DNA methylation patterns are often associated with tumor development. Moreover,
hypermethylation inhibits the expression of tumor suppressor genes, and hypomethylation stimulates
the expression of certain oncogenes. Most DNA methylation occurs on the CpGisland of the gene
promoter region.
Objective: Therefore, a comprehensive assessment of methylation status of the promoter region of human
gene is extremely important for understanding cancer pathogenesis and the function of posttranscriptional
modification.
Methods: This paper constructed three human promoter methylation datasets, which comprise of a total
of 3 million sample sequences of small cell lung cancer, non-small cell lung cancer, and hepatocellular
carcinoma from the Cancer Cell Line Encyclopedia (CCLE) database. Frequency-based One-Hot Encoding
was used to encode the sample sequence, and an innovative stacking-based ensemble deep
learning classifier was applied to establish the m5C-HPromoter predictor.
Results: Taking the average of 10 times of 5-fold cross-validation, m5C-HPromoter obtained a good
result in terms of Accuracy (Acc)=0.9270, Matthew's correlation coefficient(MCC)=0.7234, Sensitivity(
Sn)=0.9123, and Specificity(Sp)=0.9290.
Conclusion: Numerical experiments showed that the proposed m5C-HPromoter has greatly improved
the prediction performance compared to the existing iPromoter-5mC predictor. The primary reason is
that frequency-based One-Hot encoding solves the too-long and sparse features problems of One-Hot
encoding and effectively reflects the sequence feature of DNA sequences. The second reason is that the
combination of upsampling and downsampling has achieved great success in solving the imbalance
problem. The third reason is the stacking-based ensemble deep learning model that overcomes the
shortcomings of various models and has the strengths of various models. The user-friendly web-server
m5C-HPromoter is freely accessible to the public at the website: http://121.36.221.79/m5C-HPromoter
or http://bioinfo.jcu.edu.cn/m5C-HPromoter, and the predictor program has been uploaded from the
website: https://github.com/liujin66/m5C-HPromoter.