Title:i4mC-CPXG: A Computational Model for Identifying DNA N4-
methylcytosine Sites in Rosaceae Genome Using Novel Encoding Strategy
Volume: 18
Issue: 1
Author(s): Lichao Zhang, Ying Liang, Kang Xiao and Liang Kong*
Affiliation:
- School of Mathematics and Information Science & Technology, Hebei Normal
University of Science & Technology, Qinhuangdao, P.R. China
- Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, P.R. China
Keywords:
N4-methylcytosine, new encoding technology, position specific information, extreme gradient boosting, i4mCw2vec, i4mC-CPXG.
Abstract:
Background: N4-methylcytosine (4mC) is one of the most widespread DNA methylation
modifications, which plays an important role in DNA replication and repair, epigenetic inheritance,
gene expression levels and regulation of transcription. Although biological experiments can identify
potential 4mC modification sites, they are limited due to the experimental environment and labor intensive.
Therefore, it is crucial to construct a computational model to identify the 4mC sites.
Objective: Although some computational methods have been proposed to identify the 4mC sites,
some problems should not be ignored, such as: (1) a large number of unknown nucleotides exist in
the biological sequence; (2) a large number of zeros exist in the previous encoding technologies; (3)
sequence distribution information is important to identify 4mC sites. Considering these aspects, we
propose a computational model based on a novel encoding strategy with position specific information
to identify 4mC sites.
Methods: We constructed an accurate computational model i4mC-CPXG based on extreme gradient
boosting. Two aspects of feature vectors are extracted according to nucleotide information and position
specific information. From the aspect of nucleotide information, we used prior information to
identify the base type of unknown nucleotide and decrease the influence of invalid information
caused by lots of zeros. From the aspect of position specific information, the vector was designed
carefully to express the base distribution and arrangement. Then the feature vector fused by nucleotide
information and position specific information was input into extreme gradient boosting to construct
the model.
Results: The accuracy of i4mC-CPXG is 82.49% on independent dataset. The result was better than
model i4mC-w2vec which was the best model in the imbalanced dataset with the ratio of 1:15.
Meanwhile, our model achieved good performance on other species. These results validated the effectiveness
of i4mC-CPXG.
Conclusion: Our method is effective to identify potential 4mC modification sites due to the proposed
new encoding strategy fused position specific information. The satisfactory prediction results of balanced
datasets, imbalanced datasets and other species datasets indicate that i4mC-CPXG is valuable
to provide a reasonable supplement for biology research.