Title:COVID-19 Biomarkers Recognition & Classification Using Intelligent
Systems
Volume: 17
Issue: 5
Author(s): Javier Bajo-Morales*, Juan Carlos Prieto-Prieto, Luis Javier Herrera, Ignacio Rojas and Daniel Castillo-Secilla
Affiliation:
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
- Deuser Tech Group, Calle Islandia, 182-NAV 24A, Córdoba, 14014, Córdoba,
Spain
Keywords:
COVID-19, RNA-Seq, machine learning, feature selection, gene signature, WHO.
Abstract:
Background: SARS-CoV-2 has paralyzed mankind due to its high transmissibility and its
associated mortality, causing millions of infections and deaths worldwide. The search for gene expression
biomarkers from the host transcriptional response to infection may help understand the underlying
mechanisms by which the virus causes COVID-19. This research proposes a smart methodology integrating
different RNA-Seq datasets from SARS-CoV-2, other respiratory diseases, and healthy patients.
Methods: The proposed pipeline exploits the functionality of the ‘KnowSeq’ R/Bioc package, integrating
different data sources and attaining a significantly larger gene expression dataset, thus endowing the
results with higher statistical significance and robustness in comparison with previous studies in the literature.
A detailed preprocessing step was carried out to homogenize the samples and build a clinical
decision system for SARS-CoV-2. It uses machine learning techniques such as feature selection algorithm
and supervised classification system. This clinical decision system uses the most differentially
expressed genes among different diseases (including SARS-Cov-2) to develop a four-class classifier.
Results: The multiclass classifier designed can discern SARS-CoV-2 samples, reaching an accuracy
equal to 91.5%, a mean F1-Score equal to 88.5%, and a SARS-CoV-2 AUC equal to 94% by using only
15 genes as predictors. A biological interpretation of the gene signature extracted reveals relations with
processes involved in viral responses.
Conclusion: This work proposes a COVID-19 gene signature composed of 15 genes, selected after applying
the feature selection ‘minimum Redundancy Maximum Relevance’ algorithm. The integration
among several RNA-Seq datasets was a success, allowing for a considerable large number of samples
and therefore providing greater statistical significance to the results than in previous studies. Biological
interpretation of the selected genes was also provided.