Title:Comparison of Gene Selection Methods for Clustering Single-cell RNA-seq
Data
Volume: 18
Issue: 1
Author(s): Xiaoshu Zhu, Jianxin Wang, Rongruan Li and Xiaoqing Peng*
Affiliation:
- Center for Medical Genetics
and Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 400083,
China
Keywords:
Single-cell RNA-seq data, data preprocessing, gene selection, cluster, cell type identification, clustering methods.
Abstract:
Background: In single-cell RNA-seq data, clustering methods are employed to identify
cell types to understand cell-differentiation and development. Because clustering methods are sensitive
to the high dimensionality of single-cell RNA-seq data, one effective solution is to select a subset
of genes in order to reduce the dimensionality. Numerous methods, with different underlying
assumptions, have been proposed for choosing a subset of genes to be used for clustering.
Objective: To guide users in selecting suitable gene selection methods, we give an overview of different
gene selection methods and compare their performance in terms of the differences between
the selected gene sets, clustering performance, running time, and stability.
Results: We first review the data preprocessing strategies and gene selection methods in analyzing
single-cell RNA-seq data. Then, the overlaps among the gene sets selected by different methods are
analyzed and the clustering performance based on different feature gene sets is compared. The analysis
reveals that the gene sets selected by the methods based on highly variable genes and high mean
genes are most similar, and the highly variable genes play an important role in clustering. Additionally,
a small number of selected genes would compromise the clustering performance, such as SCMarker
selected fewer genes than other methods, leading to a poorer clustering performance than M3Drop.
Conclusion: Different gene selection methods perform differently in different scenarios. HVG
works well on the full-transcript sequencing datasets, NBDrop and HMG perform better on the 3’
end sequencing datasets, M3Drop and HMG are more suitable for big datasets, and SCMarker is
most consistent in different preprocessing methods.