DEEP LEARNING IN GENOMIC SIGNAL PROCESSING

Loading...
Thumbnail Image

Date

2022-04

Authors

Bekbolat, Marzhan

Journal Title

Journal ISSN

Volume Title

Publisher

Nazarbayev University School of Engineering and Digital Sciences

Abstract

The complexity of the genomics data is increasing in parallel with the development of this science, and creating new computational challenges. The recent appearance of the new generation sequencing (NGS) technologies as single cell RNA sequence (scRNA-seq) increases the chance of discovering new disease biomarkers and helps to deepen the knowledge about cellular functions. In parallel with development in genomics, a number of algorithmic and computational advancement in machine learning have enabled deep learning technologies to find unprecedented applications in many fields. However, the applications of deep learning in genomics is limited. This state of affairs is mainly attributed to relatively small sample size (n) with respect to the large number of genes (p) in such biomedical data. Moreover, the presence of the lowly expressed genes in the cell causes the dropout events, which leads to sparse nature of scRNA-seq expression data. Among various types of neural networks, convolutional neural network has particularly become an attractive choice in many applications. Although it has been presented as an effective tool in dealing with complex classification and regression problems in fields such as computer vision and natural language processing that work with high dimensional data, there are limitations in applying CNNs on the scRNA-seq data. Even though CNN has a weight sharing feature that increase the network generalization property, the “large p small n” nature of scRNA-seq data can lead to overfitting. Another problem is that CNN is basically designed to work with a data with grid-like topology such as time-series or digital images, which is not the case in scRNA-seq data. Therefore, in this thesis, we are proposing a combination of methods based on hierarchical clustering, random projection, and ensemble learning to train CNN with scRNA-seq data. The integration of ensemble learning with random projection is helpful when dealing with high dimensionality of the scRNA-seq data. Whereas, the hierarchical clustering was used as a tool for creating a sequential data. The proposed method does not imply use of any domain-specific knowledge in creating the sequential data, hence is applicable not only for scRNA-seq data, but also in other applications where data is sparse and high-dimensional.

Description

Keywords

Type of access: Gated Access, Research Subject Categories::TECHNOLOGY, new generation sequencing, NGS, single cell RNA sequence, scRNA-seq, Neural Networks

Citation

Bekbolat, M. (2022). DEEP LEARNING IN GENOMIC SIGNAL PROCESSING (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstan