Protein Family Classification using embedding methods
Loading...
Date
2020-04-30
Authors
Saduakhas, Damilya
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Sciences and Humanities
Abstract
This capstone project examines the performance of existing embedding based alignment-free
methods for protein family classification tasks. The distributed continuous representation
of biological sequences such as DNA and proteins can be analyzed using algorithms that
are based upon Natural Language Processing models such as Word2Vec. The performance
of ProtVec proposed by Asgari et.al. (2015) was analyzed and compared to its further improvements and the opportunities of embedding based methods in classification tasks were
discussed. The data were obtained from the Swiss-Prot database, and 324,018 manually annotated protein sequences were used for protein family classification task of 7,027 families.
This paper will test different advantages and will try to explain the motivation behind using
the embedding methods for classification, despite the existence of advanced alignment methods with high accuracy. The further modifications to change the metrics to non-Euclidean
ones and the use of hybrid models were proposed.
Description
Keywords
Research Subject Categories::MATHEMATICS::Applied mathematics