Protein Family Classification using embedding methods

dc.contributor.authorSaduakhas, Damilya
dc.date.accessioned2020-05-07T14:01:20Z
dc.date.available2020-05-07T14:01:20Z
dc.date.issued2020-04-30
dc.description.abstractThis capstone project examines the performance of existing embedding based alignment-free methods for protein family classification tasks. The distributed continuous representation of biological sequences such as DNA and proteins can be analyzed using algorithms that are based upon Natural Language Processing models such as Word2Vec. The performance of ProtVec proposed by Asgari et.al. (2015) was analyzed and compared to its further improvements and the opportunities of embedding based methods in classification tasks were discussed. The data were obtained from the Swiss-Prot database, and 324,018 manually annotated protein sequences were used for protein family classification task of 7,027 families. This paper will test different advantages and will try to explain the motivation behind using the embedding methods for classification, despite the existence of advanced alignment methods with high accuracy. The further modifications to change the metrics to non-Euclidean ones and the use of hybrid models were proposed.en_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/4608
dc.language.isoenen_US
dc.publisherNazarbayev University School of Sciences and Humanitiesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectResearch Subject Categories::MATHEMATICS::Applied mathematicsen_US
dc.titleProtein Family Classification using embedding methodsen_US
dc.typeCapstone Projecten_US
workflow.import.sourcescience

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Capstone_final_version_Damilya_Saduakhas (1) (1).pdf
Size:
2.98 MB
Format:
Adobe Portable Document Format
Description: