Towards Effective Usage Of Unlabeled Data In Small Labeled Sample Classification

dc.contributor.authorMukhamediya, Azamat
dc.date.accessioned2025-08-08T09:54:10Z
dc.date.available2025-08-08T09:54:10Z
dc.date.issued2025-07-10
dc.description.abstractIn numerous real-world applications, the scarcity of high-quality labeled data constitutes a significant impediment to the development of supervised machine learning models. This challenge primarily arises from the fact that manual annotation processes are often resource-intensive, requiring considerable time, expert knowledge, specialized equipment, or elaborate experimental procedures. Consequently, collecting a sufficiently large labeled dataset is frequently impractical. Conversely, unlabeled data are typically abundant and readily accessible. This thesis addresses the problem of small labeled sample classification, wherein only a limited annotation budget is available. It demonstrates that deliberate exploitation of unlabeled data can: (i) substantially enhance the predictive performance of classifiers trained on small labeled datasets, and (ii) reduce the cost associated with data labeling. To this end, the research investigates two principal directions based on the accessibility of a labeling expert: semi-supervised learning (without direct expert access) and active learning (with expert access). This raises two research questions: (i) how can we use unlabeled data to improve the performance of a classifier trained on small labeled data? and (ii) how can we leverage unlabeled data to effectively identify a set of labeled samples that are most informative in terms of predictive performance? Within the scope of semi-supervised learning, the thesis introduces an enhanced self-training algorithm that mitigates the prevalent issue of noise accumulation---wherein incorrect pseudo-labels reinforce suboptimal model predictions---by randomly partitioning unlabeled data into mini-batches during self-training. The experimental results show that enhanced self-training outperforms standard self-training in 85% of cases considered in this work. Furthermore, to more systematically address noise accumulation, the research proposes a novel semi-supervised boosting algorithm. This algorithm is carefully designed to leverage three core assumptions of semi-supervised learning: (i) the smoothness assumption, which posits that data points situated closely within high-density regions should be assigned the same label; (ii) the cluster assumption, which suggests that data points belonging to the same cluster are likely to share the same class; and (iii) the manifold assumption, which holds that high-dimensional data often lie on an underlying lower-dimensional manifold that captures its intrinsic geometric structure. By incorporating these assumptions, the proposed algorithm improves pseudo-label quality and reduces the risk of error reinforcement, thereby enhancing overall classifier performance. In particular, proposed algorithm outperforms other methods in 91% comparisons investigated in this work. In the domain of active learning, the thesis investigates the estimated error reduction (EER) approach, which prioritizes the selection of unlabeled data points based on their potential to reduce classifier error. In comparison with traditional methods that primarily focus on data diversity, the EER approach directly considers the effect of labeling specific data points on predictive performance. Despite its theoretical promise, the original EER method suffers from high computational demands due to the necessity of retraining the classifier for every candidate data point and each possible label. To address this limitation, the research introduces a novel, computationally efficient active learning algorithm. By formulating an innovative objective function, the method enables recursive, closed-form updates that avoid the need for repeated retraining, thereby significantly reducing computational cost. As a result, proposed active learning method shows better results than other methods in 81% cases considered in this work. Through extensive empirical evaluations, the thesis demonstrates that effective utilization of unlabeled data can meaningfully improve classification performance and decrease the reliance on expensive labeled data. The contributions presented herein advance the understanding and practical implementation of semi-supervised and active learning, particularly under constraints of limited labeled data.
dc.identifier.citationMukhamediya, Azamat. (2025). Towards effective usage of unlabeled data in small labeled sample classification. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/9161
dc.language.isoen
dc.publisherNazarbayev University School of Engineering and Digital Sciences
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/
dc.subjectSemi-Supervised Learning
dc.subjectActive Learning
dc.subjecttype of access: open access
dc.subjectPQDT_PhD
dc.titleTowards Effective Usage Of Unlabeled Data In Small Labeled Sample Classification
dc.typePhD thesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis_Towards effective usage of unlabeled data in small labeled sample classification.pdf
Size:
2.72 MB
Format:
Adobe Portable Document Format
Description:
PhD Thesis

Collections