Towards Effective Usage Of Unlabeled Data In Small Labeled Sample Classification
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
In numerous real-world applications, the scarcity of high-quality labeled data constitutes a significant impediment to the development of supervised machine learning models. This challenge primarily arises from the fact that manual annotation processes are often resource-intensive, requiring considerable time, expert knowledge, specialized equipment, or elaborate experimental procedures. Consequently, collecting a sufficiently large labeled dataset is frequently impractical. Conversely, unlabeled data are typically abundant and readily accessible.
This thesis addresses the problem of small labeled sample classification, wherein only a limited annotation budget is available. It demonstrates that deliberate exploitation of unlabeled data can: (i) substantially enhance the predictive performance of classifiers trained on small labeled datasets, and (ii) reduce the cost associated with data labeling. To this end, the research investigates two principal directions based on the accessibility of a labeling expert: semi-supervised learning (without direct expert access) and active learning (with expert access).
This raises two research questions: (i) how can we use unlabeled data to improve the performance of a classifier trained on small labeled data? and (ii) how can we leverage unlabeled data to effectively identify a set of labeled samples that are most informative in terms of predictive performance?
Within the scope of semi-supervised learning, the thesis introduces an enhanced self-training algorithm that mitigates the prevalent issue of noise accumulation---wherein incorrect pseudo-labels reinforce suboptimal model predictions---by randomly partitioning unlabeled data into mini-batches during self-training. The experimental results show that enhanced self-training outperforms standard self-training in 85% of cases considered in this work. Furthermore, to more systematically address noise accumulation, the research proposes a novel semi-supervised boosting algorithm. This algorithm is carefully designed to leverage three core assumptions of semi-supervised learning: (i) the smoothness assumption, which posits that data points situated closely within high-density regions should be assigned the same label; (ii) the cluster assumption, which suggests that data points belonging to the same cluster are likely to share the same class; and (iii) the manifold assumption, which holds that high-dimensional data often lie on an underlying lower-dimensional manifold that captures its intrinsic geometric structure. By incorporating these assumptions, the proposed algorithm improves pseudo-label quality and reduces the risk of error reinforcement, thereby enhancing overall classifier performance. In particular, proposed algorithm outperforms other methods in 91% comparisons investigated in this work.
In the domain of active learning, the thesis investigates the estimated error reduction (EER) approach, which prioritizes the selection of unlabeled data points based on their potential to reduce classifier error. In comparison with traditional methods that primarily focus on data diversity, the EER approach directly considers the effect of labeling specific data points on predictive performance. Despite its theoretical promise, the original EER method suffers from high computational demands due to the necessity of retraining the classifier for every candidate data point and each possible label. To address this limitation, the research introduces a novel, computationally efficient active learning algorithm. By formulating an innovative objective function, the method enables recursive, closed-form updates that avoid the need for repeated retraining, thereby significantly reducing computational cost. As a result, proposed active learning method shows better results than other methods in 81% cases considered in this work.
Through extensive empirical evaluations, the thesis demonstrates that effective utilization of unlabeled data can meaningfully improve classification performance and decrease the reliance on expensive labeled data. The contributions presented herein advance the understanding and practical implementation of semi-supervised and active learning, particularly under constraints of limited labeled data.
Description
Citation
Mukhamediya, Azamat. (2025). Towards effective usage of unlabeled data in small labeled sample classification. Nazarbayev University School of Engineering and Digital Sciences
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States
