Abstract:
Support Vector Machine (SVM) is an important supervised machine learning algorithm
used for regression and classification tasks. The core idea behind this algorithm
is to create a hyperplane between data points of two classes. The algorithm is very
intuitive and it works well when the points of the two classes are (almost) linearly
separable and the training set is not large. However, if the points in the dataset are
not linearly separable, in order to use SVM, one needs to transfer the data points to
a higher dimension, which is a costly operation. Or, alternatively, another algorithm
should be used (which probably will be a more complicated one). Therefore, it is
important to know in advance whether the data is linearly separable because further
steps of solving the given regression or classification task depend on that. One part of
this thesis focuses on investigating linear separability of data in 2 and 3 dimensions.
We propose an efficient algorithm for testing whether a dataset is linearly separable
in the context of property testing. For a given parameter ✏ 2 (0, 1), the sample complexity
and the running time complexity of our algorithm are both O(1/✏), which is
sublinear in the number of samples.
When the data points have a constant number of dimensions, the running time of
the standard SVM algorithm can reach ⌦(n2) (where n is the size of the training set)
and this makes the algorithm impractical for large values of n. In the second part
of the thesis, we propose a more efficient algorithm for dataset training using SVM.
Our algorithm does not use the entire dataset to determine the optimal hyperplane,
but only a specially constructed subset of the data that guarantees an approximate
solution. This allows us to design a subquadratic-time algorithm. More formally, our
algorithm approximates the optimal hyperplane with an (e−1)/e multiplicative error
in time O(nt · min(t, k)) + o(n2), where t is the total number of training samples in
our subset and k is a hyperparameter that controls the number of nearest neighbors
when we construct our subset.