Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2022.
(EL)
Approved for entry into archive by Κυριακή Μπαλτά (
[email protected]) on 2022-11-22T11:25:29Z (GMT) No. of bitstreams: 2
license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5)
PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5)
(EN)
Made available in DSpace on 2022-11-22T11:25:30Z (GMT). No. of bitstreams: 2
license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5)
PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5)
Previous issue date: 2022-10-14
(EN)
Submitted by ΜΕΡΚΟΥΡΙΟΣ ΠΑΠΑΝΙΚΟΛΑΟΥ (
[email protected]) on 2022-11-18T09:26:57Z
No. of bitstreams: 2
license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5)
PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5)
(EN)
K – Nearest Neighbor (k-NN) classifier is one of them most widely used supervised
learning algorithms. Its popularity is mainly due to its simplicity, effectiveness, ease of
implementation and ability to add new data in the training set at any time. However, one
of the major drawbacks of this algorithm is the fact that its performance mainly depends
on the parameter value k, i.e. the number of nearest neighbors that the algorithm examines
in order to classify the unlabeled instance. In most cases, it is a fixed value, independent
of data distribution. The most frequently used technique for the “best” k determination is
the cross validation as there is no general rule for choosing the k value due to its
dependency on the training dataset. A large k value results in a noise tolerant classifier, as
its search area is large. On the other hand, a small k value results in a noise sensitive
classifier, as the search area is limited. So, selecting a fixed k value throughout the dataset
does not take into account its special features, like data distribution, class separation,
imbalanced classes, sparse and dense neighborhoods and noisy subspaces.
In recent years, a lot of research have been conducted in order to tackle the above-
mentioned disadvantage. The research has led to various approaches of k-NN classifier,
which mainly combine it with various other techniques for k value determination. In the
present study, a thorough literature review is conducted in order to summarize all the
achievements made to date in this field. This procedure led to a pool of twenty-eight (28)
publications, covering a time period from 1986 till 2020 (with median value 2009). These
studies are presented in this work, describing the techniques used for dynamic k
determination. For each study, several indicators are recorded, namely the technique used
for k selection, the level of k selection, the number of datasets used for experiments,
whether statistical tests were conducted or not, the total number of citations each research
has received as well as the average citations per year.
Apart from the above, a new alternative version of k-NN algorithm is proposed.
The proposed algorithm is an extension of a previous work, found in the literature. The
approach is a k-free k-NN variation, in the sense that the user does not select the parameter,
but it is selected dynamically, depending on the area where each unlabeled data point lies.
The algorithm falls into the group of the studies that exploit prototype and clustering
techniques in order to represent the initial dataset. Through a recursive process,
homogenous clusters are created, each of which are represented by a representative. Moreover, a new term is introduced, namely the Sphere of Influnce (SoI), an index
which indicates the size of each created cluster. This index, in combination with the
indicator depth (d), provides useful information about the subspace that each representative
lies. Finally, heuristics are proposed in order to exploit the information provided from SoI
and d and convert it in a k value, unique for every unlabeled instance.
Extensive experiments were conducted on thirty (30) datasets for all proposed
heuristics. Some of these datasets contained artificial noise in order to test the proposed
algorithm in real life situations. The experiments showed a very competitive performance
– in terms of accuracy – of the proposed algorithm in comparison with some commonly
used k-NN variations. Moreover, Wilcoxon statistical test was used to find statistically
significant differences.
(EN)