Dynamic / Adaptive determination of parameter k IN k-NN classifier

Dynamic / Adaptive determination of parameter k IN k-NN classifier

URI: https://www.openarchives.gr/aggregator-openarchives/edm/psepheda/000004-2159_27904
RDF/XML JSON-LD

This item is provided by the institution :
University of Macedonia

Repository :
Psepheda - Digital Library and Institutional Repository

see the original item page
in the repository's web site and access all digital files if the item^*

Title

Dynamic / Adaptive determination of parameter k IN k-NN classifier (EN)

Creator

Παπανικολάου, Μερκούριος (EL)

Contributor

Ευαγγελίδης, Γεώργιος (EL)

Type

Electronic Thesis or Dissertation (EN)

Text (EN)

Thesis
Master thesis (EN)

Issued

2022-11-22T11:25:30Z

2022 (EL)

Year

2022 (EN)

Description

Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2022. (EL)

Approved for entry into archive by Κυριακή Μπαλτά ([email protected]) on 2022-11-22T11:25:29Z (GMT) No. of bitstreams: 2 license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5) PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5) (EN)

Made available in DSpace on 2022-11-22T11:25:30Z (GMT). No. of bitstreams: 2 license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5) PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5) Previous issue date: 2022-10-14 (EN)

Submitted by ΜΕΡΚΟΥΡΙΟΣ ΠΑΠΑΝΙΚΟΛΑΟΥ ([email protected]) on 2022-11-18T09:26:57Z No. of bitstreams: 2 license_rdf: 914 bytes, checksum: 24013099e9e6abb1575dc6ce0855efd5 (MD5) PapanikolaouMerkouriosMSc2022.pdf: 791181 bytes, checksum: 9603df5796e570350b6ade2315776ff7 (MD5) (EN)

K – Nearest Neighbor (k-NN) classifier is one of them most widely used supervised learning algorithms. Its popularity is mainly due to its simplicity, effectiveness, ease of implementation and ability to add new data in the training set at any time. However, one of the major drawbacks of this algorithm is the fact that its performance mainly depends on the parameter value k, i.e. the number of nearest neighbors that the algorithm examines in order to classify the unlabeled instance. In most cases, it is a fixed value, independent of data distribution. The most frequently used technique for the “best” k determination is the cross validation as there is no general rule for choosing the k value due to its dependency on the training dataset. A large k value results in a noise tolerant classifier, as its search area is large. On the other hand, a small k value results in a noise sensitive classifier, as the search area is limited. So, selecting a ﬁxed k value throughout the dataset does not take into account its special features, like data distribution, class separation, imbalanced classes, sparse and dense neighborhoods and noisy subspaces. In recent years, a lot of research have been conducted in order to tackle the above- mentioned disadvantage. The research has led to various approaches of k-NN classiﬁer, which mainly combine it with various other techniques for k value determination. In the present study, a thorough literature review is conducted in order to summarize all the achievements made to date in this ﬁeld. This procedure led to a pool of twenty-eight (28) publications, covering a time period from 1986 till 2020 (with median value 2009). These studies are presented in this work, describing the techniques used for dynamic k determination. For each study, several indicators are recorded, namely the technique used for k selection, the level of k selection, the number of datasets used for experiments, whether statistical tests were conducted or not, the total number of citations each research has received as well as the average citations per year. Apart from the above, a new alternative version of k-NN algorithm is proposed. The proposed algorithm is an extension of a previous work, found in the literature. The approach is a k-free k-NN variation, in the sense that the user does not select the parameter, but it is selected dynamically, depending on the area where each unlabeled data point lies. The algorithm falls into the group of the studies that exploit prototype and clustering techniques in order to represent the initial dataset. Through a recursive process, homogenous clusters are created, each of which are represented by a representative. Moreover, a new term is introduced, namely the Sphere of Influnce (SoI), an index which indicates the size of each created cluster. This index, in combination with the indicator depth (d), provides useful information about the subspace that each representative lies. Finally, heuristics are proposed in order to exploit the information provided from SoI and d and convert it in a k value, unique for every unlabeled instance. Extensive experiments were conducted on thirty (30) datasets for all proposed heuristics. Some of these datasets contained artificial noise in order to test the proposed algorithm in real life situations. The experiments showed a very competitive performance – in terms of accuracy – of the proposed algorithm in comparison with some commonly used k-NN variations. Moreover, Wilcoxon statistical test was used to find statistically significant differences. (EN)

Scientific field

Natural Sciences ▶ Computer and Information Sciences
Artificial Intelligence (EN)

Subject

k-nearest neighbor (k- NN) (EN)

Supervised machine learning (EN)

Dynamic k value determination (EN)

Heuristics (EN)

Classification (EN)

Publisher

Πανεπιστήμιο Μακεδονίας (EL)

School / Department / Institute

Πρόγραμμα Μεταπτυχιακών Σπουδών στην Τεχνητή Νοημοσύνη και Αναλυτική Δεδομένων (EL)