Μέθοδοι εξόρυξης γνώσης από συλλογές εγγράφων

Title

Knowledge extraction methods from document collections

Creator

Kalogeratos, Argyris

Καλογεράτος, Αργύρης

Type

PhD Thesis

Thesis
PhD thesis (EN)

Date

2013

Year

2013 (EN)

Description

This thesis studies the problem of document clustering. Given a document collection, at first, preprocessing, and feature extraction take place. As a result, each document is usually represented using a vector space model where the non-negative dimension weights describe the significance of the respective term features. The properties of such a feature space are: i) the high dimensionality that is of the order of thousands of features, and ii) sparsity which reaches 99%. In this dissertation, methods are studied and developed for document representation and knowledge extraction regarding the cluster structure of a dataset. At first, a vector space model is presented which, without supervision, revisits the traditional assumption about the term independence. A Global Term Context Vector is computed for each term feature of the collection, which embeds the context in which a term appears in the documents (term co-occurrences). Next, a semantic matrix is constructed based on which the document vectors are mapped in a denser feature space of the same dimensionality. The effectiveness of the proposed representation model is experimentally studied in the context of document clustering. The second contribution of this work is the k-synthetic prototypes clustering method that is based on the spherical k-means. Its novelty lays at the introduction of the synthetic prototypes as cluster representatives. The proposed incremental approach for the computation of a synthetic prototype uses the K nearest neighbors of the cluster medoid. The interesting property of this approach is that it favors the representation of the documents of the dominant class in the cluster. In this way the clustering algorithm manages to overcome problems caused by bad initializations. In the experimental study, this method is compared to a series of widely-used document clustering techniques. In the chapter that follows, incremental clustering algorithms are studied, which add the k+1 data based on a solution containing k clusters. A general framework for incremental clustering is presented which applies partial update of the solution when introducing the k+1 cluster. This framework covers known incremental algorithms, such as bisecting k-means, global k-means, and various extensions of them. Next, global k-synthetic prototypes algorithm is proposed which is experimentally compared to existing incremental approaches achieving better clustering results on document collections. The next chapter concerns the problem of estimating the number of clusters in a dataset. Dip-dist criterion is proposed which considers each object of a cluster as a 'viewer ' and applies a univariate statistic hypothesis test, the dip-test, to examine unimodality in the distribution of the distances between the viewer and the rest of the objects in cluster. This criterion is incorporated by the incremental dip-means method. The only assumption of this method is the unimodality of all clusters. Important advantages are: i) the unimodality test is applied on univariate distance vectors, and ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.

Η παρούσα διατριβή ασχολείται με το πρόβλημα της ομαδοποίησης εγγράφων (document clustering). Δοθείσης μίας συλλογής εγγράφων φυσικής γλώσσας (corpus), καταρχήν εφαρμόζεται προεπεξεργασία και εξαγωγή χαρακτηριστικών όρων (terms). Ως αποτέλεσμα, κάθε έγγραφο συνήθως αναπαρίσταται με ένα διανυσματικό μοντέλο (vector space model) όπου το μη αρνητικό βάρος κάθε διάστασης περιγράφει τη σημαντικότητα του αντίστοιχου χαρακτηριστικού όρου. Οι ιδιότητες αυτού του χώρου αναπαράστασης είναι: α) η πολύ υψηλή διάσταση της τάξης των χιλιάδων χαρακτηριστικών, και β) η αραιότητα που αγγίζει το 99% (high dimensionality and sparsity). Στη διατριβή μελετώνται και αναπτύσσονται μέθοδοι αναπαράστασης και εξαγωγής πληροφορίας σχετικά με τη δομή ομάδων στη συλλογή εγγράφων (cluster structure). Αρχικά προτείνεται ένα μοντέλο διανυσματικής αναπαράστασης εγγράφων, το οποίο, δίχως επίβλεψη, επανεξετάζει την παραδοσιακή υπόθεση ανεξαρτησίας των όρων (term independence). Για κάθε όρο του λεξικού εξάγεται το αντίστοιχο γενικευμένο διάνυσμα συμφραζομένων óρωv (global term context vector) το οποίο ενσωματώνει τη συμφραζόμενη πληροφορία γύρω από τις εμφανίσεις του όρου στα έγγραφα (συν-εμφανίσεις όρων). Στη συνέχεια, κατασκευάζεται ένας σημασιολογικός πίνακας βάσει του οποίου προβάλλονται τα διανύσματα δεδομένων σε έναν πυκνότερο χώρο ίδιας διάστασης. Στο στάδιο αυτό μελετήθηκε η συμβολή του προτεινόμενου μοντέλου αναπαράστασης στην ομαδοποίηση εγγράφων. Ύστερα παρουσιάζεται η μέθοδος ομαδοποίησης εγγράφων k-συνθετικών πρωτοτύπων (k-synthetic prototypes). Η μέθοδος βασίζεται στον σφαιρικό k-μέσων (spherical k-means) με την πρωτοτυπία ότι εισάγει τους συνθετικούς αντιπροσώπους για τις ομάδες. Η προτεινόμενη αυξητική προσέγγιση για τον υπολογισμό ενός συνθετικού αντιπροσώπου χρησιμοποιεί τα Κ αντικείμενα που βρίσκονται εγγύτερα στο ενδιάμεσο αντικείμενο μίας ομάδας (medoid). Η ενδιαφέρουσα ιδιότητα αυτής της προσέγγισης είναι ότι ευνοεί την αναπαράσταση της κυρίαρχης κλάσης δεδομένων σε μία ομάδα επιτρέποντας με αυτό τον τρόπο την αποφυγή λύσεων τοπικών ελαχίστων λόγω κακής αρχικοποίησης. Στην πειραματική μελέτη συγκρίνεται η μέθοδος αυτή με μία σειρά από ευρέως χρησιμοποιούμενους αλγόριθμους ομαδοποίησης. Στη συνέχεια, μελετώνται αλγόριθμοι αυξητικής ομαδοποίησης (incremental clustering), οι οποίοι εισάγουν την k+1 ομάδα δεδομένων βασιζόμενοι στη λύση k ομάδων. Αρχικά παρουσιάζεται ένα γενικό πλαίσιο (clustering framework) που εφαρμόζει τη μερική ενημέρωση της k+1 λύσης κατά την εισαγωγή μίας νέας ομάδας (partial update). Το πλαίσιο αυτό καλύπτει γνωστούς αυξητικούς αλγορίθμους, όπως ο διαμεριστικός k-μέσων (bisecting k-means), ο γενικευμένος k-μέσων (global k-means), και διάφορες παραλλαγές τους. Προτείνεται, δε, ο γενικευμένος αλγόριθμος k-συνθετικών πρωτοτύπων (global k-synthetic prototypes) ο οποίος συγκρίνεται πειραματικά με υπάρχουσες αυξητικές προσεγγίσεις επιδεικνύοντας καλύτερα αποτελέσματα ομαδοποίησης σε συλλογές εγγράφων. Η τελευταία ενότητα της διατριβής αφορά το πρόβλημα εκτίμησης του αριθμού των ομάδων σε ένα σύνολο δεδομένων. Για την προσέγγιση του προβλήματος προτείνεται το κριτήριο dip-dist το οποίο θεωρεί κάθε αντικείμενο της υπό εξέταση ομάδας ως 'παρατηρητή’ και εφαρμόζει ένα στατιστικό τεστ μονοτροπικότητας (unimodality dip-test) στην κατανομή των αποστάσεων μεταξύ του παρατηρητή και των υπολοίπων αντικειμένων της ομάδας. Στη συνέχεια, περιγράφεται η αυξητική μέθοδος από dip-μέσων (dip-means) της οποίας η μοναδική υπόθεση είναι η μονοτροπικότητα κάθε ομάδας. Τα πλεονεκτήματα της προτεινόμενης προσέγγισης είναι ότι το στατιστικό τεστ εφαρμόζεται σε 1Δ κατανομές, ενώ θα μπορούσε να χρησιμοποιηθεί και σε μεθόδους που βασίζονται στον πίνακα ομοιότητας (kernel-based methods), όπου δεν απαιτούνται τα πραγματικά διανύσματα δεδομένων.

Scientific field

Επιστήμη Ηλεκτρονικών Υπολογιστών και Πληροφορική

Φυσικές Επιστήμες

Engineering and Technology ▶ Electrical engineering, Electronic engineering, Information engineering
Communication engineering and systems, Telecommunications (EN)

Natural Sciences
Computer and Information Sciences (EN)

Subject

Πλαίσιο αυξητικής ομαδοποίησης για δεδομένα υψηλής διάστασης και αραιότητας

Framework for incremental data clustering for high dimensional data

Estimating the number of clusters based on a unimodality criterion

Σημασιολογική εξομάλυνση χρήσει συμφραζόμενης πληροφορίας όρων

Computer and Information Sciences

Φυσικές Επιστήμες

Ομαδοποίηση κειμένων με συνθετικούς αντιπροσώπους κειμένων

Εκτίμηση του αριθμού των ομάδων βάσει του κριτηρίου της μονοτροπικότητας

Semantic smoothing using term context vectors

Document clustering using synthetic cluster prototypes

Clustering high dimensional and sparse data

Επιστήμη Ηλεκτρονικών Υπολογιστών και Πληροφορική

Natural Sciences

Αναπαράσταση κειμένων με διανυσματικά μοντέλα

Vector space models for document representation

Ομαδοποίηση σε δεδομένα υψηλής διάστασης και αραιότητας

Language

English

Publisher

Πανεπιστήμιο Ιωαννίνων

University of Ioannina

School / Department / Institute

Πανεπιστήμιο Ιωαννίνων. Σχολή Θετικών Επιστημών. Τμήμα Πληροφορικής

University of Ioannina ▶ School of Informatics and Telecommunications
Department of Informatics and Telecommunications

Provider

National Documentation Centre (EKT)

Repository / collection

National Archive of PhD Theses

Subcollections

Συλλογή ΕΑΔΔ

Μέθοδοι εξόρυξης γνώσης από συλλογές εγγράφων

Βοηθείστε μας να κάνουμε καλύτερο το OpenArchives.gr.