Document representation for text clustering

Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. W...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Μπούργος, Νικόλαος
Άλλοι συγγραφείς: Bourgos, Nikolaos
Γλώσσα:English
Έκδοση: 2023
Θέματα:
Διαθέσιμο Online:https://hdl.handle.net/10889/25187
id nemertes-10889-25187
record_format dspace
spelling nemertes-10889-251872023-06-27T03:56:10Z Document representation for text clustering Αναπαράσταση εγγράφων για συσταδοποίηση κειμένου Μπούργος, Νικόλαος Bourgos, Nikolaos Document clustering Bag-of-words Word embeddings Συσταδοποίηση κειμένου Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering. We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering. Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline. 2023-06-26T09:50:49Z 2023-06-26T09:50:49Z 2023-06-23 https://hdl.handle.net/10889/25187 en application/pdf
institution UPatras
collection Nemertes
language English
topic Document clustering
Bag-of-words
Word embeddings
Συσταδοποίηση κειμένου
spellingShingle Document clustering
Bag-of-words
Word embeddings
Συσταδοποίηση κειμένου
Μπούργος, Νικόλαος
Document representation for text clustering
description Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering. We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering. Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline.
author2 Bourgos, Nikolaos
author_facet Bourgos, Nikolaos
Μπούργος, Νικόλαος
author Μπούργος, Νικόλαος
author_sort Μπούργος, Νικόλαος
title Document representation for text clustering
title_short Document representation for text clustering
title_full Document representation for text clustering
title_fullStr Document representation for text clustering
title_full_unstemmed Document representation for text clustering
title_sort document representation for text clustering
publishDate 2023
url https://hdl.handle.net/10889/25187
work_keys_str_mv AT mpourgosnikolaos documentrepresentationfortextclustering
AT mpourgosnikolaos anaparastasēengraphōngiasystadopoiēsēkeimenou
_version_ 1771297235100762112