Document representation for text clustering
Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. W...
Κύριος συγγραφέας: | |
---|---|
Άλλοι συγγραφείς: | |
Γλώσσα: | English |
Έκδοση: |
2023
|
Θέματα: | |
Διαθέσιμο Online: | https://hdl.handle.net/10889/25187 |
id |
nemertes-10889-25187 |
---|---|
record_format |
dspace |
spelling |
nemertes-10889-251872023-06-27T03:56:10Z Document representation for text clustering Αναπαράσταση εγγράφων για συσταδοποίηση κειμένου Μπούργος, Νικόλαος Bourgos, Nikolaos Document clustering Bag-of-words Word embeddings Συσταδοποίηση κειμένου Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering. We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering. Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline. 2023-06-26T09:50:49Z 2023-06-26T09:50:49Z 2023-06-23 https://hdl.handle.net/10889/25187 en application/pdf |
institution |
UPatras |
collection |
Nemertes |
language |
English |
topic |
Document clustering Bag-of-words Word embeddings Συσταδοποίηση κειμένου |
spellingShingle |
Document clustering Bag-of-words Word embeddings Συσταδοποίηση κειμένου Μπούργος, Νικόλαος Document representation for text clustering |
description |
Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering.
We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering.
Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline. |
author2 |
Bourgos, Nikolaos |
author_facet |
Bourgos, Nikolaos Μπούργος, Νικόλαος |
author |
Μπούργος, Νικόλαος |
author_sort |
Μπούργος, Νικόλαος |
title |
Document representation for text clustering |
title_short |
Document representation for text clustering |
title_full |
Document representation for text clustering |
title_fullStr |
Document representation for text clustering |
title_full_unstemmed |
Document representation for text clustering |
title_sort |
document representation for text clustering |
publishDate |
2023 |
url |
https://hdl.handle.net/10889/25187 |
work_keys_str_mv |
AT mpourgosnikolaos documentrepresentationfortextclustering AT mpourgosnikolaos anaparastasēengraphōngiasystadopoiēsēkeimenou |
_version_ |
1771297235100762112 |