Comparison of document vectorization methods: a case study with textual data

Jessica Kubrusly; Gabriel Gonzalo Ledesma Valenotti

Authors

Jessica Kubrusly Universidade Federal Fluminense https://orcid.org/0000-0003-0465-4629
Gabriel Gonzalo Ledesma Valenotti Universidade Federal Fluminense https://orcid.org/0009-0009-4687-7952

Keywords:

Text Mining, Doc2Vec, TF-IDF, Classificaton Methods

Abstract

The explosion of digital information in recent decades has brought a massive volume of text data. The interest in extracting knowledge from this vast amount of data gave rise to Text Mining. One of the challenges in this field is to transform a text corpus into a numerical database. This process, called document vectorization, is crucial for automating information extraction. The goal of this work is to compare the performance of four document vectorization methods when used for classification purposes. The compared vectorization methods were Bag of Words (BoW), TF-IDF, and two different architectures of doc2vec, CBOW and skip-gram. The classification methods applied were Logistic Regression, Decision Tree, Random Forest, XGBoost, and Perceptron. The dataset used was the publicly available Women's E-Commerce Clothing Reviews dataset, which consists of 10 attributes, with three of them considered in this work: the item review text, the review title, and a categorical variable indicating whether the customer recommends the product or not. A balanced random sample of 8,000 documents was selected, with 4,000 documents having positive recommendations and 4,000 with negative recommendations. This dataset was split into training (70\%) and testing (30\%) sets. The performance comparison metric was the area under the ROC curve (AUC). When comparing the document vectorization methods, both architectures of doc2vec outperformed the other vectorization methods across all tested classification methods.

Author Biographies

Jessica Kubrusly, Universidade Federal Fluminense

Statistics Department, Institute of Mathematics and Statistics

Gabriel Gonzalo Ledesma Valenotti , Universidade Federal Fluminense

Institute of Computing

References

AYYADEVARA, V. K. Pro machine learning algorithms. Apress: Berkeley, CA, USA, Springer, 2018.

BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001.

BREIMAN, L.; FRIEDMAN, J.; STONE, C. J.; OLSHEN, R. A. Classification and regression trees. [S.l.]: CRC press, 1984.

CHEN, T.; GUESTRIN, C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. [S.l.: s.n.], 2016. p. 785–794.

CHEN, T.; HE, T.; BENESTY, M.; KHOTILOVICH, V.; TANG, Y.; CHO, H.; CHEN, K.; MITCHELL, R.; CANO, I.; ZHOU, T.; LI, M.; XIE, J.; LIN, M.; GENG, Y.; LI, Y.; YUAN, J. xgboost: Extreme Gradient Boosting. [S.l.], 2023. R package version 1.7.3.1. Disponível em: ⟨https://CRAN.R-project.org/package=xgboost⟩.

CORTES, C.; VAPNIK, V. Support-vector networks. Machine learning, Springer, v. 20, p. 273–297, 1995.

FAWCETT, T. An introduction to roc analysis. Pattern recognition letters, Elsevier, v. 27, n. 8, p. 861–874, 2006.

FEINERER, I.; HORNIK, K.; MEYER, D. Text mining infrastructure in r. Journal of Statistical Software, v. 25, n. 5, p. 1–54, March 2008. Disponível em: ⟨http://www.jstatsoft.org/v25/i05/⟩.

FRITSCH, S.; GUENTHER, F.; WRIGHT, M. N. neuralnet: Training of Neural Networks. [S.l.], 2019. R package version 1.44.2. Disponível em:

⟨https://CRAN.R-project.org/package=neuralnet⟩.

GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A. Deep learning. [S.l.]: MIT press, 2016.

HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. H.; FRIEDMAN, J. H. The elements of statistical learning: data mining, inference, and prediction. [S.l.]: Springer, 2009. v. 2.

JOSEPH, P.; YERIMA, S. Y. A comparative study of word embedding techniques for sms spam detection. In: IEEE. 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN). [S.l.], 2022. p. 149–155.

KIM, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

KUBRUSLY, J.; NEVES, A. L.; MARQUES, T. L. A statistical analysis of textual e-commerce reviews using tree-based methods. Open Journal of Statistics, Scientific Research Publishing, v. 12, n. 3, p. 357–372, 2022.

LE, Q.; MIKOLOV, T. Distributed representations of sentences and documents. In: PMLR. International conference on machine learning. [S.l.], 2014. p. 1188–1196.

LIAW, A.; WIENER, M. Classification and regression by randomforest. R News, v. 2, n. 3, p. 18–22, 2002. Disponível em: ⟨https://CRAN.R-project.org/doc/Rnews/⟩.

LIN, X. Sentiment analysis of e-commerce customer reviews based on natural language processing. In: Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence. [S.l.: s.n.], 2020. p. 32–36.

LING, J.; CHEN, Y. Online twitter bot detection: A comparison study of vectorization and classification methods on balanced and imbalanced data. Engineering Archive, 2023.

MCCULLAGH, P. Generalized linear models. [S.l.]: Routledge, 2019.

MEDHAT, W.; HASSAN, A.; KORASHY, H. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, Elsevier, v. 5, n. 4, p. 1093–1113, 2014.

MIKOLOV, T.; CHEN, K.; CORRADO, G.; DEAN, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

MIKOLOV, T.; SUTSKEVER, I.; CHEN, K.; CORRADO, G. S.; DEAN, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, v. 26, 2013.

QASEM, A. E.; SAJID, M. Exploring the effect of n-grams with bow and tf-idf representations on detecting fake news. In: 2022 International Conference on Data Analytics for Business and Industry (ICDABI). [S.l.: s.n.], 2022. p. 741–746.

R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2019. Disponível em: ⟨https://www.R-project.org/⟩.

RINKER, T. W. textstem: Tools for stemming and lemmatizing text. Buffalo, New York, 2018. Version 0.1.4. Disponível em: ⟨http://github.com/trinker/textstem⟩.

ROBIN, X.; TURCK, N.; HAINARD, A.; TIBERTI, N.; LISACEK, F.; SANCHEZ, J.-C.; MüLLER, M. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, v. 12, p. 77, 2011.

SCHUTZE, H.; MANNING, C. D.; RAGHAVAN, P. Introduction to information retrieval. [S.l.]: Cambridge University Press Cambridge, 2008. v. 39.

SILGE, J.; ROBINSON, D. tidytext: Text mining and analysis using tidy data principles in r. JOSS, The Open Journal, v. 1, n. 3, 2016. Disponível em: ⟨http://dx.doi.org/10.21105/joss.00037⟩.

SUTTON, C. D. Classification and regression trees, bagging, and boosting. Handbook of statistics, Elsevier, v. 24, p. 303–329, 2005.

THERNEAU, T.; ATKINSON, B. rpart: Recursive Partitioning and Regression Trees. [S.l.], 2018. R package version 4.1-13. Disponível em: ⟨https://CRAN.R-project.org/package=rpart⟩.

WIJFFELS, J. doc2vec: Distributed Representations of Sentences, Documents and Topics. [S.l.], 2021. R package version 0.2.0. Disponível em: ⟨https://CRAN.R-project.org/package=doc2vec⟩.

___________word2vec: Distributed Representations of Words. [S.l.], 2021. R package version 0.3.4. Disponível em: ⟨https://CRAN.R-project.org/package=word2vec⟩.

Comparison of document vectorization methods

a case study with textual data

Authors

Keywords:

Abstract

Author Biographies

Jessica Kubrusly, Universidade Federal Fluminense

Gabriel Gonzalo Ledesma Valenotti , Universidade Federal Fluminense

References

Downloads

Published

Versions

How to Cite

Issue

Section

License

Proposta de Política para Periódicos de Acesso Livre

Developed By

Make a Submission

Language

Information

Current Issue