Comparison of document vectorization methods
a case study with textual data
Keywords:
Text Mining, Doc2Vec, TF-IDF, Classificaton MethodsAbstract
The explosion of digital information in recent decades has brought a massive volume of text data. The interest in extracting knowledge from this vast amount of data gave rise to Text Mining. One of the challenges in this field is to transform a text corpus into a numerical database. This process, called document vectorization, is crucial for automating information extraction. The goal of this work is to compare the performance of four document vectorization methods when used for classification purposes. The compared vectorization methods were Bag of Words (BoW), TF-IDF, and two different architectures of doc2vec, CBOW and skip-gram. The classification methods applied were Logistic Regression, Decision Tree, Random Forest, XGBoost, and Perceptron. The dataset used was the publicly available Women's E-Commerce Clothing Reviews dataset, which consists of 10 attributes, with three of them considered in this work: the item review text, the review title, and a categorical variable indicating whether the customer recommends the product or not. A balanced random sample of 8,000 documents was selected, with 4,000 documents having positive recommendations and 4,000 with negative recommendations. This dataset was split into training (70\%) and testing (30\%) sets. The performance comparison metric was the area under the ROC curve (AUC). When comparing the document vectorization methods, both architectures of doc2vec outperformed the other vectorization methods across all tested classification methods.
References
AYYADEVARA, V. K. Pro machine learning algorithms. Apress: Berkeley, CA, USA, Springer, 2018.
BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001.
BREIMAN, L.; FRIEDMAN, J.; STONE, C. J.; OLSHEN, R. A. Classification and regression trees. [S.l.]: CRC press, 1984.
CHEN, T.; GUESTRIN, C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. [S.l.: s.n.], 2016. p. 785–794.
CHEN, T.; HE, T.; BENESTY, M.; KHOTILOVICH, V.; TANG, Y.; CHO, H.; CHEN, K.; MITCHELL, R.; CANO, I.; ZHOU, T.; LI, M.; XIE, J.; LIN, M.; GENG, Y.; LI, Y.; YUAN, J. xgboost: Extreme Gradient Boosting. [S.l.], 2023. R package version 1.7.3.1. Disponível em: ⟨https://CRAN.R-project.org/package=xgboost⟩.
CORTES, C.; VAPNIK, V. Support-vector networks. Machine learning, Springer, v. 20, p. 273–297, 1995.
FAWCETT, T. An introduction to roc analysis. Pattern recognition letters, Elsevier, v. 27, n. 8, p. 861–874, 2006.
FEINERER, I.; HORNIK, K.; MEYER, D. Text mining infrastructure in r. Journal of Statistical Software, v. 25, n. 5, p. 1–54, March 2008. Disponível em: ⟨http://www.jstatsoft.org/v25/i05/⟩.
FRITSCH, S.; GUENTHER, F.; WRIGHT, M. N. neuralnet: Training of Neural Networks. [S.l.], 2019. R package version 1.44.2. Disponível em:
⟨https://CRAN.R-project.org/package=neuralnet⟩.
GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A. Deep learning. [S.l.]: MIT press, 2016.
HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. H.; FRIEDMAN, J. H. The elements of statistical learning: data mining, inference, and prediction. [S.l.]: Springer, 2009. v. 2.
JOSEPH, P.; YERIMA, S. Y. A comparative study of word embedding techniques for sms spam detection. In: IEEE. 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN). [S.l.], 2022. p. 149–155.
KIM, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
KUBRUSLY, J.; NEVES, A. L.; MARQUES, T. L. A statistical analysis of textual e-commerce reviews using tree-based methods. Open Journal of Statistics, Scientific Research Publishing, v. 12, n. 3, p. 357–372, 2022.
LE, Q.; MIKOLOV, T. Distributed representations of sentences and documents. In: PMLR. International conference on machine learning. [S.l.], 2014. p. 1188–1196.
LIAW, A.; WIENER, M. Classification and regression by randomforest. R News, v. 2, n. 3, p. 18–22, 2002. Disponível em: ⟨https://CRAN.R-project.org/doc/Rnews/⟩.
LIN, X. Sentiment analysis of e-commerce customer reviews based on natural language processing. In: Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence. [S.l.: s.n.], 2020. p. 32–36.
LING, J.; CHEN, Y. Online twitter bot detection: A comparison study of vectorization and classification methods on balanced and imbalanced data. Engineering Archive, 2023.
MCCULLAGH, P. Generalized linear models. [S.l.]: Routledge, 2019.
MEDHAT, W.; HASSAN, A.; KORASHY, H. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, Elsevier, v. 5, n. 4, p. 1093–1113, 2014.
MIKOLOV, T.; CHEN, K.; CORRADO, G.; DEAN, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
MIKOLOV, T.; SUTSKEVER, I.; CHEN, K.; CORRADO, G. S.; DEAN, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, v. 26, 2013.
QASEM, A. E.; SAJID, M. Exploring the effect of n-grams with bow and tf-idf representations on detecting fake news. In: 2022 International Conference on Data Analytics for Business and Industry (ICDABI). [S.l.: s.n.], 2022. p. 741–746.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2019. Disponível em: ⟨https://www.R-project.org/⟩.
RINKER, T. W. textstem: Tools for stemming and lemmatizing text. Buffalo, New York, 2018. Version 0.1.4. Disponível em: ⟨http://github.com/trinker/textstem⟩.
ROBIN, X.; TURCK, N.; HAINARD, A.; TIBERTI, N.; LISACEK, F.; SANCHEZ, J.-C.; MüLLER, M. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, v. 12, p. 77, 2011.
SCHUTZE, H.; MANNING, C. D.; RAGHAVAN, P. Introduction to information retrieval. [S.l.]: Cambridge University Press Cambridge, 2008. v. 39.
SILGE, J.; ROBINSON, D. tidytext: Text mining and analysis using tidy data principles in r. JOSS, The Open Journal, v. 1, n. 3, 2016. Disponível em: ⟨http://dx.doi.org/10.21105/joss.00037⟩.
SUTTON, C. D. Classification and regression trees, bagging, and boosting. Handbook of statistics, Elsevier, v. 24, p. 303–329, 2005.
THERNEAU, T.; ATKINSON, B. rpart: Recursive Partitioning and Regression Trees. [S.l.], 2018. R package version 4.1-13. Disponível em: ⟨https://CRAN.R-project.org/package=rpart⟩.
WIJFFELS, J. doc2vec: Distributed Representations of Sentences, Documents and Topics. [S.l.], 2021. R package version 0.2.0. Disponível em: ⟨https://CRAN.R-project.org/package=doc2vec⟩.
___________word2vec: Distributed Representations of Words. [S.l.], 2021. R package version 0.3.4. Disponível em: ⟨https://CRAN.R-project.org/package=word2vec⟩.
Downloads
Published
Versions
- 11-04-2024 (2)
- 15-03-2024 (1)
How to Cite
Issue
Section
License
Proposta de Política para Periódicos de Acesso Livre
Autores que publicam nesta revista concordam com os seguintes termos:
- Autores mantém os direitos autorais e concedem à revista o direito de primeira publicação, com o trabalho simultaneamente licenciado sob a Licença Creative Commons Attribution que permite o compartilhamento do trabalho com reconhecimento da autoria e publicação inicial nesta revista.
- Autores têm autorização para assumir contratos adicionais separadamente, para distribuição não-exclusiva da versão do trabalho publicada nesta revista (ex.: publicar em repositório institucional ou como capítulo de livro), com reconhecimento de autoria e publicação inicial nesta revista.
- Autores têm permissão e são estimulados a publicar e distribuir seu trabalho online (ex.: em repositórios institucionais ou na sua página pessoal) a qualquer ponto antes ou durante o processo editorial, já que isso pode gerar alterações produtivas, bem como aumentar o impacto e a citação do trabalho publicado (Veja O Efeito do Acesso Livre).