Comparison of document vectorization methods: a case study with textual data

Jessica Kubrusly; Gabriel Gonzalo Ledesma Valenotti

Autores

Jessica Kubrusly Federal Fluminense University https://orcid.org/0000-0003-0465-4629
Gabriel Gonzalo Ledesma Valenotti Federal Fluminense University https://orcid.org/0009-0009-4687-7952

Palavras-chave:

Mineração de Texto, Doc2Vec, TF-IDF, Métodos de Classificação

Resumo

A explosão de informações digitais nas últimas décadas trouxe um enorme volume de dados em forma de texto. O interesse em extrair conhecimento desta vasta quantidade de dados originou a Mineração de Texto. Um dos desafios nesta área é transformar um banco de textos em uma base de dados numérica. Esse processo, chamado de vetorização de documentos, é fundamental para a automatização da extração de informação. O objetivo deste trabalho é comparar o desempenho de quatro métodos de vetorização de documentos quando utilizados para fins de classificação. Os métodos de vetorização comparados foram: BoW, TF-IDF e as duas arquiteturas diferentes do doc2vec, CBOW e skip. Os métodos de classificação aplicados foram: Regressão Logística, Árvore de Classificação, Floresta Aleatória, XGBoost e Perceptron. A base de dados foi a base pública \textit{The Women's E-Commerce Clothing Reviews}, composta por 10 atributos, entre os quais 3 deles foram considerados neste trabalho: o texto de avaliação do item, o título da avaliação e uma variável categórica que indica se o cliente recomenda ou não o produto. Uma amostra aleatória balanceada de 8.000 documentos, 4.000 documentos com recomendação positiva e 4.000 com recomendação negativa, foi sorteada e dividida em treino (70\%) e teste (30\%). A medida de comparação de desempenho foi a área embaixo da curva ROC (AUC). Quando comparados os métodos de vetorização de documentos, as duas arquiteturas do doc2vec apresentaram resultados superiores às demais em todos os métodos de classificaçãp testados.

Biografia do Autor

Jessica Kubrusly, Federal Fluminense University

Departamento de Estatística, Instituto de Matemática e Estatística.

Gabriel Gonzalo Ledesma Valenotti , Federal Fluminense University

Instituto de Computação

Referências

AYYADEVARA, V. K. Pro machine learning algorithms. Apress: Berkeley, CA, USA, Springer, 2018.

BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001.

BREIMAN, L.; FRIEDMAN, J.; STONE, C. J.; OLSHEN, R. A. Classification and regression trees. [S.l.]: CRC press, 1984.

CHEN, T.; GUESTRIN, C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. [S.l.: s.n.], 2016. p. 785–794.

CHEN, T.; HE, T.; BENESTY, M.; KHOTILOVICH, V.; TANG, Y.; CHO, H.; CHEN, K.; MITCHELL, R.; CANO, I.; ZHOU, T.; LI, M.; XIE, J.; LIN, M.; GENG, Y.; LI, Y.; YUAN, J. xgboost: Extreme Gradient Boosting. [S.l.], 2023. R package version 1.7.3.1. Disponível em: ⟨https://CRAN.R-project.org/package=xgboost⟩.

CORTES, C.; VAPNIK, V. Support-vector networks. Machine learning, Springer, v. 20, p. 273–297, 1995.

FAWCETT, T. An introduction to roc analysis. Pattern recognition letters, Elsevier, v. 27, n. 8, p. 861–874, 2006.

FEINERER, I.; HORNIK, K.; MEYER, D. Text mining infrastructure in r. Journal of Statistical Software, v. 25, n. 5, p. 1–54, March 2008. Disponível em: ⟨http://www.jstatsoft.org/v25/i05/⟩.

FRITSCH, S.; GUENTHER, F.; WRIGHT, M. N. neuralnet: Training of Neural Networks. [S.l.], 2019. R package version 1.44.2. Disponível em:

⟨https://CRAN.R-project.org/package=neuralnet⟩.

GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A. Deep learning. [S.l.]: MIT press, 2016.

HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. H.; FRIEDMAN, J. H. The elements of statistical learning: data mining, inference, and prediction. [S.l.]: Springer, 2009. v. 2.

JOSEPH, P.; YERIMA, S. Y. A comparative study of word embedding techniques for sms spam detection. In: IEEE. 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN). [S.l.], 2022. p. 149–155.

KIM, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

KUBRUSLY, J.; NEVES, A. L.; MARQUES, T. L. A statistical analysis of textual e-commerce reviews using tree-based methods. Open Journal of Statistics, Scientific Research Publishing, v. 12, n. 3, p. 357–372, 2022.

LE, Q.; MIKOLOV, T. Distributed representations of sentences and documents. In: PMLR. International conference on machine learning. [S.l.], 2014. p. 1188–1196.

LIAW, A.; WIENER, M. Classification and regression by randomforest. R News, v. 2, n. 3, p. 18–22, 2002. Disponível em: ⟨https://CRAN.R-project.org/doc/Rnews/⟩.

LIN, X. Sentiment analysis of e-commerce customer reviews based on natural language processing. In: Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence. [S.l.: s.n.], 2020. p. 32–36.

LING, J.; CHEN, Y. Online twitter bot detection: A comparison study of vectorization and classification methods on balanced and imbalanced data. Engineering Archive, 2023.

MCCULLAGH, P. Generalized linear models. [S.l.]: Routledge, 2019.

MEDHAT, W.; HASSAN, A.; KORASHY, H. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, Elsevier, v. 5, n. 4, p. 1093–1113, 2014.

MIKOLOV, T.; CHEN, K.; CORRADO, G.; DEAN, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

MIKOLOV, T.; SUTSKEVER, I.; CHEN, K.; CORRADO, G. S.; DEAN, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, v. 26, 2013.

QASEM, A. E.; SAJID, M. Exploring the effect of n-grams with bow and tf-idf representations on detecting fake news. In: 2022 International Conference on Data Analytics for Business and Industry (ICDABI). [S.l.: s.n.], 2022. p. 741–746.

R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2019. Disponível em: ⟨https://www.R-project.org/⟩.

RINKER, T. W. textstem: Tools for stemming and lemmatizing text. Buffalo, New York, 2018. Version 0.1.4. Disponível em: ⟨http://github.com/trinker/textstem⟩.

ROBIN, X.; TURCK, N.; HAINARD, A.; TIBERTI, N.; LISACEK, F.; SANCHEZ, J.-C.; MüLLER, M. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, v. 12, p. 77, 2011.

SCHUTZE, H.; MANNING, C. D.; RAGHAVAN, P. Introduction to information retrieval. [S.l.]: Cambridge University Press Cambridge, 2008. v. 39.

SILGE, J.; ROBINSON, D. tidytext: Text mining and analysis using tidy data principles in r. JOSS, The Open Journal, v. 1, n. 3, 2016. Disponível em: ⟨http://dx.doi.org/10.21105/joss.00037⟩.

SUTTON, C. D. Classification and regression trees, bagging, and boosting. Handbook of statistics, Elsevier, v. 24, p. 303–329, 2005.

THERNEAU, T.; ATKINSON, B. rpart: Recursive Partitioning and Regression Trees. [S.l.], 2018. R package version 4.1-13. Disponível em: ⟨https://CRAN.R-project.org/package=rpart⟩.

WIJFFELS, J. doc2vec: Distributed Representations of Sentences, Documents and Topics. [S.l.], 2021. R package version 0.2.0. Disponível em: ⟨https://CRAN.R-project.org/package=doc2vec⟩.

___________word2vec: Distributed Representations of Words. [S.l.], 2021. R package version 0.3.4. Disponível em: ⟨https://CRAN.R-project.org/package=word2vec⟩.

Comparação de métodos de vetorização de documentos

um estudo de caso com dados textuais

Autores

Palavras-chave:

Resumo

Biografia do Autor

Jessica Kubrusly, Federal Fluminense University

Gabriel Gonzalo Ledesma Valenotti , Federal Fluminense University

Referências

Downloads

Publicado

Versões

Como Citar

Edição

Seção

Licença

Proposta de Política para Periódicos de Acesso Livre

Desenvolvido por

Enviar Submissão

Idioma

Informações