Adaptações do Extreme Gradient Boosting para base de dados desbalanceadas com aplicação em Credit Scoring

Gabriel Almeida Ferreira; Adriano Kamimura Suzuki

Autores/as

Gabriel Almeida Ferreira Universidade de São Paulo https://orcid.org/0009-0001-5930-2770
Adriano Kamimura Suzuki Universidade de São Paulo

Palabras clave:

Credit Scoring, XGBoost, Aprendizado de Máquina, Dados Desbalanceados, Balanceamento dos Dados

Resumen

O Credit Scoring pode ser visto como um problema de classificação binária, no qual o objetivo é aprender um modelo que classifique clientes como bons ou maus pagadores. Todavia, as bases de dados utilizadas no contexto de Credit Scoring possuem poucos exemplos de maus pagadores, o que pode levar ao erro de classificar um mau pagador como bom pagador e, portanto, gerar prejuízo ao credor. Nesse sentido, este trabalho apresenta o estudo de duas alternativas para lidar com o problema do desbalanceamento das classes: a adaptação dos algoritmos de aprendizado supervisionado, por meio do Extreme Gradient Boosting (XGBoost) utilizando a função de perda Weighted Focal Loss; e a utilização dos algoritmos de balanceamento artificial dos dados, por meio do oversampling e undersampling. Por fim, os resultados foram analisados, ponderações foram feitas sobre a utilização dos métodos propostos, e esses métodos foram aplicados em uma base de dados real. Como resultado, foram obtidos modelos com menor custo esperado, isso é, com menor prejuízo ao credor, porém também foi observada uma piora no Brier Score na abordagem baseada em balanceamento artificial dos dados.

Citas

BATISTA, G. E.; PRATI, R. C.; MONARD, M. C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, v. 6, p. 20–29, 2004.

BRADLEY, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, v. 30, n. 7, p. 1145–1159, 1997.

BREIMAN, L. Classification and regression trees. [S.l.]: Routledge, 2017.

BROWN, I.; MUES, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert systems with applications, v. 39, n. 3, p. 3446–3453, 2012.

CHANG, Y.-C.; CHANG, K.-H.; WU, G.-J. Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, v. 73, p. 914–920, 2018.

CHEN, T.; GUESTRIN, C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.:s.n.], 2016. (KDD ’16), p. 785–794.

DASTILE, X.; CELIK, T.; POTSANE, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, v. 91, p. 106263, 2020.

FERNANDEZ, A.; GARCIA, S.; GALAR, M.; PRATI, R. C.; KRAWCZYK, B.; HERRERA, F. Learning from imbalanced data sets. [S.l.]: Springer, 2018. v. 10.

FRIEDMAN, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, p. 1189–1232, 2001.

GOORBERGH, R. van den; SMEDEN, M. van; TIMMERMAN, D.; CALSTER, B. V. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, v. 29, n. 9, p. 1525–1534, 2022.

HASTIE, T.; FRIEDMAN, J.; TIBSHIRANI, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [S.l.]: Springer New York, 2001. 193–224 p.

HE, H.; BAI, Y.; GARCIA, E. A.; LI, S. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). [S.l.: s.n.], 2008. p. 1322–1328.

HOFMANN, H. Statlog (German Credit Data). 1994. UCI Machine Learning Repository. Disponível em: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data.

LEMAˆıTRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Common pitfalls and recommended practices. 2017. Disponível em: https://imbalanced-learn.org/stable/common pitfalls.html.

LEMAˆITRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, v. 18, p. 559–563, 2017.

LI, H.; CAO, Y.; LI, S.; ZHAO, J.; SUN, Y. Xgboost model and its application to personal credit evaluation. IEEE Intelligent Systems, v. 35, p. 52–61, 2020.

LIN, T.-Y.; GOYAL, P.; GIRSHICK, R.; HE, K.; DOLLAR, P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. [S.l.: s.n.], 2017. p. 2980–2988.

LOUZADA, F.; ARA, A.; FERNANDES, G. B. Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, v. 21, p. 117–134, 2016.

MORE, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048, 2016.

MUSHAVA, J.; MURRAY, M. A novel xgboost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function. Expert Systems with Applications, v. 202, p. 117233, 2022.

PEDREGOSA, F.; VAROQUAUX, G.; GRAMFORT, A.; MICHEL, V.; THIRION, B.; GRISEL, O.; BLONDEL, M.; PRETTENHOFER, P.; WEISS, R.; DUBOURG, V.; VANDERPLAS, J.; PASSOS, A.; COURNAPEAU, D.; BRUCHER, M.; PERROT, M.; DUCHESNAY, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011.

SANTOS, M. S.; SOARES, J. P.; ABREU, P. H.; ARAUJO, H.; SANTOS, J. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Computational Intelligence Magazine, v. 13, n. 4, p. 59–76, 2018.

SCHAPIRE, R. E. et al. A brief introduction to boosting. In: Ijcai. [S.l.: s.n.], 1999. v. 99, n. 999, p. 1401–1406.

THOMAS, L. C.; EDELMAN, D. B.; CROOK, J. N. Credit Scoring and Its Applications. [S.l.]: Society for Industrial and Applied Mathematics, 2002.

WANG, C.; DENG, C.; WANG, S. Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost. Pattern Recognition Letters, v. 136, p. 190–197, 2020.

WILSON, D. L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2, p. 408–421, 1972.

Adaptações do Extreme Gradient Boosting para base de dados desbalanceadas com aplicação em Credit Scoring

Autores/as

Palabras clave:

Resumen

Citas

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

Proposta de Política para Periódicos de Acesso Livre

Desarrollado por

Enviar un artículo

Idioma

Información