Adaptações do Extreme Gradient Boosting para base de dados desbalanceadas com aplicação em Credit Scoring

Gabriel Almeida Ferreira; Adriano Kamimura Suzuki

Authors

Gabriel Almeida Ferreira Universidade de São Paulo https://orcid.org/0009-0001-5930-2770
Adriano Kamimura Suzuki Universidade de São Paulo

Keywords:

Credit Scoring , XGBoost, Machine learning , Umbalanced Data, Data Augmentation

Abstract

Credit scoring can be seen as a binary classification problem, with the goal of developing a model that classifies customers as good or bad borrowers. However, databases used in credit scoring often have few examples of bad borrowers, which can result in misclassifying bad borrowers as good payers, leading to potential losses for the lender. In this study, two approaches for addressing the issue of class imbalance are explored: firstly, the adaptation of supervised learning algorithms, specifically Extreme Gradient Boosting (XGBoost), utilizing the Weighted Focal Loss function; and secondly, the utilization of artificial data balancing techniques through oversampling and undersampling. Finally, the obtained results are analyzed, considerations regarding the effectiveness of the proposed methods are discussed, and these methods are applied to a real-world database. As a result, models with a lower expected cost were obtained, i.e. with less damage to the creditor, but there was also a worsening in the Brier Score in the approach based on artificial data balancing.

References

BATISTA, G. E.; PRATI, R. C.; MONARD, M. C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, v. 6, p. 20–29, 2004.

BRADLEY, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, v. 30, n. 7, p. 1145–1159, 1997.

BREIMAN, L. Classification and regression trees. [S.l.]: Routledge, 2017.

BROWN, I.; MUES, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert systems with applications, v. 39, n. 3, p. 3446–3453, 2012.

CHANG, Y.-C.; CHANG, K.-H.; WU, G.-J. Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, v. 73, p. 914–920, 2018.

CHEN, T.; GUESTRIN, C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.:s.n.], 2016. (KDD ’16), p. 785–794.

DASTILE, X.; CELIK, T.; POTSANE, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, v. 91, p. 106263, 2020.

FERNANDEZ, A.; GARCIA, S.; GALAR, M.; PRATI, R. C.; KRAWCZYK, B.; HERRERA, F. Learning from imbalanced data sets. [S.l.]: Springer, 2018. v. 10.

FRIEDMAN, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, p. 1189–1232, 2001.

GOORBERGH, R. van den; SMEDEN, M. van; TIMMERMAN, D.; CALSTER, B. V. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, v. 29, n. 9, p. 1525–1534, 2022.

HASTIE, T.; FRIEDMAN, J.; TIBSHIRANI, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [S.l.]: Springer New York, 2001. 193–224 p.

HE, H.; BAI, Y.; GARCIA, E. A.; LI, S. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). [S.l.: s.n.], 2008. p. 1322–1328.

HOFMANN, H. Statlog (German Credit Data). 1994. UCI Machine Learning Repository. Disponível em: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data.

LEMAˆıTRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Common pitfalls and recommended practices. 2017. Disponível em: https://imbalanced-learn.org/stable/common pitfalls.html.

LEMAˆITRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, v. 18, p. 559–563, 2017.

LI, H.; CAO, Y.; LI, S.; ZHAO, J.; SUN, Y. Xgboost model and its application to personal credit evaluation. IEEE Intelligent Systems, v. 35, p. 52–61, 2020.

LIN, T.-Y.; GOYAL, P.; GIRSHICK, R.; HE, K.; DOLLAR, P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. [S.l.: s.n.], 2017. p. 2980–2988.

LOUZADA, F.; ARA, A.; FERNANDES, G. B. Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, v. 21, p. 117–134, 2016.

MORE, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048, 2016.

MUSHAVA, J.; MURRAY, M. A novel xgboost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function. Expert Systems with Applications, v. 202, p. 117233, 2022.

PEDREGOSA, F.; VAROQUAUX, G.; GRAMFORT, A.; MICHEL, V.; THIRION, B.; GRISEL, O.; BLONDEL, M.; PRETTENHOFER, P.; WEISS, R.; DUBOURG, V.; VANDERPLAS, J.; PASSOS, A.; COURNAPEAU, D.; BRUCHER, M.; PERROT, M.; DUCHESNAY, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011.

SANTOS, M. S.; SOARES, J. P.; ABREU, P. H.; ARAUJO, H.; SANTOS, J. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Computational Intelligence Magazine, v. 13, n. 4, p. 59–76, 2018.

SCHAPIRE, R. E. et al. A brief introduction to boosting. In: Ijcai. [S.l.: s.n.], 1999. v. 99, n. 999, p. 1401–1406.

THOMAS, L. C.; EDELMAN, D. B.; CROOK, J. N. Credit Scoring and Its Applications. [S.l.]: Society for Industrial and Applied Mathematics, 2002.

WANG, C.; DENG, C.; WANG, S. Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost. Pattern Recognition Letters, v. 136, p. 190–197, 2020.

WILSON, D. L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2, p. 408–421, 1972.

Adaptations of Extreme Gradient Boosting for Imbalanced Datasets with Application in Credit Scoring

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Proposta de Política para Periódicos de Acesso Livre

Developed By

Make a Submission

Language

Information