Adaptations of Extreme Gradient Boosting for Imbalanced Datasets with Application in Credit Scoring
Keywords:
Credit Scoring , XGBoost, Machine learning , Umbalanced Data, Data AugmentationAbstract
Credit scoring can be seen as a binary classification problem, with the goal of developing a model that classifies customers as good or bad borrowers. However, databases used in credit scoring often have few examples of bad borrowers, which can result in misclassifying bad borrowers as good payers, leading to potential losses for the lender. In this study, two approaches for addressing the issue of class imbalance are explored: firstly, the adaptation of supervised learning algorithms, specifically Extreme Gradient Boosting (XGBoost), utilizing the Weighted Focal Loss function; and secondly, the utilization of artificial data balancing techniques through oversampling and undersampling. Finally, the obtained results are analyzed, considerations regarding the effectiveness of the proposed methods are discussed, and these methods are applied to a real-world database. As a result, models with a lower expected cost were obtained, i.e. with less damage to the creditor, but there was also a worsening in the Brier Score in the approach based on artificial data balancing.
References
BATISTA, G. E.; PRATI, R. C.; MONARD, M. C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, v. 6, p. 20–29, 2004.
BRADLEY, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, v. 30, n. 7, p. 1145–1159, 1997.
BREIMAN, L. Classification and regression trees. [S.l.]: Routledge, 2017.
BROWN, I.; MUES, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert systems with applications, v. 39, n. 3, p. 3446–3453, 2012.
CHANG, Y.-C.; CHANG, K.-H.; WU, G.-J. Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, v. 73, p. 914–920, 2018.
CHEN, T.; GUESTRIN, C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.:s.n.], 2016. (KDD ’16), p. 785–794.
DASTILE, X.; CELIK, T.; POTSANE, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, v. 91, p. 106263, 2020.
FERNANDEZ, A.; GARCIA, S.; GALAR, M.; PRATI, R. C.; KRAWCZYK, B.; HERRERA, F. Learning from imbalanced data sets. [S.l.]: Springer, 2018. v. 10.
FRIEDMAN, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, p. 1189–1232, 2001.
GOORBERGH, R. van den; SMEDEN, M. van; TIMMERMAN, D.; CALSTER, B. V. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, v. 29, n. 9, p. 1525–1534, 2022.
HASTIE, T.; FRIEDMAN, J.; TIBSHIRANI, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [S.l.]: Springer New York, 2001. 193–224 p.
HE, H.; BAI, Y.; GARCIA, E. A.; LI, S. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). [S.l.: s.n.], 2008. p. 1322–1328.
HOFMANN, H. Statlog (German Credit Data). 1994. UCI Machine Learning Repository. Disponível em: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data.
LEMAˆıTRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Common pitfalls and recommended practices. 2017. Disponível em: https://imbalanced-learn.org/stable/common pitfalls.html.
LEMAˆITRE, G.; NOGUEIRA, F.; ARIDAS, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, v. 18, p. 559–563, 2017.
LI, H.; CAO, Y.; LI, S.; ZHAO, J.; SUN, Y. Xgboost model and its application to personal credit evaluation. IEEE Intelligent Systems, v. 35, p. 52–61, 2020.
LIN, T.-Y.; GOYAL, P.; GIRSHICK, R.; HE, K.; DOLLAR, P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. [S.l.: s.n.], 2017. p. 2980–2988.
LOUZADA, F.; ARA, A.; FERNANDES, G. B. Classification methods applied to credit scoring: Systematic review and overall comparison. Surveys in Operations Research and Management Science, v. 21, p. 117–134, 2016.
MORE, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048, 2016.
MUSHAVA, J.; MURRAY, M. A novel xgboost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function. Expert Systems with Applications, v. 202, p. 117233, 2022.
PEDREGOSA, F.; VAROQUAUX, G.; GRAMFORT, A.; MICHEL, V.; THIRION, B.; GRISEL, O.; BLONDEL, M.; PRETTENHOFER, P.; WEISS, R.; DUBOURG, V.; VANDERPLAS, J.; PASSOS, A.; COURNAPEAU, D.; BRUCHER, M.; PERROT, M.; DUCHESNAY, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011.
SANTOS, M. S.; SOARES, J. P.; ABREU, P. H.; ARAUJO, H.; SANTOS, J. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Computational Intelligence Magazine, v. 13, n. 4, p. 59–76, 2018.
SCHAPIRE, R. E. et al. A brief introduction to boosting. In: Ijcai. [S.l.: s.n.], 1999. v. 99, n. 999, p. 1401–1406.
THOMAS, L. C.; EDELMAN, D. B.; CROOK, J. N. Credit Scoring and Its Applications. [S.l.]: Society for Industrial and Applied Mathematics, 2002.
WANG, C.; DENG, C.; WANG, S. Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost. Pattern Recognition Letters, v. 136, p. 190–197, 2020.
WILSON, D. L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2, p. 408–421, 1972.
Downloads
Published
How to Cite
Issue
Section
License
Proposta de Política para Periódicos de Acesso Livre
Autores que publicam nesta revista concordam com os seguintes termos:
- Autores mantém os direitos autorais e concedem à revista o direito de primeira publicação, com o trabalho simultaneamente licenciado sob a Licença Creative Commons Attribution que permite o compartilhamento do trabalho com reconhecimento da autoria e publicação inicial nesta revista.
- Autores têm autorização para assumir contratos adicionais separadamente, para distribuição não-exclusiva da versão do trabalho publicada nesta revista (ex.: publicar em repositório institucional ou como capítulo de livro), com reconhecimento de autoria e publicação inicial nesta revista.
- Autores têm permissão e são estimulados a publicar e distribuir seu trabalho online (ex.: em repositórios institucionais ou na sua página pessoal) a qualquer ponto antes ou durante o processo editorial, já que isso pode gerar alterações produtivas, bem como aumentar o impacto e a citação do trabalho publicado (Veja O Efeito do Acesso Livre).