Prediction of the Reasons for Losing Credit Card Customers based on Multiple Models

: The data in this article is taken from Kaggle, and the data is publicly available. The data mainly focuses on the nearly 20 different possible reasons that the bank is losing credit card customers, such as their monthly and yearly income or gender. However, some of the data in the original dataset is embodied in words. For this kind of data, the data needs to be converted into numbers and then put into the model for testing. It is also important to check the data whether it has the outliers. Putting the data set into the models to test the accuracy. By sorting the results of different models, the model with the best matching degree is finally obtained. Then, the highly correlated factors in the model are sorted to get the main reasons why banks lose credit card customers. In the end, providing some business advice for banks could be used to identify the same group of credit card customers, and providing some advice to banks could be used to save credit card customers.


Introduction
A credit card is a physical card.When a credit card account is opened, the credit card company sets a credit limit.Within the credit limit, the cardholder can purchase goods, or services, or withdraw cash.The use of credit cards is very common in today's society, however, with the insecurity of credit cards and the emergence of more convenient and efficient mobile payment methods, gradually people are using credit cards less and have even decided to quit the credit card business.Therefore, analyzing the reasons for losing credit card customers is a potentially important source of value for financial institutions or banks.So, I hope to use the data model to analyze the credit cardrelated problem.In the beginning, we should know the importance of data analysis.In the context of digital development, data analysis tends to be widely used by enterprises to improve their business since increasingly more companies have access to gain their customers' data and consumption data, whether from their sources or a third party [1].Also, online companies can conduct data analysis to understand their customers' behavior and preferences [2], then they can make adjustments or apply some operation strategies to improve the number of customers, improve the conversion rate, which finally leads to the increase of the revenues.So, after we know the importance of using data to analyze types of questions, we need to understand what is the purpose of the article.
The main purpose of this research paper is to try to identify the main reasons why banks lose customers by constructing a model about bank credit card customers based on a set of actual data so that it can better screen the reasons why they may lose customers and can also be used to identify potential customer groups.
Credit cards are a long-term and profitable loan program for banks because credit card accounts take a revolving line of credit and therefore, banks have more aspects to monitor and manage them than other retail loans such as mortgages [3].The high-interest balances generated when borrowers do not pay their credit card rooms on time are a major source of profit for banks [4].Therefore, credit card customer churn is a huge loss for banks, however, it is particularly important to analyze the reasons for customer churn.There are objective reasons for customer churn, such as credit card fraud, in 2017, there were 1,579 data breaches and nearly 179 million records among which Credit card frauds were the most common form with 133,015 reports [5].Meanwhile, along with this, in some countries, mobile payments have become an established payment mechanism similar to credit or debit cards.For example, the volume of mobile payment transactions in China is expected to exceed credit card transactions by 30% [6].However, there are many objective reasons, such as customer gender, income, marital status, educational background, etc.This paper analyzes and gives recommendations based on these possible variables to help banks better deal with customer churn and predict potential customers.

Literature Review
Keeping an eye on credit card usage has always been a very important issue.Credit cards are being used more and more frequently in consumer spending, both in terms of profitability for banks and ease of use for cardholders, and they have taken over the dominant payment system.Expanding globalization and widespread use of the Internet have led to increased frequency of credit card use, especially in online shopping.A variety of online platforms have made it possible to constantly target potential customers through a potentially irresistible variety of deals and promotional holidays (e.g.Black Friday, Christmas, New Year).Similarly, credit card companies offer lucrative rewards (e.g.miles, points, cashback bonuses) for using credit cards instead of cash.According to the Federal Reserve Bank, 72% of U.S. consumers have at least one credit card as of 2014.However, a large consumer base doesn't mean they understand exactly what's behind the credit card.Many consumers, especially younger consumers, lack financial literacy and consumer experience, making it easy for merchants to accept misleading offers and abuse their credit.They show no concern for the expansion of credit.As of 2015, the average credit card debt of card-carrying households reached $15,779 with an outstanding revolving balance of $937.9 billion [7].
Yet with such a huge frequency of credit card use, attention to credit card customers is essential.In the past, however, banks have tended to focus their attention on credit card customers in terms of defaults.There are many ways to identify credit card defaults, such as the widely used classification method.Data mining is used to construct models, such as multicriteria linear programming, that classify the behavior of credit cardholders to determine the likelihood of default and late payment [8].In addition, based on Knight's theoretical research, a model for studying credit card risk was creatively proposed, using a priori data on credit card customers' personal information and consumer behavior to distinguish uncertainty and risk.The data level is convincing enough, but the degree of relevance to consumers themselves is low, and the scope is too narrow by starting with behavior alone.However, at the same time, as Hyun's study [25] shows, standard credit risk models can be built to classify credit cardholders.Studies conducted through machine learning (ML) have shown the impact of credit risk models in identifying multiple banking risk problems.
While it is important for banks to control risk properly, it is also important for cardholders to protect their assets safely.The rapid and high volume of credit card transactions (CCTs) has led to a significant increase in fraud cases.In 2017, a CyberSource report noted that fraudulent Latin American e-commerce credit card transactions (chargebacks) amounted to 1.4% of the total net value of the industry [9].
But now with the recessionary form of the economy, especially through the COVID-19 period, many financial institutions are on the verge of bankruptcy.This is because, during the pandemic, credit growth was negative during the downturn, while bank deposits (i.e., third-party funding) increased [10].The negative impact of the recession, along with the questionable security of credit cards and the continuous loss of credit card customers has become a huge problem.Therefore, I will use a logistic regression model to examine the data and give the most intuitive reason for banks to lose credit card customers.
Therefore, this paper mainly uses a logistic regression model to examine the data and build a model to give banks the most intuitive reason for losing credit card customers, to help banks to better retain existing customers and explore potential customers.

Analysis and Data Set
The credit card customer dataset is downloaded from the official website of Kaggle (https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers).This dataset consists of over 10,000 customers including their age, salary, marital status, credit card limit, credit card type, etc. with nearly 20 features.First, the presentation of unified data is very important: for Attrition Flag, 1 represents Existing Customer, 2 represents Attrited Customer; for Gender, 1 represents Male, 2 represents Female; for Education level, 1 represents Graduate, 2 represents High School, 3 for others; for Marital Status, 1 for Married, 2 for Single, 3 for others; for Income Category, 1 for Less than $40K, 2 for Between $40K -$60K, 3 represents others; for Card Category, 1 represents Blue, 2 represents Silver, and 3 represents others.
To better understand the variables in the dataset and check whether there are missing values or any abnormal values in the dataset, the descriptive analysis will be applied first.The descriptive statistics about the variables in the dataset are shown below: Table1-5 shows the descriptive statistics about all the variables in the wine quality dataset.According to the table above, it is obvious that there are no missing values in the dataset.In addition, according to common sense and the range (maximum-minimum) of the variables above, there are no abnormal values for these variables.Moreover, by observing the features of these variables in the dataset, it is not difficult to find out that all variables are continuous variables.
In detail, the average customer age is 8.3, while the minimum customer age is 26 and the maximum customer age is 73.The average dependent count is 2.35, while the minimum dependent count is 0 and the maximum dependent count is 5.As for the months on a book, the average value of months on the book is 35.9, while the minimum number of months on the book is 13 and the maximum customer age is 56, so the range of this chemical in wine is 43 (56-13=43).The average total relationship count is 3.81, while the minimum total relationship count is 1 and the maximum citric acid used is 6.As for the months inactive in 12 months, the average value of months inactive in 12 months is 2.34, while the range of t months inactive in 12 months is 6 (6-0=6).The average contacts count in 12 months is 2.46, while the minimum contacts count in 12 months is 0 and the maximum contacts count in 12 months is 6.As for the credit limit, the average value of the credit limit is 8632,  From figure 1, all the variables in the dataset have been shown in the form of descriptive statistics tables.In the following part, the histograms of these variables will be plotted to understand the distributions of these variables.According to the figures above, the variables of customer age, dependent count, months on book, months inactive in 12 months, and total transaction count are approximately normal distribution while all other variables in the dataset seem to be right-skewed.Except for some transformed data which was words original, they have transformed to simple numbers 1,2,3, so the histogram for them is a little strange, but there is no influence to analyze the data.

Analytical Model
As stated above, the main task of this report is to understand the different kinds of elements in influencing lose and get the credit card customers, and the effects of these different kinds of influencing lose and get the credit card customers.Thus, the logistic regression model, model for KNN, and model for Naive Bayes will be applied in the following parts.

Logistic Regression Model
When there are no more than two outcomes in the regression model for predicting the dependent variable, the binomial logistic regression analysis will be applied instead of the simple regression analysis.As the goal of this report is to examine the impact of various kinds of reasons why the bank will lose credit card customers, the binomial logistic regression analysis will be applied.

K-NN Model
The main thinking of the k-NN algorithm is that if a sample belongs to a class if most of its k-nearest neighbors in the feature space belong to a certain class, which was considered has the same characteristics.When determining the classification decision, the only way to determines which the class belong to, is chosen by the class of the most recent sample or samples.The k-NN method is only relevant to a tiny number of near samples when making class decisions.The most important parameter is the K value, so the accuracy is starting from 1 to 8.

Model train for Naïve Bayes
Naive Bayes Classifier (Naive Bayes Classifier or NBC).In theory, the NBC model has highest accuracy compared to other classification methods.This is because the NBC model assumes each element are not dependent.However, this assumption is always wrong in reality.The idea of Naive Bayes is a posterior probability, which is to calculate the probability of occurrence of the event once again in the event of an event.

4.
Analysis and results

Results and analysis of Binomial Logistic Regression Model
After figuring out the correlation coefficients between quality and some other variables in the dataset, the multivariate regression analysis will be conducted.And to begin with, all independent variables will be used to fit the regression model to predict the quality of the wine.The results of the multivariate regression model are shown below: As for the estimated coefficients which are statistically significant in this multivariate regression model, the indications of their estimated coefficients will be explained in detail.The estimated coefficient of Gender is -0.66, which indicates keeping other variables held constant, if the Gender increase by 1, then the bank will lose 0.66 credit card customers.The estimated coefficient of dependent count is -0.13, which indicates keeping other variables held constant, if the number of dependent count increases by 1, then the bank will lose 0.13 credit card consumers.The estimated coefficient of total marital status is -0.33, which indicates keeping other variables held constant, if the number of marital statuses increases by 1, then the bank will lose 0.33 credit card consumers.The estimated coefficient of total relationship count is 0.45, which indicates keeping other variables held constant, if the amount of total relationship counts increases by 1, then the bank will increase by 0.45 customers.The estimated coefficient of months inactive is -0.50, which indicates keeping other variables held constant, if the number of months inactive increases by 1, then the bank will lose 0.5 credit customers.The estimated coefficient of contacts count is -0.51, which indicates keeping other variables held constant, if the amount of contacts increases by 1, then the bank will lose 0.51 credit customers.It comes the same with other related elements.In addition, the multiple R-squared value for this multivariate regression model is 0.467, which means 46.7% of the variation in the reason for losing credit card customers can be explained by these independent variables in the model.
It can be found from figure 3 that several variables did not pass the hypothesis test, that is, the parameters of several variables did not have statistical significance.So, the number of variables should be reduced next.Here, this report uses stepwise regression.Stepwise regression is based on the initial model, which gradually reduces a certain variable, then calculates the AIC value of each model, and selects the smallest AIC value until the variables can no longer be deleted.In other words, the final model has the lowest AIC value.The following output is a summary of the model.Figure 4 above shows the logistic regression model without insignificant elements, when deleting the irrelevant elements, the reliabilities of the model will increase.
Next, check the performance in the test set, see the figure below (figure 5): Because the result of logistic regression calculation is the probability of success, we judge the correct rate is the correct rate of judgment classification.Therefore, if the calculation result is greater than or equal to 0.5, the admission is successful.The final calculation accuracy rate is 88.97%

Results and analysis of the K-NN model
The KNN algorithm uses a majority voting mechanism.It collects data from the training dataset and then uses that data to make predictions on new records.The use of the KNN algorithm can compete with the most accurate models because it can make highly accurate predictions.Therefore, we can use the KNN algorithm for applications that require high accuracy but do not require a humanreadable model.
Here comes with the results of the K-NN model:  From the figures 6-8, we can read the table that when K= 5 or 6, which means that we could take 5 or 6 elements, at that time, the accuracy reaches the highest value.However, if we take the suitable values to test the accuracy, we will find the result in the following table.
In the end, the accuracy obtained by this algorithm is equal to 87.49%, which can be regarded as a very good classification model.

Model training for Naive Bayes
In this model, I include all the variables in the algorithm.The following is a summary of the output of the model.
It can be found that the accuracy rate reaches 88.53% and the Kappa value is greater than 0.5, which indicates that this model performs well on the test set.

Conclusion of Models
The accuracy of the four methods is logistic regression 88.97%， K-NN 87.49%， Naïve Bayes 88.53%,In this paper, by comparing the four methods, I finally found that the logistic regression model (specifically for binomial one) has the best prediction ability; it comes with Naive Bayes and KNN.Here, it cannot be concluded that the K-NN model is the best poor model because the data can be made more applicable to the model through variable transformation.

Shortcomings of Models
Moreover, the K-NN table has a few shortcomings, one is we could just use the K-NN table to identify how many elements we could use in our model, then we need to change the number of elements to reach the highest accuracy; secondly, K-NN is very sensitive to data size because it relies on computing distances.For features with higher scales, the calculated distance may be very high and may yield poor results.Therefore, it is recommended to scale the data before running KNN.
In addition, the general model has 2 shortcomings: 1.It is not necessarily correct to judge whether a bank will lose a customer with a limit of 0.5, because in the subsequent analysis, it was found that there are still many credit card customers, obviously not in line with common sense.2. The processing of missing values is not very good, and direct deletion will reduce the overall variance of the data set.All the information provided by the sample was not fully used.

Suggestions for the Research
Based on the results obtained above, some suggestions can be made to help bank credit card managers to improve the probability to get new and save original credit card holders, and therefore increase bank revenue and expand bank services.To begin with, since the gender, dependent count, marital status, months inactive in 12 months (No. of months inactive in the last 12 months), and contacts count in 12 months (No. of Contacts in the last 12 months) have negative effects on deciding which credit customer should be chosen, here comes with few ideas: 1. Try to find some men as potential credit card holders, and then try to give men some discounts for existing credit cardholders.There is a high possibility that they can be retained.A possible reason is that men who tend to hold more wealth may be more likely to be potential credit card users.
2. Try to find or retain some people with simple family members as credit card candidates.For such people, their family members are often simpler and therefore more willing to become credit card candidates.
3. For married credit card holders or candidates, try your best to win, they will often face some financial difficulties, so having a credit card can be a good way to tide over the difficulties, but getting married often means more sense of responsibility, So timely repayment is often not a big problem for them.
4. Pay attention to customers who have not been very active in the past 12 months, they are often unlikely to be loyal credit card customers because if they are not active in the banking business, they are also more likely to have a credit card business that requires timely repayment Not very interested.
5. Don't assume that the higher the number of contacts in the past twelve months, they will be active, and they may not be potential or loyal credit card customers.
6.For customers who hold more products, they tend to become loyal credit card customers, because the more products they hold in a bank, the more they have deposits in that bank, and they are also satisfied.Bank-related financial business so pays special attention and pay attention to this group of people.
7. Pay attention to those who have more total transactions.The more transactions they have in the bank, it means that their capital flows are all placed in this bank, which means they are very satisfied with this bank.Service mechanism, whether it is low transfer fee or high-interest rate, there are many possibilities for such people to become credit card customers.
8. For those with an increasing number of transactions in the first to fourth quarters, an upward trend in their number of transactions means that most of their turnover at the end of the year will be in this bank, and it also shows that they are very satisfied with the bank's service, and there is a high possibility of becoming a potential holder of the bank's credit card.

Conclusion
As introduced above, the main goal of this report is to understand different kinds of reasons why the bank is losing credit card customers and to examine the impacts of different reasons.

Figure
Figure 1： Histograms of four features in the data set.
Figures 2 and 3 has shown the results for the binomial logistic regression model with all independent variables.According to the figure, only the variables of gender, dependent count, marital status, relationship count, number of inactive months in 12 months, number of contacts in 12 months, total transaction amount, and change in total transaction amount is statistically significant in this logistic regression model since their p-values are smaller than 0.01.All other estimated coefficients are not statistically significant in this multivariate regression model.As for the estimated coefficients which are statistically significant in this multivariate regression model, the indications of their estimated coefficients will be explained in detail.The estimated coefficient of Gender is -0.66, which indicates keeping other variables held constant, if the Gender increase by 1, then the bank will lose 0.66 credit card customers.The estimated coefficient of dependent count is -0.13, which indicates keeping other variables held constant, if the number of dependent count increases by 1, then the bank will lose 0.13 credit card consumers.The estimated coefficient of total marital status is -0.33, which indicates keeping other variables held constant, if the number of marital statuses increases by 1, then the bank will lose 0.33 credit card consumers.The estimated coefficient of total relationship count is 0.45, which indicates keeping other variables held constant, if the amount of total relationship counts increases by 1, then the bank will increase by 0.45 customers.The estimated coefficient of months inactive is -0.50, which indicates keeping other variables held constant, if the number of months inactive increases by 1, then the bank will lose 0.5 credit customers.The estimated coefficient of contacts count is -0.51, which indicates keeping other variables held constant, if the amount of contacts increases by 1, then the bank will lose 0.51 credit customers.It comes the same with other related elements.

Figure
Figure 2： Logistic Regression model 1 with all independent variables.

Figure
Figure 3： Logistic Regression model 1 with all independent variables.

Figure
Figure 4： Logistic Regression model without insignificant elements.

Figure
Figure 5： Performance of the New Logistic Regression model.

Figure
Figure 6： K-NN model results.

Figure
Figure 7： K-NN model results.

Figure
Figure 8： K-NN model results.

Figure
Figure 9： K-NN model accuracy results.

Figure
Figure 10： K-NN model results.

Table 2 :
Descriptive of four elements.

Table 3 :
Descriptive of four elements.

Table 4 :
Descriptive of four elements.

Table 5 :
Descriptive of four elements.The 6th International Conference on Economic Management and Green Development (ICEMGD 2022) DOI: 10.54254/2754-1169/3/2022801while the range of credit limit is 33078 (34516-1438=33078).The other elements could be read in the same way as the above data.
To figure out these questions, descriptive statistics, data visualization (histograms), Pearson correlation coefficient, binomial regression analysis, K-NN model, and Naïve Bayes model have been applied in R studio.The results have shown that the gender, dependent count, marital status, total relationship count (Total no. of products held by the customer), months inactive in 12 months (No. of months inactive in the last 12 months), contacts count in 12 months (No. of Contacts in the last 12 months), total trans_Ct (Total Transaction Count (Last 12 months)) and Total_Ct_Chng_Q4_Q1 (Change in Transaction Count (Q4 over Q1)) tend to have a significant impact on the bank losing credit card customers.In addition, the estimated coefficients of gender, dependent count, marital status, months inactive in 12 months (No. of months inactive in the last 12 months), and contacts count in 12 months (No. of Contacts in the last 12 months) are negative while the estimated coefficients of total relationship count (Total no. of products held by the customer), total trans_Ct (Total Transaction Count (Last 12 months)) and Total_Ct_Chng_Q4_Q1 (Change in Transaction Count (Q4 over Q1)) are positive.These results obtained have relatively reliable since most of the assumptions of the multivariate regression model have been proved to meet.