For the final project in INF1340-Programming for Data Science, we worked on analyzing and predicting the default possibility of credit card owners.
Credit card debt and delinquency have always been one of the biggest pain points in the banking industry, implying socioeconomic and financial risks for both consumers and the servicing organizations. For consumers, delinquency can lead to a cycle of financial hardship, reduced access to credit, and long-term damage to creditworthiness. For financial institutions, high default rates increase operational risk, reduce profitability, and may necessitate stricter lending policies, which could limit access to credit for potential customers.
This study explores patterns and predictors of credit card default using a dataset that includes demographic, behavioral, and transactional variables. By employing descriptive and diagnostic analytics, such as pivot tables and statistical tests, and predictive modeling techniques, such as logistic regression and machine learning methods, the analysis identifies key factors contributing to default risk.
Diagnostic Analytics
After necessary data cleaning and descriptive analytics, we used diagnostic analytics to explore the potential relationships between variables, especially in relation to the target variable 'Default.'
def correlation_matrix(df):
numerical_df = df.select_dtypes(include=np.number)
plt.figure(figsize=(16, 14))
sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
return numerical_df.corr()
correlation_matrix(credit_data)

def multivariate_analysis(df):
plt.figure(figsize=(15,10))
subset_columns = ['LIMIT_BAL', 'AGE', 'PAY0', 'BILL1', 'AMT1', 'Default']
g = sns.pairplot(df[subset_columns], hue='Default',
plot_kws={'alpha': 0.5},
diag_kws={'alpha': 0.5},
palette='viridis')
g.fig.suptitle('Pair Plot with Target Variable Highlighting', y=1.02)
plt.tight_layout()
plt.show()
def normalize_column(series):
return (series - series.min()) / (series.max() - series.min())
parallel_cols = ['LIMIT_BAL', 'AGE', 'PAY0', 'BILL1', 'AMT1', 'Default']
df_normalized = df[parallel_cols].copy()
for col in parallel_cols[:-1]:
df_normalized[col] = normalize_column(df_normalized[col])
fig = px.parallel_coordinates(
df_normalized,
color='Default',
title='Parallel Coordinates Plot',
color_continuous_scale=px.colors.sequential.Viridis
)
fig.show()
categorical_columns = ['SEX', 'EDUCATION', 'MARRIAGE']
plt.figure(figsize=(15,5))
for i, var in enumerate(categorical_columns, 1):
plt.subplot(1, 3, i)
sns.boxplot(x=var, y='LIMIT_BAL', hue='Default',
data=df, palette='Set2')
plt.title(f'{var} vs Limit Balance by Default Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
multivariate_analysis(credit_data)



def logistic_regression(df):
X = X = df.drop(labels=['Default','AGE_GROUP'], axis=1)
y = df['Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_const = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit()
print(result.summary())
logistic_regression(credit_data)
Logit Regression Results
==============================================================================
Dep. Variable: Default No. Observations: 20858
Model: Logit Df Residuals: 20834
Method: MLE Df Model: 23
Date: Tue, 03 Dec 2024 Pseudo R-squ.: 0.1253
Time: 21:44:36 Log-Likelihood: -9613.9
converged: True LL-Null: -10991.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.6448 0.144 -4.473 0.000 -0.927 -0.362
LIMIT_BAL -7.085e-07 1.93e-07 -3.663 0.000 -1.09e-06 -3.29e-07
SEX -0.1047 0.037 -2.830 0.005 -0.177 -0.032
EDUCATION -0.0685 0.026 -2.586 0.010 -0.120 -0.017
MARRIAGE -0.1851 0.038 -4.818 0.000 -0.260 -0.110
AGE 0.0054 0.002 2.524 0.012 0.001 0.010
PAY0 0.5908 0.021 27.797 0.000 0.549 0.632
PAY2 0.0832 0.024 3.427 0.001 0.036 0.131
PAY3 0.0560 0.027 2.056 0.040 0.003 0.109
PAY4 0.0511 0.030 1.696 0.090 -0.008 0.110
PAY5 0.0378 0.033 1.162 0.245 -0.026 0.102
PAY6 -0.0057 0.027 -0.213 0.832 -0.058 0.047
BILL1 -5.429e-06 1.39e-06 -3.918 0.000 -8.15e-06 -2.71e-06
BILL2 1.943e-06 1.85e-06 1.048 0.295 -1.69e-06 5.58e-06
BILL3 4.98e-07 1.68e-06 0.297 0.766 -2.79e-06 3.78e-06
BILL4 -1.174e-06 1.75e-06 -0.671 0.502 -4.6e-06 2.26e-06
BILL5 2.552e-06 1.97e-06 1.297 0.195 -1.31e-06 6.41e-06
BILL6 5.432e-07 1.47e-06 0.370 0.712 -2.34e-06 3.42e-06
AMT1 -1.216e-05 2.66e-06 -4.569 0.000 -1.74e-05 -6.95e-06
AMT2 -8.75e-06 2.46e-06 -3.552 0.000 -1.36e-05 -3.92e-06
AMT3 -5.074e-06 2.3e-06 -2.202 0.028 -9.59e-06 -5.59e-07
AMT4 -5.096e-06 2.19e-06 -2.327 0.020 -9.39e-06 -8.04e-07
AMT5 -3.515e-06 2.18e-06 -1.610 0.107 -7.79e-06 7.63e-07
AMT6 -1.779e-06 1.54e-06 -1.156 0.248 -4.8e-06 1.24e-06
Regarding the logistic regression, pseudo R-squared of 0.1253 indicates that about 12.53% of the variance in the dependent variable (Default) is explained by the model. The log-likelihood ratio (LLR) p-value is smaller than 0, which suggests the model is statistically significant overall. While pseudo R-squared values in logistic regression are typically lower than R-squared in linear regression, 12.53% suggests a moderate but not particularly strong fit.
Looking at the coefficients, p-values, and confidence intervals, the predictor’s impact on the likelihood of default is interpreted below:
LIMIT_BAL: The negative coefficient indicates that higher credit limits are associated with a lower likelihood of default. The coefficient is statistically significant (p < 0.05), but the effect size is small, suggesting that LIMIT_BAL alone may not be a strong predictor.
SEX, EDUCATION, MARRIAGE: These demographic variables are statistically significant, with p-values below 0.05. The signs of the coefficients suggest:
Male (coded as 1): Associated with a slightly lower likelihood of default.
Higher Education: Associated with a increased probability of default. Compared to 'Graduate School', having a 'High School' education is associated with a decrease in the log-odds of default by 0.1006.
Marital Status: The negative coefficient suggests that being married may reduce the likelihood of default.
AGE: Positively associated with default probability, meaning older age is associated with higher odds of default. Although significant, the small coefficient implies age alone has a limited effect on default likelihood.
Repayment History (PAY0 to PAY6): These are among the most influential predictors, as seen from large, positive coefficients for PAY0 and smaller yet significant coefficients for PAY2 and PAY3. The significance and positive signs indicate that higher values (i.e., delayed payments) increase the probability of default.
BILL1, BILL2, etc., and AMT1, AMT2, etc.: Many of these billing and payment amount features are statistically significant but with small coefficients, implying limited impact on default likelihood. Some values like BILL2 and AMT6 are not significant, meaning they don’t seem to add predictive value for default in this model.
Predictive Analytics
We used three different machine learning models for this dataset: Decision Tree, Logistic Regression, and Naive Bayes.
def decision_tree(df):
X = df.drop(labels=['Default', 'AGE_GROUP'], axis=1)
y = df['Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Basic Decision Tree Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred))
depths = [3, 5, 10, None]
accuracies = []
for depth in depths:
clf = DecisionTreeClassifier(criterion='entropy', max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracies.append(acc)
print(f'Depth: {depth}, Accuracy: {acc:.2f}')
plt.figure(figsize=(10, 6))
plt.plot([str(d) for d in depths], accuracies, marker='o')
plt.title('Decision Tree Accuracy vs. Depth')
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.show()
plt.figure(figsize=(20, 10))
tree.plot_tree(clf, feature_names=X.columns, class_names=['Yes', 'No'], filled=True)
plt.title('Decision Tree Visualization')
plt.show()
decision_tree(credit_data)
def decision_tree_entropy(df):
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn import tree
X = df.drop(labels=['Default', 'AGE_GROUP'], axis=1)
y = df['Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X_train, y_train)
information_gain = clf.feature_importances_
feature_importance_df = pd.DataFrame({
'Feature': X.columns,
'Information Gain': information_gain
}).sort_values(by='Information Gain', ascending=False)
print(feature_importance_df)
plt.figure(figsize=(12, 8))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Information Gain'], color='skyblue')
plt.xlabel('Information Gain')
plt.ylabel('Feature')
plt.title('Information Gain for Each Attribute')
plt.gca().invert_yaxis()
plt.show()
decision_tree_entropy(credit_data)
Basic Decision Tree Accuracy: 0.72
precision recall f1-score support
0 0.82 0.82 0.82 6918
1 0.38 0.39 0.39 2022
accuracy 0.72 8940
macro avg 0.60 0.60 0.60 8940
weighted avg 0.72 0.72 0.72 8940
Depth: 3, Accuracy: 0.82
Depth: 5, Accuracy: 0.81
Depth: 10, Accuracy: 0.80
Depth: None, Accuracy: 0.72


def naive_bayes(df):
X = df.drop(labels=['Default'], axis=1)
y = df['Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = GaussianNB()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nNaive Bayesian Results:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Naıve Bayesian')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
return {
"model": model,
"accuracy": accuracy,
"classification_report": report,
"confusion_matrix": conf_matrix,
}
results = naive_bayes(credit_data)
Naive Bayesian Results:
Accuracy: 0.7506
Classification Report:
precision recall f1-score support
0 0.87 0.80 0.83 6918
1 0.46 0.58 0.51 2022
accuracy 0.75 8940
macro avg 0.66 0.69 0.67 8940
weighted avg 0.77 0.75

Overall, Naive Bayes (0.75) performs worse than both the linear regression (0.80) and decision tree models (up to 0.82). This lower accuracy compared to previous models suggests that Naive Bayes may not be the best choice for this dataset, especially given the complexity of relationships in financial data. Naive Bayes assumes feature independence, which might not hold for this dataset (e.g., financial metrics like billing amounts and payments are likely correlated). Naive Bayes achieves the highest recall for class 1 (0.58), indicating it is better at identifying defaults compared to the decision tree (0.39) and linear regression (0.24), but its precision for class 1 (0.46) is only slightly better than the decision tree (0.38) and worse than linear regression (0.70).
Conclusion
Key findings include:
Model-wise, the most important feature in predicting default status is PAY0–the most recent credit card payment status.
Younger age groups show higher default rates, likely reflecting financial inexperience or limited income.
Higher education levels are strongly associated with higher credit limits and might reduce the likelihood of default but their influence on default prediction is limited.
All models faced challenges due to the class imbalance (non-defaults significantly outnumber defaults).
This project demonstrates the trade-offs between different models in handling imbalanced datasets for credit default prediction. While logistic regression and decision trees excel in overall accuracy, Naive Bayes offers superior recall for defaults, which may align better with business goals focused on identifying risky customers. Further optimization through feature engineering, resampling, and advanced modeling techniques can enhance performance and address class imbalance effectively.