Credit Card Frauds

EDA & Data Preprocessing

Exploratory Data Analysis is one of the major step to fine-tune the given dataset and performing data analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. Outcome of this phase are:

Understanding and cleaning the given dataset.
Understanding relationship between different features or columns and target variable.
Provide guidelines for essential variables vs non-essential variables.
Handling Missing values or human error.
Identifying outliers.

#import the train dataset
dtrain = pd.read_csv('../dataset/fraudTrain.csv')

#import the test dataset
dtest = pd.read_csv('../dataset/fraudTest.csv')

#combine 2 datasets into 1
fullset = pd.concat([dtrain, dtest])

#delete the original data before merge to save memory
del dtrain, dtest

#see the structure of data
print(fullset.shape)
print(fullset.info())

#print sample of data(the first 5 rows)
fullset.head()
	(1852394, 23)
	Int64../index: 1852394 entries, 0 to 555718
	Data columns (total 23 columns):
	#   Column                 Dtype  
	---  ------                 -----  
	0   Unnamed: 0             int64  
	1   trans_date_trans_time  object 
	2   cc_num                 int64  
	3   merchant               object 
	4   category               object 
	5   amt                    float64
	6   first                  object 
	7   last                   object 
	8   gender                 object 
	9   street                 object 
	10  city                   object 
	11  state                  object 
	12  zip                    int64  
	13  lat                    float64
	14  long                   float64
	15  city_pop               int64  
	16  job                    object 
	17  dob                    object 
	18  trans_num              object 
	19  unix_time              int64  
	20  merch_lat              float64
	21  merch_long             float64
	22  is_fraud               int64  
	dtypes: float64(5), int64(6), object(12)

	Unnamed: 0	trans_date_trans_time	cc_num	merchant	category	amt	first	last	gender	street	...	lat	long	city_pop	job	dob	trans_num	unix_time	merch_lat	merch_long
0	0	2019-01-01 00:00:18	2703186189652095	fraud_Rippin, Kub and Mann	misc_net	4.97	Jennifer	Banks	F	561 Perry Cove	...	36.0788	-81.1781	3495	Psychologist, counselling	1988-03-09	0b242abb623afc578575680df30655b9	1325376018	36.011293	-82.048315
1	1	2019-01-01 00:00:44	630423337322	fraud_Heller, Gutmann and Zieme	grocery_pos	107.23	Stephanie	Gill	F	43039 Riley Greens Suite 393	...	48.8878	-118.2105	149	Special educational needs teacher	1978-06-21	1f76529f8574734946361c461b024d99	1325376044	49.159047	-118.186462
2	2	2019-01-01 00:00:51	38859492057661	fraud_Lind-Buckridge	entertainment	220.11	Edward	Sanchez	M	594 White Dale Suite 530	...	42.1808	-112.2620	4154	Nature conservation officer	1962-01-19	a1a22d70485983eac12b5b88dad1cf95	1325376051	43.150704	-112.154481
3	3	2019-01-01 00:01:16	3534093764340240	fraud_Kutch, Hermiston and Farrell	gas_transport	45.00	Jeremy	White	M	9443 Cynthia Court Apt. 038	...	46.2306	-112.1138	1939	Patent attorney	1967-01-12	6b849c168bdad6f867558c3793159a81	1325376076	47.034331	-112.561071
4	4	2019-01-01 00:03:06	375534208663984	fraud_Keeling-Crist	misc_pos	41.96	Tyler	Garcia	M	408 Bradley Rest	...	38.4207	-79.4629	99	Dance movement psychotherapist	1986-03-28	a41d7549acf90789359a9aa5346dcb46	1325376186	38.674999	-78.632459

#convert dob to age
fullset['dob'] = pd.to_datetime(fullset.dob)
def from_dob_to_age(born):
	today = datetime.date.today()
	return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

fullset['Age']=fullset['dob'].apply(lambda x: from_dob_to_age(x))

Let’s visualize if there is any relationship between the target variable and Age. Visualize relationship between is_fraud and age. We can see that majority of fraud happened between age 37 to 65.

fig= plt.figure(figsize=(10,5) )
fig.add_subplot(1,3,1)
ar_6=sns.boxplot(x=fullset["is_fraud"],y=fullset["Age"])

Then we will check if there is any null value in the dataset.

fullset.isna().sum()
	Unnamed: 0               0
	trans_date_trans_time    0
	cc_num                   0
	merchant                 0
	category                 0
	amt                      0
	first                    0
	last                     0
	gender                   0
	street                   0
	city                     0
	state                    0
	zip                      0
	lat                      0
	long                     0
	city_pop                 0
	job                      0
	dob                      0
	trans_num                0
	unix_time                0
	merch_lat                0
	merch_long               0
	is_fraud                 0
	Age                      0

Checking for unique values in the dataset

for col in fullset:
	uValue = np.unique(fullset[col])
	rValue = len(uValue)
	if rValue < 50:
		print('Unique values {} total {} --{}'.format(col, rValue, uValue))
	else:
		print('Unique Value {} -- {}'.format(col, rValue))
		Unique Value Unnamed: 0 -- 1296675
		Unique values trans_date_trans_time total 24 --[Period('2019-01', 'M') Period('2019-02', 'M') Period('2019-03', 'M')
		 Period('2019-04', 'M') Period('2019-05', 'M') Period('2019-06', 'M')
		 Period('2019-07', 'M') Period('2019-08', 'M') Period('2019-09', 'M')
		 Period('2019-10', 'M') Period('2019-11', 'M') Period('2019-12', 'M')
		 Period('2020-01', 'M') Period('2020-02', 'M') Period('2020-03', 'M')
		 Period('2020-04', 'M') Period('2020-05', 'M') Period('2020-06', 'M')
		 Period('2020-07', 'M') Period('2020-08', 'M') Period('2020-09', 'M')
		 Period('2020-10', 'M') Period('2020-11', 'M') Period('2020-12', 'M')]
		Unique Value cc_num -- 999
		Unique Value merchant -- 693
		Unique values category total 14 --['entertainment' 'food_dining' 'gas_transport' 'grocery_net' 'grocery_pos'
		 'health_fitness' 'home' 'kids_pets' 'misc_net' 'misc_pos' 'personal_care'
		 'shopping_net' 'shopping_pos' 'travel']
		Unique Value amt -- 60616
		Unique Value first -- 355
		Unique Value last -- 486
		Unique values gender total 2 --['F' 'M']
		Unique Value street -- 999
		Unique Value city -- 906
		Unique Value state -- 51
		Unique Value zip -- 985
		Unique Value lat -- 983
		Unique Value long -- 983
		Unique Value city_pop -- 891
		Unique Value job -- 497
		Unique Value dob -- 984
		Unique Value trans_num -- 1852394
		Unique Value unix_time -- 1819583
		Unique Value merch_lat -- 1754157
		Unique Value merch_long -- 1809753
		Unique values is_fraud total 2 --[0 1]
		Unique Value Age -- 80

We want to see the amount of fraud that happening from January 2019 to December 2020.

fraud = fullset.query('is_fraud == 1')  
fraud['trans_date_trans_time'].value_counts().sort_../index().plot()
plt.title('Fraud growths')

It seems, the fraud peaking during the holiday season. We want to see what is category the most targeted by the fraudster.

sns.countplot(y=fraud.category, data=fraud, palette = 'Set3')
plt.title('Fraud Category')
plt.xlabel('Total')
plt.show()

From the graph above, we can see most of the fraud happening through online shopping and grocery point-of-sale networks. Then we want to see the percentage of positive and negative fraud transactions.

print(fullset['is_fraud'].value_counts())
	#visual the imbalanced data using charts
	fig= plt.figure(figsize=(10,5) )
	fig.add_subplot(1,2,1)
	explode = [0, 0.5]
	#labels = ['Fraud', 'Non-Fraud']
	a= fullset["is_fraud"].value_counts(normalize=True).plot.pie(explode=explode, autopct='%1.1f%%')
	fig.add_subplot(1,2,2)
	churnchart=sns.countplot(x=fullset["is_fraud"])
	plt.tight_layout()
	plt.show()
	0    1842743
	1       9651

The imbalanced ratio is 99.47:0.52 which means that our dataset is highly imbalanced where majority of transactions are legitimate transactions and very few are fraudulent transactions. The machine learning model trained on this data may yield high overall prediction towards majority class since 99.47% samples belong to majority class or class 0. Since our dataset is huge, we treated class imbalance problem using random under-sampling techinque from imblearn.
We want to check the distribution of our data using the histogram.

p = fullset.hist(figsize = (15,15))

We will convert our non numerical dataset to numerical then plot the correlation matrix to see the correlation between each features.

#apply label encoder to the dataset and put in the new dataset
newset = fullset.apply(LabelEncoder().fit_transform)

#plot correlation matrix
fig, ax = plt.subplots(figsize=(25,10))  
sns.heatmap(newset.corr(), annot = True, ax=ax, cmap = 'Blues')
plt.title("DataCorrelation")
plt.show()

We will be dropping is_fraud because we will use it as our y axis and dropping variables which are highly correlated like merch_lat and merch_long

dropvar = ['is_fraud','Unnamed: 0', 'first', 'last', 'dob', 'trans_date_trans_time','trans_num', 'merch_lat', 'merch_long'] 
#create our dependent variables
x = newset.drop(dropvar, axis=1).copy()

#independent variable
y = newset['is_fraud'].copy()

Random Under Sampler

Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class. It is one of several techniques data scientists can use to extract more accurate information from originally imbalanced datasets.
Since our dataset is highly imbalance, we keep all the data of fraud class and decrease the samples from the non-fraud class.

#import library for random under sampler
import imblearn
from imblearn.under_sampling import RandomUnderSampler
import collections
from collections import Counter

#calling the method and fit it with x and y data
unSampler = RandomUnderSampler(random_state=42, replacement=True)
xund, yund = unSampler.fit_resample(x,y)

print('original dataset shape:', Counter(y))
print('resample dataset shape', Counter(yund))
	original dataset shape: Counter({0: 1842743, 1: 9651})
	resample dataset shape Counter({0: 9651, 1: 9651})

#visual the balanced data using bar chart
underfraud = pd.DataFrame(yund)
sns.countplot(x = 'is_fraud', data = underfraud)
plt.show()

After we implement Random Under Sampler, our target observation become 19302 rows.
We will scale our data and perform PCA to reduce the dimensionality.

#scaling the data
mmscale = MinMaxScaler()
#transform the values
X_scaled = mmscale.fit_transform(xund)
from sklearn.decomposition import PCA
pca_15 = PCA(n_components = 15, random_state=2020)
pca_15.fit(X_scaled)
X_pca_15 = pca_15.transform(X_scaled)
#create following plot

plt.plot(np.cumsum(pca_15.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Explained variance')
plt.savefig('elbow_plot.png', dpi= 100)

From the chart above, it shows that the best number of features are 14.

Function for metrix evaluation
score= []

#plot confusion matrixfrom sklearn.metrics import ConfusionMatrixDisplay, accuracy_score
from sklearn.metrics import log_loss, confusion_matrix, roc_auc_score, roc_curve

#plot the confusion matrix
def cm(algo): #it need the model variable after fitting the data
    disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(ytest, algo.predict(xtest)),
                           display_labels=algo.classes_)
    disp.plot()
    plt.show()

#print metrix score
#function to print metrics score
def prints(cmatrix, acctest, acctrain, overfit, logtest, logtrain, precision1,
                  precision0, recall1, recall0, f1, roctest, roctrain):
    print("Confusion Matrix Accuracy Score = {:.2f}%\n".format(cmatrix))
    print("Accuracy Score: Training -> {:.2f}% Testing -> {:.2f}%\n".format(acctrain, acctest))
    print("Overfitting : {:.2f}%".format(overfit))
    print("Log Loss Training-> {} Testing -> {}\n".format(logtrain, logtest))
    print('Precision class 1: {:.2f}%\nPrecision class 0: {:.2f}%'.format(precision1, precision0))
    print('Recall class 1: {:.2f}%\nRecall class 0: {:.2f}%'.format(recall1, recall0))
    print('F1: {:.2f}%'.format(f1)) 
    print('ROC AUC Training-> {:.2f}% Testing-> {:.2f}%'.format(roctrain, roctest))

#function add metrics score to list
def insertlist(name, cmatrix, acctest, acctrain,overfit, logtest, logtrain, precision1,
					precision0, recall1, recall0, f1, roctest, roctrain):
	score.append([name, cmatrix, acctest, acctrain,overfit, logtest, logtrain, precision1,
					precision0, recall1, recall0, f1, roctest, roctrain])

# Plot Roc Curve
def auc_plot(algo):
	#create AUC curve
	test_prob = algo.predict_proba(xtest)[::,1]
	train_prob = algo.predict_proba(xtrain)[::,1]
	roctest = roc_auc_score(ytest, test_prob)
	roctrain = roc_auc_score(ytrain, train_prob)
	
	fpr_test, tpr_test, _ = roc_curve(ytest,  test_prob)
	fpr_train, tpr_train, _ = roc_curve(ytrain,  train_prob)
	plt.title("Area Under Curve")
	plt.plot(fpr_test,tpr_test,label="AUC Test="+str(roctest))
	plt.plot(fpr_train,tpr_train,label="AUC Train="+str(roctrain))
	plt.ylabel('True Positive Rate')
	plt.xlabel('False Positive Rate')
	plt.legend(loc=4)
	plt.grid(True)
	plt.show()		

#metrix function 
def scr(algo, name): #algo = model, name = string of the model name
	predtest = algo.predict(xtest)
	predtrain = algo.predict(xtrain)
	
	#confussion matrix percentage
	tn, fp, fn, tp = confusion_matrix(ytest, predtest).ravel()
	tst = ytest.count()
	cmatrix = ((tn + tp)/tst)*100    

	#accuracy score
	acctest = (accuracy_score(ytest, predtest))*100
	acctrain = (accuracy_score(ytrain, predtrain))*100
	overfit = acctrain - acctest

	#log loss
	logtest = log_loss(ytest,predtest)
	logtrain = log_loss(ytrain,predtrain)
			
	#classification report
	precision1 = (tp / (tp+fp))*100
	precision0 = (tn/(tn+fn))*100
	recall1 = (tp/(tp+fn))*100
	recall0 = (tn/(tn+fp))*100
	f1 = 2*(precision1 * recall1)/(precision1 + recall1)

	#roc auc score
	test_prob = algo.predict_proba(xtest)[::,1]
	train_prob = algo.predict_proba(xtrain)[::,1]
	roctest = (roc_auc_score(ytest, test_prob))*100
	roctrain = (roc_auc_score(ytrain, train_prob))*100

	insertlist(name, cmatrix, acctest, acctrain, overfit, logtest, logtrain, precision1,
				precision0, recall1, recall0, f1, roctest, roctrain)      
	#print metrics score
	return prints(cmatrix, acctest, acctrain, overfit, logtest, logtrain, precision1,
					precision0, recall1, recall0, f1, roctest, roctrain)

Split our dataset into training and testing

xtrain, xtest, ytrain, ytest = train_test_split(x_pca, yund, test_size=0.3, random_state=42)

Model 1 - Support Vector Machine

from sklearn.svm import SVC

#build SVM classification
SVM1 = SVC(random_state = 42, probability=True)
SVM1.fit(xtrain, ytrain)
scr(SVM1, 'SVM1')
cm(SVM1)
auc_plot(SVM1)

	Confusion Matrix Accuracy Score = 86.22%

	Accuracy Score: Training -> 86.94% Testing -> 86.22%

	Overfitting : 0.72%
	Log Loss Training-> 4.511955968344787 Testing -> 4.759453002690722

	Precision class 1: 97.25%
	Precision class 0: 79.36%
	Recall class 1: 74.55%
	Recall class 0: 97.89%
	F1: 84.40%
	ROC AUC Training-> 91.93% Testing-> 88.39%

From the confusion matrix, the accuracy score is 86.22%. Next, we will try to improve the prediction using cross validation to optimise the hyperparameters. Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
param_grid = [
	{
		'C': [1, 10, 15, 20, 25, 30], 
		'gamma': ['scale', 0.001, 0.01],
		'kernel': ['rbf','linear','sigmoid']
	},
]

#tuning the model using gridsearchCV and stratifiedkfold
optimal_param = GridSearchCV(
	SVC(), param_grid, cv=kfold, scoring='accuracy', verbose = 2)
	#fitting the tuned model with training dataset
optimal_param.fit(xtrain, ytrain)

#print the best parameter
print(optimal_param.best_params_)
		
	{'C': 15, 'gamma': 'scale', 'kernel': 'rbf'}

#build the model with hyperparameter tuning
optimal_param = SVC(random_state = 42, probability=True, C=20, gamma='scale', kernel='rbf')
optimal_param.fit(xtrain,ytrain)
#print metrics score
scr(optimal_param, 'SVM_tuned')
#plot confusion matrix for the tuned model
cm(optimal_param)
#plot the learning curve
auc_plot(optimal_param)
	Confusion Matrix Accuracy Score = 87.65%

	Accuracy Score: Training -> 93.07% Testing -> 87.65%

	Overfitting : 5.42%
	Log Loss Training-> 2.392745485940981 Testing -> 4.26443851713289

	Precision class 1: 93.15%
	Precision class 0: 83.39%
	Recall class 1: 81.28%
	Recall class 0: 94.02%
	F1: 86.82%
	ROC AUC Training-> 97.61% Testing-> 93.11%

Before Tuning Accuracy Score - 86.22%
After Tuning Accuracy score is 87.71%
However, model is slightly overfit after hypertuning.

Model 2 - Logistic Regression

#create logistic regression model
logreg = LogisticRegression()
#fitting the model with training dataset
logreg.fit(xtrain, ytrain)
scr(logreg, 'Logreg')
cm(logreg)
auc_plot(logreg)
	Confusion Matrix Accuracy Score = 84.63%

	Accuracy Score: Training -> 85.06% Testing -> 84.63%

	Overfitting : 0.43%
	Log Loss Training-> 5.15872999693614 Testing -> 5.308178710927689

	Precision class 1: 92.04%
	Precision class 0: 79.44%
	Recall class 1: 75.83%
	Recall class 0: 93.44%
	F1: 83.15%
	ROC AUC Training-> 85.19% Testing-> 84.35%

The accuracy score is 84.63%, it's lower compared to SVM. We will tune the model using grid search cross validation.

log_param = [
	{
		'C': [0.01, 0.1, 1.0, 10, 100], 
		'penalty': ['l2'],
		'solver': ['newton-cg', 'sag','saga'] 
	},
]

#tuned the model using gridsearch and kfold
opt_log = GridSearchCV(LogisticRegression(), param_grid = log_param, scoring='accuracy', cv=kfold, verbose = 2)

#fit the tuned model with training dataset
opt_log.fit(xtrain, ytrain) 
print(opt_log.best_params_)
	{'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'}

opt_log = LogisticRegression(random_state = 42, C=0.012, penalty = 'l2', solver = 'sag' )
opt_log.fit(xtrain, ytrain)
scr(opt_log, 'logreg_tuned')
cm(opt_log)
auc_plot(opt_log)
	Confusion Matrix Accuracy Score = 85.17%

	Accuracy Score: Training -> 85.54% Testing -> 85.17%

	Overfitting : 0.37%
	Log Loss Training-> 4.9951177071719135 Testing -> 5.123281655199142

	Precision class 1: 93.81%
	Precision class 0: 79.37%
	Recall class 1: 75.31%
	Recall class 0: 95.03%
	F1: 83.55%
	ROC AUC Training-> 84.62% Testing-> 83.81%

Before Tuning Accuracy = 84.63%
After Tuning Accuracy = 85.17%
Overfitting is reduced as well.

Model 3 - Random Forest

from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()
#fit the model with training dataset
RF.fit(xtrain, ytrain)
scr(RF, 'RF')
cm(RF)
auc_plot(RF)
	Confusion Matrix Accuracy Score = 88.57%

	Accuracy Score: Training -> 100.00% Testing -> 88.57%

	Overfitting : 11.43%
	Log Loss Training-> 9.992007221626413e-16 Testing -> 3.948323732826286

	Precision class 1: 96.23%
	Precision class 0: 83.08%
	Recall class 1: 80.28%
	Recall class 0: 96.86%
	F1: 87.54%
	ROC AUC Training-> 100.00% Testing-> 93.87%

The accuracy score is 88.50%, it's higher compared to SVM and Logistic Regression. However, the model is overfitting as the training accuracy score is 100%

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 50)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 100, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 50, 100]
# Method of selecting samples for training each tree
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
				'max_features': max_features,
				'max_depth': max_depth,
				'min_samples_split': min_samples_split,
				'min_samples_leaf': min_samples_leaf,
				'bootstrap': bootstrap}

#implement parameter tuning using randomized search cv and kfold
opt_RF = RandomizedSearchCV(estimator = RF, param_distributions = random_grid, n_iter = 100, cv = kfold, verbose=2, random_state=42, n_jobs = -1)
#fit the model with training dataset
opt_RF.fit(xtrain, ytrain) 
print(opt_RF.best_params_) #best parameter tuning result
	{'n_estimators': 853, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 78, 'bootstrap': False}

#build the model with hyperparameter tuning
opt_RF = RandomForestClassifier(n_estimators= 600, min_samples_split= 5, min_samples_leaf= 2,max_features= 'auto',max_depth= 6)
opt_RF.fit(xtrain, ytrain)
scr(opt_RF, 'RF_tuned')
cm(opt_RF)
auc_plot(opt_RF)
	Confusion Matrix Accuracy Score = 85.74%

	Accuracy Score: Training -> 86.15% Testing -> 85.74%

	Overfitting : 0.42%
	Log Loss Training-> 4.782934449404927 Testing -> 4.926456581166755

	Precision class 1: 95.55%
	Precision class 0: 79.40%
	Recall class 1: 74.97%
	Recall class 0: 96.51%
	F1: 84.02%
	ROC AUC Training-> 90.47% Testing-> 87.91%

Model 4 - Decision Tree

#import the library
from sklearn.tree import DecisionTreeClassifier
#create the model
decTree = DecisionTreeClassifier()
#fit the model with training data
decTree.fit(xtrain, ytrain)
scr(decTree, 'decTree')
cm(decTree)
auc_plot(decTree)
	Confusion Matrix Accuracy Score = 82.90%

	Accuracy Score: Training -> 100.00% Testing -> 82.90%

	Overfitting : 17.10%
	Log Loss Training-> 9.992007221626413e-16 Testing -> 5.904643831404563

	Precision class 1: 82.68%
	Precision class 0: 83.13%
	Recall class 1: 83.25%
	Recall class 0: 82.56%
	F1: 82.97%
	ROC AUC Training-> 100.00% Testing-> 82.90%

#parameter variables
criterion = ['gini', 'entropy']
max_depth = [3,6,9]
min_samples_split = [3,6,9]
min_samples_leaf = [2,4,8]

#put the variables to dict
random_tree = {'criterion': criterion,
				'max_depth': max_depth,
				'min_samples_split': min_samples_split,
				'min_samples_leaf': min_samples_leaf}

#implement hyperparameter tuning to the model
opt_decTree = DecisionTreeClassifier(min_samples_split = 7, min_samples_leaf = 40, max_depth = 5)
#fit the model with training dataset
opt_decTree.fit(xtrain, ytrain) 
#fit the model with training dataset
opt_decTree.fit(xtrain, ytrain) 
print(opt_decTree.best_params_)
	{'min_samples_split': 9, 'min_samples_leaf': 8, 'max_depth': 9, 'criterion': 'gini'}
	
	Confusion Matrix Accuracy Score = 85.81%

	Accuracy Score: Training -> 89.08% Testing -> 85.81%
	
	Overfitting : 3.27%
	Log Loss Training-> 3.773181988317355 Testing -> 4.90261131447784
	
	Precision class 1: 92.43%
	Precision class 0: 80.97%
	Recall class 1: 78.00%
	Recall class 0: 93.61%
	F1: 84.61%
	ROC AUC Training-> 94.47% Testing-> 89.99%

Model 5 - K-Nearest Neighbor

from sklearn.neighbors import KNeighborsClassifier

#build the model
neigh = KNeighborsClassifier(n_neighbors=3)
#fit the model with training dataset
neigh.fit(xtrain, ytrain)
scr(neigh, 'KNN')
cm(neigh)
auc_plot(neigh)
	Confusion Matrix Accuracy Score = 82.01%

	Accuracy Score: Training -> 91.15% Testing -> 82.01%

	Overfitting : 9.14%
	Log Loss Training-> 3.0574194329192323 Testing -> 6.214775759201414

	Precision class 1: 83.61%
	Precision class 0: 80.55%
	Recall class 1: 79.63%
	Recall class 0: 84.39%
	F1: 81.57%
	ROC AUC Training-> 97.18% Testing-> 88.02%

The accuracy score is 83.15%, it's the lowest accuracy compared to the other algorithms. Also, this model has overfitting problem.

tuning_params = {
	'n_neighbors' : [71,75,77,83], #from the plot above
	"leaf_size":[5,10,20,30],
	"p":[1,2]
}

#implement hyperparameter tuning to the model
opt_knn = GridSearchCV(neigh, param_grid = tuning_params, cv = kfold, verbose = 1, n_jobs = -1)
#fit the model with training dataset
opt_knn.fit(xtrain, ytrain) 
print(opt_knn.best_params_)
scr(opt_knn, 'knn_tuned')
cm(opt_knn)
auc_plot(opt_knn)
	{'leaf_size': 5, 'n_neighbors': 83, 'p': 1}

	Confusion Matrix Accuracy Score = 85.24%

	Accuracy Score: Training -> 85.66% Testing -> 85.24%
	
	Overfitting : 0.43%
	Log Loss Training-> 4.951649723316238 Testing -> 5.099414020217411
	
	Precision class 1: 96.96%
	Precision class 0: 78.19%
	Recall class 1: 72.76%
	Recall class 0: 97.72%
	F1: 83.13%
	ROC AUC Training-> 90.07% Testing-> 88.92%

Model 6 - Artificial Neural Network

Let's say we want to split the data in 60:20:20 for train:valid:test dataset. In the first step we will split the data in training and remaining dataset. Since we want the valid and test size to be equal (10% each of overall data). We have to define valid_size=0.5 (that is 50% of remaining data)

Xtrain, xrem, Ytrain, yrem = train_test_split(xtrain,ytrain, train_size=0.6)
xvalid, xtest, yvalid, ytest = train_test_split(xrem,yrem, test_size=0.5)	
#build the model
ann=keras.Sequential([keras.layers.Dense(20,input_shape=(14,),activation='relu'),
                       keras.layers.Dense(1,activation='sigmoid'),])
ann.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy', tf.keras.metrics.AUC()])
#train the model
history=ann.fit(Xtrain,Ytrain,epochs=30,validation_data=(xvalid,yvalid))
plot_learningCurve(history,30)

testeval1 = ann.evaluate(xtest, ytest)
traineval1 = ann.evaluate(xvalid, yvalid)
yprediction=ann.predict(xtest)
ypred=[]
for element in yprediction:
    if element>0.5:
        ypred.append(1)
    else:
        ypred.append(0)

#plot confusion matrix
cmt =tf.math.confusion_matrix(labels=ytest,predictions=ypred)
plt.figure(figsize=(7,5))
sns.heatmap(cmt,annot=True,fmt='d')
plt.xlabel('predicted')
plt.ylabel('actual')
#print the metrics score
calc(ann, 'ann', testeval1, traineval1, cmt) 
	
	Confusion Matrix Accuracy Score = 86.24%

	Accuracy Score: Training -> 85.09% Testing -> 86.24%
	
	Overfitting : -1.15%
	Log Loss Training-> 0.35751211643218994 Testing -> 0.3230932652950287
	
	Precision class 1: 96.95%
	Precision class 0: 79.77%
	Recall class 1: 74.30%
	Recall class 0: 97.75%
	F1: 84.13%
	ROC AUC Training-> 90.58% Testing-> 91.87%

opt_ann=create_my_model1()
#train the model
history=opt_ann.fit(Xtrain,Ytrain,epochs=50,batch_size = 15,validation_data=(xvalid,yvalid))
plot_learningCurve(history,50)

calc(opt_ann, 'ann_tuned', testeval, traineval, cm1)
cm1 =tf.math.confusion_matrix(labels=ytest,predictions=ypred1)
plt.figure(figsize=(7,5))
sns.heatmap(cm1,annot=True,fmt='d')
plt.xlabel('predicted')
plt.ylabel('actual')
	Confusion Matrix Accuracy Score = 86.72%

	Accuracy Score: Training -> 85.09% Testing -> 86.24%

	Overfitting : -1.15%
	Log Loss Training-> 0.35751211643218994 Testing -> 0.3230932652950287

	Precision class 1: 96.36%
	Precision class 0: 80.65%
	Recall class 1: 75.81%
	Recall class 0: 97.24%
	F1: 84.86%
	ROC AUC Training-> 90.58% Testing-> 91.87%

Model 7 - Convolutional Neural Networks

epochs=20
cnn=Sequential()
cnn.add(Conv1D(32,2, activation='relu',input_shape=Xtrain[0].shape))
cnn.add(BatchNormalization())
cnn.add(Dropout(0.2))

cnn.add(Conv1D(64,2, activation='relu'))
cnn.add(BatchNormalization())
cnn.add(Dropout(0.5))

cnn.add(Flatten())
cnn.add(Dense(64,activation='relu'))
cnn.add(Dropout(0.5))

cnn.add(Dense(1,activation='sigmoid'))
cnn.compile(optimizer=Adam(learning_rate=0.0001),loss='binary_crossentropy',metrics=['accuracy',
	tf.keras.metrics.AUC()])

	history=cnn.fit(Xtrain,Ytrain,epochs=epochs,validation_data=(xvalid,yvalid))
plot_learningCurve(history,epochs)

cntesteval = cnn.evaluate(xtest, ytest)
cntraineval = cnn.evaluate(Xtrain,Ytrain)
cnypredict=cnn.predict(xtest)
ypred2=[]
for element in cnypredict:
    if element>0.5:
        ypred2.append(1)
    else:
        ypred2.append(0)
	
cm2 =tf.math.confusion_matrix(labels=ytest,predictions=ypred2)
plt.figure(figsize=(7,5))
sns.heatmap(cm2,annot=True,fmt='d')
plt.xlabel('predicted')
plt.ylabel('actual')

calc(cnn, 'cnn', cntesteval, cntraineval, cm2)
	Confusion Matrix Accuracy Score = 86.68%

	Accuracy Score: Training -> 86.53% Testing -> 86.68%

	Overfitting : -0.15%
	Log Loss Training-> 0.35620877146720886 Testing -> 0.3592900037765503

	Precision class 1: 96.99%
	Precision class 0: 80.35%
	Recall class 1: 75.21%
	Recall class 0: 97.75%
	F1: 84.72%
	ROC AUC Training-> 90.13% Testing-> 89.15%

epochs=50
cnnmax=Sequential()
cnnmax.add(Conv1D(32,2, activation='relu',input_shape=Xtrain[0].shape))
cnnmax.add(BatchNormalization())
cnnmax.add(MaxPool1D(2))
cnnmax.add(Dropout(0.2))

cnnmax.add(Conv1D(64,2, activation='relu'))
cnnmax.add(BatchNormalization())
cnnmax.add(MaxPool1D(2))
cnnmax.add(Dropout(0.5))

cnnmax.add(Flatten())
cnnmax.add(Dense(64,activation='relu'))
cnnmax.add(Dropout(0.5))

cnnmax.add(Dense(1,activation='sigmoid'))

cnnmax.compile(optimizer=Adam(learning_rate=0.0001),loss='binary_crossentropy',metrics=['accuracy',
	tf.keras.metrics.AUC()])
history2=cnnmax.fit(Xtrain,Ytrain,epochs=100,validation_data=(xvalid,yvalid))

plot_learningCurve(history2,100)

calc(cnn, 'cnn_tuned', maxtesteval, maxtraineval, cm3)
#plot confusion matrix
cm3 =tf.math.confusion_matrix(labels=ytest,predictions=ypred3)
plt.figure(figsize=(7,5))
sns.heatmap(cm3,annot=True,fmt='d')
plt.xlabel('predicted')
plt.ylabel('actual')
	Confusion Matrix Accuracy Score = 77.28%

	Accuracy Score: Training -> 78.63% Testing -> 77.28%

	Overfitting : 1.35%
	Log Loss Training-> 0.46808817982673645 Testing -> 0.48547038435935974

	Precision class 1: 89.22%
	Precision class 0: 71.24%
	Recall class 1: 61.12%
	Recall class 0: 92.88%
	F1: 72.54%
	ROC AUC Training-> 85.53% Testing-> 83.94%

Conclusion

scoring = pd.DataFrame(score, columns = ['algo','c_matrix','acc_test','acc_train','overfit', 'loss_test',
	'loss_train', 'prec1', 'prec0', 'recall1','recall0', 'F1', 'roctest', 'roctrain'])

	algo	c_matrix	acc_test	acc_train	overfit	loss_test	loss_train	prec1	prec0	recall1	recall0	F1	roctest	roctrain
0	SVM1	86.219997	86.219997	86.936570	0.716574	4.759453	4.511956e+00	97.252252	79.361523	74.551105	97.892919	84.401876	88.385672	91.927927
1	SVM_tuned	87.653255	87.653255	93.072311	5.419056	4.264439	2.392745e+00	93.153937	83.394608	81.284530	94.024180	86.815416	93.105302	97.606863
2	Logreg	84.631324	84.631324	85.064022	0.432697	5.308179	5.158730e+00	92.036882	79.441997	75.828729	93.436960	83.150322	84.346487	85.189161
3	logreg_tuned	85.166638	85.166638	85.537710	0.371072	5.123282	4.995118e+00	93.806452	79.371033	75.310773	95.025907	83.547213	83.811320	84.624693
4	RF	88.568468	88.568468	100.000000	11.431532	3.948324	9.992007e-16	96.233444	83.081481	80.283149	96.856649	87.537651	93.873451	100.000000
5	RF_tuned	88.982905	88.982905	100.000000	11.017095	3.805185	9.992007e-16	95.597738	84.042232	81.733425	96.234888	88.123604	94.519837	100.000000
6	RF_tuned	85.736488	85.736488	86.152024	0.415537	4.926457	4.782934e+00	95.554577	79.397556	74.965470	96.511226	84.017028	87.914007	90.470434
7	decTree	82.904507	82.904507	100.000000	17.095493	5.904644	9.992007e-16	82.681756	83.130435	83.252762	82.556131	82.966277	82.904447	100.000000
8	decTree_tuned	85.805560	85.805560	89.075568	3.270008	4.902611	3.773182e+00	92.430442	80.968031	78.004144	93.609672	84.606742	89.989355	94.472079
9	decTree_tuned	85.391124	85.391124	86.129820	0.738696	5.045744	4.790606e+00	94.759825	79.263068	74.930939	95.854922	83.686849	88.834197	90.058876
10	KNN	82.006562	82.006562	91.147954	9.141392	6.214776	3.057419e+00	83.611313	80.547313	79.627072	84.386874	81.570570	88.016352	97.184685
11	knn_tuned	85.287515	85.287515	85.663533	0.376018	5.081522	4.951650e+00	96.794872	78.319933	72.997238	97.582038	83.228346	89.064966	90.235731
12	knn_tuned	85.235711	85.235711	85.663533	0.427823	5.099414	4.951650e+00	96.962724	78.192371	72.755525	97.720207	83.132768	88.920260	90.068988
13	ann	86.237514	86.237514	85.085124	-1.152390	0.323093	3.575121e-01	96.951819	79.774614	74.302939	97.747093	84.129693	91.869640	90.576106
14	ann_tuned	86.718461	86.237514	85.085124	-1.152390	0.323093	3.575121e-01	96.360153	80.650995	75.810098	97.238372	84.858709	91.869640	90.576106
15	cnn	86.681465	86.681467	86.528498	-0.152969	0.359290	3.562088e-01	96.987366	80.346476	75.207234	97.747093	84.719864	89.148206	90.132415
16	cnn_tuned	77.284499	77.284497	78.633112	1.348615	0.485470	4.680882e-01	89.218922	71.237458	61.115298	92.877907	72.540250	83.936465	85.533202

Seven algorithms from Machine Learning and Deep Learning were conducted to determine the most predictive classifier for credit card fraud detection. These projects constructed from two datasets and combine into one, with extreme imbalance classification problems that solved using Imblearn’s Random Under Sampling to make the class is balanced. Some models have a high accuracy score; however, the overfitting cannot be avoided. For example, Decision Tree and Random Forest have a high accuracy score of 100% in the training phase yet their overfitting percentage is more than 10%.

Hyperparameter tuning, K-Fold cross-validation and validation curves are applied to cope with the overfitting problem and to improve the accuracy. As a result, the overfit in Random Forest and Decision Tree are dropped by almost 10% by changing some of the parameters, and the accuracy score is also increased for Decision Tree from 82.90% to 85.39%.

For future benefits, we selected three algorithms with the lowest percentage of overfitting as they will generalise well to new data. If the model can generalise the data well, it also could perform the classification or prediction task that was intended for. Consequently, we picked Logistic Regression, Decision Trees and CNN as our final algorithm for the credit card fraud detection problem.

Devy Ratnasari

Credit Card Fraud Detection