Government bonds buyer predictions

The analysis is conducted to predict how much government bonds the company has to order based on the users behaviour. The dataset is sourced from Indonesia Fintech StartUp between August and October 2021.

Users information

This dataset contains:
  • user_id - User's identity
  • registration_import_datetime - Date of user registered
  • user_gender - Gender of the user
  • user_age - Age of the user
  • user_occupation - Occupation of the user
  • user_income_range - Range income of the user
  • referral_code_used - If the user using referral code during registration or not
  • user_income_source - The source of income from the user
user_id registration_import_datetime user_gender user_age user_occupation user_income_range referral_code_used user_income_source
0 162882 2021-09-17 Female 51 Swasta > Rp 500 Juta - 1 Miliar NaN Gaji
1 3485491 2021-10-09 Female 55 Others > Rp 50 Juta - 100 Juta NaN Gaji
2 1071649 2021-10-08 Male 50 Swasta Rp 10 Juta - 50 Juta NaN Gaji
3 3816789 2021-08-12 Female 53 IRT > Rp 50 Juta - 100 Juta NaN Gaji
4 3802293 2021-08-15 Female 47 PNS > Rp 500 Juta - 1 Miliar used referral Gaji

User's daily transaction

This dataset contains:
  • user_id - Id of user, it's same with the user_id from dataset 1
  • date - Transaction date
  • Saham_AUM - AUM (../assets Under Management) of equity mutual fund held by client to date
  • Saham_invested_amount - The total price paid by client to buy equity mutual fund to date.
  • Saham_transaction_amount - Total value of transaction to buy (if positive) or sell (if negative) equity mutual fund on the date.
  • Pasar_Uang_AUM - AUM of money market mutual fund held by client to date
  • Pasar_Uang_invested_amount - The total price paid by client to buy money market mutual fund to date
  • Pasar_Uang_transaction_amount - Total value of transaction to buy (if positive) or sell (if negative) money market mutual fund on the date.
  • Pendapatan_Tetap_AUM - AUM of fixed income mutual fund held by client to date
  • Pendapatan_Tetap_invested_amount - The total price paid by client to buy fixed income mutual fund to date
  • Pendapatan_Tetap_transaction_amount - Total value of transaction to buy (if positive) or sell (if negative) fixed income mutual fund on the date.
  • Campuran_AUM - AUM of mixed mutual fund held by client to date
  • Campuran_invested_amount - The total price paid by client to buy mixed mutual fund to date
  • Campuran_transaction_amount - Total value of the transaction to buy (if positive) or sell (if negative) mixed mutual fund on the date.
  • user_id trans_date Saham_AUM Saham_invested_amount Saham_transaction_amount Pasar_Uang_AUM Pasar_Uang_invested_amount Pasar_Uang_transaction_amount Pendapatan_Tetap_AUM Pendapatan_Tetap_invested_amount Pendapatan_Tetap_transaction_amount Campuran_AUM Campuran_invested_amount Campuran_transaction_amount
    0 50701 2021-08-30 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10132277.0 10000000.0 0.0
    1 50701 2021-08-31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10206945.0 10000000.0 0.0
    2 50701 2021-09-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9956556.0 10000000.0 0.0
    3 50701 2021-09-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9914858.0 10000000.0 0.0
    4 50701 2021-09-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10016360.0 10000000.0 0.0

Bonds history purchase

This dataset contains:
  • user_id - ID of the user, same as dataset 1 and 2
  • flag_order_bond - 1 if the user ordered bonds, 0 for the opposite
  • bond_units_ordered - The total of bond that user ordered
user_id flag_order_bond bond_units_ordered
0 50701 1 34
1 50961 1 99
2 51883 0 0
3 53759 0 0
4 54759 1 92

Users

We want to see how much the total of data that we have and the datatype, if there are any duplicate in our data, null value or any other problem that we need to taking care of.
print(f'Total Row: {user.shape[0]} , Column: {user.shape[1]}')
		user.info()
	Total Row: 14712 , Column: 8
			RangeIndex: 14712 entries, 0 to 14711
			Data columns (total 8 columns):
			#   Column                        Non-Null Count  Dtype    
			---  ------                        --------------  -----    
			0   user_id                       14712 non-null  int64    
			1   registration_import_datetime  14712 non-null  period[D]
			2   user_gender                   14712 non-null  object   
			3   user_age                      14712 non-null  int64    
			4   user_occupation               14712 non-null  object   
			5   user_income_range             14712 non-null  object   
			6   referral_code_used            5604 non-null   object   
			7   user_income_source            14712 non-null  object   
			dtypes: int64(2), object(5), period[D](1)
There are 14712 rows of data with 8 columns for the dataset 1, we can see that on the column referral_code_used there are only 5604 rows of non-null data the next step that we will do is to double check null value in the dataset and if any duplicate user_id since the user_id is our primary key.
is_duplicate_user = user.user_id.duplicated().sum()
		total_user = user.user_id.count()
		print(f'Total Duplicate User_ID : {is_duplicate_user}')
		print(f'Total User : {total_user}')
		user.isnull().sum()
	Total Duplicate User_ID : 0
			Total User : 14712
			user_id                            0
			registration_import_datetime       0
			user_gender                        0
			user_age                           0
			user_occupation                    0
			user_income_range                  0
			referral_code_used              9108
			user_income_source                 0
			dtype: int64
From the result above, it's confirmed that we have 9108 rows of null value in column referral_code_used and luckily, we don't have any duplicate data for the user_id so we can continue to decide what to do with the null value. We will be checking for the unique value from each column in the dataset to help us deciding what to replace the null value with.
for x in user:
			uniqueValues = user[x].unique()
			n_values = len(uniqueValues)
			if n_values <=50:
			print(f'Column: {x} -- total unique value: ({n_values}) {uniqueValues}')
	Column: user_gender -- total unique value: (2) ['Female' 'Male']
			Column: user_occupation -- total unique value: (9) ['Swasta' 'Others' 'IRT' 'PNS' 'Pengusaha' 'Pensiunan' 'TNI/Polisi' 'Guru''Pelajar']
			Column: user_income_range -- total unique value: (6) ['> Rp 500 Juta - 1 Miliar' '> Rp 50 Juta - 100 Juta' 'Rp 10 Juta - 50 Juta' '< 10 Juta' '> Rp 100 Juta - 500 Juta' '> Rp 1 Miliar']
			Column: referral_code_used -- total unique value: (2) [nan 'used referral']
			Column: user_income_source -- total unique value: (10) ['Gaji' 'Keuntungan Bisnis' 'Lainnya' 'Dari Orang Tua / Anak' 'Undian' 'Tabungan' 'Warisan' 'Hasil Investasi' 'Dari Suami / istri' 'Bunga Simpanan']
			
By checking the unique value, we can see the categorical data of each column. We notice that there are only 2 categories from referral_code_used, which is nan or 'used referral'. We will be converting the null value to none instead of nan or deleting the data.
user.referral_code_used.fillna('None', inplace = True)
		user.referral_code_used.unique()
		user.isnull().sum()
	array(['None', 'used referral'], dtype=object)
			user_id                         0
			registration_import_datetime    0
			user_gender                     0
			user_age                        0
			user_occupation                 0
			user_income_range               0
			referral_code_used              0
			user_income_source              0
			dtype: int64
			
We successfully convert the null value to none. Then we will start our analysis by checking the user registration growth between August and October 2021. We will visualise the chart by daily and monthly registration.
user['registration_import_datetime'].value_counts().sort_index().plot()
		plt.title('User Join Growth by Day')
		user['registration_month'] = user.registration_import_datetime.dt.to_timestamp('m').copy()
		user.registration_month.value_counts().sort_index().plot()
		plt.title('User Join Growth by Month')
		plt.show()
In the daily registration chart, we can see that October is the month with the lowest and highest number of registrations, meanwhile when we try to visualise it on monthly basis we can see that September is the month with the highest number of registration, around 5200 users signed up. Next, we want to see the comparison by user's gender.
gender = user[['user_id', 'user_gender']].groupby('user_gender').count()
		plt.pie(gender.user_id, labels=gender.index , startangle=90, shadow=True,explode=(0.1, 0.1), autopct='%1.2f%%')
		plt.title('Users\' Gender Distribution')
		plt.show()			
The user's gender distribution chart show that around 61.02% of our users are male, meanwhile the rest are female. Then we will try to visualise the user's range and source of income to see more details of it.
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(aspect="equal"))
		income_source = user[['user_id', 'user_income_source']].groupby('user_income_source').count
		def func(pct, allvals):
			absolute = int(np.round(pct/100.*np.sum(allvals)))
			return "{:.1f}%\n({:d} users)".format(pct, absolute)
		
		wedges, texts, autotexts = ax.pie(income_source.user_id, autopct=lambda pct: func(pct, income_source.user_id),
			textprops=dict(color="black"), explode=(1,0,0,0,0,0,0,0,0,0))
		ax.legend(wedges, income_source.index,
			title="Income Source",
			loc="upper left",
			bbox_to_anchor=(1, 0, 0.5, 1))
		plt.setp(autotexts, size=8, weight="bold")
		ax.set_title("User's Income Source")
		plt.show()			
We read the chart counterclockwise starting from the exploded pie. It shows that the lowest income source is from Bunga Simpanan (Saving's interest) with 0.2%. whilst the majority of the income source is from Gaji (Salary) with 62.1% or equal to 9131 users. Then we will see for the user's income range.
fig, ax = plt.subplots(figsize=(6, 5))
		ax = sns.countplot(y= user.user_income_range, data = user)
		
		ax.set_title('User Income Range')
		for x in ax.containers:
			ax.bar_label(x, label_type='edge')
		
		plt.show()			
The user income range chart shows that the majority of users' income range is less than 10 Million Rupiah (approx 630 USD) with 6275 users. Nevertheless, there are 27 users with the highest income range - 1 Billion Rupiah (approx 63000 USD) which is our potential big investor. We also will try to combine income sources and range to see the breakdown of it.
fig, ax = plt.subplots(figsize=(12, 8))
		ax = sns.countplot(y= user.user_income_source, hue=user.user_income_range, data = user)
		ax.set_title('Income source and range')
		for x in ax.containers:
			ax.bar_label(x, label_type='edge')
		ax.legend(loc='best', fontsize=8)
		plt.show()
From the income range and source chart, we can see that 2835 of our users had wages of less than 10 Million Rupiah. Moreover, the majority of our user's salary income range is between 10 Million to 50 Million Rupiah with a total of 3483 users. It is noticeable that there are 10 users had a salary of more than 1 Billion Rupiah. We will try to visualise the distribution of our user's occupations.
fig, ax = plt.subplots(figsize=(10, 5))
		ax =sns.countplot(x=user.user_occupation)
		for x in ax.containers:
			ax.bar_label(x, label_type='edge')
		ax.set_title('User occupation')
		plt.show()
The users' occupation chart shows that the majority of our users are Pelajar (Student) with a total of 7887 users. Which is make sense that our majority income range is less than 10 Million. We will try to combine the income range and user occupation to see more details about it.
fig, ax = plt.subplots(figsize=(12, 8))
		ax = sns.countplot(y= user.user_occupation, hue=user.user_income_range, data = user)
		ax.set_title('Income range and occupation')
		for x in ax.containers:
			ax.bar_label(x, label_type='edge')
		ax.legend(loc='right', fontsize=8)
		plt.show()
When we combined the income range and occupation, we found interesting information that we have 17 students with an income range of more than 500 Million Rupiah. So we will investigate a little bit further about the 17 students.
pelajar_age = user.query('user_occupation == \'Pelajar\' and (user_income_range == \'> Rp 500 Juta - 1 Miliar\'
			sor user_income_range == \'> Rp 1 Miliar\')')
user_id registration_import_datetime user_gender user_age user_occupation user_income_range referral_code_used user_income_source registration_month
619 4228712 2021-10-05 Male 17 Pelajar > Rp 1 Miliar used referral Keuntungan Bisnis 2021-10-31
899 3795261 2021-08-08 Male 18 Pelajar > Rp 1 Miliar used referral Lainnya 2021-08-31
1394 4376258 2021-10-19 Male 18 Pelajar > Rp 500 Juta - 1 Miliar used referral Gaji 2021-10-31
2282 3818578 2021-08-16 Male 19 Pelajar > Rp 1 Miliar None Keuntungan Bisnis 2021-08-31
2609 3742646 2021-08-09 Female 20 Pelajar > Rp 1 Miliar None Keuntungan Bisnis 2021-08-31
3999 3749935 2021-08-04 Female 21 Pelajar > Rp 1 Miliar None Keuntungan Bisnis 2021-08-31
4076 4245492 2021-10-07 Female 21 Pelajar > Rp 500 Juta - 1 Miliar None Keuntungan Bisnis 2021-10-31
4569 4355874 2021-10-17 Male 21 Pelajar > Rp 1 Miliar used referral Keuntungan Bisnis 2021-10-31
6135 4239037 2021-10-06 Female 23 Pelajar > Rp 500 Juta - 1 Miliar used referral Gaji 2021-10-31
8322 4371352 2021-10-19 Male 25 Pelajar > Rp 500 Juta - 1 Miliar used referral Gaji 2021-10-31
8871 3959126 2021-09-02 Female 26 Pelajar > Rp 500 Juta - 1 Miliar used referral Keuntungan Bisnis 2021-09-30
9063 4342801 2021-10-17 Male 26 Pelajar > Rp 500 Juta - 1 Miliar None Gaji 2021-10-31
9725 4270298 2021-10-09 Female 27 Pelajar > Rp 500 Juta - 1 Miliar None Keuntungan Bisnis 2021-10-31
10101 4134537 2021-09-24 Male 28 Pelajar > Rp 500 Juta - 1 Miliar None Keuntungan Bisnis 2021-09-30
10925 4338825 2021-10-15 Male 29 Pelajar > Rp 1 Miliar used referral Lainnya 2021-10-31
11617 4164878 2021-10-02 Male 31 Pelajar > Rp 500 Juta - 1 Miliar used referral Keuntungan Bisnis 2021-10-31
11950 3858112 2021-08-18 Female 32 Pelajar > Rp 500 Juta - 1 Miliar None Gaji 2021-08-31
It's interesting that we have a few users aged below 20, as a student with an income range above 500 Million Rupiah and the income source is from business profit. There is a possibility with the wrong information the user provided. However, we just leave it as is for now and take notes just in case we might need some adjustments prior to building the model. The next thing we want to check is how many per cent of our users registered using referral codes. This might also will help to decide what type of campaign approach to increase the registration number.
referral_code = user[['user_id', 'referral_code_used']].groupby('referral_code_used').count()
		plt.pie(referral_code.user_id, labels=referral_code.index, startangle=90, shadow=True,explode=(0.1, 0.1), autopct='%1.2f%%')
		plt.title('Referral Code Comparison')
		plt.show()
We can see from the referral code comparison that only 38.09% of our users use referral codes during their registration process. We can create an announcement on the company website or application about bonus credit if the users refer some friends to register with us. This will increase the number of new users, and transactions, and also a cheaper version of the advertisement as we announce it using our private platform so third-party advertisement costs are needed. As well as encourage the current users to make more transactions using the bonus credit and so on. We will check the average age of our users.
user['user_age'].describe()
			count    14712.000000
			mean        27.176591
			std          8.552585
			min         17.000000
			25%         21.000000
			50%         25.000000
			75%         31.000000
			max         83.000000
			Name: user_age, dtype: float64
The description shows that the average age of our users is 27 years old. The youngest is 17 and the oldest is 83 years old. We will try to visualise its distribution of it by using the histogram.
user['user_age'].hist(figsize = (5,5))
		plt.title('User age distribution')
The histogram shows that most of our users are in the age group 20's and 30's. The next step we will do is to use zscore to calculate if there are any outliers.
mean = np.mean(user.user_age)
		std = np.std(user.user_age)
		outlier = []
		for i in user.user_age:
			z = (i-mean)/std
			if z > 3:
				outlier.append(i)
		print('outliers in dataset are', set(outlier))
		sns.boxplot(x=user.user_age)
			outliers in dataset are {53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 71, 73, 74, 76, 82, 83}
The z-score calculation, shows the outliers from the user's age are 53 years and older. We will visualise the distribution and check how many users are in the outliers then merge dataset 1 and the transactions' dataset to see how many percentages of the outlier's users contribute to the transaction before we decide to remove the outliers or just leave it as is.
user_outliers = user.loc[user.user_age>=53,:]
		total_out = user_outliers.user_id.count()
		percentage_outliers = (total_out/total_user)
		print(f'Total users in outliers : {total_out} OR {percentage_outliers:.2%} from total user')
		fig, ax = plt.subplots(figsize=(6, 5))
		ax = sns.countplot(y= user_outliers.user_age, data = user_outliers)
		ax.set_title('User distribution above 53 years of age')
		for x in ax.containers:
			ax.bar_label(x, label_type='edge')
		plt.show()
			Total users in outliers : 264 OR 1.79% from total user
The chart above shows that we have a total of 264 users in the outliers category or equal to 1.79% of the total user. With the highest distribution at 54 years of age and the lowest starting from 73 years. Since the age outliers are still in the normal scope, we will keep the users 54 years of age and older included in our analysis.

Transactions

We want to check how many rows of data we have on the transactions' dataset and the total number of users that making transactions.
print(f'Total rows: {trans.shape[0]} ,columns: {trans.shape[1]}')
transcation_users = trans['user_id'].nunique()
print(f'Total users making transaction period August - October 2021 = {transcation_users} users
	or {(transcation_users/total_user):.2%}')
trans.info()
	Total rows: 158811 ,columns: 14
	Total users making transaction period August - October 2021 = 8277 users or 56.26%
	RangeIndex: 158811 entries, 0 to 158810
	Data columns (total 14 columns):
	#   Column                               Non-Null Count   Dtype  
	---  ------                               --------------   -----  
	0   user_id                              158811 non-null  int64  
	1   date                                 158811 non-null  object 
	2   Saham_AUM                            106292 non-null  float64
	3   Saham_invested_amount                106292 non-null  float64
	4   Saham_transaction_amount             100839 non-null  float64
	5   Pasar_Uang_AUM                       131081 non-null  float64
	6   Pasar_Uang_invested_amount           131081 non-null  float64
	7   Pasar_Uang_transaction_amount        124273 non-null  float64
	8   Pendapatan_Tetap_AUM                 105946 non-null  float64
	9   Pendapatan_Tetap_invested_amount     105946 non-null  float64
	10  Pendapatan_Tetap_transaction_amount  100497 non-null  float64
	11  Campuran_AUM                         5352 non-null    float64
	12  Campuran_invested_amount             5352 non-null    float64
	13  Campuran_transaction_amount          5117 non-null    float64
	dtypes: float64(12), int64(1), object(1)
From the data description above, we can see that there are 158811 rows and 14 columns of data in the transactions' dataset. Furthermore, it's only 8277 users made transactions or equal to 56.26% of the total users. The company can create some campaigns to encourage new users to make new transactions. In addition, we can see from the data information we have plenty of columns containing the null value. We will check to confirm this information and change the 'date' column name to trans_date to avoid ambiguity with the method.
trans.rename(columns={"date": "trans_date"}, inplace=True)
trans['trans_date'] = pd.to_datetime(trans['trans_date'])
trans.isnull().sum()
	user_id                                     0
	trans_date                                  0
	Saham_AUM                               52519
	Saham_invested_amount                   52519
	Saham_transaction_amount                57972
	Pasar_Uang_AUM                          27730
	Pasar_Uang_invested_amount              27730
	Pasar_Uang_transaction_amount           34538
	Pendapatan_Tetap_AUM                    52865
	Pendapatan_Tetap_invested_amount        52865
	Pendapatan_Tetap_transaction_amount     58314
	Campuran_AUM                           153459
	Campuran_invested_amount               153459
	Campuran_transaction_amount            153694
	dtype: int64
The information above confirmed our findings and referring to the dataset information all of the columns with the null value is a transactions columns. So, we will be changing the data to 0 instead of null.
trans.fillna(0, inplace=True)
trans.isnull().sum()
	user_id                                0
	trans_date                             0
	Saham_AUM                              0
	Saham_invested_amount                  0
	Saham_transaction_amount               0
	Pasar_Uang_AUM                         0
	Pasar_Uang_invested_amount             0
	Pasar_Uang_transaction_amount          0
	Pendapatan_Tetap_AUM                   0
	Pendapatan_Tetap_invested_amount       0
	Pendapatan_Tetap_transaction_amount    0
	Campuran_AUM                           0
	Campuran_invested_amount               0
	Campuran_transaction_amount            0
	dtype: int64
The null value has been replaced, in the next step, we will find how many times a transaction has been made by the user. We will create some boolean flag column 1 if the user making a transaction on certain dates, and 0 if not. Later we will visualise this column to see the daily growth.
trans['count_saham'] = np.where(trans['Saham_transaction_amount']!=0.00,1,0)
trans['count_uang'] = np.where(trans['Pasar_Uang_transaction_amount']!=0.00,1,0)
trans['count_fixed'] = np.where(trans['Pendapatan_Tetap_transaction_amount']!=0.00,1,0)
trans['count_mixed'] = np.where(trans['Campuran_transaction_amount']!=0.00,1,0)

trans_count = trans[['count_saham', 'count_uang', 'count_fixed', 'count_mixed', 'trans_date']].groupby('trans_date').sum()
t_series = trans_count.index.to_pydatetime()

fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(t_series, trans_count)
plt.title('Total transaction growth per day')
plt.legend(['Saham', 'Pasar Uang', 'Pendapatan Tetap', 'Campuran'])
plt.setp(ax.xaxis.get_ticklabels(), rotation=90)

plt.show()
From the number of transactions chart, the higher type of transaction is Pasar Uang (Money Market) followed by Pendapatan Tetap (fixed income) and Saham (stocks) respectively. As we can see from the dataset, the AUM and the invested amount is updated daily, meanwhile, transactions are filled when the users make any buy or sell transactions and it will fill with 0 if no transactions have been made. Furthermore, we will be separating buy and sell transactions into new columns, to avoid biased data when summarising the transactions.
trans_buysell = ['Saham_transaction_amount','Pasar_Uang_transaction_amount', 'Pendapatan_Tetap_transaction_amount',
	'Campuran_transaction_amount']

for x in trans_buysell:
	mask = trans[x] < 0
	trans[x+'_buy'] = trans[x].mask(mask).abs()
	trans[x+'_sell'] = trans[x].mask(~mask).abs()

trans.loc[trans.Saham_transaction_amount !=0.00,:].head()
user_id trans_date Saham_AUM Saham_invested_amount Saham_transaction_amount Pasar_Uang_AUM Pasar_Uang_invested_amount Pasar_Uang_transaction_amount Pendapatan_Tetap_AUM Pendapatan_Tetap_invested_amount ... count_fixed count_mixed Saham_transaction_amount_buy Saham_transaction_amount_sell Pasar_Uang_transaction_amount_buy Pasar_Uang_transaction_amount_sell Pendapatan_Tetap_transaction_amount_buy Pendapatan_Tetap_transaction_amount_sell Campuran_transaction_amount_buy Campuran_transaction_amount_sell
25 50961 2021-09-01 1705566.0 1700000.0 1000000.0 100065.0 100000.0 0.0 200124.0 200000.0 ... 0 0 1000000.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
46 50961 2021-09-30 2063909.0 2000000.0 300000.0 700603.0 700000.0 600000.0 1398998.0 1400000.0 ... 1 0 300000.0 NaN 600000.0 NaN 1200000.0 NaN 0.0 NaN
102 61414 2021-08-13 0.0 0.0 -10000.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 NaN 10000.0 0.0 NaN 0.0 NaN 0.0 NaN
154 66145 2021-09-29 248015.0 240000.0 140000.0 29992.0 30000.0 0.0 129929.0 130000.0 ... 0 0 140000.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
166 67251 2021-09-20 562777.0 570000.0 530000.0 110065.0 110000.0 100000.0 420268.0 420000.0 ... 1 0 530000.0 NaN 100000.0 NaN 370000.0 NaN 0.0 NaN
5 rows × 26 columns
Since we split the transactions into buy and sell, we will drop the original transaction column.
trans.drop(trans_buysell, axis=1, inplace=True)
trans.head()
user_id trans_date Saham_AUM Saham_invested_amount Pasar_Uang_AUM Pasar_Uang_invested_amount Pendapatan_Tetap_AUM Pendapatan_Tetap_invested_amount Campuran_AUM Campuran_invested_amount ... count_fixed count_mixed Saham_transaction_amount_buy Saham_transaction_amount_sell Pasar_Uang_transaction_amount_buy Pasar_Uang_transaction_amount_sell Pendapatan_Tetap_transaction_amount_buy Pendapatan_Tetap_transaction_amount_sell Campuran_transaction_amount_buy Campuran_transaction_amount_sell
0 50701 2021-08-30 0.0 0.0 0.0 0.0 0.0 0.0 10132277.0 10000000.0 ... 0 0 0.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
1 50701 2021-08-31 0.0 0.0 0.0 0.0 0.0 0.0 10206945.0 10000000.0 ... 0 0 0.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
2 50701 2021-09-01 0.0 0.0 0.0 0.0 0.0 0.0 9956556.0 10000000.0 ... 0 0 0.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
3 50701 2021-09-02 0.0 0.0 0.0 0.0 0.0 0.0 9914858.0 10000000.0 ... 0 0 0.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
4 50701 2021-09-03 0.0 0.0 0.0 0.0 0.0 0.0 10016360.0 10000000.0 ... 0 0 0.0 NaN 0.0 NaN 0.0 NaN 0.0 NaN
5 rows × 22 columns
The next thing is we will visualise the buy-sell transactions to see the total amount of transactions growth.
buysell = trans[['Saham_transaction_amount_buy', 'Saham_transaction_amount_sell', 'Pasar_Uang_transaction_amount_buy', 'Pasar_Uang_transaction_amount_sell',
	'Pendapatan_Tetap_transaction_amount_buy', 'Pendapatan_Tetap_transaction_amount_sell','Campuran_transaction_amount_buy', 'Campuran_transaction_amount_sell',
	'trans_date']].groupby('trans_date').sum()
t_series = buysell.index.to_pydatetime()

fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(t_series, buysell)
plt.title('Summaries buying and selling growth per day')
ax.legend(['Saham_transaction_amount_buy', 'Saham_transaction_amount_sell', 'Pasar_Uang_transaction_amount_buy', 'Pasar_Uang_transaction_amount_sell',
'Pendapatan_Tetap_transaction_amount_buy', 'Pendapatan_Tetap_transaction_amount_sell','Campuran_transaction_amount_buy', 'Campuran_transaction_amount_sell'],
bbox_to_anchor=(1,1.05))
plt.setp(ax.xaxis.get_ticklabels(), rotation=90)
plt.ticklabel_format(style='plain', axis='y')

plt.show()
The summaries buying and selling chart shows that the total amount of Pendapatan Tetap (Buying) and Pasar Uang (Buying) are leading than other transactions. We are also can see that the majority of the transactions are buying, this could meanings that our users are more interested in investing their money.
Since invested columns are updated daily with the last balanced of the users, we will drop the duplicate data and take the latest date.
invested_col = ['user_id', 'trans_date', 'Saham_AUM', 'Saham_invested_amount', 'Pasar_Uang_AUM',
	'Pasar_Uang_invested_amount', 'Pendapatan_Tetap_AUM', 'Pendapatan_Tetap_invested_amount', 'Campuran_AUM',
	'Campuran_invested_amount']
invested_AUM = trans.sort_values(by='trans_date').drop_duplicates(subset=['user_id'], keep='last').reset_index(drop=True)
invested_AUM = invested_AUM[invested_col].copy()
print(f'Total rows: {invested_AUM.shape[0]}, columns: {invested_AUM.shape[1]}\nInvested:')
invested_AUM.head()
	Total rows: 8277, columns: 10
	Invested:
user_id trans_date Saham_AUM Saham_invested_amount Pasar_Uang_AUM Pasar_Uang_invested_amount Pendapatan_Tetap_AUM Pendapatan_Tetap_invested_amount Campuran_AUM Campuran_invested_amount
0 3786300 2021-09-30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 3760891 2021-09-30 258975.00 249864.00 0.00 0.00 553302.00 550000.00 0.00 0.00
2 4096532 2021-09-30 13803192.00 13352136.00 2390277.00 2387932.00 8134491.00 8149154.00 0.00 0.00
3 2836295 2021-09-30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 2839273 2021-09-30 0.00 0.00 10028.00 10000.00 0.00 0.00 0.00 0.00
Transactions dataframe:
sum_trans_total = trans.groupby('user_id')[['Saham_transaction_amount_buy',
	'Saham_transaction_amount_sell', 'Pasar_Uang_transaction_amount_buy',
	'Pasar_Uang_transaction_amount_sell',
	'Pendapatan_Tetap_transaction_amount_buy',
	'Pendapatan_Tetap_transaction_amount_sell',
	'Campuran_transaction_amount_buy', 'Campuran_transaction_amount_sell',
'count_saham', 'count_uang', 'count_fixed', 'count_mixed']].sum()
print(f'Total rows:{sum_trans_total.shape[0]}, columns: {sum_trans_total.shape[1]}\nTransactions:')
sum_trans_total.head()
	Total rows:8277, columns: 12
	Transactions:
Saham_transaction_amount_buy Saham_transaction_amount_sell Pasar_Uang_transaction_amount_buy Pasar_Uang_transaction_amount_sell Pendapatan_Tetap_transaction_amount_buy Pendapatan_Tetap_transaction_amount_sell Campuran_transaction_amount_buy Campuran_transaction_amount_sell count_saham count_uang count_fixed count_mixed
user_id
50701 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0
50961 1300000.00 0.00 600000.00 0.00 1200000.00 0.00 0.00 0.00 2 1 1 0
53759 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0
54759 0.00 0.00 2000000.00 0.00 0.00 0.00 0.00 0.00 0 1 0 0
61414 0.00 10000.00 0.00 0.00 0.00 0.00 0.00 0.00 1 0 0 0
After creating the invested and transactions dataframe, then we will merge both dataframe into new_trans and we will be using this dataset in the future.
new_trans = invested_AUM.merge(sum_trans_total, on='user_id')
print(f'Total rows: {new_trans.shape[0]}, columns:{new_trans.shape[1]}')
del invested_AUM, sum_trans_total
new_trans.head()
	Total rows: 8277, columns:22
user_id trans_date Saham_AUM Saham_invested_amount Pasar_Uang_AUM Pasar_Uang_invested_amount Pendapatan_Tetap_AUM Pendapatan_Tetap_invested_amount Campuran_AUM Campuran_invested_amount ... Pasar_Uang_transaction_amount_buy Pasar_Uang_transaction_amount_sell Pendapatan_Tetap_transaction_amount_buy Pendapatan_Tetap_transaction_amount_sell Campuran_transaction_amount_buy Campuran_transaction_amount_sell count_saham count_uang count_fixed count_mixed
0 3786300 2021-09-30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 10000.00 0.00 0.00 0.00 0.00 1 1 0 0
1 3760891 2021-09-30 258975.00 249864.00 0.00 0.00 553302.00 550000.00 0.00 0.00 ... 310000.00 410000.00 499783.00 49783.00 0.00 0.00 5 7 7 0
2 4096532 2021-09-30 13803192.00 13352136.00 2390277.00 2387932.00 8134491.00 8149154.00 0.00 0.00 ... 1937932.00 0.00 6619154.00 0.00 0.00 0.00 2 1 2 0
3 2836295 2021-09-30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 100000.00 0.00 0.00 0.00 0.00 1 1 0 0
4 2839273 2021-09-30 0.00 0.00 10028.00 10000.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 40000.00 0.00 0.00 1 0 1 0
The next thing we will do is to do a quick exploration of our bonds order dataset before combining it with user and transaction.

Bonds Order

print(f'Total rows: {result.shape[0]}, columns:{result.shape[1]}')
print(f'Total User: {result.user_id.nunique()}')
result.info()
	Total rows: 8484, columns:3
	Total User: 8484
	RangeIndex: 8484 entries, 0 to 8483
	Data columns (total 3 columns):
	#   Column              Non-Null Count  Dtype
	---  ------              --------------  -----
	0   user_id             8484 non-null   int64
	1   flag_order_bond     8484 non-null   int64
	2   bond_units_ordered  8484 non-null   int64
	dtypes: int64(3)
Since our problem statement is to find the prediction of whether the user will order bonds based on their behaviour. We want to see the percentage that users who order bonds from the bonds order dataset to see the comparison between each class.
bond_order = result[['user_id', 'flag_order_bond']].groupby('flag_order_bond').count()
fig, ax = plt.subplots(figsize=(5, 5), subplot_kw=dict(aspect="equal"))
wedges, texts, autotexts = ax.pie(bond_order.user_id, autopct=lambda pct: func(pct, bond_order.user_id),
									textprops=dict(color="black"), explode=(0,0.1))
ax.legend(wedges, bond_order.index, title='Flag order bond', loc='upper right')
plt.show()

bond_chart=sns.countplot(x=result["flag_order_bond"])
plt.tight_layout()
plt.show()
It shows that only 40.9% of our users order Government Bonds. However, we found a difference in our number of users from the transactions' dataset and 3. We have 8277 total users in the transactions dataset that makes transactions, meanwhile, 8484 users in the bonds order dataset. We will combine the transactions' dataset and bonds in order to see the details of it.
trans_3 = result.merge(new_trans, how='left')
print(f'Total combined rows: {trans_3.shape[0]}, columns:{trans_3.shape[1]}')
print(f'Difference total users:{trans_3.user_id.nunique()-transaction_users}')
trans_3.isnull().sum()	Total combined rows: 8484, columns:24
	Difference total users:207

	user_id                                  0
	flag_order_bond                          0
	bond_units_ordered                       0
	trans_date                             207
	Saham_AUM                              207
	Saham_invested_amount                  207
	Pasar_Uang_AUM                         207
	Pasar_Uang_invested_amount             207
	Pendapatan_Tetap_AUM                   207
	Pendapatan_Tetap_invested_amount       207
	Campuran_AUM                           207
	Campuran_invested_amount               207
	Saham_transaction_amount               207
	Pasar_Uang_transaction_amount          207
	Pendapatan_Tetap_transaction_amount    207
	Campuran_transaction_amount            207
	dtype: int64
We left joined our dataset since our bonds order dataset has more rows of data for a better understanding what is the difference. There are 8484 rows of data and 16 columns after the merging. We also check how much the difference between the dataset and the null value. Our difference number of users between the transactions' dataset and 3 is 207 users, it is also the same as the number of our null values. We will try to visualise the user with null values if they ordered bonds or not before we decide to exclude those users.
v = trans_3[trans_3.isnull().any(axis=1)]
v_null = v[['user_id', 'flag_order_bond']].groupby('flag_order_bond').count()
fig, ax = plt.subplots(figsize=(5, 5), subplot_kw=dict(aspect="equal"))
wedges, texts, autotexts = ax.pie(v_null.user_id, autopct=lambda pct: func(pct, v_null.user_id),
	textprops=dict(color="black"), explode=(0,1))
ax.legend(wedges, v_null.index, title='Null flag order bond', loc='upper right')
plt.show()
From the pie chart, we can see only 1 of 207 users ended up ordering bonds from the company. Since we don't have any transactional records for the 207 users, it is safe to exclude them from our analysis. Now, we will merge all of our datasets into 1 before we preprocess our data for modelling.
df1 = new_trans.merge(user, how='inner', on='user_id')
full_data = df1.merge(result, how='inner', on='user_id')
print(f'Total full rows: {full_data.shape[0]}, columns:{full_data.shape[1]}')
print('Null Values:')
full_data.isnull().sum()	Total full rows: 8277, columns:32

	Null Values:
	user_id                                     0
	trans_date                                  0
	Saham_AUM                                   0
	Saham_invested_amount                       0
	Pasar_Uang_AUM                              0
	Pasar_Uang_invested_amount                  0
	Pendapatan_Tetap_AUM                        0
	Pendapatan_Tetap_invested_amount            0
	Campuran_AUM                                0
	Campuran_invested_amount                    0
	Saham_transaction_amount_buy                0
	Saham_transaction_amount_sell               0
	Pasar_Uang_transaction_amount_buy           0
	Pasar_Uang_transaction_amount_sell          0
	Pendapatan_Tetap_transaction_amount_buy     0
	Pendapatan_Tetap_transaction_amount_sell    0
	Campuran_transaction_amount_buy             0
	Campuran_transaction_amount_sell            0
	count_saham                                 0
	count_uang                                  0
	count_fixed                                 0
	count_mixed                                 0
	registration_import_datetime                0
	user_gender                                 0
	user_age                                    0
	user_occupation                             0
	user_income_range                           0
	referral_code_used                          0
	user_income_source                          0
	registration_month                          0
	flag_order_bond                             0
	bond_units_ordered                          0
	dtype: int64

After we combined the 3 datasets, we have a total of 8277 rows of data and 32 columns. We will create an age group bin into 17-19,20-29,30-39 and so on instead of user_age.
bins = [17, 20, 30, 40, 50, 60, 70, 80, 100]
labels = ['17-19','20-29','30-39', '40-49', '50-59', '60-69', '70-79', '80+']
full_data['age_group'] = pd.cut(full_data['user_age'], bins, labels = labels,include_lowest = True)
full_data.drop(['user_age'], axis=1, inplace=True)
full_data.age_group.unique()

	['20-29', '40-49', '30-39', '50-59', '17-19', '60-69', '70-79', '80+']
	Categories (8, object): ['17-19' < '20-29' < '30-39' < '40-49' < '50-59' < '60-69' < '70-79' < '80+']
trans_group = full_data[['age_group', 'Saham_transaction_amount_buy',
	'Saham_transaction_amount_sell', 'Pasar_Uang_transaction_amount_buy',
	'Pasar_Uang_transaction_amount_sell',
	'Pendapatan_Tetap_transaction_amount_buy',
	'Pendapatan_Tetap_transaction_amount_sell',
	'Campuran_transaction_amount_buy', 'Campuran_transaction_amount_sell']].groupby('age_group').sum()

trans_group.plot(title='Summary transaction by age group', figsize=(8,5))
plt.ticklabel_format(axis='y', style='plain')
plt.show()
When we visualise the number of transactions mapped with the age group, we can see that the age group 20-29 years old contribute the most transactions, and it started to slow down until the user aged.
Then we will check the correlation from each of our features by plotting the correlation matrix.
fig, ax = plt.subplots(figsize=(25,10))  
sns.heatmap(full_data.corr(), annot = True, ax=ax, cmap = 'Blues')
plt.title("DataCorrelation")
plt.show()
The correlation matrix shows that AUM and Invested column has a perfect positive correlation between each other as well as buying transactions. Furthermore, bond_units_ordered also has a positive correlation with flag_order_bond. So, we will be dropping one of the features to avoid multicollinearity in the model performance.
pos_corr = ['Saham_AUM', 'Pasar_Uang_AUM', 'Pendapatan_Tetap_AUM', 'Campuran_AUM', 'Saham_transaction_amount_buy',
	'Pasar_Uang_transaction_amount_buy', 'Pendapatan_Tetap_transaction_amount_buy', 'Campuran_transaction_amount_buy',
	'bond_units_ordered']
full_data.drop(pos_corr, axis=1, inplace=True)

Preprocessing

So, we will convert our categorical data into numerical. We will print out the unique value from non-numerical data. Moreover, we will remove the column transaction date, registration month, and user id.
for x in full_data:
	uniqueValues = full_data[x].unique()
	n_values = len(uniqueValues)
	if full_data[x].dtypes == object:
		print(f'Column: {x} -- total unique value: ({n_values}) {uniqueValues}')
	Column: user_gender -- total unique value: (2) ['Male' 'Female']
	Column: user_occupation -- total unique value: (9) ['Pelajar' 'Swasta' 'Others' 'Pengusaha' 'IRT' 'Guru' 'Pensiunan' 'PNS'
	'TNI/Polisi']
	Column: user_income_range -- total unique value: (6) ['< 10 Juta' 'Rp 10 Juta - 50 Juta' '> Rp 100 Juta - 500 Juta'
	'> Rp 500 Juta - 1 Miliar' '> Rp 50 Juta - 100 Juta' '> Rp 1 Miliar']
	Column: referral_code_used -- total unique value: (2) ['used referral' 'None']
	Column: user_income_source -- total unique value: (10) ['Keuntungan Bisnis' 'Gaji' 'Lainnya' 'Warisan' 'Undian'
	'Dari Orang Tua / Anak' 'Tabungan' 'Hasil Investasi' 'Dari Suami / istri'
	'Bunga Simpanan']
From the unique value details, it clearly shows that user_income_range is an ordinal category. So we will be using a label encoder to convert them into numerical data, meanwhile one hot encoding for the rest.
full_data.drop(['trans_date', 'registration_month', 'user_id'], axis=1, inplace=True)
one_hot_var = ['user_gender','user_occupation', 'referral_code_used', 'user_income_source']
one_hot = pd.get_dummies(full_data, columns = one_hot_var)

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

one_hot['registration_import_datetime'] = labelencoder.fit_transform(one_hot['registration_import_datetime'])
one_hot['age_group'] = labelencoder.fit_transform(one_hot['age_group'])
one_hot['user_income_range'] = labelencoder.fit_transform(one_hot['user_income_range'])
one_hot.info()
	Int64Index: 8277 entries, 0 to 8276
	Data columns (total 39 columns):
	 #   Column                                    Non-Null Count  Dtype  
	---  ------                                    --------------  -----  
	 0   Saham_invested_amount                     8277 non-null   float64
	 1   Pasar_Uang_invested_amount                8277 non-null   float64
	 2   Pendapatan_Tetap_invested_amount          8277 non-null   float64
	 3   Campuran_invested_amount                  8277 non-null   float64
	 4   Saham_transaction_amount_sell             8277 non-null   float64
	 5   Pasar_Uang_transaction_amount_sell        8277 non-null   float64
	 6   Pendapatan_Tetap_transaction_amount_sell  8277 non-null   float64
	 7   Campuran_transaction_amount_sell          8277 non-null   float64
	 8   count_saham                               8277 non-null   int32  
	 9   count_uang                                8277 non-null   int32  
	 10  count_fixed                               8277 non-null   int32  
	 11  count_mixed                               8277 non-null   int32  
	 12  registration_import_datetime              8277 non-null   int32  
	 13  user_income_range                         8277 non-null   int32  
	 14  flag_order_bond                           8277 non-null   int64  
	 15  age_group                                 8277 non-null   int32  
	 16  user_gender_Female                        8277 non-null   uint8  
	 17  user_gender_Male                          8277 non-null   uint8  
	 18  user_occupation_Guru                      8277 non-null   uint8  
	 19  user_occupation_IRT                       8277 non-null   uint8  
	 20  user_occupation_Others                    8277 non-null   uint8  
	 21  user_occupation_PNS                       8277 non-null   uint8  
	 22  user_occupation_Pelajar                   8277 non-null   uint8  
	 23  user_occupation_Pengusaha                 8277 non-null   uint8  
	 24  user_occupation_Pensiunan                 8277 non-null   uint8  
	 25  user_occupation_Swasta                    8277 non-null   uint8  
	 26  user_occupation_TNI/Polisi                8277 non-null   uint8  
	 27  referral_code_used_None                   8277 non-null   uint8  
	 28  referral_code_used_used referral          8277 non-null   uint8  
	 29  user_income_source_Bunga Simpanan         8277 non-null   uint8  
	 30  user_income_source_Dari Orang Tua / Anak  8277 non-null   uint8  
	 31  user_income_source_Dari Suami / istri     8277 non-null   uint8  
	 32  user_income_source_Gaji                   8277 non-null   uint8  
	 33  user_income_source_Hasil Investasi        8277 non-null   uint8  
	 34  user_income_source_Keuntungan Bisnis      8277 non-null   uint8  
	 35  user_income_source_Lainnya                8277 non-null   uint8  
	 36  user_income_source_Tabungan               8277 non-null   uint8  
	 37  user_income_source_Undian                 8277 non-null   uint8  
	 38  user_income_source_Warisan                8277 non-null   uint8  
After we convert our non-numerical data into numeric, we can see our dataset shape is changed to 8277 rows and 39 columns. We will use the current feature to create our model using multiple classification algorithms and evaluate which one gives the best performance. So, we will take out the target variable which is flag_order_bond from our dataset and divide them into training (60% of total data) and 40% for validation and testing. Also, we will scale our data between 0 and 1 using minmaxscaler from scikit-learn.
x_v1 = one_hot.drop(['flag_order_bond'], axis=1)
y_v1 = one_hot['flag_order_bond'].copy()
Function to split data into train, val and test also scalingfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

def split_scaled(x,y):
	xtrain, xtemp, ytrain, ytemp = train_test_split(x, y, train_size=0.6, random_state=3)
	xvalid, xtest, yvalid, ytest = train_test_split(xtemp, ytemp, test_size=0.5, random_state=3)
	scale = MinMaxScaler()

	xtrain_scaled = scale.fit_transform(xtrain)
	xvalid_scaled = scale.fit_transform(xvalid)
	xtest_scaled = scale.fit_transform(xtest)
	return xtrain_scaled, ytrain, xvalid_scaled, yvalid, xtest_scaled, ytest
Function to evaluate modelsfrom sklearn.metrics import ConfusionMatrixDisplay, accuracy_score
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, log_loss
score_comp = []
def eval(model, name, x1, y1, x2, y2, x3,y3):
	#accuracy score
	predtrain = model.predict(x1)
	predvalid = model.predict(x2)
	predtest = model.predict(x3)
	acctest = (accuracy_score(y3, predtest))
	accvalid = (accuracy_score(y2, predvalid))
	acctrain = (accuracy_score(y1, predtrain))
	overfit = acctrain - acctest
	
	tn, fp, fn, tp = confusion_matrix(y3, predtest).ravel()
	precision1 = (tp / (tp+fp))
	precision0 = (tn/(tn+fn))
	recall1 = (tp/(tp+fn))
	recall0 = (tn/(tn+fp))
	f1 = 2*(precision1 * recall1)/(precision1 + recall1)

	#log loss
	logtest = log_loss(y3,predtest)
	logtrain = log_loss(y1,predtrain)

	#roc auc score
	test_prob = model.predict_proba(x3)[::,1]
	train_prob = model.predict_proba(x1)[::,1]
	roctest = (roc_auc_score(y3, test_prob))
	roctrain = (roc_auc_score(y1, train_prob))
	fpr_test, tpr_test, _ = roc_curve(y3,  test_prob)
	fpr_train, tpr_train, _ = roc_curve(y1,  train_prob)

	#print report
	print(f'Training accuracy: {acctrain:.2%} \nValidation accuracy: {accvalid:.2%}\nTesting accuracy: {acctest:.2%} \nOverfitting: {overfit:.2%}')
	print(f'Log Loss Training: {logtrain}, Log Loss Testing: {logtest}')
	print(f'Precision class 1: {precision1:.2%} \nPrecision class 0: {precision0:.2%}')
	print(f'Recall class 1: {recall1:.2%} \nRecall class 0: {recall0:.2%}')
	print(f'F1: {f1:.2%}') 
	print(f'ROC AUC Training: {roctrain:.2%} Testing: {roctest:.2%}')

	#insert result to list for comparison
	score_comp.append([name, acctrain, accvalid, acctest, overfit, precision1, precision0, recall1, recall0,
	f1, logtrain, logtest, roctrain, roctest])

	#confusion matrix plot
	disp = ConfusionMatrixDisplay(confusion_matrix(y3, predtest),display_labels=model.classes_)
	disp.plot()
	plt.show()
	
	#roc auc learning curve plot
	plt.title("Area Under Curve")
	plt.plot(fpr_test,tpr_test,label="AUC Test="+str(roctest))
	plt.plot(fpr_train,tpr_train,label="AUC Train="+str(roctrain))
	plt.ylabel('True Positive Rate')
	plt.xlabel('False Positive Rate')
	plt.legend(loc=4)
	plt.grid(True)
	plt.show()

Model 1 - Logistic Regression

The first model that we will implement is using Logistic Regression with our original dataset.
xtrain, ytrain, xvalid, yvalid, xtest, ytest = split_scaled(x_v1, y_v1)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(xtrain, ytrain)
eval(logreg, 'logreg1', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 63.25% 
	Validation accuracy: 61.51%
	Testing accuracy: 62.02% 
	Overfitting: 1.23%
	Log Loss Training: 13.246006330072253, Log Loss Testing: 13.690493950334956
	Precision class 1: 52.56% 
	Precision class 0: 66.20%
	Recall class 1: 40.76% 
	Recall class 0: 75.92%
	F1: 45.92%
	ROC AUC Training: 66.75% Testing: 64.66%
So, we can see that our model does not give a good accuracy score. We will try to tune the model using GridSearch Cross Validation to improve the performance.
from sklearn.model_selection import GridSearchCV
log_param = [
	{
		'C': [0.01, 0.1, 1.0, 10, 100], 
        'penalty': ['l1','l2'],
        'solver': ['newton-cg','saga'],
        'max_iter': [200,400,600,800]
	},
]

#tuned the model using gridsearch and kfold
opt_log = GridSearchCV(LogisticRegression(), param_grid = log_param, scoring='accuracy', cv=10, verbose = 2)
#fit the tuned model with training dataset
opt_log.fit(xtrain, ytrain) 
print(opt_log.best_params_) 
	{'C': 100, 'max_iter': 800, 'penalty': 'l1', 'solver': 'saga'}
We will fit our dataset to a new model with the parameters resulting from the Grid Search.
lr = LogisticRegression(C= 100, penalty= 'l1', solver= 'saga', max_iter=800)
lr.fit(xtrain, ytrain)
eval(lr,'logreg_tuned', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 65.40% 
	Validation accuracy: 63.63%
	Testing accuracy: 65.40% 
	Overfitting: 0.01%
	Log Loss Training: 12.4693911644187, Log Loss Testing: 12.471626444422782
	Precision class 1: 58.01% 
	Precision class 0: 68.71%
	Recall class 1: 45.34% 
	Recall class 0: 78.52%
	F1: 50.90%
	ROC AUC Training: 70.22% Testing: 69.73%
It seems the tuning increases the performance of the model. The next thing we will try to check the features coefficient and try to modify our features.
coefs_df = pd.DataFrame()
coefs_df['features'] = x_v1.columns
coefs_df['coefs'] = logreg.coef_[0]
print(coefs_df.sort_values('coefs', ascending=False).head(20))
plt.bar([x for x in range(len(coefs_df['coefs']))], coefs_df['coefs'])
plt.show()
						features  coefs
	0                      Saham_invested_amount   4.31
	1                 Pasar_Uang_invested_amount   2.63
	14                                 age_group   1.95
	10                               count_fixed   1.94
	2           Pendapatan_Tetap_invested_amount   1.58
	3                   Campuran_invested_amount   1.42
	30     user_income_source_Dari Suami / istri   0.59
	9                                 count_uang   0.57
	23                 user_occupation_Pensiunan   0.42
	36                 user_income_source_Undian   0.34
	12              registration_import_datetime   0.33
	20                       user_occupation_PNS   0.32
	26                   referral_code_used_None   0.26
	15                        user_gender_Female   0.23
	13                         user_income_range   0.20
	31                   user_income_source_Gaji   0.14
	29  user_income_source_Dari Orang Tua / Anak   0.12
	5         Pasar_Uang_transaction_amount_sell   0.09
	22                 user_occupation_Pengusaha   0.08
	25                user_occupation_TNI/Polisi   0.04
We will remove the low coefficient features that less than 0.2.
dropslv2 = coefs_df.loc[coefs_df['coefs']<0.2].features
dropslv2 = dropslv2.to_list()
encoded2 = one_hot.drop(dropslv2, axis=1)
x_v2 = encoded2.drop(['flag_order_bond'], axis=1)
y_v2 = encoded2['flag_order_bond'].copy()
Fit new dataset to the model
xtrain2, ytrain2, xvalid2, yvalid2, xtest2, ytest2 = split_scaled(x_v2, y_v2)

lr2 = LogisticRegression()
lr2.fit(xtrain2, ytrain2)
eval(lr2, 'logreg_coef', xtrain2, ytrain2, xvalid2, yvalid2, xtest2, ytest2)
	Training accuracy: 62.18% 
	Validation accuracy: 61.27%
	Testing accuracy: 61.35% 
	Overfitting: 0.83%
	Log Loss Training: 13.630684870068873, Log Loss Testing: 13.92991435328199
	Precision class 1: 51.52% 
	Precision class 0: 65.52%
	Recall class 1: 38.78% 
	Recall class 0: 76.12%
	F1: 44.25%
	ROC AUC Training: 65.84% Testing: 64.27%
It seems that our previous model before removing the low coefficient features has a better performance. We will try to tune the model and see if the accuracy will improve.
from sklearn.model_selection import GridSearchCV
log_param = [
	{
		'C': [0.01, 0.1, 1.0, 10, 100], 
		'penalty': ['l2', 'l1'],
		'solver': ['newton-cg', 'saga'],
		'max_iter': [200,400,600] 
	},
]

#tuned the model using gridsearch and kfold
opt_log = GridSearchCV(LogisticRegression(), param_grid = log_param, scoring='accuracy', cv=10, verbose = 2)
#fit the tuned model with training dataset
opt_log.fit(xtrain2, ytrain2) 
print(opt_log.best_params_) #best parameter tuning result
	{'C': 100, 'max_iter': 600, 'penalty': 'l1', 'solver': 'saga'}
Fit the model with new parameter from hypertuning.
lr3 = LogisticRegression(C=100, penalty='l1', solver='saga', max_iter=600)
lr3.fit(xtrain2, ytrain2)
eval(lr3, 'logreg_coef_tuned', xtrain2, ytrain2, xvalid2, yvalid2, xtest2, ytest2)
	Training accuracy: 64.44% 
	Validation accuracy: 63.32%
	Testing accuracy: 64.92% 
	Overfitting: -0.48%
	Log Loss Training: 12.81777927611375, Log Loss Testing: 12.645750373838808
	Precision class 1: 57.09% 
	Precision class 0: 68.52%
	Recall class 1: 45.50% 
	Recall class 0: 77.62%
	F1: 50.64%
	ROC AUC Training: 69.33% Testing: 68.66%
Unfortunately, when we reduce the low coefficient features it doesn't perform better than our previous dataset. We will try to create a new model using different algorithms.

Model 2 - Support Vector Machine

from sklearn.svm import SVC
svm_v1 = SVC(probability=True)
svm_v1.fit(xtrain, ytrain)
eval(svm_v1, 'svm1', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 63.85% 
	Validation accuracy: 60.06%
	Testing accuracy: 61.11% 
	Overfitting: 2.74%
	Log Loss Training: 13.028263760262844, Log Loss Testing: 14.016976317990002
	Precision class 1: 51.24% 
	Precision class 0: 64.74%
	Recall class 1: 34.81% 
	Recall class 0: 78.32%
	F1: 41.45%
	ROC AUC Training: 68.65% Testing: 62.84%
We will try to hyperparameter tuning our model to see if we can improve the performance.
param_grid = [
		{
			'C': [0.01, 0.1, 1.0, 10, 100], 
			'gamma': ['scale', 0.001, 0.01],
			'kernel': ['rbf','linear','sigmoid', 'poly']
		},
	]
	
	#tuned the model using gridsearch and kfold
	opt_log = GridSearchCV(SVC(), param_grid, scoring='accuracy', cv=10)
	#fit the tuned model with training dataset
	opt_log.fit(xtrain, ytrain) 
	#best parameter tuning result
	print(opt_log.best_params_) 
	{'C': 100, 'gamma': 'scale', 'kernel': 'poly'}
svm_v2 = SVC(probability=True, C=100, gamma='scale', kernel='poly')
svm_v2.fit(xtrain, ytrain)
eval(svm_v2, 'svm2', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 70.18% 
	Validation accuracy: 61.93%
	Testing accuracy: 63.83% 
	Overfitting: 6.35%
	Log Loss Training: 10.749224862924386, Log Loss Testing: 13.037529215024863
	Precision class 1: 55.45% 
	Precision class 0: 67.60%
	Recall class 1: 43.51% 
	Recall class 0: 77.12%
	F1: 48.76%
	ROC AUC Training: 76.66% Testing: 65.07%
It seems that both of SVM model that we have created has an overfitting problem. After the tuning, the accuracy score is increased but the overfitting is as well.

Model 3 - Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
			rf = RandomForestClassifier()
			rf.fit(xtrain, ytrain)
			eval(rf, 'random_forest', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
				Training accuracy: 99.74% 
				Validation accuracy: 67.49%
				Testing accuracy: 66.55% 
				Overfitting: 33.19%
				Log Loss Training: 0.09435511358407653, Log Loss Testing: 12.058082112059724
				Precision class 1: 55.78% 
				Precision class 0: 78.54%
				Recall class 1: 74.35% 
				Recall class 0: 61.44%
				F1: 63.74%
				ROC AUC Training: 100.00% Testing: 76.33%
The Random Forest Classifier model results in an outstanding accuracy score with the training dataset. However, it does not have the same performance in the validation and testing dataset, which results in overfitting. Since it takes a lot of computation for the hyperparameter tuning, we will try to visualise the validation manually.
Function for calculate and visualise validation curve
			def validations(model, param, prange, x, y, k):
				train_scoreNum, test_scoreNum = validation_curve(
											model,
											X = x, y = y, 
											param_name = param, 
											param_range = prange, cv = k, scoring = "accuracy")
				# Calculating mean and standard deviation of training score
				mean_train_score = np.mean(train_scoreNum, axis = 1)
				std_train_score = np.std(train_scoreNum, axis = 1)
				
				# Calculating mean and standard deviation of testing score
				mean_test_score = np.mean(test_scoreNum, axis = 1)
				std_test_score = np.std(test_scoreNum, axis = 1)
				
				# Plot mean accuracy scores for training and testing scores
				plt.plot(prange, mean_train_score,
					label = "Training Score", color = 'b')
				plt.plot(prange, mean_test_score,
				label = "Cross Validation Score", color = 'g')
				
				# Creating the plot
				plt.title("Validation Curve with RF Classifier")
				plt.xlabel(param)
				plt.ylabel("Accuracy")
				plt.tight_layout()
				plt.legend(loc = 'best')
				plt.show()
			parameter_range = np.arange(1, 100, 1)
			validations(rf, 'min_samples_split', parameter_range, xtrain, ytrain, 10)
parameter_range = np.arange(1, 15, 1)
			validations(rf, 'max_depth', parameter_range, xtrain, ytrain, 10)
parameter_range = np.arange(1, 10, 1)
			validations(rf, 'n_estimators', parameter_range, xtrain, ytrain, 10)
validations(rf, 'min_samples_leaf', parameter_range, xtrain, ytrain, 10)
parameter_range = np.arange(1, 10, 1)
			validations(rf, 'max_features', parameter_range, xtrain, ytrain, 10)
Fitting the parameter based on validation_curve visualisation
			RF_tuned = RandomForestClassifier(min_samples_split=100, max_depth=3, min_samples_leaf=15, n_estimators=1, max_features=7)
				RF_tuned.fit(xtrain, ytrain)
				eval(RF_tuned, 'rf_tuned', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
				Training accuracy: 63.15% 
				Validation accuracy: 62.48%
				Testing accuracy: 61.90% 
				Overfitting: 1.25%
				Log Loss Training: 13.282296758373823, Log Loss Testing: 13.734024932688962
				Precision class 1: 51.35% 
				Precision class 0: 74.09%
				Recall class 1: 69.62% 
				Recall class 0: 56.84%
				F1: 59.11%
				ROC AUC Training: 68.15% Testing: 65.19%
We successfully reduce the overfitting to 1.25% from our Random Forest algorithm from 33.19%. However, our accuracy score is also decreasing. So, we will move on to our next model.

Model 4 - K-Nearest Neighbor Classifier

from sklearn.neighbors import KNeighborsClassifier
knn1 = KNeighborsClassifier()
knn1.fit(xtrain, ytrain)
eval(knn1, 'knn', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 72.51% 
	Validation accuracy: 56.92%
	Testing accuracy: 58.82% 
	Overfitting: 13.70%
	Log Loss Training: 9.907286926328014, Log Loss Testing: 14.84406498271612
	Precision class 1: 47.65% 
	Precision class 0: 64.75%
	Recall class 1: 41.83% 
	Recall class 0: 69.93%
	F1: 44.55%
	ROC AUC Training: 78.96% Testing: 58.57%
Our KNN model also has the same problem as our previous models. Furthermore, this model also yields the lowest accuracy score up to now. We will do the same thing as our previous model with hyperparameter tuning.
#parameter dict
tuning_params = {
	'n_neighbors' : [19], #from the plot above
	"leaf_size":[5,10,20,30],
	"p":[1,2],
	'algorithm':['ball_tree', 'kd_tree', 'brute', 'auto']
}

#implement hyperparameter tuning to the model
opt_knn = GridSearchCV(KNeighborsClassifier(), param_grid = tuning_params, cv = 10, scoring='accuracy')
#fit the model with training dataset
opt_knn.fit(xtrain, ytrain) 
print(opt_knn.best_params_){'algorithm': 'ball_tree', 'leaf_size': 10, 'n_neighbors': 19, 'p': 1}
knn_tuned = KNeighborsClassifier(leaf_size=10, n_neighbors=19, p=1, algorithm='ball_tree')
knn_tuned.fit(xtrain, ytrain)
eval(knn_tuned, 'knn_tuned', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 65.18% 
	Validation accuracy: 58.91%
	Testing accuracy: 59.66% 
	Overfitting: 5.52%
	Log Loss Training: 12.54923010668215, Log Loss Testing: 14.539348106238078
	Precision class 1: 48.77% 
	Precision class 0: 64.77%
	Recall class 1: 39.39% 
	Recall class 0: 72.93%
	F1: 43.58%
	ROC AUC Training: 69.89% Testing: 61.47%
We were able to reduce the overfitting percentage and increase the accuracy score a little bit. It seems this model is the most underperformed compared to others.

Model 5 - Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(xtrain, ytrain)
eval(dt, 'decision_tree', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 99.74% 
	Validation accuracy: 65.26%
	Testing accuracy: 66.00% 
	Overfitting: 33.74%
	Log Loss Training: 0.09435511358407653, Log Loss Testing: 12.253971532652752
	Precision class 1: 56.18% 
	Precision class 0: 74.01%
	Recall class 1: 63.82% 
	Recall class 0: 67.43%
	F1: 59.76%
	ROC AUC Training: 100.00% Testing: 65.68%
From the result above, it looks like decision tree has the similar results with random forest Classifier.
random_tree = {'criterion': ['gini', 'entropy'],
	'max_depth': [int(x) for x in np.linspace(start = 1, stop = 20, num = 1)],
	'min_samples_split': [int(x) for x in np.linspace(start = 1, stop = 20, num = 1)],
	'min_samples_leaf': [int(x) for x in np.linspace(start = 1, stop = 20, num = 1)]}

#implement hyperparameter tuning to the model
opt_decTree = GridSearchCV(DecisionTreeClassifier(), param_grid= random_tree, cv = 10, scoring='accuracy')
#fit the model with training dataset
opt_decTree.fit(xtrain, ytrain) 
print(opt_decTree.best_params_)
	{'criterion': 'gini', 'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}
dt_tuned = DecisionTreeClassifier(criterion='gini', max_depth=1, min_samples_leaf=1, min_samples_split=1)
dt_tuned.fit(xtrain, ytrain)
eval(dt_tuned, 'dt_tuned', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 67.52% 
	Validation accuracy: 66.47%
	Testing accuracy: 66.79% 
	Overfitting: 0.73%
	Log Loss Training: 11.707292170085777, Log Loss Testing: 11.97102014735171
	Precision class 1: 55.45% 
	Precision class 0: 82.54%
	Recall class 1: 81.53% 
	Recall class 0: 57.14%
	F1: 66.01%
	ROC AUC Training: 69.48% Testing: 69.33%
The hyperparameter tuning successfully lowers the overfitting percentage by more than 30% and slightly increases the accuracy score. We will implement our last algorithm before we compare which algorithm is performing better.

Model 6 - Naive Bayes

from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(xtrain, ytrain)
eval(nb, 'naive', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 62.00% 
	Validation accuracy: 60.06%
	Testing accuracy: 63.10% 
	Overfitting: -1.10%
	Log Loss Training: 13.696007641011693, Log Loss Testing: 13.298715109148901
	Precision class 1: 57.86% 
	Precision class 0: 64.17%
	Recall class 1: 24.73% 
	Recall class 0: 88.21%
	F1: 34.65%
	ROC AUC Training: 65.56% Testing: 64.16%
Unlike our other model, this model has an underfitting problem. We will perform the same procedure with hyperparameter tuning.
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
opt = GridSearchCV(GaussianNB(), param_grid=params_NB, cv=10,verbose=1,scoring='accuracy')
opt.fit(xtrain, ytrain) 
print(opt.best_params_)
	{'var_smoothing': 1.873817422860383e-08}
nbtuned = GaussianNB(var_smoothing=1.873817422860383e-08)
nbtuned.fit(xtrain, ytrain)
eval(nbtuned, 'naive_tuned', xtrain, ytrain, xvalid, yvalid, xtest, ytest)
	Training accuracy: 62.08% 
	Validation accuracy: 60.12%
	Testing accuracy: 62.92% 
	Overfitting: -0.84%
	Log Loss Training: 13.666975298370438, Log Loss Testing: 13.36401158267991
	Precision class 1: 57.19% 
	Precision class 0: 64.11%
	Recall class 1: 24.89% 
	Recall class 0: 87.81%
	F1: 34.68%
	ROC AUC Training: 65.55% Testing: 64.13%
Unfortunately, our hyperparameter tuning does not give the best results. So we will compare all of our models and see which one performs better than the others.

Conclusion - Model Selection

We have been implementing 6 classification algorithms to our dataset for the bonds order prediction. We will see which one has a better performance.
scores = pd.DataFrame(score_comp, columns=['Algorithm','accuracy_train','accuracy_valid','accuracy_test', 'overfit', 'precision1', 'precision0', 'recall1', 'recall0','f1','loss_train', 'loss_test', 'roc_train', 'roc_test'])
scores
Algorithm accuracy_train accuracy_valid accuracy_test overfit precision1 precision0 recall1 recall0 f1 loss_train loss_test roc_train roc_test
logreg1 0.63 0.62 0.62 0.01 0.53 0.66 0.41 0.76 0.46 13.25 13.69 0.67 0.65
logreg_tuned 0.65 0.64 0.65 0.00 0.58 0.69 0.45 0.79 0.51 12.47 12.47 0.70 0.70
logreg_coef 0.62 0.61 0.61 0.01 0.52 0.66 0.39 0.76 0.44 13.63 13.93 0.66 0.64
logreg_coef_tuned 0.64 0.63 0.65 -0.00 0.57 0.68 0.45 0.78 0.50 12.83 12.71 0.69 0.68
svm1 0.64 0.60 0.61 0.03 0.51 0.65 0.35 0.78 0.41 13.03 14.02 0.69 0.63
svm2 0.70 0.62 0.64 0.06 0.55 0.68 0.44 0.77 0.49 10.75 13.04 0.77 0.65
random_forest 1.00 0.67 0.67 0.33 0.56 0.79 0.74 0.61 0.64 0.09 12.06 1.00 0.76
rf_tuned 0.63 0.62 0.62 0.01 0.51 0.74 0.70 0.57 0.59 13.28 13.73 0.68 0.65
knn 0.73 0.57 0.59 0.14 0.48 0.65 0.42 0.70 0.45 9.91 14.84 0.79 0.59
knn_tuned 0.65 0.59 0.60 0.06 0.49 0.65 0.39 0.73 0.44 12.55 14.54 0.70 0.61
decision_tree 1.00 0.65 0.66 0.34 0.56 0.74 0.64 0.67 0.60 0.09 12.25 1.00 0.66
dt_tuned 0.68 0.66 0.67 0.01 0.55 0.83 0.82 0.57 0.66 11.71 11.97 0.69 0.69
naive 0.62 0.60 0.63 -0.01 0.58 0.64 0.25 0.88 0.35 13.70 13.30 0.66 0.64
naive_tuned 0.62 0.60 0.63 -0.01 0.57 0.64 0.25 0.88 0.35 13.67 13.36 0.66 0.64
From the score table above, we can see that KNN is the most less perform compare to others. When we see how precise the model is in predicting class 1 and 0, it seems that logistic regression has a better performance compared to others. However, the percentage of the score is not satisfying. If we can get more data to train our model, we could increase our accuracy score as the model has more data to learn. For now, tuned logistic regression is our better option with a 65% accuracy score, 58% precision in class 1 and 69% precision in class 0.