Machine Learning (ML) is a form of artificial intelligence (AI), and in algorithmic trading, it helps make better predictions and trading decisions. Machine learning in the finance world works like magic. It has been a significant method in the financial service industry for the longest time.
Want to know more about machine learning for finance and all the how-tos? Read on.
What is Machine Learning?
Machine Learning refers to the method for building algorithms that improve themselves through experience. This method has three categories – supervised learning, unsupervised learning, and reinforcement learning.
Supervised Learning refers to the situation where we provide the machine with some inputs and its desired outcomes. With this information, the machine learns to build outputs close to the ones we need.
With Unsupervised Learning, inputs are provided, but the desired outcomes are not. The machines need to figure out the underlying logic of what to do.
Lastly, for Reinforced Learning, the program navigates a certain environment with a specified objective and provides feedback/rewards that it tries to maximize.
Machine Learning in the Finance World
In finance, a huge volume of data and information is needed to process. And that is where machine learning comes to play.
Algorithmic traders use ML to make better decisions, predictions, and calculations. With ML, processing a huge volume of data in a timely and efficient manner is possible.
Some platforms, like Robo-advisor, are built based on ML principles. With minimal supervision, the platform offers traders comprehensive planning, portfolio management, security, and much more. Banks also use ML to detect fraudulent behavior and possible anomalies.
Bank Default and Machine Learning
So, how can a bank predict if the user will default in the following months? Can machine learning help with this problem?
Before we go deeper, let’s learn and understand a few basic things.
What is a bank default?
A bank default is the failure to repay a debt. It includes the principal or interest of the loan. Default happens when the user misses a payment, avoids them, or outright stops paying.
Secured vs. Unsecured Debt
In secured debt, like a housing mortgage loan, if the user doesn’t pay or stops paying, the bank can reclaim the home if he defaults.
In an unsecured debt, like credit debts and utility bills, if the user defaults, it almost always results in legal disputes.
Finding and Loading the Data
You might wonder how we can obtain bank data without getting into trouble by hacking it or playing too much money for it. Well, thanks to Kaggle!
Kaggle is a data repository platform that offers learning, practice, cleaning, and more data.
Learning from A Real-Life Example
In this article, we will learn from an actual example. We will look into the datasets from the UCL machine learning repository. It features data about default payments of credit card clients from a Taiwanese bank from 2005. The data goes from April to September and features 25 variables.
First, download the datasets from Kaggle.
Import pandas to gain access to the dataset and load it.
Use this command to print out a few rows. It will give you a glance at the dataset.
Take note of the following descriptions that can also be found on the Kaggle website.
- ID – id of each user
- LIMIT_BAL – the amount of given credit in NT dollars
- SEX – gender (1 = male, 2 = female)
- EDUCATION – level of acquired education
- MARRIAGE – marital status
- PAY_0 to PAY_6 – repayment status from September to April
- BILL_AMT1 to BILL_AMT6 – the amount of bill statement from September to April
- PAY_AMT1 to PAY_AMT6 – the amount of the previous payment from September to April
- default.payment.next.month – Default payment (1 = yes, 2 = no)
Next, specify the first column (ID) as our index; the empty strings should be the missing values.
As we have in the EDUCATION column values 0, 5, and 6 that says Unknown, we will add them to the Other category (4). The MARRIAGE column also has a 0, so we will take care of that too.
Picking Features and Labels
For our features (x), we will select all columns except the default.payment.next.month, and for the label (y), we shall pick the aforementioned column.
Making Data More Compact Using Python
Here’s how you make your data more compact to fill less space.
Here’s the function to see the top five columns:
Analyzing Data with Python
This is one of the important processes. First, import the relevant libraries. Here’s the function:
Describe the data
Here’s the code:
Plotting the data
In plotting the data, use this code:
Statistical analysis using Jasp
Download the free program Jasp if you want a quick way to analyze and plot data.
After launching the program, click Open, locate your csv dataset, and click on it.
Go to “Descriptives,” and drag the variables to the position you want them in. We just pick the plots and statistics we want.
Use a histogram to see if genders vary with the dependent variable.
Then, use the violin plot as it allows us to see the distribution of each gender.
But do the default percentages vary with education levels? Let’s see:
We are next interested in seeing the correlation matrix between the variables.
In order to combat confusion, we can create a heatmap that will color our correlations. The bigger the correlation, the more saturated the color is. I’ll also rename the PAY_0 to PAY_1 variable.
Let’s try this. I’ll pass three features (LIMIT_BAL, EDUCATION, and default.payment.next.month) as an extension of the previous analysis.
But what if we want a detailed statistical report with just one line of code? Let’s try it:
Note: Be sure to check out the Full code section in order to see the complete pandas profiling result.
So, one-hot encoding is used for categorical variables because machine learning algorithms can’t handle them well. Also, the encoding process makes the data more expressive.
For one-hot encoding, let’s create binary variables (EDUCATION, SEX, MARRIAGE) and add them to the data frame.
Next, normalize the data using the MinMax scaling method.
Creating Train and Test Groups
The next step is to split the data into two groups (test/train) in order not to overfit (when the model uses complex wat to explain idiosyncrasies in the data).
To split and see if the data is normalized, use this:
Before choosing a classifier and building a model, we need to see what to do with missing data. Here are some ways how to treat missing data:
- Delete the rows that contain missing data.
- Replace missing values with a high score (e.g. -99999).
- Replace the variables with a statistic (mean, median, mode).
- Calculate the missing data with machine learning.
To check any missing value, here’s how:
We would see white strips in the gray columns if we had missing data.
Picking and Implementing a Classifier
When you pick a classifier, you will consider the number of features, size of the dataset, linearity, complexity, accuracy, scalability, etc.
To help you, here are some classifiers for you to choose from and compare:
- Decision tree
- Random forest
- Logistic regression
It is a structure in which each node represents a test of the attribute, the branches represent the outcomes of the test, whilst each leaf node represents a class label. The nodes include decision nodes (square), chance nodes (circle), and end nodes (triangle).
The decision tree is easy to interpret, doesn’t have many hyperparameters, and supports categorical and numerical variables. However, it is also not stable, and it can overfit.
First, import and start coding the classifier.
Then, call our model, fit it to the training data and compute the prediction:
Don’t forget the confusion matrix.
The main structure of our confusion matrix is the following:
- True Negative | False Positive
- False Negative | True Positive
- TN – predicts a good client, and the client didn’t default
- FN – predicts a good client, but the client defaults
- TP – predicts a default, and the client defaults
- FP – predicts a default, but the client did not default
The prediction report has the following metrics:
- Accuracy – calculated the model’s overall ability to correctly predict the target. ((TP + TN) / (TP + FP + TN + FN))
- Precision – out of all predictions of the TP (default), how many observations defaulted (TP / (TP + FP))
- Recall – out of all positives, how many were predicted correctly (TP / (TP + FN))
- F-1 Score – this is a harmonic average of recall and precision.
- Specificity – shows what fraction of negative cases did not default. (TN / (TN + FP))
In order to evaluate the performance of the model, we can use the Precision-Recall curve.
Let’s compare the two predictions before and after the resampling.
Let’s split it into two groups and run the classifier again to see the change.
Now, let’s calculate the Receiver Operating Characteristic (ROC) curve.
Now we move on to the k-nearest neighbors (KNN) classifier.
KNN is easy to interpret and intuitive. It is also easy to use and doesn’t have assumptions. Also, it can be used for regression and classification. However, KNN is quite slow. Also, it can’t handle imbalanced data and can’t deal with missing values.
Parameters vs Hyperparameters
So, what is the difference between parameters and hyperparameters?
Parameters refer to the internal attributes of the model. They are learned during the training phase and estimated based on data.
On the hand, hyperparameters are the external attributes. They are set before the training phase and require tuning for better performance.
Turning the hyperparameters can be done through cross-validation and grid search. The k-fold cross validation functions in three main steps:
- Splits the training data randomly into k folds.
- Model trains using the k-1 folds and evaluates the performance.
- Repeats the process k times and averages the results.
Let’s check it out:
So, we have this:
Tuning the Parameters
Interpreting the Data
Banks need to know what machine learning does and what are the most important features in it. It is important that they know the explanation behind the variables in the algorithm.
So, how do we get an idea of the main features that have the most predictive value? Well, scikit-learn offers us a feature importance function.
Our Random Forest uses a metric impurity to create the best splits in the growth phase. While training the classifier we can predict how much a feature contributes to the decrease of impurity.
We also use the SHAP library to compute the importance of features. Shapley values remove the order effect by considering all the possible ordering approaches.
Note: You can access the full code from this GitHub Link.
want to learn how to algo trade so you can remove all emotions from trading and automate it 100%? click below to join the free discord and then join the bootcamp to get started today.