Random Forest and it’s Parameters

Jai Kushwaha
4 min readSep 25, 2020

Elementary parallelism between Decision Tree and Random Forest

A simple example in the banking sector are the credit disbursal, which has a high risk in the current as there is a rise of NPAs. But bank has to to the business and for a customer to get a loan bank takes decision.

If bank takes Decision is based on a single variable like credit or CIBIL score, if the score is high approve the loan if less reject the proposal. So this forms a simple example of decision tree.

If bank takes multiple variable into consideration and multiple aspects of customer like loan amount, age or type of business the can come up with a certain cut off point so as to increase their business. An these multiple decisions are taken and cloud of decision are made and which turn a decision tree into a forest.

Concise definition of Decision Tree

A decision tree is a supervised machine learning algorithm that can be used for both classification and regression problems. A decision tree is simply a series of sequential decisions made to reach a specific result.

Random Forest Elaborated

The decision tree algorithm is quite easy to understand and interpret. But often, a single tree is not sufficient for producing effective results. This is where the Random Forest algorithm comes into the picture.

Random Forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. As the name suggests, it is a “forest” of trees!

But why do we call it a “random” forest? That’s because it is a forest of randomly created decision trees. Each node in the decision tree works on a random subset of features to calculate the output. The random forest then combines the output of individual decision trees to generate the final output.

How Random Forest is formulated

Random forest is a collective decision made based on voting of multiple trees. Or in an algorithmic term a“forest” it builds, it is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

Let’s see an example

Importing of datasets :

Random Forest using randomForest package in R

Parameters given are the target variable i.e. independent variable, train data, ntree, mtry, node size, importance.

While printing the rf algorithm we can see there is a term called OOB i.e. Out of Bag. The out-of-bag (OOB) error is the average error for each calculated using predictions from the trees that do not contain in their respective bootstrap sample.

Error rate in Random Forest based on classifier 0 and 1

Then comes the part where we want to know the variable importance.

Random forest computes two measures of variable importance

  • Mean Decrease in Accuracy
  • Mean decrease in GINI

Tuning of Random Forest with parameters

For Mtry 4 OOB is the least.

Model Performance and statistics

AUC, KS(Kolmogorov–Smirnov test) and GINI compared with decision tree.

Conclusion:

Random Forest is suitable for situations when we have a large dataset, and interpret-ability is not a major concern.

Also, Random Forest has a higher training time than a single decision tree. You should take this into consideration because as we increase the number of trees in a random forest, the time taken to train each of them also increases. That can often be crucial when you’re working with a tight deadline in a machine learning project.

But I will say this — despite instability and dependency on a particular set of features, decision trees are really helpful because they are easier to interpret and faster to train. Anyone with very little knowledge of data science can also use decision trees to make quick data-driven decisions.

--

--

Jai Kushwaha

I am a 11yrs+ experienced Senior Consultant in Analytics and Model development with domain expertise in BFSI.