In order to review the accuracy of supervised machine learning algorithms, we took a dataset named “prostate” available in “elemStatLearn” packager in R. “prostate” is a dataset to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.
This is a data frame with the following 9 variables.
Lcavol: log cancer volume
lweight: log prostate weight
age: in years
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion (binary data, 1 for Yes, 0 for No)
lcp: log of capsular penetration
gleason: a numeric vector
pgg45: percent of Gleason score 4 or 5
Our aim is to understand three regression based model in supervised machine learning for prediction purpose: Linear model, Support vector machinea model and Regression tree model.
Here we made 2 subsets training set and testing set of our dataset “prostate”. For better training of our models we kept 70% of data in training set and 30% into test set.
We will try to predict svi values in test set by using different models we train in training set.
Before starting modelling, we did a free analysis to get relationship between different variables in our training set and dependency of svi variable.
From Figure 1 it is clear that number of affected people is more whose age is between 60 and 70. Also a red colour dot represents population with svi=1 and a green colour dot represents population with svi=0 (svi can be understood as a tumour), so here almost all red dots are above mean line of each Y-axis variable and it gives an idea about importance of these variables in prediction of svi variable in test set.
- Linear model with logistic regression: The lm() command is used to carry out linear modelling in R. Because outcome of linear model are continues values and we have to predict discrete (categorical) values so we use logistic regression to categorize in their respective category (either 0 or 1).
To achieve this we use sigmoid function shown in Figure 2 where cut off of linear model outcomes is their mean which decides category of outcome. Following is command for linear model in R:
Lin_model <- lm(formula= svi~.,data=train)
Predicted <- predict(Lin_model,test)
In lm() function we have passed two parameters – formula and training dataset. In formula our dependent variable is left side of ~ sign and a dot represents all variables available in given dataset. You can simply print Lin_model to see importance of all independent variables. By using predict() function you can get all outcomes of linear model in which we pass two parameters first is variable in which we trained our linear model and second is test set. As stated above we used logistic regression for categorical data. (We putted value greater than cutoff in 1 category and rest in 0 category). Also if we scale outcomes of lm() function between 0 and 1 then it will represent probability to fall in particular category.
In Figure 1 we have shown you data is more descriptive because population with svi=1 is distinguishable with svi=0, this is why linear model seems to be an effective approach and accuracy of this model was 83%.
- Support Vector machine (SVM): SVM is model with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Use of SVM makes perfect sense in our case because we are looking for predicting categorical values. To implement SVM model in R svm() function is given under e1071 package, command is as follow:
Svm_model <- svm(formula=svi~., data=train,type=’C-classification’)
Predicted <- predict(Svm_model,test)
Only difference between in lm() and svm() is we have given an extra argument in svm() function that is type which is to specify prediction should be categorical. So when we apply this model against test dataset It gives us directly a vector binary values. Even if SVM is a powerful classification model, in our case it is equivalent to Linear model and gives accuracy of 83%.
- Regression Tree Model: A more powerful model is Regression Tree based model which makes decision tree recursively and train model accordingly. In R rpart() function is given under rpart package to implement this and commands are as follow:
Rpart_model <- rpart(formula=as.factor(svi)~., data=train, method=’class’, control= rpart.control(minsplit=2,cp=0))
Predict <- predict(Rpart_model,test,type=”class”)
Here in rpart() function we converted our dependent variable into factor (because it is categorical data so it will generate two levels 0 and 1) also given method argument is class and control parameters for splitting tree more optimally.
By printing or plotting (as shown in Figure 3) variable Rpart_model you can see where splitting is done in decision tree. However these parameters are optional and values will be taken default but these values are not optimum and hence we have used different parameter values and got the best result with these values which is 91% accurate.
Conclusion: There are number of models available in machine learning world and accuracy of their result is solely depends on the nature of our dataset and the way we apply them also the more data we train with more accurate model we build.
Please feel free to contact me if you need more detail or code of above study.