#Let's clear the global environment:
rm(list=ls())
#Setting working directory based on open script:
library(rstudioapi)
## Get the current source editor context:
editor_context <- getSourceEditorContext()
## Check if a script is open:
if (!is.null(editor_context)) {
### Get and set the directory of the currently open script
setwd(dirname(editor_context$path))
} else {
print("No script open in the editor.")
}
# Load any R Packages you may need
library(tidyverse)
library(caret)
library(e1071)
library(mosaic)
library(dplyr)
library(rpart)
library(rpart.plot)
library(ggplot2)
Obesity has become a global epidemic, affecting individuals of all ages, genders, and socioeconomic backgrounds. There are different factors contributing to obesity, including genetic predisposition, environmental influences, lifestyle choices, and socioeconomic factors. Understanding the factors that contribute to obesity is crucial for developing effective prevention and intervention strategies. This project looks to investigate the relationship between obesity levels and both physical conditions and eating habits. I will be doing this by looking at two different machine learning models to recognize patterns in the data and make predictions. I will be using a decision tree as well as an SVM model.
The data that I used for my final project was found on UC Irvine Machine Learning Repository. The dataset includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru, and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records. The target class variable is Obesity Level (Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III). The other variables include features such as Gender, Age, Height, Weight, Family History, High Caloric Food Consumption, Vegetable Consumption, Meals, Snack Frequency, Smoking Habit, Water Consumption, Calorie Count Monitoring, Physical Activity Level, Technology Usage, Alcohol Consumption, and Transportation Mode. Some variables are represented as characters and others as numerical.
Here’s a quick look at the Obesity dataset:
head(Obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
First, let me rename my columns to make it easier to understand:
colnames(Obesity)<- c("Gender", "Age", "Height", "Weight", "Fam_History", "High_Cal",
"Veg", "Meals", "Snack","Smoke", "H2O", "Cal_Count", "Phys_Activity",
"Tech", "Alc", "Transportation", "Obesity_Level")
Obesity <- subset(Obesity, Alc != "Always")
Let’s look at the structure of the data:
str(Obesity)
## 'data.frame': 2110 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ Fam_History : chr "yes" "yes" "yes" "no" ...
## $ High_Cal : chr "no" "no" "no" "no" ...
## $ Veg : num 2 3 2 3 2 2 3 2 3 2 ...
## $ Meals : num 3 3 3 3 1 3 3 3 3 3 ...
## $ Snack : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ Smoke : chr "no" "yes" "no" "no" ...
## $ H2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ Cal_Count : chr "no" "yes" "no" "no" ...
## $ Phys_Activity : num 0 3 2 2 0 0 1 3 1 1 ...
## $ Tech : num 1 0 1 0 0 0 0 0 1 1 ...
## $ Alc : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ Transportation: chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ Obesity_Level : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
We can convert categorical variables into factors:
Obesity$Gender <- as.factor(Obesity$Gender)
Obesity$Fam_History <- as.factor(Obesity$Fam_History)
Obesity$High_Cal <- as.factor(Obesity$High_Cal)
Obesity$Snack <- as.factor(Obesity$Snack)
Obesity$Smoke <- as.factor(Obesity$Smoke)
Obesity$Cal_Count <- as.factor(Obesity$Cal_Count)
Obesity$Alc <- as.factor(Obesity$Alc)
Obesity$Transportation <- as.factor(Obesity$Transportation)
Obesity$Obesity_Level <- as.factor(Obesity$Obesity_Level)
Check for missing values. We have none! If it did, we would have to take care of it appropriately.
colSums(is.na(Obesity))
## Gender Age Height Weight Fam_History
## 0 0 0 0 0
## High_Cal Veg Meals Snack Smoke
## 0 0 0 0 0
## H2O Cal_Count Phys_Activity Tech Alc
## 0 0 0 0 0
## Transportation Obesity_Level
## 0 0
Let’s look at the distribution of the target variable, Obesity Level. Looks pretty balanced. Each class is represented adequately.
gf_bar(~Obesity_Level, data = Obesity)
Now let’s look at the relationship between the numerical data. The highest correlation belongs to height and weight!
numerical_data <- Obesity[, sapply(Obesity, is.numeric)]
cor(numerical_data)
## Age Height Weight Veg Meals
## Age 1.00000000 -0.02596426 0.20240178 0.01610002 -0.04453242
## Height -0.02596426 1.00000000 0.46320373 -0.03813332 0.24392487
## Weight 0.20240178 0.46320373 1.00000000 0.21588429 0.10675838
## Veg 0.01610002 -0.03813332 0.21588429 1.00000000 0.04146228
## Meals -0.04453242 0.24392487 0.10675838 0.04146228 1.00000000
## H2O -0.04531002 0.21337583 0.20060259 0.06846661 0.05713819
## Phys_Activity -0.14495069 0.29470893 -0.05144929 0.01993781 0.12963621
## Tech -0.29674617 0.05199037 -0.07079299 -0.10044503 0.03867736
## H2O Phys_Activity Tech
## Age -0.04531002 -0.14495069 -0.29674617
## Height 0.21337583 0.29470893 0.05199037
## Weight 0.20060259 -0.05144929 -0.07079299
## Veg 0.06846661 0.01993781 -0.10044503
## Meals 0.05713819 0.12963621 0.03867736
## H2O 1.00000000 0.16723643 0.01199282
## Phys_Activity 0.16723643 1.00000000 0.05864230
## Tech 0.01199282 0.05864230 1.00000000
Now let’s look at the frequency of the categorical data to better understand what we are working with:
barplot(table(Obesity$Fam_History), main = 'Frequency of Fam_History', xlab = 'Fam_History', ylab = 'Frequency')
barplot(table(Obesity$High_Cal), main = 'Frequency of High_Cal', xlab = 'High_Cal', ylab = 'Frequency')
barplot(table(Obesity$Snack), main = 'Frequency of Snack', xlab = 'Snack', ylab = 'Frequency')
barplot(table(Obesity$Smoke), main = 'Frequency of Smoke', xlab = 'Smoke', ylab = 'Frequency')
barplot(table(Obesity$Cal_Count), main = 'Frequency of Cal_Count', xlab = 'Cal_Count', ylab = 'Frequency')
barplot(table(Obesity$Alc), main = 'Frequency of Alc', xlab = 'Alc', ylab = 'Frequency')
table(Obesity$Alc)
##
## Frequently no Sometimes
## 70 639 1401
barplot(table(Obesity$Transportation), main = 'Frequency of Transportation', xlab = 'Transportation', ylab = 'Frequency')
In terms of other actions that I could have taken for data exploration, is implementing a PCA. It reduces the number of variables in a dataset while preserving the more important information. But this was not possible for my data because I have both categorical and numerical variables and PCA’s only take numerical.
Let’s get rid of unnecessary columns and row.
#get rid of always row
Obesity <- subset(Obesity, Alc != "Always")
#get rid of 2 columns
Obesity <- subset(Obesity, select = -c(Height, Weight))
The first model we will be looking at are decision trees.
Before constructing our decision tree, let’s split the data into training and testing sets.
set.seed(14)
train_index <- sample(1:nrow(Obesity), size = 0.7 * nrow(Obesity))
train <- Obesity[train_index, ]
test <- Obesity[-train_index, ]
Now that we have our testing and training data, we can create and plot our decision tree.
tree <- rpart(Obesity_Level ~ ., data = train, method = "class")
rpart.plot(tree, type = 4)
Let’s see how well the decision tree did in predicting accurately. We get an accuracy of 63.03%. The original decision tree model showed a moderate performance in classifying obesity levels, with the root split being vegetables (Do you usually eat vegetables in your meal). The split was with always eating vegetables to the left, and never/sometimes to the right. It is important to understand that there might be a problem of overfitting with this model.
#Performance
predictions <- predict(tree, newdata = test, type = "class")
confusionMatrix(predictions, test$Obesity_Level)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 43 0 0
## Normal_Weight 22 61 6
## Obesity_Type_I 11 10 70
## Obesity_Type_II 0 0 9
## Obesity_Type_III 5 0 1
## Overweight_Level_I 1 4 0
## Overweight_Level_II 1 9 13
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 5
## Normal_Weight 4 2 25
## Obesity_Type_I 9 0 33
## Obesity_Type_II 69 0 7
## Obesity_Type_III 3 94 0
## Overweight_Level_I 0 0 31
## Overweight_Level_II 6 0 3
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 3
## Normal_Weight 7
## Obesity_Type_I 23
## Obesity_Type_II 8
## Obesity_Type_III 3
## Overweight_Level_I 1
## Overweight_Level_II 31
##
## Overall Statistics
##
## Accuracy : 0.6303
## 95% CI : (0.5914, 0.668)
## No Information Rate : 0.1643
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5682
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.51807 0.72619
## Specificity 0.98545 0.87978
## Pos Pred Value 0.84314 0.48031
## Neg Pred Value 0.93127 0.95455
## Prevalence 0.13112 0.13270
## Detection Rate 0.06793 0.09637
## Detection Prevalence 0.08057 0.20063
## Balanced Accuracy 0.75176 0.80299
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.7071 0.7582
## Specificity 0.8390 0.9557
## Pos Pred Value 0.4487 0.7419
## Neg Pred Value 0.9392 0.9593
## Prevalence 0.1564 0.1438
## Detection Rate 0.1106 0.1090
## Detection Prevalence 0.2464 0.1469
## Balanced Accuracy 0.7730 0.8570
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9792 0.29808
## Specificity 0.9777 0.98866
## Pos Pred Value 0.8868 0.83784
## Neg Pred Value 0.9962 0.87752
## Prevalence 0.1517 0.16430
## Detection Rate 0.1485 0.04897
## Detection Prevalence 0.1675 0.05845
## Balanced Accuracy 0.9784 0.64337
## Class: Overweight_Level_II
## Sensitivity 0.40789
## Specificity 0.94255
## Pos Pred Value 0.49206
## Neg Pred Value 0.92105
## Prevalence 0.12006
## Detection Rate 0.04897
## Detection Prevalence 0.09953
## Balanced Accuracy 0.67522
The second model we will be looking at is an SVM model.
SVM’s only allow input types that are numeric, but the target attribute type can be either numeric or categorical. This means that it can predict my target variable of obesity levels for classification purposes, but I have to change my input types to all numeric.
First, let’s encode my binary categorical variables into 0/1. And also apply one-hot encoding for Gender.
# Encode binary categorical variables
binary_cols <- c("Fam_History", "High_Cal", "Smoke", "Cal_Count")
for (col in binary_cols) {
Obesity[[col]] <- ifelse(Obesity[[col]] == "yes", 1, 0)
}
# One-hot encode "Gender" variable
Obesity <- cbind(Obesity, model.matrix(~ Gender - 1, data = Obesity))
# Remove the original "Gender" variable
Obesity <- Obesity[, !names(Obesity) %in% "Gender"]
Next, we can apply one-hot encoding again for all categorical columns that had more than 2 classes.
# Define categorical columns
categorical_cols <- c("Snack", "Alc", "Transportation")
for (col in categorical_cols) {
# Create dummy variable object
dummy_var <- caret::dummyVars("~ .", data = Obesity[, col, drop = FALSE])
# Generate one-hot encoded variables
encoded_vars <- predict(dummy_var, newdata = Obesity)
# Update column names
new_col_names <- colnames(encoded_vars)
new_col_names <- gsub("Obesity.", paste(col, "", sep = "_"), new_col_names)
colnames(encoded_vars) <- new_col_names
# Add the encoded variables to the original data frame
Obesity <- cbind(Obesity, encoded_vars)
# Remove the original column
Obesity[[col]] <- NULL
}
# Remove original categorical variables
Obesity <- Obesity[, !names(Obesity) %in% categorical_cols]
Since we are working with numerical data now, it is important to scale them. The goal of normalization is to bring all numeric features to a similar scale.
num_cols <- c("Age", "Veg", "Meals", "H2O", "Phys_Activity", "Tech")
Obesity_scaled <- Obesity
Obesity_scaled[num_cols] <- scale(Obesity_scaled[num_cols])
Before constructing our SVM model, let’s split the data into training and testing sets.
set.seed(14)
train_index <- createDataPartition(Obesity_scaled$Obesity_Level, p = 0.7, list = FALSE)
train_data <- Obesity_scaled[train_index, ]
test_data <- Obesity_scaled[-train_index, ]
Now we can create our SVM model, using the radial kernel:
radialSVM = svm(formula = Obesity_Level ~ .,
data = train_data,
type = 'C-classification',
kernel = 'radial')
Let’s see how well the SVM model did in predicting accurately. We get an accuracy of 74.64%. When compared to other kernel’s the SVM model with a radial kernel performed much better.
predictions <- predict(radialSVM, newdata = test_data)
confusionMatrix(predictions, test_data$Obesity_Level)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 66 12 1
## Normal_Weight 7 50 8
## Obesity_Type_I 3 9 70
## Obesity_Type_II 1 1 13
## Obesity_Type_III 0 0 3
## Overweight_Level_I 4 4 0
## Overweight_Level_II 0 9 10
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 1
## Normal_Weight 0 1 8
## Obesity_Type_I 2 0 13
## Obesity_Type_II 84 0 5
## Obesity_Type_III 0 96 1
## Overweight_Level_I 1 0 54
## Overweight_Level_II 2 0 5
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 9
## Obesity_Type_I 13
## Obesity_Type_II 11
## Obesity_Type_III 0
## Overweight_Level_I 3
## Overweight_Level_II 51
##
## Overall Statistics
##
## Accuracy : 0.7464
## 95% CI : (0.7106, 0.78)
## No Information Rate : 0.1664
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7037
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.8148 0.58824
## Specificity 0.9745 0.93956
## Pos Pred Value 0.8250 0.60241
## Neg Pred Value 0.9728 0.93613
## Prevalence 0.1284 0.13471
## Detection Rate 0.1046 0.07924
## Detection Prevalence 0.1268 0.13154
## Balanced Accuracy 0.8947 0.76390
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.6667 0.9438
## Specificity 0.9240 0.9428
## Pos Pred Value 0.6364 0.7304
## Neg Pred Value 0.9328 0.9903
## Prevalence 0.1664 0.1410
## Detection Rate 0.1109 0.1331
## Detection Prevalence 0.1743 0.1823
## Balanced Accuracy 0.7953 0.9433
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9897 0.62069
## Specificity 0.9925 0.97794
## Pos Pred Value 0.9600 0.81818
## Neg Pred Value 0.9981 0.94159
## Prevalence 0.1537 0.13788
## Detection Rate 0.1521 0.08558
## Detection Prevalence 0.1585 0.10460
## Balanced Accuracy 0.9911 0.79932
## Class: Overweight_Level_II
## Sensitivity 0.58621
## Specificity 0.95221
## Pos Pred Value 0.66234
## Neg Pred Value 0.93502
## Prevalence 0.13788
## Detection Rate 0.08082
## Detection Prevalence 0.12203
## Balanced Accuracy 0.76921
It is important to compare and understand the differences between these two models, and how they work on the data that I have chosen. In terms of model complexity, the decision tree is a much more basic model especially when compared to an SVM model. An SVM model is more suitable for datasets with intricate feature relationships. In terms of interpretability, decision trees are much more straightforward. You can visualize the decision rules, and see what features have a greater impact on predicting the target variable. This is different from an SVM model, especially with radial kernels, because they do not provide explicit rules for classification. This makes it more challenging to understand the underlying decision making process.
In terms of performance, the decision tree exhibited a moderate performance with an accuracy of 63%. But it also showed signs of overfitting, and when trying to prune the tree for a better performance, it ended up underfitting. It felt somewhat unreliable in its performance as it appeared very sensitive to small changes. The SVM model with a radial kernel specifically, demonstrated a better performance when compared to other SVM models and when compared to the decision tree model. It achieved a higher accuracy of 73.81%, indicating its effectiveness in capturing complex patterns.
In summary, while both the original decision tree and the SVM with a radial kernel have different advantages, the SVM model demonstrates superior performance in terms of accuracy and generalization. However, the decision tree provides better interpretability, which could be valuable for understanding the factors influencing obesity levels in the dataset. From both, we can learn a lot about how to better predict Obesity Levels given different Eating Habits and Physical Conditions