#Let's clear the global environment:  
rm(list=ls())
#Setting working directory based on open script: 
library(rstudioapi)
## Get the current source editor context:
editor_context <- getSourceEditorContext()
## Check if a script is open:
if (!is.null(editor_context)) {
  ### Get and set the directory of the currently open script
  setwd(dirname(editor_context$path))
} else {
  print("No script open in the editor.")
}
# Load any R Packages you may need
library(tidyverse)
library(caret)
library(e1071)
library(mosaic)
library(dplyr)
library(rpart)
library(rpart.plot)
library(ggplot2)

Introduction

Obesity has become a global epidemic, affecting individuals of all ages, genders, and socioeconomic backgrounds. There are different factors contributing to obesity, including genetic predisposition, environmental influences, lifestyle choices, and socioeconomic factors. Understanding the factors that contribute to obesity is crucial for developing effective prevention and intervention strategies. This project looks to investigate the relationship between obesity levels and both physical conditions and eating habits. I will be doing this by looking at two different machine learning models to recognize patterns in the data and make predictions. I will be using a decision tree as well as an SVM model.

The data that I used for my final project was found on UC Irvine Machine Learning Repository. The dataset includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru, and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records. The target class variable is Obesity Level (Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III). The other variables include features such as Gender, Age, Height, Weight, Family History, High Caloric Food Consumption, Vegetable Consumption, Meals, Snack Frequency, Smoking Habit, Water Consumption, Calorie Count Monitoring, Physical Activity Level, Technology Usage, Alcohol Consumption, and Transportation Mode. Some variables are represented as characters and others as numerical.

Here’s a quick look at the Obesity dataset:

head(Obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Data Munging

First, let me rename my columns to make it easier to understand:

colnames(Obesity)<- c("Gender", "Age", "Height", "Weight", "Fam_History", "High_Cal", 
                      "Veg", "Meals", "Snack","Smoke", "H2O", "Cal_Count", "Phys_Activity", 
                      "Tech", "Alc", "Transportation", "Obesity_Level")
Obesity <- subset(Obesity, Alc != "Always")

Let’s look at the structure of the data:

str(Obesity)
## 'data.frame':    2110 obs. of  17 variables:
##  $ Gender        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ Fam_History   : chr  "yes" "yes" "yes" "no" ...
##  $ High_Cal      : chr  "no" "no" "no" "no" ...
##  $ Veg           : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ Meals         : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ Snack         : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ Smoke         : chr  "no" "yes" "no" "no" ...
##  $ H2O           : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ Cal_Count     : chr  "no" "yes" "no" "no" ...
##  $ Phys_Activity : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ Tech          : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ Alc           : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ Transportation: chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ Obesity_Level : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

We can convert categorical variables into factors:

Obesity$Gender <- as.factor(Obesity$Gender)
Obesity$Fam_History <- as.factor(Obesity$Fam_History)
Obesity$High_Cal <- as.factor(Obesity$High_Cal)
Obesity$Snack <- as.factor(Obesity$Snack)
Obesity$Smoke <- as.factor(Obesity$Smoke)
Obesity$Cal_Count <- as.factor(Obesity$Cal_Count)
Obesity$Alc <- as.factor(Obesity$Alc)
Obesity$Transportation <- as.factor(Obesity$Transportation)
Obesity$Obesity_Level <- as.factor(Obesity$Obesity_Level)

Check for missing values. We have none! If it did, we would have to take care of it appropriately.

colSums(is.na(Obesity))
##         Gender            Age         Height         Weight    Fam_History 
##              0              0              0              0              0 
##       High_Cal            Veg          Meals          Snack          Smoke 
##              0              0              0              0              0 
##            H2O      Cal_Count  Phys_Activity           Tech            Alc 
##              0              0              0              0              0 
## Transportation  Obesity_Level 
##              0              0

Let’s look at the distribution of the target variable, Obesity Level. Looks pretty balanced. Each class is represented adequately.

gf_bar(~Obesity_Level, data = Obesity)

Now let’s look at the relationship between the numerical data. The highest correlation belongs to height and weight!

numerical_data <- Obesity[, sapply(Obesity, is.numeric)]
cor(numerical_data)
##                       Age      Height      Weight         Veg       Meals
## Age            1.00000000 -0.02596426  0.20240178  0.01610002 -0.04453242
## Height        -0.02596426  1.00000000  0.46320373 -0.03813332  0.24392487
## Weight         0.20240178  0.46320373  1.00000000  0.21588429  0.10675838
## Veg            0.01610002 -0.03813332  0.21588429  1.00000000  0.04146228
## Meals         -0.04453242  0.24392487  0.10675838  0.04146228  1.00000000
## H2O           -0.04531002  0.21337583  0.20060259  0.06846661  0.05713819
## Phys_Activity -0.14495069  0.29470893 -0.05144929  0.01993781  0.12963621
## Tech          -0.29674617  0.05199037 -0.07079299 -0.10044503  0.03867736
##                       H2O Phys_Activity        Tech
## Age           -0.04531002   -0.14495069 -0.29674617
## Height         0.21337583    0.29470893  0.05199037
## Weight         0.20060259   -0.05144929 -0.07079299
## Veg            0.06846661    0.01993781 -0.10044503
## Meals          0.05713819    0.12963621  0.03867736
## H2O            1.00000000    0.16723643  0.01199282
## Phys_Activity  0.16723643    1.00000000  0.05864230
## Tech           0.01199282    0.05864230  1.00000000

Now let’s look at the frequency of the categorical data to better understand what we are working with:

barplot(table(Obesity$Fam_History), main = 'Frequency of Fam_History', xlab = 'Fam_History', ylab = 'Frequency')

barplot(table(Obesity$High_Cal), main = 'Frequency of High_Cal', xlab = 'High_Cal', ylab = 'Frequency')

barplot(table(Obesity$Snack), main = 'Frequency of Snack', xlab = 'Snack', ylab = 'Frequency')

barplot(table(Obesity$Smoke), main = 'Frequency of Smoke', xlab = 'Smoke', ylab = 'Frequency')

barplot(table(Obesity$Cal_Count), main = 'Frequency of Cal_Count', xlab = 'Cal_Count', ylab = 'Frequency')

barplot(table(Obesity$Alc), main = 'Frequency of Alc', xlab = 'Alc', ylab = 'Frequency')

table(Obesity$Alc)
## 
## Frequently         no  Sometimes 
##         70        639       1401
barplot(table(Obesity$Transportation), main = 'Frequency of Transportation', xlab = 'Transportation', ylab = 'Frequency')

In terms of other actions that I could have taken for data exploration, is implementing a PCA. It reduces the number of variables in a dataset while preserving the more important information. But this was not possible for my data because I have both categorical and numerical variables and PCA’s only take numerical.

Let’s get rid of unnecessary columns and row.

#get rid of always row 
Obesity <- subset(Obesity, Alc != "Always")


#get rid of 2 columns 
Obesity <- subset(Obesity, select = -c(Height, Weight))

Decision Tree

The first model we will be looking at are decision trees.

Before constructing our decision tree, let’s split the data into training and testing sets.

set.seed(14)
train_index <- sample(1:nrow(Obesity), size = 0.7 * nrow(Obesity))
train <- Obesity[train_index, ]
test <- Obesity[-train_index, ]

Now that we have our testing and training data, we can create and plot our decision tree.

tree <- rpart(Obesity_Level ~ ., data = train, method = "class")
rpart.plot(tree, type = 4)

Let’s see how well the decision tree did in predicting accurately. We get an accuracy of 63.03%. The original decision tree model showed a moderate performance in classifying obesity levels, with the root split being vegetables (Do you usually eat vegetables in your meal). The split was with always eating vegetables to the left, and never/sometimes to the right. It is important to understand that there might be a problem of overfitting with this model.

#Performance
predictions <- predict(tree, newdata = test, type = "class")
confusionMatrix(predictions, test$Obesity_Level)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  43             0              0
##   Normal_Weight                        22            61              6
##   Obesity_Type_I                       11            10             70
##   Obesity_Type_II                       0             0              9
##   Obesity_Type_III                      5             0              1
##   Overweight_Level_I                    1             4              0
##   Overweight_Level_II                   1             9             13
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  5
##   Normal_Weight                     4                2                 25
##   Obesity_Type_I                    9                0                 33
##   Obesity_Type_II                  69                0                  7
##   Obesity_Type_III                  3               94                  0
##   Overweight_Level_I                0                0                 31
##   Overweight_Level_II               6                0                  3
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   3
##   Normal_Weight                         7
##   Obesity_Type_I                       23
##   Obesity_Type_II                       8
##   Obesity_Type_III                      3
##   Overweight_Level_I                    1
##   Overweight_Level_II                  31
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6303         
##                  95% CI : (0.5914, 0.668)
##     No Information Rate : 0.1643         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5682         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                             0.51807              0.72619
## Specificity                             0.98545              0.87978
## Pos Pred Value                          0.84314              0.48031
## Neg Pred Value                          0.93127              0.95455
## Prevalence                              0.13112              0.13270
## Detection Rate                          0.06793              0.09637
## Detection Prevalence                    0.08057              0.20063
## Balanced Accuracy                       0.75176              0.80299
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.7071                 0.7582
## Specificity                         0.8390                 0.9557
## Pos Pred Value                      0.4487                 0.7419
## Neg Pred Value                      0.9392                 0.9593
## Prevalence                          0.1564                 0.1438
## Detection Rate                      0.1106                 0.1090
## Detection Prevalence                0.2464                 0.1469
## Balanced Accuracy                   0.7730                 0.8570
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9792                   0.29808
## Specificity                           0.9777                   0.98866
## Pos Pred Value                        0.8868                   0.83784
## Neg Pred Value                        0.9962                   0.87752
## Prevalence                            0.1517                   0.16430
## Detection Rate                        0.1485                   0.04897
## Detection Prevalence                  0.1675                   0.05845
## Balanced Accuracy                     0.9784                   0.64337
##                      Class: Overweight_Level_II
## Sensitivity                             0.40789
## Specificity                             0.94255
## Pos Pred Value                          0.49206
## Neg Pred Value                          0.92105
## Prevalence                              0.12006
## Detection Rate                          0.04897
## Detection Prevalence                    0.09953
## Balanced Accuracy                       0.67522

SVM

The second model we will be looking at is an SVM model.

SVM’s only allow input types that are numeric, but the target attribute type can be either numeric or categorical. This means that it can predict my target variable of obesity levels for classification purposes, but I have to change my input types to all numeric.

First, let’s encode my binary categorical variables into 0/1. And also apply one-hot encoding for Gender.

# Encode binary categorical variables
binary_cols <- c("Fam_History", "High_Cal", "Smoke", "Cal_Count")
for (col in binary_cols) {
  Obesity[[col]] <- ifelse(Obesity[[col]] == "yes", 1, 0)
}

# One-hot encode "Gender" variable
Obesity <- cbind(Obesity, model.matrix(~ Gender - 1, data = Obesity))
# Remove the original "Gender" variable
Obesity <- Obesity[, !names(Obesity) %in% "Gender"]

Next, we can apply one-hot encoding again for all categorical columns that had more than 2 classes.

# Define categorical columns
categorical_cols <- c("Snack", "Alc", "Transportation")
for (col in categorical_cols) {
  # Create dummy variable object
  dummy_var <- caret::dummyVars("~ .", data = Obesity[, col, drop = FALSE])
  # Generate one-hot encoded variables
  encoded_vars <- predict(dummy_var, newdata = Obesity)
  # Update column names
  new_col_names <- colnames(encoded_vars)
  new_col_names <- gsub("Obesity.", paste(col, "", sep = "_"), new_col_names)
  colnames(encoded_vars) <- new_col_names
  # Add the encoded variables to the original data frame
  Obesity <- cbind(Obesity, encoded_vars)
  # Remove the original column
  Obesity[[col]] <- NULL
}

# Remove original categorical variables
Obesity <- Obesity[, !names(Obesity) %in% categorical_cols]

Since we are working with numerical data now, it is important to scale them. The goal of normalization is to bring all numeric features to a similar scale.

num_cols <- c("Age", "Veg", "Meals", "H2O", "Phys_Activity", "Tech")
Obesity_scaled <- Obesity
Obesity_scaled[num_cols] <- scale(Obesity_scaled[num_cols])

Before constructing our SVM model, let’s split the data into training and testing sets.

set.seed(14)
train_index <- createDataPartition(Obesity_scaled$Obesity_Level, p = 0.7, list = FALSE)
train_data <- Obesity_scaled[train_index, ]
test_data <- Obesity_scaled[-train_index, ]

Now we can create our SVM model, using the radial kernel:

radialSVM = svm(formula = Obesity_Level ~ ., 
                 data = train_data, 
                 type = 'C-classification', 
                 kernel = 'radial')

Let’s see how well the SVM model did in predicting accurately. We get an accuracy of 74.64%. When compared to other kernel’s the SVM model with a radial kernel performed much better.

predictions <- predict(radialSVM, newdata = test_data)
confusionMatrix(predictions, test_data$Obesity_Level)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  66            12              1
##   Normal_Weight                         7            50              8
##   Obesity_Type_I                        3             9             70
##   Obesity_Type_II                       1             1             13
##   Obesity_Type_III                      0             0              3
##   Overweight_Level_I                    4             4              0
##   Overweight_Level_II                   0             9             10
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  1
##   Normal_Weight                     0                1                  8
##   Obesity_Type_I                    2                0                 13
##   Obesity_Type_II                  84                0                  5
##   Obesity_Type_III                  0               96                  1
##   Overweight_Level_I                1                0                 54
##   Overweight_Level_II               2                0                  5
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         9
##   Obesity_Type_I                       13
##   Obesity_Type_II                      11
##   Obesity_Type_III                      0
##   Overweight_Level_I                    3
##   Overweight_Level_II                  51
## 
## Overall Statistics
##                                         
##                Accuracy : 0.7464        
##                  95% CI : (0.7106, 0.78)
##     No Information Rate : 0.1664        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7037        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.8148              0.58824
## Specificity                              0.9745              0.93956
## Pos Pred Value                           0.8250              0.60241
## Neg Pred Value                           0.9728              0.93613
## Prevalence                               0.1284              0.13471
## Detection Rate                           0.1046              0.07924
## Detection Prevalence                     0.1268              0.13154
## Balanced Accuracy                        0.8947              0.76390
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.6667                 0.9438
## Specificity                         0.9240                 0.9428
## Pos Pred Value                      0.6364                 0.7304
## Neg Pred Value                      0.9328                 0.9903
## Prevalence                          0.1664                 0.1410
## Detection Rate                      0.1109                 0.1331
## Detection Prevalence                0.1743                 0.1823
## Balanced Accuracy                   0.7953                 0.9433
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9897                   0.62069
## Specificity                           0.9925                   0.97794
## Pos Pred Value                        0.9600                   0.81818
## Neg Pred Value                        0.9981                   0.94159
## Prevalence                            0.1537                   0.13788
## Detection Rate                        0.1521                   0.08558
## Detection Prevalence                  0.1585                   0.10460
## Balanced Accuracy                     0.9911                   0.79932
##                      Class: Overweight_Level_II
## Sensitivity                             0.58621
## Specificity                             0.95221
## Pos Pred Value                          0.66234
## Neg Pred Value                          0.93502
## Prevalence                              0.13788
## Detection Rate                          0.08082
## Detection Prevalence                    0.12203
## Balanced Accuracy                       0.76921

Comparison Between the 2 Models

It is important to compare and understand the differences between these two models, and how they work on the data that I have chosen. In terms of model complexity, the decision tree is a much more basic model especially when compared to an SVM model. An SVM model is more suitable for datasets with intricate feature relationships. In terms of interpretability, decision trees are much more straightforward. You can visualize the decision rules, and see what features have a greater impact on predicting the target variable. This is different from an SVM model, especially with radial kernels, because they do not provide explicit rules for classification. This makes it more challenging to understand the underlying decision making process.

In terms of performance, the decision tree exhibited a moderate performance with an accuracy of 63%. But it also showed signs of overfitting, and when trying to prune the tree for a better performance, it ended up underfitting. It felt somewhat unreliable in its performance as it appeared very sensitive to small changes. The SVM model with a radial kernel specifically, demonstrated a better performance when compared to other SVM models and when compared to the decision tree model. It achieved a higher accuracy of 73.81%, indicating its effectiveness in capturing complex patterns.

Conclusion

In summary, while both the original decision tree and the SVM with a radial kernel have different advantages, the SVM model demonstrates superior performance in terms of accuracy and generalization. However, the decision tree provides better interpretability, which could be valuable for understanding the factors influencing obesity levels in the dataset. From both, we can learn a lot about how to better predict Obesity Levels given different Eating Habits and Physical Conditions