# Load any R Packages you may need
library(tidyverse)
library(parsnip)
library(tidymodels)
library(rpart.plot)
library(ggridges)
Introduction
Honey bees are an extremely important species. They are a keystone species in that they pollinate much of our food. Without honey bees flowering plants will decline which will have a negative consequence for all ecosystems. The Save The Bees movement has become increasingly popular within the last few years because beekeepers are losing a high percentage of their honeybee colonies every year. And just as worrisome is the that the wild bee populations are also in decline. Of course, there are many factors that have to do with the recent decline but there are some that may have a greater affect than others. Honey bees are at risk of becoming endangered which is why it is important to look at the causes of the decline.
The data set Bee Colonies from TidyTuesday is made up of two main tables with information taken from US honeybee colony farmers. One table, named colony, provides information on honey bee colonies in terms of number of colonies, maximum, lost, percent lost, added, renovated, and colonies lost. The data also provides another table, named stressor, that identifies colony health stressors. Using this dataset, we can ask the question What are the main causes of colony loss in the US?
Data Wrangling
First, let’s make sure to clean the data a little bit!
- We can group the months into season instead. So originally there was January-March, April-June, July-September, and October-December. I instead made months October-December and January-March be the Winter season, and April-June and July-September be the Summer season.
- I got rid of the data from the year 2021, because it only had data on half of the year, so it did not give accurate enough information to be representative on the whole year.
- I selected the columns that had to do with colony lost count and percentage, since that is the data we are exploring.
- I added a new column id for later use when joining the two tables together.
colony = colony %>%
mutate(season = case_when(
months %in% c('January-March', 'October-December') ~ 'Winter',
months %in% c('April-June', 'July-September') ~ 'Summer'
)) %>%
filter(year != 2021) %>%
mutate(id = row_number()) %>%
select(id, year, season, state, colony_n, colony_lost, colony_lost_pct)
head(colony)
## # A tibble: 6 × 7
## id year season state colony_n colony_lost colony_lost_pct
## <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 2015 Winter Alabama 7000 1800 26
## 2 2 2015 Winter Arizona 35000 4600 13
## 3 3 2015 Winter Arkansas 13000 1500 11
## 4 4 2015 Winter California 1440000 255000 15
## 5 5 2015 Winter Colorado 3500 1500 12
## 6 6 2015 Winter Connecticut 3900 870 22
For the stressor table, I cleaned the data in the same way. I grouped the months into seasons instead, filtered out the year 2021 data, and created a column for id to help with joining the tables.
stressor = stressor %>%
mutate(season = case_when(
months %in% c('January-March', 'October-December') ~ 'Winter',
months %in% c('April-June', 'July-September') ~ 'Summer'
)) %>%
filter(year != 2021) %>%
mutate(id = row_number()) %>%
select(year, season, state, stressor, stress_pct)
head(stressor)
## # A tibble: 6 × 5
## year season state stressor stress_pct
## <dbl> <chr> <chr> <chr> <dbl>
## 1 2015 Winter Alabama Varroa mites 10
## 2 2015 Winter Alabama Other pests/parasites 5.4
## 3 2015 Winter Alabama Disesases NA
## 4 2015 Winter Alabama Pesticides 2.2
## 5 2015 Winter Alabama Other 9.1
## 6 2015 Winter Alabama Unknown 9.4
Preliminary Exploratory Analysis
Now that we have our data in a way that is easier to manage, we can explore it a bit more.
The graph below shows the average percentage of colonies lost in the US during the winter and summer seasons by year.
colony %>%
group_by(year, season) %>%
summarize(avg_colony_lost_pct = mean(colony_lost_pct, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = avg_colony_lost_pct, fill = season)) + geom_col(position = "dodge") + labs(x = "", y = "Average Colonies Lost (%)", fill = "", title = "Average Percentage of Total Colonies Lost in the U.S.", subtitle = "From 2015-2020") + theme_minimal()

We can see that in the winter months, there is a higher average colonies lost percentage for each year (2015-2020). This may be because bee colonies have a hard time staying alive during colder months, but it’s interesting to see how the percentage on colonies lost during the winter months has increased during 2016-2018. There may be other reasons as to why the average colonies lost percentage is at its peak in 2018.
The graph below explores the stressor table. It shows the average stress percentage for each stressor from the years 2015 -2020.
na.omit(stressor) %>%
group_by(year, stressor) %>%
summarize(stress_pct_avg = mean(stress_pct)) %>%
ggplot(aes(x = year, y = stress_pct_avg, color = stressor)) + geom_point() + geom_smooth() + labs(x = "", y = "Average Stress (%)", color = "Stressor", title = "Average Stress Percentage for Each Stressor", subtitle = "From 2015 to 2020")

We can see that the stressor Varroa Mites, has the highest average stress percentage throughout the years, by a lot when compared to other stressors. This means that the Varroa mites is the stressor that has the highest percent of colonies affected by it. Again, in 2018, there is a peak in a high stress percentage for most stressors, including the Varroa mites.
Joining Tables
Now that we’ve see two causes of colony loss separately, let’s look at them together. To do this, we can merge the two tables together.
First, we can fix the stressor table to look at only the Varroa mites stress percent by state.
stressor_vm = stressor%>%
filter(stressor == "Varroa mites") %>%
arrange(year) %>%
mutate(id = row_number()) %>%
select(id, year, season, state, stressor, stress_pct)
head(stressor_vm)
## # A tibble: 6 × 6
## id year season state stressor stress_pct
## <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 2015 Winter Alabama Varroa mites 10
## 2 2 2015 Winter Arizona Varroa mites 26.9
## 3 3 2015 Winter Arkansas Varroa mites 17.6
## 4 4 2015 Winter California Varroa mites 24.7
## 5 5 2015 Winter Colorado Varroa mites 14.6
## 6 6 2015 Winter Connecticut Varroa mites 2.5
Now that the stressor table is filtered to only have the Varroa mites stress percentage, we can join it to the colony table.
colonyJoinedStressorVM = colony %>%
inner_join(stressor_vm, by = c("id", "year", "season", "state"))
head(colonyJoinedStressorVM)
## # A tibble: 6 × 9
## id year season state colony_n colony_lost colony_lost_pct stressor
## <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 2015 Winter Alabama 7000 1800 26 Varroa mi…
## 2 2 2015 Winter Arizona 35000 4600 13 Varroa mi…
## 3 3 2015 Winter Arkansas 13000 1500 11 Varroa mi…
## 4 4 2015 Winter California 1440000 255000 15 Varroa mi…
## 5 5 2015 Winter Colorado 3500 1500 12 Varroa mi…
## 6 6 2015 Winter Connecticut 3900 870 22 Varroa mi…
## # … with 1 more variable: stress_pct <dbl>
Exploratory Data Analysis
With this new data, we can explore the relationships between season and stress percentage of Valorroa mites, with colony lost percentage.
The graph below shows the relationship between the percentage of colonies affected by Varroa mites and the percentage of total colonies lost in the U.S. More specifically, it looks at the relationship when taking the season into account.
colonyJoinedStressorVM %>%
ggplot(aes(x = stress_pct, y = colony_lost_pct, color = season)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Colonies Affected by varroa mites (%)", y = "Total Colonies Lost (%)", color = "Season", title = "Percentage of Total Colonies Lost Based on Percentage of Colonies Affected by Varroa Mites ", subtitle = "By season")
Visually, we can see that there appears to be a general positive correlation between stress pct and colony lost percent, with winter seasons having greater colony lost percentage. Both higher stress percents of Varroa mites and the winter season together, have the highest colony lost percentage.
Let’s investigate these relationships through modeling now!
Predictive Modeling
Linear Regression Model
Target: Percent of total colonies lost (colony_lost_pct)
Feature Percent of colonies affected by Varroa mites (stress_pct) and season (season)
Let’s build a multiple regression model that predicts colony_lost_pct while considering our features of stress_pct and season !
set.seed(14)
#1. Data Splitting
colonyJoinedStressorVM_split = initial_split(colonyJoinedStressorVM, prop = 0.7)
colonyJoinedStressorVM_train = training(colonyJoinedStressorVM_split)
colonyJoinedStressorVM_test = testing(colonyJoinedStressorVM_split)
#2. Fit model
lm_fit_train =
linear_reg() %>%
fit(colony_lost_pct ~ stress_pct + season, data = colonyJoinedStressorVM_train)
lm_fit_train
## parsnip model object
##
##
## Call:
## stats::lm(formula = colony_lost_pct ~ stress_pct + season, data = data)
##
## Coefficients:
## (Intercept) stress_pct seasonWinter
## 6.1542 0.1062 4.1184
#3. Generate Predictions
lm_pred = lm_fit_train %>%
predict(new_data = colonyJoinedStressorVM_test)
#4. Prediction Metric (RMSE)
colonyJoinedStressorVM_test %>%
mutate(colony_lost_pct_Pred = lm_pred$.pred) %>%
rmse(estimate = colony_lost_pct_Pred, truth = colony_lost_pct)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.33
We created a multiple regression model: colony_lost_pct_hat = 0.1062(stress_pct) + 4.1184(seasonWinter)
The RMSE for the multiple regression model was 6.329434. This a very low RMSE. This means that the multiple regression model fits the dataset very well! But this number should be recieved with caution because the relationship between our features (season and stress_pct) and target (colony_lost_pct) may not actually be a linear relationship.
Let’s look at the relationship again using a different model!
Decision Tree
The predictive modeling of linear regression that we used, may not be the best model. Our data may need a non-linear model.
To create a more accurate predictive model, we can use a decision tree to employ the CART (classification and regression tree) algorithm :
set.seed(14)
#1. Data Splitting
colonyJoinedStressorVM_split = initial_split(colonyJoinedStressorVM, prop = 0.7)
colonyJoinedStressorVM_train = training(colonyJoinedStressorVM_split)
colonyJoinedStressorVM_test = testing(colonyJoinedStressorVM_split)
#2. Fit model
tree_fit =
decision_tree() %>%
set_mode(mode = "regression") %>%
fit(colony_lost_pct ~ stress_pct + season,
data = colonyJoinedStressorVM_train)
rpart.plot(tree_fit$fit, roundint = FALSE)

#3. Generate Predictions
tree_pred = tree_fit %>%
predict(new_data = colonyJoinedStressorVM_test)
#4. Prediction Metric (RMSE)
colonyJoinedStressorVM_test %>%
mutate(colony_lost_pct_Pred = tree_pred$.pred) %>%
rmse(truth = colony_lost_pct, estimate = colony_lost_pct_Pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.60
By viewing the CART output, we can see where the variable season is split. And if we follow along, we can see that the winter months have more higher stress percentages.
The RMSE for the CART model with multiple features was 6.597867. Meanwhile the RMSE for the multiple regression model was 6.329434.
Conclusion
While the RMSE for the multiple regression model was technically smaller, the RMSE for the CART model was not completely off. They were relatively close in size, which means that both models that we created to predict values of colony_lost_pct were generally good at doing so. Of course, these are only two ways of predicting values, so there may be other models that may do a more accurate job. But in exploring both the data visually and through modeling, we can see the effect that the winter season, and varroa mites have on the percentage of colony lost.
The dataset we explored did not contrain many different factors. There could have been outside factors that affect the percentage of colony lost in the US. Perhaps the season and stressors are not the only reason why honeybee colonies are being threatened. It is also important to note that there may be a relationship between the varroa mites and the season which may have had an effect on our ability to perform accurate and useful predictive modeling.
There are many other factors that we must keep looking at to make sure that our honeybee colonies thrive again.
---
title: "Save the Bees!"
author: "Caroline Mazariegos"
date: "05/13/2022"
output: 
  html_document:
    theme: flatly
    toc: true
    toc_float: true
    code_download: true
---

```{r, warning = FALSE, message = FALSE}
# Load any R Packages you may need
library(tidyverse)
library(parsnip)
library(tidymodels)
library(rpart.plot)
library(ggridges)

```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')
```

# Introduction

Honey bees are an extremely important species. They are a keystone species in that they pollinate much of our food. Without honey bees flowering plants will decline which will have a negative consequence for all ecosystems. The Save The Bees movement has become increasingly popular within the last few years because beekeepers are losing a high percentage of their honeybee colonies every year. And just as worrisome is the that the wild bee populations are also in decline. Of course, there are many factors that have to do with the recent decline but there are some that may have a greater affect than others. Honey bees are at risk of becoming endangered which is why it is important to look at the causes of the decline. 

The data set **Bee Colonies** from *TidyTuesday* is made up of two main tables with information taken from US honeybee colony farmers. One table, named *colony*, provides information on honey bee colonies in terms of number of colonies, maximum, lost, percent lost, added, renovated, and colonies lost. The data also provides another table, named *stressor*, that identifies colony health  stressors. Using this dataset, we can ask the question **What are the main causes of colony loss in the US?**




# Data Wrangling

First, let's make sure to clean the data a little bit! 

- We can group the months into season instead. So originally there was January-March, April-June, July-September, and October-December. I instead made months October-December and January-March be the Winter season, and April-June and July-September be the Summer season. 
- I got rid of the data from the year 2021, because it only had data on half of the year, so it did not give accurate enough information to be representative on the whole year.
- I selected the columns that had to do with colony lost count and percentage, since that is the data we are exploring.
- I added a new column *id* for later use when joining the two tables together.

```{r, warning = FALSE, message = FALSE}
colony = colony %>% 
  mutate(season = case_when(
    months %in% c('January-March', 'October-December') ~ 'Winter', 
    months %in% c('April-June', 'July-September') ~ 'Summer'
  )) %>% 
  filter(year != 2021) %>% 
  mutate(id = row_number()) %>%
  select(id, year, season, state, colony_n, colony_lost, colony_lost_pct)

head(colony)
```

For the stressor table, I cleaned the data in the same way. I grouped the months into seasons instead, filtered out the year 2021 data, and created a column for *id* to help with joining the tables. 

```{r, warning = FALSE, message = FALSE}
stressor = stressor %>% 
  mutate(season = case_when(
    months %in% c('January-March', 'October-December') ~ 'Winter', 
    months %in% c('April-June', 'July-September') ~ 'Summer'
  )) %>% 
  filter(year != 2021) %>% 
  mutate(id = row_number()) %>% 
  select(year, season, state, stressor, stress_pct)

head(stressor)
```



# Preliminary Exploratory Analysis

Now that we have our data in a way that is easier to manage, we can explore it a bit more. 

The graph below shows the average percentage of colonies lost in the US during the winter and summer seasons by year. 
```{r, warning = FALSE, message = FALSE}
colony %>% 
  group_by(year, season) %>% 
  summarize(avg_colony_lost_pct = mean(colony_lost_pct, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = avg_colony_lost_pct, fill = season)) + geom_col(position = "dodge") + labs(x = "", y = "Average Colonies Lost (%)", fill = "", title = "Average Percentage of Total Colonies Lost in the U.S.", subtitle = "From 2015-2020") + theme_minimal()
```

We can see that in the winter months, there is a higher average colonies lost percentage for each year (2015-2020). This may be because bee colonies have a hard time staying alive during colder months, but it's interesting to see how the percentage on colonies lost during the winter months has increased during 2016-2018. There may be other reasons as to why the average colonies lost percentage is at its peak in 2018. 




The graph below explores the stressor table. It shows the average stress percentage for each stressor from the years 2015 -2020. 
```{r, warning = FALSE, message = FALSE}
na.omit(stressor) %>% 
  group_by(year, stressor) %>% 
  summarize(stress_pct_avg = mean(stress_pct)) %>% 
  ggplot(aes(x = year, y = stress_pct_avg, color = stressor)) + geom_point() + geom_smooth() + labs(x = "", y = "Average Stress (%)", color = "Stressor", title = "Average Stress Percentage for Each Stressor", subtitle = "From 2015 to 2020")
```

We can see that the stressor Varroa Mites, has the highest average stress percentage throughout the years, by a lot when compared to other stressors. This means that the Varroa mites is the stressor that has the highest percent of colonies affected by it. Again, in 2018, there is a peak in a high stress percentage for most stressors, including the Varroa mites. 



# Joining Tables 

Now that we've see two causes of colony loss separately, let's look at them together. To do this, we can merge the two tables together.


First, we can fix the stressor table to look at only the Varroa mites stress percent by state. 
```{r, warning = FALSE, message = FALSE}
stressor_vm = stressor%>% 
  filter(stressor == "Varroa mites") %>% 
  arrange(year) %>% 
  mutate(id = row_number()) %>% 
  select(id, year, season, state, stressor, stress_pct)

head(stressor_vm)
```

Now that the stressor table is filtered to only have the Varroa mites stress percentage, we can join it to the colony table. 
```{r, warning = FALSE, message = FALSE}
colonyJoinedStressorVM = colony %>% 
  inner_join(stressor_vm, by = c("id", "year", "season", "state"))

head(colonyJoinedStressorVM)
```

# Exploratory Data Analysis

With this new data, we can explore the relationships between season and stress percentage of Valorroa mites, with colony lost percentage.


The graph below shows the relationship between the percentage of colonies affected by Varroa mites and the percentage of total colonies lost in the U.S. More specifically, it looks at the relationship when taking the season into  account. 
```{r, warning = FALSE, message = FALSE}
colonyJoinedStressorVM %>% 
  ggplot(aes(x = stress_pct, y = colony_lost_pct, color = season)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Colonies Affected by varroa mites (%)", y = "Total Colonies Lost (%)", color = "Season", title = "Percentage of Total Colonies Lost Based on Percentage of Colonies Affected by Varroa Mites ", subtitle = "By season")
```
Visually, we can see that there appears to be a general positive correlation between stress pct and colony lost percent, with winter seasons having greater colony lost percentage. Both higher stress percents of Varroa mites and the winter season together, have the highest colony lost percentage.


Let's investigate these relationships through modeling now!


# Predictive Modeling 

## Linear Regression Model

**Target**: Percent of total colonies lost (*colony_lost_pct*)

**Feature** Percent of colonies affected by Varroa mites (*stress_pct*) and season (*season*)


Let's build a multiple regression model that predicts *colony_lost_pct* while considering our features of *stress_pct* and *season* !

```{r, warning = FALSE, message = FALSE}
set.seed(14)
#1. Data Splitting
colonyJoinedStressorVM_split = initial_split(colonyJoinedStressorVM, prop = 0.7)

colonyJoinedStressorVM_train = training(colonyJoinedStressorVM_split)
colonyJoinedStressorVM_test = testing(colonyJoinedStressorVM_split)

#2. Fit model
lm_fit_train = 
  linear_reg() %>% 
  fit(colony_lost_pct ~ stress_pct + season, data = colonyJoinedStressorVM_train)
lm_fit_train

#3. Generate Predictions
lm_pred = lm_fit_train %>% 
  predict(new_data = colonyJoinedStressorVM_test)

#4. Prediction Metric (RMSE)
colonyJoinedStressorVM_test %>% 
  mutate(colony_lost_pct_Pred = lm_pred$.pred) %>% 
  rmse(estimate = colony_lost_pct_Pred, truth = colony_lost_pct)
```

We created a multiple regression model: colony_lost_pct_hat = 0.1062(stress_pct) + 4.1184(seasonWinter)

The RMSE for the multiple regression model was 6.329434. This a very low RMSE. This means that the multiple regression model fits the dataset very well! But this number should be recieved with caution because the relationship between our features (*season* and *stress_pct*) and target (*colony_lost_pct*) may not actually be a linear relationship. 

Let's look at the relationship again using a different model!



## Decision Tree

The predictive modeling of linear regression that we used, may not be the best model. Our data may need a non-linear model.

To create a more accurate predictive model, we can use a decision tree to employ the CART (classification and regression tree) algorithm : 

```{r, warning = FALSE, message = FALSE}
set.seed(14) 
#1. Data Splitting
colonyJoinedStressorVM_split = initial_split(colonyJoinedStressorVM, prop = 0.7)

colonyJoinedStressorVM_train = training(colonyJoinedStressorVM_split)
colonyJoinedStressorVM_test = testing(colonyJoinedStressorVM_split)

#2. Fit model
tree_fit = 
  decision_tree() %>%
  set_mode(mode = "regression") %>%
  fit(colony_lost_pct ~ stress_pct + season, 
    data = colonyJoinedStressorVM_train)

rpart.plot(tree_fit$fit, roundint = FALSE)

#3. Generate Predictions
tree_pred = tree_fit %>%
  predict(new_data = colonyJoinedStressorVM_test)

#4. Prediction Metric (RMSE)
colonyJoinedStressorVM_test %>%
  mutate(colony_lost_pct_Pred = tree_pred$.pred) %>%
  rmse(truth = colony_lost_pct, estimate = colony_lost_pct_Pred)

```

By viewing the CART output, we can see where the variable season is split. And if we follow along, we can see that the winter months have more higher stress percentages. 

The RMSE for the CART model with multiple features was 6.597867. Meanwhile the RMSE for the multiple regression model was 6.329434. 




# Conclusion 

While  the RMSE for the multiple regression model was technically smaller, the RMSE for the CART model was not completely off. They were relatively close in size, which means that both models that we created to predict values of *colony_lost_pct* were generally good at doing so. Of course, these are only two ways of predicting values, so there may be other models that may do a more accurate job. But in exploring both the data visually and through modeling, we can see the effect that the winter season, and varroa mites have on the percentage of colony lost. 

The dataset we explored did not contrain many different factors. There could have been outside factors that affect the percentage of colony lost in the US. Perhaps the season and stressors are not the only reason why honeybee colonies are being threatened. It is also important to note that there may be a relationship between the varroa mites and the season which may have had an effect on our ability to perform accurate and useful predictive modeling. 

There are many other factors that we must keep looking at to make sure that our honeybee colonies thrive again. 


