#Load Necessary Libraries
library(tidyverse)
library(tseries)
library(forecast)

#Importing .csv file
Drought <- read_csv("/Users/carolinemazariegos/SCHOOL/STAT391/Project/DataTablesU.S.DroughtMonitor.csv")

Introduction

The U.S drought monitor (USDA) is data that is updated each Thursday to show the location and intensity of drought across the country. The data is based on the drought conditions based on how much precipitation did or didn’t fall, and it comes from the USDA website (https://droughtmonitor.unl.edu/). There are visuals and more information available on the website for further investigation. The data that was extracted includes a few variables, including the date, levels of classification where D0 is abnormally dry and D4 is an exceptional level of drought, and the DSCI (Drought Severity and Coverage Index).

The variables that will be looked at in the project are the date and DSCI. More specifically, the DSCI (Drought Severity and Coverage Index) has possible values that range from 0 to 500. Zero means that none of the area is abnormally dry or in drought, and 500 means that all of the area is in the D4 classification or is in an exceptional drought. The number is calculated using the percentages of the levels classification, but the USDA provides the dataset already with the values. The data used in this project looks at the intensity for U.S as a whole rather than by geographical locations or states.

To get a better understanding of the data, the following code provides a display of the first few rows of the data. It has been mutated to use only the two variables of interest; Week and DSCI:

Drought_new = Drought%>%
  select(Week, DSCI)%>%
  arrange(Week)

head(Drought_new)

## # A tibble: 6 × 2
##   Week        DSCI
##   <date>     <dbl>
## 1 2016-01-05    68
## 2 2016-01-12    65
## 3 2016-01-19    61
## 4 2016-01-26    59
## 5 2016-02-02    60
## 6 2016-02-09    58

Taking a closer look at the data, there is more that can be learned, but first the data must be turned into a time series. Based on the information that was provided on the website, the first observation begins on 2016-01-05 and the last observation is on 2023-04-04. Since there is an observation taking on the Thursday of every week, the frequency of the data is 52 (52 weeks in a year). The span of the data is approximately 8 years and 4 months. This information can be used to turn the data into a time series:

Drought.ts = ts(Drought_new[,2],start=c(2016,01),end=c(2023,04), frequency=52)

Visualization

To explore the time series data more, visualizations can be made of the data to better understand it.

The time series line plot displays the values of DSCI over time. The x-axis displays the data of the observation while the y-axis represents the values of DSCI:

plot(Drought.ts, xlab = "Year", ylab = "DSCI", main = "U.S Drought Severity and Coverage Index Over Time")

The plot displays a large spike in DSCI around the year 2020, where the DSCI continues to stay at high values after 2021. There is overall a few trends of sharp increases throughout time. There appears to be some sort of seasonality in that there appears to be a small pattern of increase followed by a deep decrease that repeats itself throughout time. The plot also shows a slight change in variance in that the plot shows fluctuations in DSCI that increase and decrease over time. Overall, the plot shows a mean that does not change drastically over time. It does not show a trend of increasing or decreasing.

Because of the seasonal component that there might be, another potential helpful plot is the boxplot of the season component of the time series:

boxplot(Drought.ts ~cycle(Drought.ts), ylab = "DSCI", xlab = "Seasonal Cycle of 1 Year")

The boxplot shows the distribution of DSCI values within each seasonal cycle. A seasonal cycle in this time series is a year (52 weeks). By comparing the boxplots across the different cycles, the varying medians are visible. Because the medians and range of the boxplots increase and decrease, this suggests that there are seasonal effects that are changing over time. This is an indicator that seasonality may have to be addressed in the modeling process.

Decomposition Plot

A decomposition plot is another way to analyze the time series. It separates the time series into its underlying components. For this time series decomposition plot, it displays trend, seasonal, and random components.

Drought.decom = decompose(Drought.ts)
plot(Drought.decom)

The trend component of the decomposition plot shows a slow increase over time, and then a fast increase in DSCI after the year 2020 where it continues to remain high.

The seasonal component of the decomposition plot shows a pattern in which in a given year, the DSCI appears to reach very low values, only to drastically spike back up. This type of seasonality appears to be attributed with the different seasons found in the U.S and its impact on droughts.

Check Stationarity

To better analyze and create a strong model, checking for stationarity is important. First, it is important to check for any non-stationarity in the data. The Augmented Dickey-Fuller Test (adf) and acf test can help determine whether the time series is stationary:

acf(Drought.ts)

adf.test(Drought.ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Drought.ts
## Dickey-Fuller = -2.109, Lag order = 7, p-value = 0.531
## alternative hypothesis: stationary

ACF plot: Looking at the ACF plot, the ACF values never reach to be within the blue lines. Although it decreases, it never reaches a significant value. Because of this, I would not consider the plot an example of stationarity.

ADF test: As for the ADF test, the p-value = 0.531 The p-value of 0.531 > 0.05. This means that the null hypothesis (the data is non-stationary) cannot be rejected. We do not have enough evidence to assume that the data is stationary.

The ACF plot and ADF test both conclude that the data is non-stationary. Therefore, the data must be turned into stationary data. In order to do this, the data must be differenced. Differencing the data will eliminate the trend that is present:

diff_drought = diff(Drought.ts)

Using the differenced data, we can run the ACF and ADF test to determine if the data is now stationary:

acf(diff_drought)

adf.test(diff_drought)

## Warning in adf.test(diff_drought): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_drought
## Dickey-Fuller = -5.8485, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

ACF plot: Looking at the ACF plot, the ACF values begin to decrease and reach within the blue lines at lag 4. From lag 4, the ACF values stays within the blue lines. This is an indicator of stationarity.

ADF test: As for the ADF test, the p-value = 0.01. The p-value of 0.01 < 0.05. This means that that null hypothesis (the data is non-stationary) can be rejected. We can conclude that we have enough evidence to assume the data is stationary.

Both tests support the differenced time series as being stationary.

It is also important to see if seasonal differencing would be an even better option:

diff_drought_52 = diff(Drought.ts, 52)

acf(diff_drought_52)

adf.test(diff_drought_52)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_drought_52
## Dickey-Fuller = -1.7675, Lag order = 6, p-value = 0.6745
## alternative hypothesis: stationary

ADF test: As for the ADF test, the p-value = 0.0475. The p-value of 0.6745 > 0.05. This means that the null hypothesis (the data is non-stationary) cannot be rejected. We do not have enough evidence to assume that the data is stationary.

The ACF plot and ADF test both conclude that the seasonally differenced data is non-stationary. The differenced data appears to be the best data to work with as it is stationary. We can create a better model with the differenced data.

Fit Model

The ACF and PACF plots can help determine the best model.

acf(diff_drought)

pacf(diff_drought)

ACF Plot: The ACF plot appears to have a slight sine wave pattern. This sine wave pattern begins at lag 3. Therefore q = 3.

PACF Plot: The PACF plot appears to have a slight sine wave pattern as well. This sine wave pattern begins at lag 1. Therefore p = 1.

This means that we can create a ARMA(1,3) model:

ARMA13Model= arima(diff_drought, order=c(1,0,3), include.mean=F)

Assess Model

In order to check if the model is a good fit for the data there are multiple different tests that can be conducted. The first thing that we want to look at are the residuals. We want the residuals to be white noise, so we can plot them to see:

plot(ARMA13Model$residuals)

There doesn’t appear to be any real pattern. For plotting residuals, we do not want to see any obvious trends, so this is a good sign.

Next, we can plot the ACF and PACF of the residuals to determine if it looks like white noise.

acf(ARMA13Model$residuals)

pacf(ARMA13Model$residuals)

Both the ACF and the PACF plots look like white noise. There is no real trend/pattern in either plot.

Next we can perform tests to determine white noise. One test is the Shapiro-Wilk Test which tests if the data comes from a normally distributed population.:

shapiro.test(ARMA13Model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  ARMA13Model$residuals
## W = 0.98664, p-value = 0.001846

The Shapiro-Wilk Test has given the p-value = 0.001846. 0.001846 < 0.05, therefore we have enough statistical evidence to reject the null hypothesis (the data comes from a normally distributed population). This means that we have enough evidence to conclude that the residuals may not be normally distributed. This is not the ideal result, but there is another test worth conducting

The Ljung-Box Test tests whether the data is independently distributed:

Box.test(ARMA13Model$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ARMA13Model$residuals
## X-squared = 8.2465, df = 20, p-value = 0.9901

The Ljung-Box test has given the p-value = 0.9901. 0.9901 > 0.05, therefore we fail to reject the null hypothesis, and we can conclude that the data is independently distributed. The test shows that we don’t have evidence that there is serial correlations in the data. Based on soley the Ljung-Box test, the data appears to be white noise.

Lastly, it is important to check the distribution of the residuals to verify model assumptions. We can observe this by looking at the histogram and the QQ-normal plot of the residuals:

hist(ARMA13Model$residuals)

qqnorm(ARMA13Model$residuals)
qqline(ARMA13Model$residuals)

The histogram appears to be fairly normally distributed, but one could argue that is slightly left skewed. This might be an indicator that there are values pulling the distribution in a certain direction. For the most part there is a strong diagonal line in the qq-plot, but there appears to be a little bit of variation on the tails of the line that is worth noting.

Looking at the diagnostics checks, mostly all but the Shapiro-Wilk’s test supported the idea that our ARMA(1,3) model is a good fit. Therefore, we have confidence that we chose a good model.

Forecast

Based on the ARMA(1,1) that we have chose for our time series, we are able to make forcasted predictions:

forecast_model= forecast(ARMA13Model, 10)
forecast_model

##          Point Forecast     Lo 80    Hi 80      Lo 95    Hi 95
## 2023.077    -2.18482614 -7.349150 2.979497 -10.082977 5.713325
## 2023.096    -1.25815951 -6.912010 4.395691  -9.904978 7.388659
## 2023.115    -0.72226078 -6.563995 5.119474  -9.656423 8.211901
## 2023.135    -0.42207888 -6.331820 5.487662  -9.460248 8.616090
## 2023.154    -0.24665687 -6.179444 5.686130  -9.320072 8.826758
## 2023.173    -0.14414275 -6.084780 5.796494  -9.229563 8.941278
## 2023.192    -0.08423496 -6.027551 5.859081  -9.173752 9.005282
## 2023.212    -0.04922571 -5.993456 5.895004  -9.140141 9.041690
## 2023.231    -0.02876680 -5.973309 5.915775  -9.120160 9.062626
## 2023.250    -0.01681091 -5.961460 5.927838  -9.108367 9.074745

autoplot(forecast_model)

Based on the mode, a forecast of the next 10 timepoints were made and their corresponding DSCI predictions. These values were then plotted along with the original time series data, to show a visual of what these predictions are and how they correspond to over time.

Conclusion

During this investigation of droughts in the U.S there were many factors to take into consideration. One of the biggest factors that I thought was going to be necessary to control was the seasonality component of the time series. But after conducting the tests necessary to take seasonality into consideration, many of the results did not appear to help the data. Therefore, I decided to move on with a much more simpler model. Because of the lack of complexity of the model, the forecasted points may not show an accurate representation of predicted values. I continue to believe that the seasonality component must be taken into further consideration to better predict future DSCI values and better predict droughts in the U.S. Of course no model is 100% accurate, so it is important to understand the limitations that this model has.

Final Project

Caroline Mazariegos

05/09/2023