# Load any R Packages you may need
library(tidyverse)
library(learnr)

Introduction

What exactly is data wrangling? The term is familiar to anyone in the Data Science field as seen in class descriptions, internship/job qualifications, and virtually any STEM setting.

It is an extremely important skill as it gets your data in a useful form to work with for both visualizing and modelling. Understanding the data you work with and being able to clean it for easier access and analysis is a desirable and useful skill for any future data scientist.

In this tutorial, you will learn to do just that!

Let’s look at the data for Volcano Eruptions from The Smithsonian Institution on tidytuesday. There are two tables of data. One called volcano gives the general information of all of the volcanoes in the world and includes, the name, location, and the date of last eruption.

Here’s a quick look of the volcano table:

head(volcano)

The other table is events. It gives information on all the volcanic activity that has happened including the name of the volcano, the type of activity (explosion, ash…etc. ), and the date.

Here’s a quick look at the events table:

head(events)

S.S. FAM

There are 5 important functions needed for data transformation:

  1. select()
  2. filter()
  3. mutate()
  4. arrange()
  5. summarize()

We’ll go over some of these to better understand the volcano dataset.

select()

The select() function allows us to choose the columns we want.

Let’s take the events table and look at only three variables: volcano_name, eruption_start_year, and event_type

Click run code to see what happens:

events %>% # calling the dataset
  select(volcano_name, eruption_start_year, event_type) #selecting the columns 

The arguments of the select() function are the variables we want to keep and are separated by a comma.

One important tool to note is the pipe operator: %>% - It allows us to connect functions together. We’ll continue using it as we explore more functions.

filter()

The filter() function allows us to choose the rows we want by filtering for the rows that satisfy certain conditions.

Let’s continue to add to our existing code. Right now, we’re looking at three columns of the events data. We can go even further and filter the data to only look at event types that resulted in fatalities.

Click run code to see what happens:

events %>%  # calling the dataset
  select(volcano_name, eruption_start_year, event_type) %>%  #selecting the columns
  filter(event_type == "Fatalities")  #filtering for certain rows 

The arguments of the filter() function are conditions based on logical statements. In this example, we wanted only events that were categorized as having fatalities which is why we used event_type == . For numerical variables, we could also use > or <.

mutate()

The mutate() function allows us to insert new columns in the data as new variables.

Using the same data we’ve been working on, we can create a new variable: years_ago to see how many years ago each eruption that resulted in fatalities was.

years_ago = 2022 - eruption_start_year

Click run code to see what happens:

events %>%  # calling the dataset
  select(volcano_name, eruption_start_year, event_type) %>%  #selecting the columns
  filter(event_type == "Fatalities") %>%   #filtering for certain rows 
  mutate(years_ago = 2022 - eruption_start_year) #creating a new variable

The arguments of the mutate() function begin with the new variable’s name followed by how to attain it using either existing variables or external objects.

arrange()

The arrange() function allows us to sort the rows by a specific column.

Let’s arrange what we have so far, by the eruption_start_year.

Click run code to see what happens:

events %>%  # calling the dataset
  select(volcano_name, eruption_start_year, event_type) %>%  #selecting the columns
  filter(event_type == "Fatalities") %>%   #filtering for certain rows 
  mutate(years_ago = 2022 - eruption_start_year) %>%  #creating a new variable
  arrange(eruption_start_year) #ordering the data by eruption year 

The arguments of the arrange() function are simply the variable you want to order the data by. By default, it is arranged in ascending order.

To arrange in descending order, you can wrap the variable in the desc() function

events %>% 
  select(volcano_name, eruption_start_year, event_type) %>% 
  filter(event_type == "Fatalities") %>%    
  mutate(years_ago = 2022 - eruption_start_year) %>%  
  arrange(desc(eruption_start_year)) #descending order

Combining Multiple Tables

Many data projects contain multiple tables, as seen in the volcanos dataset. We can work with multiple tables by using a group of mutating joins functions.

  1. inner_join()
  2. full_join()
  3. left_join() and right_join
  4. anti_join()

Let’s continue to look at the volcano dataset. For this data, we can use one specific join function.

left_join

The left_join() function allows us include all rows in one table (the “main” table), and only rows in the other table with a matching variable.

We want to take the the events table and combine it with the volcano table. One variable that they share in common is the volcano_number.

Click run code to see what happens:

events %>% # This is our "main" table
  left_join(volcano, by = "volcano_number") #We are adding corresponding rows from the volcano table to the events table by the variable volcano_number

The arguments of the left_join() function consist of the name of the table that you want to add followed by the matching variable between both tables.

When using left_join() you’ll end up with the same number of rows as the “main” table. It is a great way to check the two tables were merged correctly.

#The original table: 
events %>% 
  nrow()
## [1] 41322
#The new table, after volcano table is joined with events table
events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  nrow()
## [1] 41322

The number of rows match, so we know we merged both tables correctly!

The right_join() function works similar to the left_join(), only in the other direction.

Putting It Together

Let’s take what we’ve learned so far to see which regions of the world have had the most volcanic events that have resulted in fatalities.

It is important to go step by step in order to understand what is being done to the data.

Step 1 - Joining Tables

Join the events and volcano tables using the correct join function.

The code is started. Finish inputing the arguments into the code below (Hint: We are merging both tables by the volcano_number variable)

events %>% 
  left_join()

Step 2 - Filtering and Selecting

Now that we have all the data we want to work with in a single table, we can clean it up.

Let’s select the variables (columns), that we want to look at. We want to look at volcano_name.x, event_type, region.

Let’s also filter for event types that resulted in fatalities like we did earlier.

The code is started. Finish inputing the arguments for the select() and filter() functions into the code below.

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select() %>% 
  filter()

Step 3 - na. omit()

One thing that you may have noticed, is that there are many NA’s in the table (meaning that there is no data available). To get rid of those rows, we can use the na.omit() function.

Click run code to see what happens:

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() # removes any row that has NA ANYWHERE

Step 4 - count() and fct_reorder()

Now we have the data that we want to work with!

We can create a bar graph using ggplot that shows us how many volcanic events have resulted in fatalities at a given region.

Click run code to see what happens:

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() %>% 
  ggplot(aes(x = region)) 
    + geom_bar() 
    + coord_flip() #flipped the axis for readability purposes

Here, we used ggplot to plot the x axis as the region variable. The geom_bar() function automatically gives count as the y aesthetic. And so that all the regions could fit and were readable, we flipped the axis using the coord_flip() function.

This graph shows the information we want, but there are some things we can tweak to make it even easier to understand.

The count() and fct_reorder() function will be extremely useful here.

The count() function counts the number of observations by group.

Let’s go back to looking at our table.

Click run code to see what happens:

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() %>% 
  count(region) #creates a new column with the number of observations by region

Now, we have a table with two variable: the region, and the number of observations of events resulting in fatalities per region.

Using this, we can now make our visual easier to understand.

The fct_reorder() function allows us to reorder a categorical variable based on a second variable. In this case, the second variable has become n (the count).

Now, we can create our bar graph again.

Click run code to see what happens:

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() %>% 
  count(region) %>% 
  ggplot(aes(x = fct_reorder(region, n), y = n)) + geom_col() + coord_flip()

When implementing the fct_reorder() function, we wrap it around the x aesthetic. We also use the geom_col() instead of geom_bar(), since we now have a value for the y-aesthetic. And again, we use coord_flip() to read the regions easily.

Step 6 - Graphing

NOW, we have our data the way we want it.

We can now make the graph look nicer with with proper labels, titles, themes, and colors.

Feel free to play around with the code below!

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() %>% 
  count(region) %>% 
  ggplot(aes(x = fct_reorder(region, n), y = n)) + geom_col(fill = "light blue", color = "black") + coord_flip() + labs(x = "", y = "Count", title = "Number of Volcanic Events Resulting in Fatalities") + theme_minimal()

Conclusion

events %>% 
  left_join(volcano, by = "volcano_number") %>% 
  select(volcano_name.x, event_type, region) %>% 
  filter(event_type == "Fatalities") %>% 
  na.omit() %>% 
  count(region) %>% 
  ggplot(aes(x = fct_reorder(region, n), y = n)) + geom_col(fill = "light blue", color = "black") + coord_flip() + labs(x = "", y = "Count", title = "Number of Volcanic Events Resulting in Fatalities") + theme_minimal()

The graph is a visual in understanding where in the world, fatal volcanic activity take place. Using the bar plot that we constructed we can see that Indonesia, Japan, Taiwan, Marianas are all regions that are affected most by deadly volcanic activity. We can also see areas like Alaska, that have had virtually no history of fatal volcanic activity.

There are many reasons for these trends. It could be geography, seeing that the regions with the highest count are in the Asian continent. It could also be influenced by the regions infrastructure, seeing that Western USA has a low count despite lying near a fault line. By seeing these trends, it can perhaps help countries and people better prepare for extreme events.

We were able to reach this point in analyzing by manipulating the data over and over again until we were able to produce a product that made it easy to communicate our findings.

As you can see, data wrangling is a very important skill! There are so many other tools and functions worth exploring, but hopefully you were able to learn a little bit about the basics of data wrangling!

Caroline Mazariegos

04/01/2022