Dplyr Cheat Sheet 2020

Using select, filter, mutate, arrange, and summarize

Prep homework

Basic computer setup

If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also has pointers for if you’re new to R or RStudio.
If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll troubleshoot.

Manipulating Data with dplyr Overview. Dplyr is an R package for working with structured data both in and outside of R. Dplyr makes data manipulation for R users easy, consistent, and performant. With dplyr as an interface to manipulating Spark DataFrames, you can: Select, filter, and aggregate data. Package ‘dplyr’ February 19, 2021 Type Package Title A Grammar of Data Manipulation Version 1.0.5 Description A fast, consistent tool for working with data frame. Plot.ly/r/getting-started p plotly (library( plotly ) x = rnorm( 1000 ), y = rnorm( 1000 ), mode = ‘markers’ ) plotly (x = c( 1, 2, 3 ), y = c( 5, 6, 7 ). 9/14/2020 Lecture: Descriptive statistics, and the creation of good observational sampling designs. Sampling and Simulation for Estimation. Lab Topic: Sampling and simulation. Reading: W&S 1,3-4, W&G Chapters on data transformation and pipes, Dplyr cheat sheet. Dplyr functions will manipulate each 'group' separately and then combine the results. Mtcars%% groupby(cyl)%% summarise(avg = mean(mpg)) These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as. Data Transformation with dplyr:: CHEAT SHEET.

New to dplyr?

If you’ve never used dplyr before (or even if you have), you may find this cheat sheet useful.

Getting Started

Want to download an R script with the content from today’s session?

1 - What is data wrangling?

It has been estimated that the process of getting your data into the appropriate formats takes about 80% of the total time of analysis. We will talk about formatting as tidy data (e.g., such that each column is a single variable, each row is a single observation, and each cell is a single value, you can learn more about tidy data here) in a future session of Code Club.

The package dplyr, as part of the tidyverse has a number of very helpful functions that will help you get your data into a format suitable for your analysis.

What will we go over today

These five core dplyr() verbs will help you get wrangling.

select() - picks variables (i.e., columns) based on their names
filter() - picks observations (i.e., rows) based on their values
mutate() - makes new variables, keeps existing columns
arrange() - sorts rows based on values in columns
summarize() - reduces values down to a summary form

2 - Get ready to wrangle

Let’s get set up and grab some data so that we can get familiar with these verbs

You can do this locally, or at OSC. You can find instructions if you are having trouble here.

First load your libraries.

Then let’s access the iris dataset that comes pre-loaded in base R. We will take that data frame and assign it to a new object called iris_data. Then we will look at our data.

This dataset contains the measurements (in cm) of Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width for three different Species of iris, setosa, versicolor, and virginica. Download need for speed carbon for mac free.

3 - Using `select()`

select() allows you to pick certain columns to be included in your data frame.

We will create a dew data frame called iris_petals_species that includes the columns Species, Petal.Length and Petal.Width.

Download shockwave plugin for mac. What does our new data frame look like? Vmplayer download for mac.

Note - look what happened to the order of the columns!

This is not the only way to select columns.

You could also subset by indexing with the square brackets, but you can see how much more readable using select() is. It’s nice not to have to refer back to remember what column is which index.

4 - Using `filter()`

Artwork by Allison Horst.

filter() allows you to pick certain observations (i.e, rows) based on their values to be included in your data frame.

We will create a new data frame that only includes information about the irises where their Species is setosa.

Let’s check the dimensions of our data frame. Remember, our whole data set is 150 observations, and we are expecting 50 observations per Species.

5 - Using `mutate()`

Artwork by Allison Horst.

mutate() allows you to make new variables, while keeping all your existing columns.

Let’s make a new column that is the ratio of Sepal.Length/Sepal.Width

Note – see the new column location

6 - Using `arrange()`

Very often you will want to order your data frame by some values. To do this, you can use arrange().

Let’s arrange the values in our iris_data by Sepal.Length.

What if we want to arrange by Sepal.Length, but within Species? We can do that using the helper group_by().

7 - Using `summarize()`

By using summarize(), you can create a new data frame that has the summary output you have requested.

We can calculate the mean Sepal.Length across our dataset.

What if we want to calculate means for each Species?

We can integrate some helper functions into our code to simply get out a variety of outputs. We can use across() to apply our summary aross a set of columns. I really like this function.

This can also be useful for counting observations per group. Here, how many iris observations do we have per Species?

8 - Breakout rooms!

Read in data

Now you try! We are going to use the Great Backyard Birds dataset we downloaded two weeks ago and you will apply the functions we have learned above to investigate this dataset.

If you weren’t here for Session 1, get the birds data set.

Download the file from the internet.

If you were here for Session 1, join back in! Let’s read in our data.

Exercises

Below you can find our breakout room exercises for today.

Exercise 1

Investigate the structure of the birds dataset.

Solution (click here)

Exercise 2

Create a new data frame that removes the column range.

Hints (click here)

Try using select(). Remember, you can tell select() what you want to keep, and what you want to remove.

Solutions (click here)

Exercise 3

How many unique species of birds have been observed?.

Hints (click here)

Try using summarize() with a group_by() helper.

Solutions (click here)

Exercise 4

How many times have Bald Eagles been observed?.

Hints (click here)

Try using filter(). Remember the syntax you need to use to indicate you are looking for a Bald Eagle.

Solutions (click here)

Exercise 5

How many times have any kind of eagle been observed?. Group hint: there are only Bald Eagle and Golden Eagle in this dataset.

Hints (click here)

There is a way to denote OR within filter().

More Hints (click here)

You denote OR by using the vertical bar.

Solutions (click here)

Exercise 6

What is the northern most location of the bird observations in Ohio?

Hints (click here)

Try using arrange(). You can arrange in both ascending and descending order. You can also use your Ohio knowledge to check if you’ve done this correctly.

Solutions (click here)

Bonus time!

Bonus 1

What is the most commonly observed bird in Ohio?

Hints (click here)

Try using tally() and a little helper term.

Solutions (click here)

Bonus 2

What is the least commonly observed bird (or birds) in Ohio?

Hints (click here)

Try using the data frame you’ve created in the previous exercise.

Rstudio Dplyr Cheat Sheet

Solutions (click here)