3  Data wrangling

Real data is messy and often the data input and re-formatting is a tedious task, especially if those collecting the data are not familiar with what works best for computer languages.

3.1 Base R vs tidyverse

Although we will skip of many of the basic R functionality, here is a quick overview of some key features. Based on https://r4ds.hadley.nz/workflow-basics.html#coding-basics

Basic arithmetic with single numbers:

1 / 200 * 30
#> [1] 0.15

(59 + 73 + 2) / 3
#> [1] 44.66667

2^2
#> [1] 4

sqrt(4)
#> 2

Assignment operator. This assigns the number to an ‘object name’:

```{r}
x <- 5 * 2

print(x)
```
[1] 10

Any object we create in R can be assigned a name, so that it is stored in our Environment for us to access again:

object_name <- value

When reading that code, say “object name gets value” in your head.

Tip

To quickly type the assignment operator <- press ALT plus - key (or Option plus - on Mac). Other keyboard shortcuts are listed under Tools -> Keyboard Shortcuts Help.

Multiple elements can be combined together into a vector. R will apply functions and arithmetic to the a whole vector. This is a very useful concept in R as we will see when working with data frames (tables).

```{r}
x <- c(3, 5, 10, 1)

print(x)

print(x * 2)
```
[1]  3  5 10  1
[1]  6 10 20  2

If we want to access a single element of that vector we can use:

```{r}
x[1]
```
[1] 3

or multiple elements:

```{r}
# get values from 1st to 2nd position
x[1:2]

# use a logical vector:
x[c(TRUE, TRUE, FALSE, FALSE)]

# or use another vector of positions to return:
x[c(1,3)]
```
[1] 3 5
[1] 3 5
[1]  3 10

Leave a lot of comments in your code! Your future self will love you for it.

```{r}
# This text is not evaluated by R, because it starts with a hashtag.
# You can highlight multiple rows of code and press Ctrl + Shift + C to 'comment out' all of them at once.
```

3.1.1 Functions

Functions are basically scripts that are designed to take certain inputs, do something useful with those inputs, and then give you something back. They take the form:

function_name(argument1 = value1, argument2 = value2, ...)

For example, the function we used above sqrt() expects 1 argument: ‘a numeric or complex vector or array’. Therefore, we could have written:

```{r}
sqrt(x = 4)

# because there is only 1 argument, we would typically leave the x = out of it.
sqrt(4)
```
[1] 2
[1] 2

Even when we have multiple arguments, we don’t have to write them out in full if we maintain the correct order of arguments. Often it is better to though, so you can easily see what you’re trying to do.

```{r}
seq(from = 10, to = 30, by = 2)

# same as this, which might be confusing:
#seq(10, 30, 2) 
```
 [1] 10 12 14 16 18 20 22 24 26 28 30

Sometimes we find functions with ... which means that it will take multiple values in this position.

For example, thepaste() function expects ... for it’s first object and the help page says this argument should be: “One or more R objects, to be converted to character vectors”.

Therefore we can do:

```{r}
paste("Hello", "World")

# OR:
string1 <- "Hello" 
string2 <- "World"

paste(string1, string2)
```
[1] "Hello World"
[1] "Hello World"

But it still has named arguments we can use:

```{r}
paste(string1, string2, sep = "_*_")
```
[1] "Hello_*_World"

When we paste a vector with a single value, it will return multiple strings. But we can also request them to be collapsed into 1 string.

```{r}
# multiple strings:
paste(x, string1, sep = "")

# 1 string:
paste(x, string1, string2, sep = "", collapse = "_")
```
[1] "3Hello"  "5Hello"  "10Hello" "1Hello" 
[1] "3HelloWorld_5HelloWorld_10HelloWorld_1HelloWorld"
Tip

If you’re working with strings, check out the stringr package.

3.2 Create demo dataset

This code creates a data frame using only numbers and strings (text). This is an example of a ‘reproducible example’ where anybody can take this code, run it and see the same df_data object in their environment.

This is also handy for testing code or asking questions online. For more details on reproducible examples see https://www.tidyverse.org/help/#reprex

You’ll rarely need to make a data frame from scratch, but it’s handy to see that it is a collection of vectors, where each row is a vector.

```{r}
df_data <- data.frame(
  Animal_ID = rep(c(11,12,21,22), times = 3),
  Timepoint = rep(seq(from = 1, to = 3), each = 4),
  Treatment_Group = rep(c("Treatment A","Treatment B"), each = 2, times = 3),
  Variable = c(1,1.2,1,1.2,2,2.5,.5,.4,3,3.5,0.2,.1))
```

To show the dataframe we just created below our code chunk we can write print(df_data) or simply df_data works in RStudio:

```{r}
df_data
```
   Animal_ID Timepoint Treatment_Group Variable
1         11         1     Treatment A      1.0
2         12         1     Treatment A      1.2
3         21         1     Treatment B      1.0
4         22         1     Treatment B      1.2
5         11         2     Treatment A      2.0
6         12         2     Treatment A      2.5
7         21         2     Treatment B      0.5
8         22         2     Treatment B      0.4
9         11         3     Treatment A      3.0
10        12         3     Treatment A      3.5
11        21         3     Treatment B      0.2
12        22         3     Treatment B      0.1

This dataframe can also be viewed in it’s own tab inside RStudio, like an excel sheet. Put your cursor on df_data and press F2 or type:

```{r}
#| eval: false
View(df_data) #notice the capital V
```

3.3 Wide vs Long vs Tidy data

Often people consider the shape of a data frame to be either wide or long. Wide data might look like this, where some column names contains values (not headings):

Code
```{r}
#| code-fold: true
pivot_wider(df_data, 
            names_from = Timepoint, 
            values_from = Variable, 
            names_prefix = "Time_")
```
# A tibble: 4 × 5
  Animal_ID Treatment_Group Time_1 Time_2 Time_3
      <dbl> <chr>            <dbl>  <dbl>  <dbl>
1        11 Treatment A        1      2      3  
2        12 Treatment A        1.2    2.5    3.5
3        21 Treatment B        1      0.5    0.2
4        22 Treatment B        1.2    0.4    0.1

Or, we might see that information about the same animal is spread across multiple rows in a long format:

Code
```{r}
#| code-fold: true
pivot_longer(df_data,
             cols = c(Timepoint, Variable), 
             names_to = 'name', 
             values_to = 'value')
```
# A tibble: 24 × 4
   Animal_ID Treatment_Group name      value
       <dbl> <chr>           <chr>     <dbl>
 1        11 Treatment A     Timepoint   1  
 2        11 Treatment A     Variable    1  
 3        12 Treatment A     Timepoint   1  
 4        12 Treatment A     Variable    1.2
 5        21 Treatment B     Timepoint   1  
 6        21 Treatment B     Variable    1  
 7        22 Treatment B     Timepoint   1  
 8        22 Treatment B     Variable    1.2
 9        11 Treatment A     Timepoint   2  
10        11 Treatment A     Variable    2  
# ℹ 14 more rows

But when we made our demo data frame we actually made it in ‘tidy’ format, because the tidyverse is designed to work with data in this format. The principals are best described in the R for Data Science book: https://r4ds.hadley.nz/data-tidy.html#fig-tidy-structure

We will often collect observational data in wide format, and present our research in wide format for publications, because it is often the best way for humans to understand data on a page. There are 2 very useful functions for converting your data to a tidy format when you are workign with your data in R: pivot_wider() and pivot_longer(). In fact, that’s the functions used above to swap our original data frame.

3.4 Introducing ‘select’ function

Picture every function as having an input and output. For example, the basic function paste('hello', 'world') has 2 strings given to it as input, and then when the function is executed it outputs 1 character string, which we can then assign to an object and use later.

If we extend this to a function that works on data frames, we would normally give the function a data frame as input and it will return the updated/modified dataframe as output. When using functions from dplyr/tidyverse they always take the input data as the first argument of the function.

For example, select() is a function for selecting columns from data frames. Our example data frame has 4 columns, but let’s say we wanted only 2 of them and we wanted to select which ones to keep by using their names, we could write:

```{r}
# select takes the dataframe as first argument, then the names of the columns you want to keep/exclude
df_two_columns <- select(df_data, Animal_ID, Variable)

df_two_columns
```
   Animal_ID Variable
1         11      1.0
2         12      1.2
3         21      1.0
4         22      1.2
5         11      2.0
6         12      2.5
7         21      0.5
8         22      0.4
9         11      3.0
10        12      3.5
11        21      0.2
12        22      0.1

We can also specify the columns we don’t want by using a - symbol before it. The following function will give the same result as above but by exluding the unwanted columns.

```{r}
df_two_columns <- select(df_data, -Timepoint, -Treatment_Group)

df_two_columns
```
   Animal_ID Variable
1         11      1.0
2         12      1.2
3         21      1.0
4         22      1.2
5         11      2.0
6         12      2.5
7         21      0.5
8         22      0.4
9         11      3.0
10        12      3.5
11        21      0.2
12        22      0.1

The select function can also be used with ‘selection helpers’. For example if you want all columns that included a certain bit of text you could use contains("text_to_search"), or if you wanted columns that started with a certain bit of text starts_with("text_to_search"). You can also select columns based on what type of data they contain. Our example data frame has columns of type ‘num’ (number, technically they are a ‘double’), ‘int’ (integer, i.e. a num but not a double) or ‘chr’ (character):

```{r}
glimpse(df_data)
```
Rows: 12
Columns: 4
$ Animal_ID       <dbl> 11, 12, 21, 22, 11, 12, 21, 22, 11, 12, 21, 22
$ Timepoint       <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
$ Treatment_Group <chr> "Treatment A", "Treatment A", "Treatment B", "Treatmen…
$ Variable        <dbl> 1.0, 1.2, 1.0, 1.2, 2.0, 2.5, 0.5, 0.4, 3.0, 3.5, 0.2,…

So, if we wanted to select the columns with characters we could use the where() helper function:

```{r}
select(df_data, where(is.character))
```
   Treatment_Group
1      Treatment A
2      Treatment A
3      Treatment B
4      Treatment B
5      Treatment A
6      Treatment A
7      Treatment B
8      Treatment B
9      Treatment A
10     Treatment A
11     Treatment B
12     Treatment B
Note

is.character() is a function in itself. By using it inside where() it lets you use it safely with the select function. E.g. you could test if some object or column is a character directly with it: is.character(df_data$Animal_ID) = FALSE is.character(df_data$Treatment_Group) = TRUE

The select function also returns the data frame with the columns in the order you specify. Sometimes you might have 100 columns, and you mutate a new column onto the end, but you want it to appear first. You could use the ‘everything()’ helper as follows:

```{r}
# selects the column of interst first, then returns 'everything' else
select(df_data, Variable, everything())
```
   Variable Animal_ID Timepoint Treatment_Group
1       1.0        11         1     Treatment A
2       1.2        12         1     Treatment A
3       1.0        21         1     Treatment B
4       1.2        22         1     Treatment B
5       2.0        11         2     Treatment A
6       2.5        12         2     Treatment A
7       0.5        21         2     Treatment B
8       0.4        22         2     Treatment B
9       3.0        11         3     Treatment A
10      3.5        12         3     Treatment A
11      0.2        21         3     Treatment B
12      0.1        22         3     Treatment B
Tip

There’s actually another function called relocate() that makes this even easier:

Code
```{r}
#| code-fold: true
#| eval: false
relocate(df_data, Variable)
```

3.5 Using pipes

At this point we should introduce one of the most useful aspects of using the tidyverse, pipes.

Pipe refers to: %>% (or |> if using R’s newly inbuilt pipe, see Global Options in RStudio)

When working with data that has been imported there are normally multiple ‘data wrangling’ steps involved, such as selecting columns of interest, adding new columns of calculated values, filtering out rows of interest, etc. Each of these steps can get messy in the code when each step is taking a data frame as input and giving a new data frame as output. So, the %>% is used to keep the code more readable.

The %>% takes the output from function and gives it to the first position/argument of the next function, without having to save it as a variable in your environment. It takes this structure: output_from_previous %>% next_function(argument1 = 1, argument2 = 3) is the same as next_function(output_from_previous, argument1 = 1, argument2 = 3) Notice how the ‘output_from_previous’ object is placed into the first position of the next function by the pipe.

For the first pipe in a chain of pipes the ‘output_from_previous’ is normally just a starting data frame object itself, which is parsed to the first position of the function: df_data %>% next_function()

Let’s take the above select function example and re-write it using the pipe:

```{r}
#Original:
#df_two_columns <- select(df_data, Animal_ID, Variable)

df_two_columns <- df_data %>% select(Animal_ID, Variable)

df_two_columns
```
   Animal_ID Variable
1         11      1.0
2         12      1.2
3         21      1.0
4         22      1.2
5         11      2.0
6         12      2.5
7         21      0.5
8         22      0.4
9         11      3.0
10        12      3.5
11        21      0.2
12        22      0.1

For this example it makes very little difference to the ‘readability’ of the code. But let’s say we have 3 steps of cleaning we want to do:

  1. Select 3 columns: Animal_ID, Variable, Treatment_Group
  2. Arrange the columns in order of Animal_ID
  3. Filter out any rows that have a ‘Variable’ reading of less than 1

Without pipes we might do this.

Step 1 - select:

```{r}
df_step1 <- select(df_data, Animal_ID, Variable, Treatment_Group)

df_step1
```
   Animal_ID Variable Treatment_Group
1         11      1.0     Treatment A
2         12      1.2     Treatment A
3         21      1.0     Treatment B
4         22      1.2     Treatment B
5         11      2.0     Treatment A
6         12      2.5     Treatment A
7         21      0.5     Treatment B
8         22      0.4     Treatment B
9         11      3.0     Treatment A
10        12      3.5     Treatment A
11        21      0.2     Treatment B
12        22      0.1     Treatment B

Step 2 - arrange:

```{r}
df_step2 <- arrange(df_step1, Animal_ID)

df_step2
```
   Animal_ID Variable Treatment_Group
1         11      1.0     Treatment A
2         11      2.0     Treatment A
3         11      3.0     Treatment A
4         12      1.2     Treatment A
5         12      2.5     Treatment A
6         12      3.5     Treatment A
7         21      1.0     Treatment B
8         21      0.5     Treatment B
9         21      0.2     Treatment B
10        22      1.2     Treatment B
11        22      0.4     Treatment B
12        22      0.1     Treatment B

Step 3 - filter:

```{r}
#The ! symbol means NOT and negates a check. Here it is used when you want to exclude rather than keep what you are filtering.
df_step3 <- filter(df_step2, !Variable < 1)

df_step3
```
  Animal_ID Variable Treatment_Group
1        11      1.0     Treatment A
2        11      2.0     Treatment A
3        11      3.0     Treatment A
4        12      1.2     Treatment A
5        12      2.5     Treatment A
6        12      3.5     Treatment A
7        21      1.0     Treatment B
8        22      1.2     Treatment B

This creates a lot of variables in your environment which wastes memory and is difficult to keep track of. Therefore, we can use pipes to do the same:

```{r}
df_out <- df_data %>% 
  select(Animal_ID, Variable, Treatment_Group) %>%  
  arrange(Animal_ID) %>% 
  filter(!Variable < 1) 


df_out
```
  Animal_ID Variable Treatment_Group
1        11      1.0     Treatment A
2        11      2.0     Treatment A
3        11      3.0     Treatment A
4        12      1.2     Treatment A
5        12      2.5     Treatment A
6        12      3.5     Treatment A
7        21      1.0     Treatment B
8        22      1.2     Treatment B

When troubleshooting a problem in a chain of functions, you can split the chain at any point by removing a %>% and returning everything up to that point to help you visualise what the data frame looks like at that point in the chain.

3.6 Import data

Normally we are interested in importing some data we have stored in a spreadsheet, such as what we might have in Excel.

If we’ve collected data in an experiment, it’s probably in a nicely formatted Excel format that is terrible for computer to read. Often it is easiest to re-format your tables with 1 row for column headings, and remove all formatting.

Sometimes this is best done by saving a sheet to a plain text format such as .csv or .txt. It also means your data isn’t modified automatically by Excel (such as some gene names being converted to dates e.g. MARCH1, which have now been renamed because of this ).

Plain text files are one of the most standard ways of saving data for both humans and software to read, and for most data we work with from experiments it is relatively fast. But note, there are much more efficient ways of handling bigger data sets: https://r4ds.hadley.nz/arrow.html

```{r}
# find the file path to a demo dataset

my_filepath <- palmerpenguins::path_to_file("penguins_raw.csv")

print(my_filepath)
```
[1] "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/palmerpenguins/extdata/penguins_raw.csv"

Use this filepath to import with `data.table::fread()`

```{r}
df_in <- read_csv(file = my_filepath)
```
Rows: 344 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```{r}
# View(df_in)
```

Check the dimensions and structure of your data to make sure it’s imported how you expected:

```{r}
# just the dimensions
dim(df_in)

# structure of df
glimpse(df_in)
```
[1] 344  17
Rows: 344
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…
Note

My personal preference is to use the fread() function from the data.table package (which is a popular alternative to the tidyverse way of working with data). I’ve found it to be good at guessing what type of data is in each column, for various file formats and is faster than other functions (fread is short for ‘fast read’).

It also works with multiple filetypes and guesses what to do based on the file extension of the file.

It is good practice to store your input files for your scripts in a subfolder of your Project directory called ‘Inputs’ or ‘data’ or something similar. That way, you are storing a copy of the original data with your scripts, and you can use relative filepaths:

df_in <- read_csv("./Inputs/my_awesome_data.csv")

3.6.1 Reading in excel files

It has become easier to read data directly from excel sheets. It is still worth remembering that data you see formatted in cells in Excel is not always a good representation of what is stored underneath it (e.g. dates as the number of seconds since January 1, 1970). Let the function try and guess the data types correctly, and modify if neccesary. See details: https://r4ds.hadley.nz/spreadsheets.html

You can also specify which sheet to import.

df_in <- read_excel("./Inputs/my_awesome_excel_file.xlsx", sheet = "Sheet 1")
Tab completion

When writing a file path to import or export a file, you can press the Tab button on your keyboard after you type "./", and it will display a list of files and folders. Tab complete is useful for typing long variable or function names too.

3.6.2 Column names

When we looked at the raw data from palmerpenguins we noticed that some of the column names had spaces in them. We can technically get by with this using ticks:

df_in$`Sample Number`
Note

Note that the $ is a base R way of accessing the values within a column

To fix these column names we can either:

  1. Tell read_csv what names to use:
```{r}
read_csv(my_filepath, 
         col_names = c("studyName", 'sample_number', 'Species'), 
         show_col_types = FALSE # this stops the message about cols printing
         )
```
# A tibble: 345 × 17
   studyName sample_number Species     X4    X5    X6    X7    X8    X9    X10  
   <chr>     <chr>         <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 studyName Sample Number Species     Regi… Isla… Stage Indi… Clut… Date… Culm…
 2 PAL0708   1             Adelie Pen… Anve… Torg… Adul… N1A1  Yes   2007… 39.1 
 3 PAL0708   2             Adelie Pen… Anve… Torg… Adul… N1A2  Yes   2007… 39.5 
 4 PAL0708   3             Adelie Pen… Anve… Torg… Adul… N2A1  Yes   2007… 40.3 
 5 PAL0708   4             Adelie Pen… Anve… Torg… Adul… N2A2  Yes   2007… <NA> 
 6 PAL0708   5             Adelie Pen… Anve… Torg… Adul… N3A1  Yes   2007… 36.7 
 7 PAL0708   6             Adelie Pen… Anve… Torg… Adul… N3A2  Yes   2007… 39.3 
 8 PAL0708   7             Adelie Pen… Anve… Torg… Adul… N4A1  No    2007… 38.9 
 9 PAL0708   8             Adelie Pen… Anve… Torg… Adul… N4A2  No    2007… 39.2 
10 PAL0708   9             Adelie Pen… Anve… Torg… Adul… N5A1  Yes   2007… 34.1 
# ℹ 335 more rows
# ℹ 7 more variables: X11 <chr>, X12 <chr>, X13 <chr>, X14 <chr>, X15 <chr>,
#   X16 <chr>, X17 <chr>

As you can see, we would have to write out all 17 column names otherwise it gets a generic name like X4

  1. Therefore, we could just rename some by hand:
```{r}
# view current column names
colnames(df_in)

df_in <- 
  df_in %>% 
  rename(
    # new_name = old_name
    sample_number = `Sample Number`,
    individual_ID = `Individual ID` # etc.
  )

# check new column names
# colnames(df_in)
```
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           
  1. We can also use janitor::clean_names() to fix them automatically:
```{r}
# You may need to run: install.packages("janitor")
df_in <- 
  df_in %>% 
  janitor::clean_names()

colnames(df_in)
```
 [1] "study_name"        "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "individual_id"     "clutch_completion" "date_egg"         
[10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
[13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
[16] "delta_13_c_o_oo"   "comments"         

3.7 Task:

  1. Import your own data into R
  2. Check the structure of the data frame (data types in each column)
  3. Rename some columns so they are all easy to work with

3.8 Other useful basic tools

3.8.1 mutate

The mutate() function is another dplyr verb you’ll use a lot. It adds columns to your current data frame. For this we will use our own simple data frame that we created earlier.

We can add a single values to all rows of a data frame:

```{r}
df_data %>% 
  mutate(new_col = 1,
         another_col = "hello")
```
   Animal_ID Timepoint Treatment_Group Variable new_col another_col
1         11         1     Treatment A      1.0       1       hello
2         12         1     Treatment A      1.2       1       hello
3         21         1     Treatment B      1.0       1       hello
4         22         1     Treatment B      1.2       1       hello
5         11         2     Treatment A      2.0       1       hello
6         12         2     Treatment A      2.5       1       hello
7         21         2     Treatment B      0.5       1       hello
8         22         2     Treatment B      0.4       1       hello
9         11         3     Treatment A      3.0       1       hello
10        12         3     Treatment A      3.5       1       hello
11        21         3     Treatment B      0.2       1       hello
12        22         3     Treatment B      0.1       1       hello

It’s more likely that will want to take a column we already have, do something to it and store it in a new column:

```{r}
df_data %>% 
  mutate(
    Variable_doubled = Variable * 2
  )
```
   Animal_ID Timepoint Treatment_Group Variable Variable_doubled
1         11         1     Treatment A      1.0              2.0
2         12         1     Treatment A      1.2              2.4
3         21         1     Treatment B      1.0              2.0
4         22         1     Treatment B      1.2              2.4
5         11         2     Treatment A      2.0              4.0
6         12         2     Treatment A      2.5              5.0
7         21         2     Treatment B      0.5              1.0
8         22         2     Treatment B      0.4              0.8
9         11         3     Treatment A      3.0              6.0
10        12         3     Treatment A      3.5              7.0
11        21         3     Treatment B      0.2              0.4
12        22         3     Treatment B      0.1              0.2

This is similar to how we multiplied every element of a vector by 1 number earlier. We can also use functions, and also refer to the column we just calculated in the next row:

```{r}
df_data %>% 
  mutate(
    variable_mean = mean(Variable),
    variable_minus_mean = Variable - variable_mean
  )

# In reality we would just do: 
# mutate(variable_minus_mean = Variable - mean(variable_mean))
```
   Animal_ID Timepoint Treatment_Group Variable variable_mean
1         11         1     Treatment A      1.0      1.383333
2         12         1     Treatment A      1.2      1.383333
3         21         1     Treatment B      1.0      1.383333
4         22         1     Treatment B      1.2      1.383333
5         11         2     Treatment A      2.0      1.383333
6         12         2     Treatment A      2.5      1.383333
7         21         2     Treatment B      0.5      1.383333
8         22         2     Treatment B      0.4      1.383333
9         11         3     Treatment A      3.0      1.383333
10        12         3     Treatment A      3.5      1.383333
11        21         3     Treatment B      0.2      1.383333
12        22         3     Treatment B      0.1      1.383333
   variable_minus_mean
1           -0.3833333
2           -0.1833333
3           -0.3833333
4           -0.1833333
5            0.6166667
6            1.1166667
7           -0.8833333
8           -0.9833333
9            1.6166667
10           2.1166667
11          -1.1833333
12          -1.2833333

It’s also useful to replace the column, especially if we transform its type:

```{r}
df_data %>% 
  mutate(
    Variable = as.integer(Variable),
    Treatment_Group = as.factor(Treatment_Group)
  )
```
   Animal_ID Timepoint Treatment_Group Variable
1         11         1     Treatment A        1
2         12         1     Treatment A        1
3         21         1     Treatment B        1
4         22         1     Treatment B        1
5         11         2     Treatment A        2
6         12         2     Treatment A        2
7         21         2     Treatment B        0
8         22         2     Treatment B        0
9         11         3     Treatment A        3
10        12         3     Treatment A        3
11        21         3     Treatment B        0
12        22         3     Treatment B        0

It’s also works well with tidyselect functions, like we did using select(). But this time we need the across() function, which is a little more advanced but super useful. See https://r4ds.hadley.nz/iteration.html

Here we will use the penguins dataset again.

If we wanted to divide all numeric columns by 100 and then round to 1 digits we could do:

```{r}
penguins %>% 
  mutate(
    across(where(is.numeric), ~ .x / 100),
    across(where(is.numeric), ~round(.x, digits = 1))
    )
```
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen            0.4           0.2               1.8        37.5
 2 Adelie  Torgersen            0.4           0.2               1.9        38  
 3 Adelie  Torgersen            0.4           0.2               2          32.5
 4 Adelie  Torgersen           NA            NA                NA          NA  
 5 Adelie  Torgersen            0.4           0.2               1.9        34.5
 6 Adelie  Torgersen            0.4           0.2               1.9        36.5
 7 Adelie  Torgersen            0.4           0.2               1.8        36.2
 8 Adelie  Torgersen            0.4           0.2               2          46.8
 9 Adelie  Torgersen            0.3           0.2               1.9        34.8
10 Adelie  Torgersen            0.4           0.2               1.9        42.5
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <dbl>

This functionality is even more useful with summarise() which we will touch on in the next chapter.

3.8.2 Filter data

Another common task we have is to filter out rows of data based on some criteria we have. Let’s say we only wanted the data from Animal 22:

```{r}
df_data %>% 
  filter(Animal_ID == 22)
```
  Animal_ID Timepoint Treatment_Group Variable
1        22         1     Treatment B      1.2
2        22         2     Treatment B      0.4
3        22         3     Treatment B      0.1

We can negate this to show everything other than Animal 22:

```{r}
df_data %>% 
  filter(!Animal_ID == 22) 
# filter(Animal_ID != 22) also works
```
  Animal_ID Timepoint Treatment_Group Variable
1        11         1     Treatment A      1.0
2        12         1     Treatment A      1.2
3        21         1     Treatment B      1.0
4        11         2     Treatment A      2.0
5        12         2     Treatment A      2.5
6        21         2     Treatment B      0.5
7        11         3     Treatment A      3.0
8        12         3     Treatment A      3.5
9        21         3     Treatment B      0.2

We can specify multiple conditions:

```{r}
# AND operator is &
df_data %>% 
  filter(Animal_ID != 22 & Timepoint == 2)
```
  Animal_ID Timepoint Treatment_Group Variable
1        11         2     Treatment A      2.0
2        12         2     Treatment A      2.5
3        21         2     Treatment B      0.5
```{r}
# OR operator is |
df_data %>% 
  filter(Animal_ID == 22 | Timepoint == 2)
```
  Animal_ID Timepoint Treatment_Group Variable
1        22         1     Treatment B      1.2
2        11         2     Treatment A      2.0
3        12         2     Treatment A      2.5
4        21         2     Treatment B      0.5
5        22         2     Treatment B      0.4
6        22         3     Treatment B      0.1

We can also use %in% to match multiple values

```{r}
df_data %>% 
  filter(Timepoint %in% c(1,3))
```
  Animal_ID Timepoint Treatment_Group Variable
1        11         1     Treatment A      1.0
2        12         1     Treatment A      1.2
3        21         1     Treatment B      1.0
4        22         1     Treatment B      1.2
5        11         3     Treatment A      3.0
6        12         3     Treatment A      3.5
7        21         3     Treatment B      0.2
8        22         3     Treatment B      0.1

3.8.3 Handling NA values

Normally it is best to leave rows in place that contain NA values because no functions in the tidyverse will drop NA values without notifying you.

If you need to remove rows with NA values you can use tidyr::drop_na(). With this you can tell it which columns to inspect for missing values.

Before you do this, check out this explanation of missing values here: https://r4ds.hadley.nz/missing-values.html

Note

We haven’t talked about dates or times here. It takes a little bit of time to learn, but keep in mind that ggplot works very nicely with dates and times, so take the time to format your timeseries data correctly: https://r4ds.hadley.nz/datetimes.html

3.9 Tasks:

  1. Add a column to your own data so that each row contains the string “cows” using mutate()
  2. Remove column using select()
  3. Add a column to your own data with a calculation using an existing column
  4. Create a subset of data using filter()