4  Summarising data

Often we are interested in comparing treatment means. Taking the mean of variables in our data is way of summarising it, along with various other ways we can summarise our data in R.

If we want the summary of different treatments, we are actually wanting to group our data by these treatments and then calculate something from them.

4.1 Split-apply-combine concept

As we will see, one of the most useful tools is the group_by() function. This works on the ‘split-apply-combine’ principal where you split the data up into smaller tables, do something to those smaller tables, then put all the data back together again.

Let’s say we wanted to know the sum of the Variable column, we could do:

```{r}
sum(df_data$Variable)
```
[1] 16.6

As we learnt in the previous chapter, we could store this in a new column using mutate():

```{r}
df_data %>% mutate(sum_total = sum(Variable))
```
   Animal_ID Timepoint Treatment_Group Variable sum_total
1         11         1     Treatment A      1.0      16.6
2         12         1     Treatment A      1.2      16.6
3         21         1     Treatment B      1.0      16.6
4         22         1     Treatment B      1.2      16.6
5         11         2     Treatment A      2.0      16.6
6         12         2     Treatment A      2.5      16.6
7         21         2     Treatment B      0.5      16.6
8         22         2     Treatment B      0.4      16.6
9         11         3     Treatment A      3.0      16.6
10        12         3     Treatment A      3.5      16.6
11        21         3     Treatment B      0.2      16.6
12        22         3     Treatment B      0.1      16.6

But let’s say we wanted to know the sum of the variable column for each Treatment (not total). If we ‘group_by’ Treatment_Group first, then behind the scenes the sum function will first be given a subset of the dataframe containing only the Treatment A data, then it will be given all of the Treatment B data. To do it manually we would have to do do:

```{r}
df_treatmentA <- df_data %>% filter(Treatment_Group == 'Treatment A')

sum(df_treatmentA$Variable)
```
[1] 13.2

Then we would then have to repeat this for all of the treatments, which is not ideal. So, one might consider using a function to help, like:

```{r}
# 1. get all unique names in Treatment_Group
treatments <- unique(df_data$Treatment_Group)

treatments
```
[1] "Treatment A" "Treatment B"
```{r}
#2. Write a function
f_groupsums <- function(x){
  df_temp <- df_data %>% filter(Treatment_Group == x)
  
  return(sum(df_temp$Variable))
}

#3. use apply to iterate through the treatments 
lapply(treatments, f_groupsums)
```
[[1]]
[1] 13.2

[[2]]
[1] 3.4

This takes a lot of code,is probably a little scary if you’re not used to making functions, and returns a list of numbers that you then need to process further to get the treatment names for them.

4.2 The easy way

Instead, we use group_by with the summarise function instead. The summarise function is like mutate, except you use functions inside it that summarise the data, returning single values per group:

```{r}
df_data %>% 
  group_by(Treatment_Group) %>% 
  summarise(sum_groups = sum(Variable))
```
# A tibble: 2 × 2
  Treatment_Group sum_groups
  <chr>                <dbl>
1 Treatment A           13.2
2 Treatment B            3.4

This gives the same numbers, only in a nice data frame.

It’s also possible to group by multiple columns:

```{r}
df_data %>% 
  group_by(Treatment_Group, Timepoint) %>% 
  summarise(sum_groups = sum(Variable))
```
`summarise()` has grouped output by 'Treatment_Group'. You can override using
the `.groups` argument.
# A tibble: 6 × 3
# Groups:   Treatment_Group [2]
  Treatment_Group Timepoint sum_groups
  <chr>               <int>      <dbl>
1 Treatment A             1        2.2
2 Treatment A             2        4.5
3 Treatment A             3        6.5
4 Treatment B             1        2.2
5 Treatment B             2        0.9
6 Treatment B             3        0.3

You could actually still use mutate if you wanted to store this in a column of the original data frame (useful for some calculations, such as % within a group)

```{r}
df_data %>% 
  group_by(Treatment_Group) %>% 
  mutate(sum_groups = sum(Variable),
         perc_of_group_sum = Variable/sum_groups*100) %>% 
    arrange(Treatment_Group) #just to make it nicer to read
```
# A tibble: 12 × 6
# Groups:   Treatment_Group [2]
   Animal_ID Timepoint Treatment_Group Variable sum_groups perc_of_group_sum
       <dbl>     <int> <chr>              <dbl>      <dbl>             <dbl>
 1        11         1 Treatment A          1         13.2              7.58
 2        12         1 Treatment A          1.2       13.2              9.09
 3        11         2 Treatment A          2         13.2             15.2 
 4        12         2 Treatment A          2.5       13.2             18.9 
 5        11         3 Treatment A          3         13.2             22.7 
 6        12         3 Treatment A          3.5       13.2             26.5 
 7        21         1 Treatment B          1          3.4             29.4 
 8        22         1 Treatment B          1.2        3.4             35.3 
 9        21         2 Treatment B          0.5        3.4             14.7 
10        22         2 Treatment B          0.4        3.4             11.8 
11        21         3 Treatment B          0.2        3.4              5.88
12        22         3 Treatment B          0.1        3.4              2.94

This same principal applies to all base functions (e.g. mean, max, min, etc) or you can write custom functions and use them with mutate and summarise:

```{r}
df_data %>% 
  group_by(Treatment_Group) %>% 
  summarise(mean_variable = mean(Variable, na.rm=TRUE)) #use na.rm=TRUE if you want the mean function to ignore NA values
```
# A tibble: 2 × 2
  Treatment_Group mean_variable
  <chr>                   <dbl>
1 Treatment A             2.2  
2 Treatment B             0.567

4.3 Summarising multiple columns

As we mentioned previously, we can use tidyselect with across() to efficiently iterate through our columns.

Let’s say we want to find the mean and sd of all columns that contain numbers. We could write it out in long form:

```{r}
penguins %>% 
  group_by(species) %>% 
  summarise(bill_length_mean = mean(bill_length_mm, na.rm = TRUE),
            bill_length_sd = sd(bill_length_mm, na.rm = TRUE),
            bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
            bill_depth_sd = sd(bill_depth_mm, na.rm = TRUE)
            #etc....
            )
```
# A tibble: 3 × 5
  species   bill_length_mean bill_length_sd bill_depth_mean bill_depth_sd
  <fct>                <dbl>          <dbl>           <dbl>         <dbl>
1 Adelie                38.8           2.66            18.3         1.22 
2 Chinstrap             48.8           3.34            18.4         1.14 
3 Gentoo                47.5           3.08            15.0         0.981

Or, we can use across():

```{r}
penguins %>% 
  group_by(species) %>% 
  summarise(
    across(.cols = where(is.numeric),
           .fns = list(mean = ~mean(.x, na.rm = TRUE),
                       sd = ~sd(.x, na.rm=TRUE)),
           .names = '{.col}_{.fn}') #see ?across for details on how this works
  )
```
# A tibble: 3 × 11
  species   bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean
  <fct>                   <dbl>             <dbl>              <dbl>
1 Adelie                   38.8              2.66               18.3
2 Chinstrap                48.8              3.34               18.4
3 Gentoo                   47.5              3.08               15.0
# ℹ 7 more variables: bill_depth_mm_sd <dbl>, flipper_length_mm_mean <dbl>,
#   flipper_length_mm_sd <dbl>, body_mass_g_mean <dbl>, body_mass_g_sd <dbl>,
#   year_mean <dbl>, year_sd <dbl>

4.4 Task:

  • Choose 1 variable from you data and calculate the mean and sd for each treatment (or another relevant group)
  • Mutate 2 new columns to your summary table called ymin and ymax, containing the values calculated from mean - SD and mean + SD, respectively.
  • Advanced: re-calculate your summary table but also include the number of samples per group.