```{r}
sum(df_data$Variable)
```
[1] 16.6
Often we are interested in comparing treatment means. Taking the mean of variables in our data is way of summarising it, along with various other ways we can summarise our data in R.
If we want the summary of different treatments, we are actually wanting to group our data by these treatments and then calculate something from them.
As we will see, one of the most useful tools is the group_by()
function. This works on the ‘split-apply-combine’ principal where you split the data up into smaller tables, do something to those smaller tables, then put all the data back together again.
Let’s say we wanted to know the sum of the Variable column, we could do:
```{r}
sum(df_data$Variable)
```
[1] 16.6
As we learnt in the previous chapter, we could store this in a new column using mutate()
:
```{r}
%>% mutate(sum_total = sum(Variable))
df_data ```
Animal_ID Timepoint Treatment_Group Variable sum_total
1 11 1 Treatment A 1.0 16.6
2 12 1 Treatment A 1.2 16.6
3 21 1 Treatment B 1.0 16.6
4 22 1 Treatment B 1.2 16.6
5 11 2 Treatment A 2.0 16.6
6 12 2 Treatment A 2.5 16.6
7 21 2 Treatment B 0.5 16.6
8 22 2 Treatment B 0.4 16.6
9 11 3 Treatment A 3.0 16.6
10 12 3 Treatment A 3.5 16.6
11 21 3 Treatment B 0.2 16.6
12 22 3 Treatment B 0.1 16.6
But let’s say we wanted to know the sum of the variable column for each Treatment (not total). If we ‘group_by’ Treatment_Group first, then behind the scenes the sum function will first be given a subset of the dataframe containing only the Treatment A data, then it will be given all of the Treatment B data. To do it manually we would have to do do:
```{r}
<- df_data %>% filter(Treatment_Group == 'Treatment A')
df_treatmentA
sum(df_treatmentA$Variable)
```
[1] 13.2
Then we would then have to repeat this for all of the treatments, which is not ideal. So, one might consider using a function to help, like:
```{r}
# 1. get all unique names in Treatment_Group
<- unique(df_data$Treatment_Group)
treatments
treatments```
[1] "Treatment A" "Treatment B"
```{r}
#2. Write a function
<- function(x){
f_groupsums <- df_data %>% filter(Treatment_Group == x)
df_temp
return(sum(df_temp$Variable))
}
#3. use apply to iterate through the treatments
lapply(treatments, f_groupsums)
```
[[1]]
[1] 13.2
[[2]]
[1] 3.4
This takes a lot of code,is probably a little scary if you’re not used to making functions, and returns a list of numbers that you then need to process further to get the treatment names for them.
Instead, we use group_by
with the summarise
function instead. The summarise function is like mutate, except you use functions inside it that summarise the data, returning single values per group:
```{r}
%>%
df_data group_by(Treatment_Group) %>%
summarise(sum_groups = sum(Variable))
```
# A tibble: 2 × 2
Treatment_Group sum_groups
<chr> <dbl>
1 Treatment A 13.2
2 Treatment B 3.4
This gives the same numbers, only in a nice data frame.
It’s also possible to group by multiple columns:
```{r}
%>%
df_data group_by(Treatment_Group, Timepoint) %>%
summarise(sum_groups = sum(Variable))
```
`summarise()` has grouped output by 'Treatment_Group'. You can override using
the `.groups` argument.
# A tibble: 6 × 3
# Groups: Treatment_Group [2]
Treatment_Group Timepoint sum_groups
<chr> <int> <dbl>
1 Treatment A 1 2.2
2 Treatment A 2 4.5
3 Treatment A 3 6.5
4 Treatment B 1 2.2
5 Treatment B 2 0.9
6 Treatment B 3 0.3
You could actually still use mutate
if you wanted to store this in a column of the original data frame (useful for some calculations, such as % within a group)
```{r}
%>%
df_data group_by(Treatment_Group) %>%
mutate(sum_groups = sum(Variable),
perc_of_group_sum = Variable/sum_groups*100) %>%
arrange(Treatment_Group) #just to make it nicer to read
```
# A tibble: 12 × 6
# Groups: Treatment_Group [2]
Animal_ID Timepoint Treatment_Group Variable sum_groups perc_of_group_sum
<dbl> <int> <chr> <dbl> <dbl> <dbl>
1 11 1 Treatment A 1 13.2 7.58
2 12 1 Treatment A 1.2 13.2 9.09
3 11 2 Treatment A 2 13.2 15.2
4 12 2 Treatment A 2.5 13.2 18.9
5 11 3 Treatment A 3 13.2 22.7
6 12 3 Treatment A 3.5 13.2 26.5
7 21 1 Treatment B 1 3.4 29.4
8 22 1 Treatment B 1.2 3.4 35.3
9 21 2 Treatment B 0.5 3.4 14.7
10 22 2 Treatment B 0.4 3.4 11.8
11 21 3 Treatment B 0.2 3.4 5.88
12 22 3 Treatment B 0.1 3.4 2.94
This same principal applies to all base functions (e.g. mean, max, min, etc) or you can write custom functions and use them with mutate and summarise:
```{r}
%>%
df_data group_by(Treatment_Group) %>%
summarise(mean_variable = mean(Variable, na.rm=TRUE)) #use na.rm=TRUE if you want the mean function to ignore NA values
```
# A tibble: 2 × 2
Treatment_Group mean_variable
<chr> <dbl>
1 Treatment A 2.2
2 Treatment B 0.567
As we mentioned previously, we can use tidyselect
with across()
to efficiently iterate through our columns.
Let’s say we want to find the mean and sd of all columns that contain numbers. We could write it out in long form:
```{r}
%>%
penguins group_by(species) %>%
summarise(bill_length_mean = mean(bill_length_mm, na.rm = TRUE),
bill_length_sd = sd(bill_length_mm, na.rm = TRUE),
bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE),
bill_depth_sd = sd(bill_depth_mm, na.rm = TRUE)
#etc....
)```
# A tibble: 3 × 5
species bill_length_mean bill_length_sd bill_depth_mean bill_depth_sd
<fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie 38.8 2.66 18.3 1.22
2 Chinstrap 48.8 3.34 18.4 1.14
3 Gentoo 47.5 3.08 15.0 0.981
Or, we can use across()
:
```{r}
%>%
penguins group_by(species) %>%
summarise(
across(.cols = where(is.numeric),
.fns = list(mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm=TRUE)),
.names = '{.col}_{.fn}') #see ?across for details on how this works
)```
# A tibble: 3 × 11
species bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean
<fct> <dbl> <dbl> <dbl>
1 Adelie 38.8 2.66 18.3
2 Chinstrap 48.8 3.34 18.4
3 Gentoo 47.5 3.08 15.0
# ℹ 7 more variables: bill_depth_mm_sd <dbl>, flipper_length_mm_mean <dbl>,
# flipper_length_mm_sd <dbl>, body_mass_g_mean <dbl>, body_mass_g_sd <dbl>,
# year_mean <dbl>, year_sd <dbl>
ymin
and ymax
, containing the values calculated from mean - SD
and mean + SD
, respectively.