In this post, we will learn how to carry out descriptive statistics in R. After we have learned how to do this, we will learn how to create a nice latex table and how to save the summary statistics to a .csv file.

## Why Descriptive Statistics?

Carrying out descriptive statistics, also known as summary statistics, is a very good starting point for most statistical analyses. It is, furthermore, a very good way to summarize and communicate information about the data we have collected.

There are, of course, plenty of useful r-packages for data manipulation and summary statistics. In this post, we will mainly work with the base R functions, and the psych and Tidyverse packages. Tidyverse comes with a bunch of handy packages that you can use to, for example, add an empty column to the dataframe.

## Installing the R-packages

As mentioned in the previous section, we are, in this descriptive statistics with R post, going to work with some r-packages. If they’re not installed the following commands will install them.

```
list.of.packages <- c("tidyverse", "psych", "knitr", "kableExtra")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
```

In this summary statistics in R tutorial, we will start by calculating descriptive statistics and some variance measures. After that, we continue with the most common ways to report the central tendency (i.e., the mean, the median). Finally, we will also calculate the harmonic, the geometric, and the trimmed mean.

## Descriptive statistics using R

In this section, we will start by calculating some demographic statistics for our data. Furthermore, we will calculate the number of missing values by group, the % of missing values by group, the mean age, age range, and such.

### Import Data

First, however, we are going to read an xlsx file using R (it can be downloaded here):

```
library(readxl)
play_df <- read_excel("../SimData/play_data.xlsx")
```

Note, data can be stored in a range of different formats. For instance, we can also read a .dta (Stata) file, and a SPSS (.sav) with R.

Before calculating some summary statistics w can have a look at the first five rows of our data by typing `head(play_df)`

. Here’s how the data looks like:

Second, as we can see in the Gender column it is coded as 0 (and 1) and we are going to recode the values to “Male” and “Female”. We are going to use the recode function. If we want, or need to, we can also remove a column. Alternatively, when calculating the summary statistics, we can also select the columns we want to use.

```
require(tidyverse)
play_df$Gender <- play_df$Gender %>%
recode("0" = "Male",
"1" = "Female")
```

## Calculating Demographic Information

In this section, we are going to summarize the information about the participants of the study. That is, we are going to calculate the mean and standard deviation in terms of age, and the age range. Here, we use the Tidyverse package, again, and the summarise function:

```
require(tidyverse)
play_df %>%
summarise(sd = sd(Age, na.rm = T),
mean = mean(Age, na.rm = T),
range = paste(min(Age, na.rm = T), "-", max(Age, na.rm = T)),
n = sum(!is.na(Age)))
```

In the code chunk above, we calculated some summary statistics about the sample. Note, we used the na.rm = T because there might be missing values in the variable Age. To create the age range variable we take the min and the max of the variable Age. Notice that we used the paste function to create the range.

### Calculate mean age, age range, standard deviation by Group

Now, we are going to group the data and calculate the mean, standard deviation, age range, and how many there are in each group. In the code chunk below, all we have done is to add the group_by method and added “Gender” to that.

```
require(tidyverse)
play_df %>% group_by(Gender) %>%
summarise(sd = sd(Age, na.rm = T),
mean = mean(Age, na.rm = T),
range = paste(min(Age, na.rm = T), "-", max(Age, na.rm = T)),
n = sum(!is.na(Age)))
```

## Central Tendency in R

In this part of the R descriptive statistics tutorial, we will focus on the measures of central tendency. The central tendency is something we calculate because we often want to know about the “*average*” or “*middle*” of our data. The two most commonly used measures of central tendency can easily be obtained using R; the mean and the median.

### Calculate the Mean in R

In the previous section, we calculated summary statistics (e.g., mean, standard deviation, range) in one go. However, if we are only interested in one summary statistic, we can calculate them separately. First, if we only want to calculate the mean of one of our variables we can use the mean function. Note, here we are interested in calculating the summary statistics for the dependent variable “RT”:

```
mean(play_df$RT, na.rm = T)
# Output: [1] 0.4963685
```

### Calculate the Mean by One Group

Second, when we use Tidyverse group_by and summarise functions, we just add the mean function. Note, this is very similar to what we did previously.

```
play_df %>% group_by(Gender) %>%
summarise(RT = mean(RT, na.rm = T))
```

### Calculate the mean by Two Groups

Third, if we want to calculate the mean by two groups we add a group to the group_by function:

```
play_df %>% group_by(Gender, Day) %>%
summarise(RT = mean(RT, na.rm = T))
```

### Geometric, Harmonic, & Trimmed Mean in R

In this section, we are going to use the R-package psych to calculate the geometric, harmonic, and trimmed mean in R. Many times it may be better to calculate the geometric and harmonic mean when we are doing summary statistics. In R, these two descriptive statistics can be obtained using the summarise function together with the functions *geometric.mean* and *harmonic.mean *(from psych).

#### Geometric Mean in R

In this section, we, are going to calculate the geometric mean in R. One very nice thing, when working with summarise is that we can input any function, from another package, that we need to use. This, in the next code chunk we are going to use the geometric.mean function from the psych package to calculate the geometric mean.

```
play_df %>% group_by(Gender, Day) %>%
summarise("Geometric Mean" = psych::geometric.mean(RT, na.rm = T))
```

#### Harmonic Mean in R

In this, R summary statistics example, we use summarise together with harmonic.mean to get the harmonic mean in R:

```
play_df %>% group_by(Gender, Day) %>%
summarise("Harmonic Mean" = psych::harmonic.mean(RT, na.rm = T))
```

#### Trimmed Mean in R

In this section, we are going to calculate the trimmed mean. This can, actually, be done using the mean function. All we do is use the trim=.2:

```
play_df %>% group_by(Gender, Day) %>%
summarise("Harmonic Mean" = mean(RT, trim=0.2, na.rm = T))
```

### Get the Median in R

In this section, we are going to calculate the median using R. It’s as easy as calculating the mean and just use the function called median.

`median(play_df$RT, na.rm = T)`

Of course, we often want the median, as well, calculated by group (e.g. categorical variable) and if we want to calculate the median by group we just use group_by, again, and summarise:

```
play_df %>% group_by(Gender, Day) %>%
summarise(Mean = median(RT, na.rm = T))
```

Now, most of the time we want to get all the measures of central tendency (or all summary statistics we calculate in R) in the same output. We can, of course, get all the data in the same output using summarise. In the descriptive statistics in R example below, the standard deviation (*sd*), mean, median, harmonic mean, geometric mean, and trimmed mean are all in the same output.

## Measures of Variability in R

Central tendency (e.g., the mean & median) is not the only type of descriptive statistic that we want to calculate. Most of the time, we also want to have a look at a measure of the variability of our data.

### Standard deviation in R

In this section, we are going to calculate the standard deviation using R. We’ve, actually, already done this using the function sd.

`sd(play_df$RT, na.rm = T)`

If we want to calculate the standard deviation by groups this is, again, doable using the group_by and summarise functions.

```
play_df %>% group_by(Gender, Day) %>%
summarise("SD" = sd(RT, na.rm = T))
```

### Interquartile Range in R

In this descriptive statistics in R example, we will use *IQR* to calculate the interquartile range in R.

`IQR(play_df$RT, na.rm = T)`

### Quantiles in R

We can also calculate quantiles. Here, we only do this by groups and we have to create a custom function (see this post for the original code adapted in the example below) to do this together with summarise_at.

```
p <- c(0.25, 0.5, 0.75)
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p)
play_df %>% group_by(Gender, Day) %>%
summarise_at(vars(RT), lst(!!!p_funs))
```

### Calculate Variance in R

In this last section, of this descriptive statistics in R tutorial, we are going to calculate the variance. Furthermore, In R, the variance is easy to calculate using R. In the summary statistics in R example below, we will use the *var* function.

`var(play_df$RT, na.rm = T)`

Now, we are going to calculate the descriptive statistic variance by groups.

```
play_df %>% group_by(Gender, Day) %>%
summarise(Variance = var(RT, na.rm = T))
```

After we have calculated the descriptive statistics we can visualize the data as well. Another step, int the data analysis pipeline, may be dummy coding. In a more recent post, it is covered how to create dummy variables in R.

## Summary Statistics with R using psych

In this section, we will use the r-package psych to calculate most of the descriptive statistics we calculated above. Here, we will use the function describeBy to calculate the standard deviation, median, mean, interquartile range, trimmed mean range, skewness, kurtosis, standard error, and quantiles.

```
library(psych)
with(play_df, describeBy(RT, group = list(Gender, Day),
IQR = T, quant = c(0.25, 0.50, 0.75)))
```

## All Descriptive Stats with dplyr

In this section, we are going to calculate the summary statistics above, using dplyr, group_by, and summarise. Furthermore, we are saving this table and we are going to create a latex table using the kable function from the knitr package.

```
tbl <- play_df %>% group_by(Gender, Day) %>%
summarise(SD = sd(RT, na.rm = T),
Mean = mean(RT, na.rm = T),
Median = median(RT, na.rm = T),
"Trimmed Mean" = mean(RT, trim = 0.2, na.rm = T),
"Geometric Mean" = psych::geometric.mean(RT, na.rm = T),
"Harmonic Mean" = psych::harmonic.mean(RT, na.rm = T),
IQR = IQR(RT, na.rm = T),
"%25 Q" = quantile(RT, .25, na.rm = T),
"%50 Q" = quantile(RT, .5, na.rm = T),
"%75 Q" = quantile(RT, .75, na.rm = T))
```

Now, we are ready to use kable to create a latex table. In the code chunk below, we load kableExtra and knitr. Kable is used to creating the latex table and kable_styling to scale the table down so it fits a PDF created with RMarkdown.

```
library(kableExtra)
library(knitr)
kable(tbl, format = "latex",
digits=2, booktabs = TRUE) %>%
kable_styling(latex_options = "scale_down")
```

## Saving Descriptive Statistics to a CSV File

If we want to save our descriptive statistics, calculated in R, we can use the Tidyverse write_excel_csv function. In the example below, we are saving the R tibble *tbl* created earlier to a .csv file:

```
write_excel_csv(tbl, "descriptive_stats.csv")
```

The next step in the data analysis pipeline would be to visualize the data to further explore any possible relationships. See the scatter plot in R with ggplot2 tutorial for more information on data visualization in R.

## Conclusion: Descriptive Statistics in R

In this post, we have learned how to describe our data. More specifically, we have learned how to calculate measures of central tendency (mean, median, etc), variability (standard deviation), and more. Furthermore, we have calculated summary statistics using R and saved it as a latex table and a CSV file.