How to randomly select rows in R? Learn the sample() and slice_sample() functions to take random samples from dataframes. Explore practical examples and synthetic datasets for hands-on learning. Enhance your data analysis skills and unlock new possibilities with random row selection in R.

The post How to Randomly Select Rows in R – Sample from Dataframe appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to randomly select rows in R, an essential skill when working with large datasets or conducting statistical analyses. The ability to randomly sample rows enables us to extract representative subgroups or explore data in a randomized order. For instance, if you are building a random forest regression model and need to create a training and test dataset. Randomly selecting rows can ensure that both datasets have a diverse data representation, avoiding biases and producing robust results.

To achieve this task in R, we will utilize the power of both base R and the tidyverse packages. These packages provide efficient and flexible functions for random sampling from a dataframe.

Now, let us dive into the tutorial and explore the step-by-step process of randomly sampling rows in R. We will begin by examining the built-in functions and techniques available in base R. Next, we will explore the rich functionality provided by the tidyverse packages, such as dplyr and tidyr, which offer elegant and intuitive methods for data manipulation and sampling.

We follow clear examples and explanations throughout the tutorial to ensure your understanding. So, let us get started and unlock the power of random sampling in R!

In the first section, we will have a look at the widely used `sample()`

function. This function allows us to randomly select rows by specifying the desired number or proportion of rows to be sampled. Next, we will have a look the `slice_sample()`

function from the dplyr package. This function also provides a convenient way to randomly select rows by specifying the number or the proportion of rows to be sampled. To demonstrate the usage of these functions, we will then generate a synthetic dataset that simulates data related to hearing and perception in a psychology study. In the following sections, we will present step-by-step examples of how to randomly select a specific number of rows or a proportion of rows from the dataset using both the `sample()`

function and the `slice_sample()`

function.

To take a random sample from a dataframe in R, we have a range of powerful functions and packages at our disposal. We can leverage the capabilities of base R and the popular tidyverse packages to accomplish this task seamlessly.

The function `sample() `

is our go-to random sampling option in base R. This versatile function allows us to extract a specified number of random rows from a dataframe. By specifying the desired sample size, we can ensure that our subset represents the original data.

Alternatively, if you prefer a more expressive and intuitive syntax, the tidyverse packages, such as dplyr and tidyr, provide convenient functions for sampling. With dplyr, we can use the slice_sample() function to randomly select rows from a dataframe based on a given fraction or number. This allows for easy sampling while preserving the overall structure of the dataset.

Furthermore, if you require more advanced sampling techniques like stratified sampling, the dplyr package combines the `group_by()`

and `sample_n()`

functions. These functions enable us to stratify our dataframe based on specific variables and obtain random samples from each stratum.

In addition to its capabilities for random sampling, the dplyr package offers a wide array of functions that simplify various data manipulation tasks. With dplyr, we can perform operations such as selecting columns, removing columns, renaming columns, and adding new columns to a dataframe, all concisely and intuitively.

By combining the power of base R and the tidyverse packages, we can confidently tackle any sampling requirement in our data analysis. In the upcoming sections of this tutorial, we will delve into each method in detail, providing practical examples and step-by-step instructions.

R’s `sample() `

function is a powerful tool for randomly selecting elements from a given vector or dataframe.

- The first argument,
`x`

, represents the vector or dataframe we want to sample. - The
`size`

argument specifies the number of elements we want to sample from x. - By default, the
`replace`

argument is set to`FALSE`

, meaning sampling is done without replacement. This ensures that each selected element is unique. However, setting replace = TRUE allows for sampling with replacement, allowing the same element to be selected multiple times. - The optional
`prob`

argument allows us to assign probabilities to each element in x. This enables us to perform weighted sampling, where elements with higher probabilities are more likely to be selected.

Utilizing the sample() function with these arguments, we can effortlessly generate random samples from vectors or dataframes in R. This flexibility and simplicity make it a go-to choice, for many, when it comes to random sampling in R in various data analysis scenarios.

The slice_sample() function from the dplyr package in R allows us to randomly sample rows from a dataframe based on specified criteria.

- The
`.data`

argument represents the input dataframe from which we want to sample rows. - The
`…`

argument allows for additional conditions or expressions to be applied during the sampling process. - The
`n`

argument specifies the exact number of rows we want to sample from the dataframe. - Alternatively, the prop argument allows us to specify the proportion or fraction of rows to sample from the dataframe.
- The
`by`

argument enables us to group the dataframe by one or more variables before sampling. This is useful when we want to sample within specific groups or strata. - The
`weight_by`

argument allows us to assign sampling weights to each row in the dataframe. This can be utilized when performing weighted sampling, where specific rows are more likely to be selected. - By default, the
`replace`

argument is set to`FALSE`

, meaning sampling is done without replacement. However, setting`replace = TRUE`

allows for sampling with replacement, allowing the same row to be selected multiple times.

By leveraging the `slice_sample() `

function with some of these arguments, we can effortlessly generate random samples of rows from dataframes in R while considering various conditions and sampling strategies. This versatility, combined with the rich functionality of dplyr, empowers us to perform complex data manipulations and sampling operations with ease.

Here we generate a synthetic dataset in R that we can use to practice randomly selecting rows:

```
library(dplyr)
groups <- c("Group A", "Group B", "Group C")
data <- data.frame(
group = rep(groups, each = 100)
) %>%
mutate(
perception = ifelse(
group %in% c("Group A", "Group B") | group == "Group C",
rnorm(300, mean = 8, sd = 1),
NA
)
)
```

Code language: JavaScript (javascript)

In the code chunk above, we use the dplyr package in R for data manipulation tasks. We start by loading the dplyr library, allowing us to access its functions. Moreover, we define a vector called `groups`

that contains three group labels: “Group A”, “Group B”, and “Group C”. Next, we create a dataframe named `data`

using the data.frame() function. Within the dataframe, we create a column named `group`

by repeating the elements of `groups`

100 times each, resulting in 300 rows.

To assign values to the `perception`

column, we use the mutate() function from dplyr. Using the `ifelse() `

function, we apply conditions using %in% in R and the or operator (‘|’). If the `group`

is either “Group A” or “Group B” or if it is equal to “Group C”, we generate 300 random normal values with a mean of 8 and a standard deviation of 1. Otherwise, we assign the value NA to the `perception`

column for the remaining cases. In the next section, we will use `sample()`

to take a random sample from the synthetic data.

Here is how to randomly select rows in R using the `sample()`

function:

```
# Randomly select 50 rows from the dataframe
random_sample <- data[sample(nrow(data), 50), ]
```

Code language: PHP (php)

In the code chunk above, we used the` sample()`

function to randomly select rows from the dataframe. Here is a breakdown of the code:

`nrow(data)`

returns the number of rows in the`data`

dataframe.`sample(nrow(data), 50)`

generates a random sample of 50 row indices from 1 to the number of rows in data.- Finally, we use these randomly selected row indices to subset the
`data`

dataframe and store the result in the random_sample variable.

After running the code, the `random_sample`

dataframe will contain 50 randomly selected rows from the original data dataframe. In the next subsection, we will randomly select 1/4 of the total rows of a dataframe in R using the same function.

Here is how to randomly select one-fourth (25%) of the data from the dataframe data:

```
# Randomly select 25% of the rows from the dataframe
random_sample <- data[sample(nrow(data), nrow(data) * 0.25), ]
```

Code language: PHP (php)

In the code chunk above, we multiply the total number of rows in the `data`

dataframe (`nrow(data)`

) by `0.25`

to select one-fourth of the data. Here is a breakdown of the code:

`nrow(data) * 0.25`

calculates the desired number of rows, which is 25% (1/4) of the total number of rows in`data`

.`sample(nrow(data), nrow(data) * 0.25)`

generates a random sample of row indices using the calculated number of rows.- Finally, we use these randomly selected row indices to subset the
`data`

dataframe and store the result in the`random_sample`

variable.

After running this updated code, the `random_sample`

dataframe will contain approximately one-fourth of the randomly selected rows from the original `data`

dataframe. The next section will cover how to take a random sample of a dataframe using `slice_sample()`

.

Here is how to use `slice_sample() `

in R to take a random sample from dataframe:

```
library(dplyr)
# Randomly select 50 rows from the dataframe
random_sample <- data %>% slice_sample(n = 50)
```

Code language: PHP (php)

In the code chunk above code, we used the `slice_sample()`

function to randomly select a specific number of rows from the dataframe. Here’s an explanation of the code:

`data %>%`

is a pipe operator (`%>%`

) that passes the dataframe`data`

as the input for the subsequent function.`slice_sample(n = 50)`

performs the random sampling operation, where`n`

is set to 50 to indicate the desired number of rows to be selected randomly.

After running this code, the `random_sample`

dataframe will contain 50 randomly selected rows from the original `data`

dataframe using the `slice_sample()`

function.

Here is how we randomly select 1/4 of the total rows in R using `slice_sample()`

:

```
library(dplyr)
# Randomly select 25% of the rows from the dataframe
random_sample <- data %>% slice_sample(prop = 0.25)
```

Code language: PHP (php)

In this code, we utilize the `slice_sample()`

function from dplyr to randomly select rows based on a proportion of the data. Here’s an explanation of the code:

`data %>%`

is a pipe operator (`%>%`

) that passes the dataframe`data`

as the input for the subsequent function.`slice_sample(prop = 0.25)`

performs the random sampling operation, where`prop`

is set to 0.25 to indicate the desired proportion (25%) of the data to be selected randomly.

By executing this code, the `random_sample`

dataframe will contain approximately one-fourth of the randomly selected rows from the original `data`

dataframe using the `slice_sample()`

function.

In this post, you have learned how to randomly select rows in R using two powerful functions, `sample()`

and `slice_sample()`

.

First, we explored the `sample()`

function, which allowed us to randomly select a specific number or proportion of rows from a dataframe. We saw how this function can be useful for selecting a random subset of data for analysis or modeling purposes.

Next, we introduced the `slice_sample()`

function from the `dplyr `

package, which provided an alternative approach to random row selection. With this function, we could easily specify the number or proportion of rows to be sampled, making it convenient for various sampling needs.

Throughout the post, we used a synthetic dataset based on psychology example data to demonstrate the functionality of these functions. By following the examples and explanations provided, you gained the ability to leverage these functions for your data analysis tasks.

If you found this post informative and valuable, I encourage you to share it on social media and with your colleagues. Spread the knowledge and help others discover the techniques for randomly selecting rows in R. Additionally, I would love to hear from you! Comment on the blog if you have any specific topics or techniques you want me to cover in future posts. Your feedback is valuable to us. Please let me know if you encounter any errors or have suggestions for improvement.

Here are some other blog posts you may find useful:

- Countif function in R with Base and dplyr
- Sum Across Columns in R – dplyr & base
- How to Convert a List to a Dataframe in R – dplyr
- How to Create a Matrix in R with Examples – empty, zeros
- How to Rename Factor Levels in R using levels() and dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr

The post How to Randomly Select Rows in R – Sample from Dataframe appeared first on Erik Marsja.

]]>In this blog post, we will learn how to extract P-values from the regression models in R. We will explore the process of fitting a regression model, and then dive into the methods of extracting P-values using the lm() function. Additionally, we will demonstrate how to extract P-values from all predictors and leverage the tidy() function for a tidy output. Unlock the power of statistical inference with the ability to extract P-values from lm() in R.

The post Extract P-Values from lm() in R: Empower Your Data Analysis appeared first on Erik Marsja.

]]>Are you searching for a way to extract p-values from the lm() function in R? Look no further! In this blog post, we will explore how to obtain p-values from linear regression models using the lm() function in R. Whether you are a researcher, a student, or a data enthusiast, understanding how to extract p-values can significantly enhance your statistical analysis skills and provide valuable insights from your data.

Linear regression is a powerful statistical technique commonly employed in various fields, including psychological research. It allows us to investigate the relationship between a dependent variable and one or more independent variables. By fitting a regression model, we can estimate the relationship’s strength and direction, assess the effects’ significance, and make predictions based on the observed data.

However, it is not sufficient to merely examine the estimated coefficients of a regression model; we need to assess their statistical significance. P-values play a crucial role in this process. They provide a measure of evidence against the null hypothesis, indicating whether the observed effects are statistically significant or simply due to chance. Extracting p-values allows us to evaluate the reliability of the estimated coefficients and determine their practical relevance.

In the next sections of this blog post, we will dive into the process of extracting p-values from lm() in R, guiding you through the necessary steps and illustrating the practical application of this valuable statistical information. Let’s unlock the power of p-values and enhance how we interpret linear regression results in R!

In this post, we will explore the process of extracting p-values from the `lm()`

function in R. To begin, we will generate synthetic data to create a suitable regression model for our analysis. This synthetic data will include Pure Tone Average (PTA) and Inhibitory Control (IC) predictors. Next, we will demonstrate how to fit a regression model using the `lm()`

function in R, incorporating the generated synthetic data.

Once the regression model is established, we will delve into the post’s main focus: extracting p-values. We will introduce the `broom`

package, which provides a convenient and tidy approach to extracting essential information from statistical models. Specifically, we will explore the usage of the tidy() function from `broom`

to extract p-values from the regression model.

Moving forward, we will look at extracting p-values from all predictors in the regression model. This will allow you to assess the significance of each predictor and gain valuable insights into their impact on the outcome variable.

Additionally, we will demonstrate how to extract p-values specifically using the `tidy()`

function, highlighting the ease and efficiency of this approach. This method will enable you to obtain the p-values in a tidy format, facilitating further analysis or integration into reports and theses.

By the end of this post, you will have a comprehensive understanding of how to extract p-values from the `lm()`

function in R, enabling you to perform robust statistical analyses and draw meaningful conclusions from your regression models.

To generate the synthetic data used in this blog post, you may optionally utilize the `dplyr`

package. Although not mandatory, `dplyr`

provides a convenient and efficient way to manipulate and transform data in R. With its intuitive syntax and powerful functions, `dplyr`

simplifies common data manipulation tasks such as filtering, selecting columns, creating new variables, and summarizing data.

For those interested in extracting p-values from the `lm()`

function using the `broom`

package, it needs to be installed. The `broom`

package simplifies obtaining information from statistical models, including regression models, in a tidy format. It offers functions like `tidy()`

, `glance()`

, and `augment()`

that allows you to extract coefficients, p-values, model fit measures, and more. Installing `broom`

enables you to access and analyze model outputs easily.

Using `dplyr`

alongside `broom`

can further streamline your data analysis workflow. Combining these packages enables you to seamlessly generate synthetic data, apply regression models, and extract essential information like p-values tidy and efficiently.

With its broad range of capabilities, `dplyr`

serves as a valuable tool for data manipulation and transformation, facilitating tasks such as reshaping data from wide to long format, filtering observations based on specific conditions, or removing unnecessary columns. Incorporating `dplyr`

into your R programming toolkit empowers you to efficiently manipulate, clean, and prepare your data for analysis, saving you time and effort throughout the data processing stage. Finally, you may also want to update R to the latest version.

Here are some synthetic data we can use to practice extracting p-values from `lm()`

in R:

```
# Load necessary libraries
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Generate synthetic data
n <- 100 # Number of observations
# Create PTA variable
PTA <- rnorm(n, mean = 30, sd = 5) # Pure tone average (mean = 30, sd = 5)
# Create SNR variable
SNR <- -7.1 * PTA + rnorm(n, mean = 0, sd = 3) # Signal-to-noise ratio (SNR)
# Create IC variable
IC <- rep(504, n) + rnorm(n, mean = 0, sd = 50) # Inhibitory control (IC)
# Standardize the variables using z-scores
PTA <- scale(PTA)
SNR <- scale(SNR)
IC <- scale(IC)
# Combine variables into a data frame
data <- data.frame(PTA, SNR, IC)
```

Code language: R (r)

In the code chunk above, we first load the necessary `dplyr`

library for data manipulation and analysis.

Next, we set the seed to ensure the reproducibility of the generated synthetic data. To generate the data, we specify the number of observations `n`

as 100.

We create the `PTA`

variable, representing the pure tone average, randomly sampling from a normal distribution with a mean of 30 and a standard deviation of 5.

The `SNR`

variable, which denotes the signal-to-noise ratio, is generated by multiplying -7.1 with the `PTA`

variable and adding random noise sampled from a normal distribution with a mean of 0 and a standard deviation of 3.

For the `IC`

variable, which stands for inhibitory control, we set a constant value 504 for all observations using the `rep()`

function. To introduce variability, we add random noise sampled from a normal distribution with a mean of 0 and a standard deviation of 50. Specifically, we used the rep() function to generate a squence of numbers.

To ensure comparability across variables, we standardize the data using z-scores. Standardization transforms each variable into a mean of 0 and a standard deviation of 1.

Finally, we combine the standardized variables into a dataframe `data`

, ready for further analysis and exploration of the predictors’ significance in multiple regression models.

Here is how to perform a multiple regression analysis in R using the standardized variables from the `data`

dataframe:

```
# Fit the model:
fit <- lm(SNR ~ PTA + IC, data = data)
# View the summary of the regression model
summary(fit)
```

Code language: R (r)

In the code chunk above, we fit a multiple regression model using the `lm()`

function in R. The formula `SNR ~ PTA + IC`

specifies that the dependent variable `SNR`

is regressed on the independent variables `PTA`

and `IC`

.

By assigning the result of the regression model to the fit object, we store the fitted model for further analysis and examination. To obtain a summary of the regression results, we use the `summary()`

function on the `f`

it object. This provides detailed information about the coefficients, standard errors, t-values, and p-values associated with each predictor in the model.

Furthermore, it is worth mentioning that in addition to multiple regression, R offers various regression models, such as probit regression, which can be used to model binary outcomes. These different regression techniques allow us to explore and analyze relationships between variables in various contexts and research questions.

Here is how we can extract the overall p-value from the lm() object in R:

```
# Capture the printed output of summary(fit)
summary_output <- capture.output(summary(fit))
# Extract the p-value from the captured output
p_value_line <- grep("p-value:", summary_output, value = TRUE)
p_value <- sub(".*p-value: (.*)$", "\\1", p_value_line)
# Check if the extracted p-value is not empty
if (!is.na(p_value) && !is.na(p_value[1])) {
# Perform desired actions if the p-value is extracted
print(paste("The p-value is", p_value[1]))
} else {
# Perform alternative actions if the p-value is not found
print("The p-value is not present in the summary output.")
}
```

Code language: R (r)

In the code chunk above, we capture the printed output of the `summary(fit)`

function using `capture.output()`

. We extract the p-value from the captured output by searching for the line containing “p-value:” using `grep()`

and specifying `value = TRUE`

. Using the `sub()`

function, we extract the actual p-value from the selected line of text. Next, we check if the extracted p-value is not empty using `!is.na()`

. If the p-value is successfully extracted, we print a message including the extracted p-value. However, if the p-value is not found in the summary output, we print a message indicating it is absent. This code allowed us to automate extracting the p-value from the summary output. However, we may be interested in a specific predictor and need to extract its p-value in R. Here is how we can do it using the `dplyr `

and `stringr `

packages:

Here is how we can extract the p-value from lm() for all predictor variables, including the intercept:

```
# Extract the p-values of the predictors from the model
predictor_pvalues <- summary(fit)$coefficients[, "Pr(>|t|)"]
# Print the predictor p-values
print(predictor_pvalues)
```

Code language: R (r)

In the code above, we extract the p-values of the predictors from the `fit`

using the `summary()`

function. The `coefficients`

attribute of the summary object contains information about the coefficients, including the p-values.

By selecting the column `"Pr(>|t|)"`

from the `coefficients`

attribute, we specifically obtain the p-values of the predictors.

Finally, we print the predictor p-values using the `print()`

function to view the results.

Executing this code will provide you with the p-values of the predictors in the regression model. These p-values indicate the statistical significance of each predictor concerning the dependent variable.

Here is an example of how to use the `broom`

package to extract p-values from the regression model `fit`

:

```
# Load the broom package
library(broom)
# Extract p-values using broom's tidy() function
p_values <- tidy(fit)$p.value
# Print the predictor p-values
print(p_values)
```

Code language: R (r)

In the code above, we first load the `broom`

package using the `library()`

function. Then, we use the `tidy()`

function from the `broom`

package to extract information from the regression model `fit`

, including the p-values of the predictors. By accessing the `p.value`

column from the result of `tidy(fit)`

, we obtain the p-values of the predictors. Finally, we print the predictor p-values using the `print()`

function to display the results.

Running this code will extract the p-values from lm in R. Specifically, we extracted the p-values using the `broom`

package. The output will display the p-values, indicating the statistical significance of each predictor in the regression model.

Using the `broom`

package simplifies the process of extracting model information, such as p-values, and provides a tidy output that can be easily further analyzed or visualized.

In conclusion, we have successfully learned how to extract the p-values from `lm()`

in R. By leveraging the `broom`

package, we accessed vital information about statistical significance in regression models. This knowledge empowers us to make informed decisions and draw meaningful conclusions from our analyses.

With the ability to extract p-values, we can confidently identify the significance of predictors, understand their impact on the outcome variable, and communicate our findings effectively. This valuable skill applies to various fields, including Psychology, social sciences, and beyond.

I encourage you to implement these techniques in your data analysis projects, whether in research papers, theses, or real-world applications. Share this blog post with your colleagues and fellow data enthusiasts to spread knowledge and enhance the statistical prowess of your community.

Remember, extracting p-values from `lm()`

in R is just one step toward unlocking the full potential of your data. Keep exploring, learning, and utilizing the vast statistical tools to further your data-driven journey.

Here are some other resources that you may find helpful:

- How to Make a Residual Plot in R & Interpret Them using ggplot2
- Mastering SST & SSE in R: A Complete Guide for Analysts
- How to Calculate Z Score in R
- Report Correlation in APA Style using R: Text & Tables

The post Extract P-Values from lm() in R: Empower Your Data Analysis appeared first on Erik Marsja.

]]>Unlock the potential of or in R! Explore advanced data manipulation techniques and learn how to filter, select, mutate, and summarize data based on multiple conditions. Boost your coding skills and streamline your data analysis workflows with ease.

The post Master or in R: A Comprehensive Guide to the Operator appeared first on Erik Marsja.

]]>In this comprehensive tutorial, we will look at the capabilities of one operator that is very handy for data wrangling: or in R. We will delve into the power of the or operator, symbolized by “|”, and explore how it can be used in our data analysis workflows. Whether you are a seasoned data scientist or a beginner venturing into the realm of R programming, understanding and harnessing the full potential of the or operator will empower you to manipulate, analyze, and visualize data with unparalleled flexibility and precision.

Imagine you are investigating the interplay between cognition and hearing. As you explore the dataset, you may encounter scenarios where you must extract specific observations satisfying multiple conditions. This is precisely where the or operator becomes an invaluable tool. Using the R’s or operator, you can combine logical conditions to filter your data to obtain subsets that meet your desired criteria.

Let us consider a practical example. Suppose you want to analyze the relationship between cognition and hearing in individuals above 60 years of age or with a hearing impairment. Here we can use the or operator to filter the dataset effortlessly. For example, by including only those participants who fulfill either of these conditions. This focused subset will serve as the foundation for further analysis. It can enable you to gain insights specific to your research questions.

Throughout this tutorial, we will embark on a journey of discovery, exploring various applications of the or operator in R. You will learn how to construct complex logical expressions, perform efficient data filtering, and unlock the true potential of your datasets. By the end of this tutorial, you will have a solid understanding of utilizing the or operator effectively, empowering you to handle diverse data-wrangling challenges confidently.

In this blog post, we will first outline the requirements to follow along effectively. You must have R installed and an interactive development environment (IDE) like RStudio. Basic knowledge of R programming is also recommended.

Next, We will dive into synthetic data generation to create a dataset for practicing the or operator in R. This dataset will involve variables related to hearing and working memory capacity.

We will then explore various examples of utilizing the or operator in R. We will cover filtering based on multiple conditions using the or operator in conjunction with the ‘%in%’ operator. Additionally, we will demonstrate selecting columns that match specific patterns using the ‘matches()’ function and or in R.

Next, We will discuss selecting columns that contain specific substrings using `contains()`

and the or operator. We will also showcase adding a new column based on values in another column using the or operator and `mutate()`

.

Furthermore, We will demonstrate how to filter based on multiple logical conditions using the or operator and comparison operators. Conditional recoding of a variable using `case_when`

() and or in R will also be covered.

Additionally, we will explore combining logical conditions with or in R within if statements and summarizing data based on multiple conditions using or and `group_by() `

with `summarize()`

functions.

Throughout this blog post, We will provide detailed explanations and code examples to ensure a clear understanding of each concept. So let`s get started and unlock the full potential of the or operator in R for efficient data manipulation and analysis.

To effectively follow this blog post, ensure that you have R installed on your system, as it will serve as the programming language for implementing the concepts discussed. Make sure you have a version of R that is up-to-date. Additionally, it is recommended to utilize an interactive development environment (IDE) such as RStudio, Jupyter Notebook with R kernel, or Visual Studio Code with R extensions. These IDEs provide a user-friendly interface with syntax highlighting and code completion features, enabling a seamless coding experience.

While prior programming experience is not mandatory, having a basic understanding of R programming will greatly facilitate your comprehension. Familiarity with concepts such as variables, functions, conditional statements, and data structures in R is beneficial and will aid in following the examples provided.

By meeting these requirements, you will be well-prepared to delve into this tutorial on mastering the or operator in R for efficient data wrangling. Embrace this opportunity to enhance your data manipulation skills and gain valuable insights from the power of the ‘or’ operator in conjunction with dplyr functions.

Here, we generate a synthetic dataset specifically designed for practicing the usage of the or in R. This dataset will serve as a valuable resource to enhance your skills in working with logical conditions.

```
# Loading required libraries
library(dplyr)
# Generating the dataset
hearing <- c("excellent", "impaired", "normal")
wmc <- c("low", "medium", "high")
# Creating combinations of hearing and working memory capacity (WMC)
data <- expand.grid(hearing = hearing, wmc = wmc)
# Generating the dependent variable SNR
data <- data %>%
mutate(snr = ifelse(hearing == "impaired", -6.1, -9.1))
```

Code language: R (r)

In the code chunk above, we start by loading the necessary library, `dplyr`

, which provides powerful functions for data manipulation in R. Next, we generate the dataset by defining the levels for the variables `hearing`

and `wmc`

. The `hearing`

variable includes categories for “excellent,” “impaired,” and “normal,” while `wmc`

consists of “low,” “medium,” and “high.”

To create a comprehensive dataset, we utilize the `expand.grid()`

function. This function generates all possible combinations of the specified variables, resulting in a dataset with the combinations of `hearing`

and `wmc`

.

Moving forward, we introduce the dependent variable, SNR (Signal-to-Noise Ratio), to the dataset using the `mutate()`

function. With the help of the `ifelse()`

function, we assign values to the `snr`

variable based on a conditional statement. If the variable’s value is “impaired,” the corresponding SNR value is set to -6.1. Otherwise, for “excellent” and “normal” hearing levels, the SNR is assigned -9.1. By using the `%>%`

pipe operator, we update the dataset `data`

with the newly created `snr`

variable.

Here are eight examples of using the or operator in R with the provided dataset:

`%in%`

operators in R:Here is how we can use or in R together with the R’s %in% operator:

```
filtered_data <- data %>%
filter(hearing %in% c("excellent", "impaired") | wmc %in% "high")
```

Code language: R (r)

In the code chunk above, we used the or operator in R in conjunction with the `filter()`

function to create the `filtered_data`

dataset. This code allows us to filter rows based on specific conditions selectively.

Using the pipe operator `%>%`

, we pass the `data`

dataset to the `filter()`

function. Within the `filter()`

function, we specify the filtering conditions using the or operator `|`

.

The first condition, `hearing %in% c("excellent", "impaired")`

, checks if the value of the `hearing`

variable is either “excellent” or “impaired”. The `%in%`

operator checks for membership in a vector, and here it determines if the value of `hearing`

matches any of the specified levels.

The second condition, `wmc %in% "high"`

, checks if the value of the `wmc`

variable is “high”. Similarly, the `%in%`

operator checks for a match between `wmc`

and the specified level.

By using the or operator `|`

between these conditions, we instruct R to include rows in the `filtered_data`

dataset that satisfy either of the conditions. In other words, if the value of `hearing`

is “excellent” or “impaired”, or if the value of `wmc`

is “high”, the row will be included in the `filtered_data`

.

`matches()`

and or in R:Here is another example where we use the `select()`

and `matches()`

function together with or in R:

```
selected_data <- data %>%
select(matches("hearing|wmc"))
```

Code language: HTML, XML (xml)

In the code chunk above, we selected specific columns from the `data`

dataset using the `select()`

function in R.

Within the `select()`

function, we used the `matches()`

function along with the pattern “hearing|wmc”. This pattern specifies a regular expression that matches column names “hearing” or “wmc”.

Using the pipe operator `%>%`

, we pass the `data`

dataset to the `select()`

function for further processing.

As a result, the `selected_data`

dataset is created, consisting of only the columns that match the specified pattern. Any column names that include “hearing” or “wmc” in their names will be included in the `selected_data`

dataset, while other columns will be excluded.

This code allows for selecting specific columns based on patterns in their names, providing flexibility in working with datasets that contain a large number of columns.

`contains()`

:Here is a third example where we use or in R to select columns:

`selected_data <- data %>% select(contains("hear") | contains("wmc"))`

Code language: JavaScript (javascript)

In the code chunk above, we utilized the `select()`

function in R to choose specific columns from the `data`

dataset. Building upon the previous example (Example 2), we employed the `contains()`

function within the `select()`

function to identify columns based on specific substrings.

By using the pipe operator `%>%`

, we passed the `data`

dataset to the `select()`

function, similar to Example 2.

Within the `select()`

function, we incorporated the `contains()`

function. This function searches for columns that contain either the substring “hear” or “wmc” in their column names.

`mutate()`

:Here we add a column to the dataframe based on other columns using `mutate()`

and or in R:

```
data <- data %>%
mutate(high_wmc_or_impaired =
ifelse(wmc == "high" | hearing == "impaired", "Yes", "No"))
```

Code language: JavaScript (javascript)

In the code chunk above, we employed the `mutate()`

function in R to add a new column to the `data`

dataset. The new column is named `high_wmc_or_impaired`

, and we used the `ifelse()`

function to determine its values based on specific conditions.

Using the pipe operator `%>%`

, we passed the `data`

dataset to the `mutate()`

function for further transformation.

Within the `mutate()`

function, we utilized the `ifelse()`

function to assign values to the `high_wmc_or_impaired`

column. The condition `wmc == "high" | hearing == "impaired"`

evaluates whether the value of the `wmc`

column is “high” or the value of the `hearing`

column is “impaired”.

If the condition is met, the corresponding value in the `high_wmc_or_impaired`

column is set to “Yes”. Otherwise, if the condition is not satisfied, the value is set to “No”.

By incorporating the or operator `|`

within the condition of the `ifelse()`

function, we instruct R to evaluate both conditions and assign the appropriate value to each row in the `high_wmc_or_impaired`

column.

Here we subset data in R using the or operator and the `filter()`

function:

`filtered_data <- data %>% filter(wmc == "medium" | snr < -7)`

Code language: HTML, XML (xml)

In the code chunk above, we utilized the `filter()`

function in R to create the `filtered_data`

dataset. This code allows us to filter rows based on specific conditions selectively.

Using the pipe operator `%>%`

, we passed the `data`

dataset to the `filter()`

function for further processing.

Within the `filter()`

function, we specified the filtering conditions using the or operator `|`

. The first condition, `wmc == "medium"`

, checks if the value of the `wmc`

column is equal to “medium”. The second condition, `snr < -7`

, checks if the value of the `snr`

column is less than -7.

By using the or operator `|`

between these conditions, we instruct R to include rows in the `filtered_data`

dataset that satisfy either of the conditions. In other words, if the value of `wmc`

is “medium” or the value of `snr`

is less than -7, the row will be included in the `filtered_data`

.

`case_when()`

and or in R:Here we recode a variable using the `case_when()`

function and the or operator in R:

```
data <- data %>%
mutate(hearing_group =
case_when(hearing == "excellent" | hearing == "impaired" ~ "Good",
TRUE ~ "Normal"))
```

Code language: PHP (php)

In the code chunk above, we used the `mutate()`

function in R to add a new column called `hearing_group`

to the `data`

dataset. We employed the `case_when()`

function to assign values to the new column based on specific conditions.

Using the pipe operator `%>%`

, we passed the `data`

dataset to the `mutate()`

function for further transformation.

Within the `mutate()`

function, we utilized the `case_when()`

function to evaluate different conditions. The first condition, `hearing == "excellent" | hearing == "impaired"`

, checks if the value of the `hearing`

column is either “excellent” or “impaired”.

If the condition is met, the corresponding value in the `hearing_group`

column is set to “Good”. Otherwise, if the condition is not satisfied, the value is set to “Normal”.

By using the or operator `|`

within the condition of the `case_when()`

function, we evaluate multiple conditions and assign the appropriate value to each row in the `hearing_group`

column.

Here is another example of using the or operator in R:

```
for (i in 1:nrow(data)) {
if (data$hearing[i] == "impaired" | data$wmc[i] == "high") {
print(paste("Participant", i, "meets the criteria")) } }
```

Code language: PHP (php)

In the code chunk above, we used a for loop to iterate over each row in the `data`

dataset and performed a conditional check on the values of the `hearing`

and `wmc`

columns.

The loop starts with the `for`

statement, where we define a loop variable `i`

that iterates from 1 to the total number of rows in the `data`

dataset, specified by `nrow(data)`

.

Within the loop, we used an `if`

statement to check if the value of the `hearing`

column at the current iteration (`data$hearing[i]`

) is equal to “impaired” or if the value of the `wmc`

column at the current iteration (`data$wmc[i]`

) is equal to “high”.

If the condition is true, meaning either the `hearing`

is “impaired” or the `wmc`

is “high”, we execute the code block inside the curly braces. In this case, we print a message using the `print()`

function and the `paste()`

function to concatenate the strings “Participant”, the current iteration value `i`

, and “meets the criteria”.

By using the `paste()`

function, we create a formatted string that displays the participant number (`i`

) and indicates that they meet the specified criteria.

`group_by()`

with `summarize()`

:Here we calculate descriptive statistics using the `group_by()`

and `summarize()`

functions together with or in R:

```
summary_data <- data %>%
group_by(wmc %in% c("medium", "high") | snr < -8) %>%
summarize(mean_snr = mean(snr))
```

Code language: HTML, XML (xml)

In the code chunk above, we performed data summarization using the `group_by()`

and `summarize()`

functions in R.

Using the pipe operator `%>%`

, we passed the `data`

dataset to the `group_by()`

function for grouping the data based on specific conditions. Within the `group_by()`

function, we used the `wmc %in% c("medium", "high") | snr < -8`

condition. This condition checks if either the `wmc`

column value is “medium” or “high”, or if the `snr`

column value is less than -8. It groups the data accordingly. Next, we used the `%>%`

operator again to pass the grouped data to the `summarize()`

function for calculating the mean of the `snr`

column.

Within the `summarize()`

function, we specified `mean_snr = mean(snr)`

to compute the mean of the `snr`

column for each group. As a result, the `summary_data`

dataset is created, containing the mean `snr`

value for groups based on the specified conditions. These examples demonstrate various ways the or operator can filter, select, mutate, and summarize data in R. This way showcases its versatility and power in data manipulation tasks.

In this post, you have learned about the powerful applications of the or operator in R. You learned how it can greatly enhance your data manipulation and analysis workflows. By mastering the use of or in combination with various functions and operators, you can efficiently filter, select, mutate, and summarize your data based on multiple conditions.

Throughout the post, we explored different examples and techniques that showcased the versatility of or in R. You gained insights into filtering data based on multiple conditions using the or operator with `%in%`

, matching specific patterns with `matches()`

, and selecting columns containing specific substrings using `contains()`

.

Additionally, you learned how to add new columns based on values in other columns using the or operator with `mutate()`

, and perform conditional recoding using `case_when()`

. We also discussed how to combine logical conditions with or in R’s if statements and how to summarize data based on multiple conditions using `group_by()`

and `summarize()`

.

Now that you have acquired these valuable skills apply them to your data analysis tasks. Share this post with your colleagues and friends who might benefit from learning about the versatile or operator in R. Together, we can expand our knowledge and leverage the full potential of R in data manipulation and analysis.

Here are some resources that you might find helpful:

- Countif function in R with Base and dplyr
- Plot Prediction Interval in R using ggplot2
- How to Create a Sankey Plot in R: 4 Methods
- Sum Across Columns in R – dplyr & base
- How to Convert a List to a Dataframe in R – dplyr
- R Count the Number of Occurrences in a Column using dplyr
- How to Rename Column (or Columns) in R with dplyr

The post Master or in R: A Comprehensive Guide to the Operator appeared first on Erik Marsja.

]]>Keeping your software tools up-to-date is essential for a seamless and efficient workflow, and the R programming language is no exception. In this blog post, we will explore the importance of updating R, discuss the circumstances that may necessitate an update, address the possibility of updating R within RStudio, and explore different methods for upgrading […]

The post Update R: Keeping Your RStudio Environment Up-to-Date appeared first on Erik Marsja.

]]>Keeping your software tools up-to-date is essential for a seamless and efficient workflow, and the R programming language is no exception. In this blog post, we will explore the importance of updating R, discuss the circumstances that may necessitate an update, address the possibility of updating R within RStudio, and explore different methods for upgrading R. So, if you want to stay ahead with the latest features, bug fixes, and improvements, read on to discover how to update R in your RStudio environment.

To check which R version you are currently running you can open up the console (or start RGui) and follow these steps:

- In RStudio, you will find the R Console in the bottom left panel. Click on it to open the console.
- In the console, type
`version`

and press Enter. This command will display detailed information about your R installation, including the version number. - After running the version command, the R version information will be displayed in the console output.

Note that you can also see which R version you are running top left above the console:

Staying current with R updates brings many benefits that may enhance your programming experience. Here are three key reasons why updating R is important:

**Performance and Stability**: Each R update incorporates enhancements and bug fixes, leading to a more stable and reliable environment for data analysis and statistical modeling tasks.**Security**: Updates can include security patches that protect your R environment from potential vulnerabilities, ensuring the confidentiality and integrity of your data.**New Features and Functionality**: Regular updates introduce new features, packages, and functionalities that expand the capabilities of R, allowing you to leverage the latest advancements in data science and statistical analysis.

While it is generally beneficial to keep R up-to-date, there are specific scenarios where updating becomes particularly important:

**Package Dependencies:**Some R packages require specific R versions to function correctly. When using packages that have updated their dependencies, you may need to update R to ensure compatibility and take advantage of the latest package features.**Bug Fixes:**Updates often address known bugs, resolving issues affecting your current workflow. You can benefit from the bug fixes and improvements introduced in the latest version by updating R.**New Functionality:**If you need a specific feature or functionality only available in a newer version of R, updating becomes necessary to access those capabilities.

Yes, you can update R directly within your RStudio environment. RStudio provides a convenient interface for managing your R installations and simplifies updating to the latest R version. Following a few straightforward steps, you can ensure your, e.g., RStudio runs on the most recent version of R.

Apart from updating R within RStudio, there are alternative methods available for upgrading R:

**Manual Installation:**Visit the official R website (https://www.r-project.org/) and download your operating system’s latest version of R. Follow the installation instructions to upgrade your R installation.**Package Managers:**Some package managers, such as Homebrew for macOS and Chocolatey for Windows, offer the option to install and update R. Using package managers simplifies the process by handling dependencies and updates automatically.**Command Line**: If you prefer working with the command line, you can use specific commands to update R. For example, in R, you can use the`installr`

package and its`updateR()`

function to upgrade R from within the R console.

Having understood the significance of updating R and the circumstances that call for an update, let us now delve into the outline of this blog post.

The outline of the blog post is as follows. First, we start by addressing how to check the R version in RStudio, ensuring you have the necessary information to proceed. Next, we delve into the importance of updating R, discussing its benefits to your data analysis and statistical modeling workflows. We explore the situations that may prompt the need for an R update, ensuring you understand when it is necessary to stay up-to-date.

Moving on, we address whether R can be updated within RStudio, providing clarity on the capability of this popular IDE. We then delve into the different methods available for upgrading R, offering you a range of options to suit your preferences and requirements.

In the subsequent sections, we provide a detailed guide on updating R in Windows using three methods. Firstly, we explain the process of manually updating R, ensuring you have complete control over the installation. Secondly, we demonstrate how to update R using the convenient `updateR()`

function within the RStudio environment, streamlining the update process. Lastly, we explore the usage of the Chocolatey Package Manager as an alternative method for updating R in Windows.

Updating R in Windows is a straightforward process that can be achieved through various methods. In the following subsections, we will explore three different approaches. First, we will describe how to download and install the latest version of R manually. Then, we will discuss the convenience of using the `updateR() `

function within the R console. Finally, we will explore utilizing package managers to update R in your Windows environment seamlessly.

Here are the steps manually update R in Windows:

**Visit the official R website**: Go to the official website of R, which is https://www.r-project.org/.**Choose a CRAN mirror:**On the R homepage, navigate to the “Download” section and select a CRAN mirror that is geographically close to your location. CRAN mirrors are servers that host R and its packages.**Select the base distribution**: Under the selected CRAN mirror, click on the link corresponding to your Windows operating system. Choose the “base” distribution, which includes the essential components of R.**Download the installer**: On the base distribution page, click the link to download the installer file for the latest version of R. The installer file usually has a name like “R-x.x.x-win.exe,” where “x.x.x” represents the version number.**Run the installer**: Locate the downloaded installer file on your computer and double-click on it to run the installation process.**Choose installation options**: Follow the prompts provided by the installer to select the desired installation options. You can typically accept the default settings unless you have specific preferences.**Install R**: Proceed with the installation by clicking the appropriate buttons. The installer will extract the necessary files and install R on your Windows system.**Verify the installation:**Once the installation is complete, you can verify the updated R version by launching RStudio or opening the R console. In the console, type`version`

and press Enter. The displayed version should match the latest version you installed.

Here is how to update R using the `updateR()`

function:

**Launch RGui:**Open RGui to access the R console, where you will run the update command.**Load the installr package**: If you have not already, install and load the installr package using`install.packages("installr")`

and`library(installr)`

.**Run the updateR() function**: In the R console, type`updateR()`

and press Enter to initiate the update process.**Follow the prompts:**The`updateR()`

function will guide you through the update process, displaying available updates and asking for confirmation.**Select update options:**Depending on your preferences, you may be prompted to choose update options such as updating only R or including packages.**Update process:**The function will handle the necessary steps, including downloading and installing the latest R version.**Verify the update**: After the update process completes, you can verify the updated R version by typing`version`

in the R console and pressing Enter.

Note that this method is the same as upgrading R from within RStudio (we open up RStudio instead of RGui).

Updating R using Chocolatey Package Manager:

**Install Chocolatey**: If you don’t have Chocolatey installed, visit the Chocolatey website (https://chocolatey.org/) and follow the installation instructions for Windows.**Open Command Prompt or PowerShell**: Launch Command Prompt or PowerShell on your Windows system.**Check Chocolatey installation**: In the Command Prompt or PowerShell, type`choco -v`

and press Enter to verify that Chocolatey is installed correctly.**Update Chocolatey packages**: To ensure you have the latest Chocolatey packages, type`choco upgrade chocolatey`

and press Enter. Follow any prompts if necessary.**Update R**: In the Command Prompt or PowerShell, type`choco upgrade r`

and press Enter. Chocolatey will handle the process of updating R to the latest version.**Verify the update:**After completing the update, you can verify the updated R version by launching RStudio or opening the R console. Type`version`

and press Enter. The displayed version should match the latest version available.

To upgrade R itself from within RStudio, you can make use of the `installr`

package, which provides a convenient way to install or upgrade R. Follow these steps:

**Install the installr package**: If you haven’t installed the`installr`

package yet, open RStudio and run the command`install.packages("installr")`

to install it.**Load the installr package**: Once the package is installed, load it into your R session by running`library(installr)`

.- Run the updateR() function: In the R console, type
`updateR()`

and press Enter. This function will check for the latest version of R and guide you through the upgrade process. **Follow the prompts**: The`updateR()`

function will prompt you with instructions, including asking for confirmation before upgrading. Follow the prompts accordingly.**Upgrading R**: The function will handle the necessary steps to upgrade R, including downloading and installing the latest version.- Verify the upgrade: After the upgrade process completes, you can verify the updated R version by typing
`R.version$version`

in the R console and pressing Enter.

Note that using the `installr`

package to upgrade R from within RStudio follows the same method as using it from other R environments like RGui (as in a previous section).

By using the `installr`

package within RStudio, you can conveniently upgrade R to the latest version without downloading and installing it manually. This lets you keep your RStudio environment up-to-date and benefit from R’s latest improvements and features.

When it comes to upgrading R, the easiest method is using the `installr`

package and its `updateR()`

function. This approach simplifies the process by automating the update steps and providing clear prompts. With just a few commands in the R console, you can effortlessly initiate the upgrade process, making it ideal for users who prefer a straightforward and user-friendly approach.

Following the `installr`

package, using the Chocolatey package manager also offers a relatively easy way to update R. However, it does come with some drawbacks. Before utilizing Chocolatey, you need to install the package manager itself, which adds an extra step to the process. Additionally, using Chocolatey requires knowledge of running PowerShell commands, which might be unfamiliar to some users. Despite these minor challenges, updating R once Chocolatey is set up becomes a relatively streamlined task.

On the other hand, manually installing a new version of R is the most cumbersome method for upgrading. This process involves downloading the latest R installer, running the installation package, and ensuring that all dependencies are properly managed. It requires more manual intervention and is prone to errors or compatibility issues. Consequently, it is generally considered the least convenient option for users seeking a quick and hassle-free upgrade experience.

Considering the ease of use, the `installr`

package with its `updateR()`

function stands out as the simplest and most user-friendly method for upgrading R. However, depending on individual preferences and technical proficiency, users may opt for the Chocolatey package manager or manual installation when needed.

In the blog post, we have learned about the importance of keeping R up-to-date and explored various methods to update it within RStudio. We discovered how to check the R version, identified situations requiring an R update, and clarified that R could be updated within the RStudio environment. We explored three different approaches for upgrading R in Windows: manual installation, utilizing the `updateR()`

function, and leveraging the Chocolatey Package Manager.

By following the outlined steps and utilizing the provided methods, you can easily ensure that your R installation remains current, benefiting from the latest features, improvements, and bug fixes. Keeping R up-to-date is crucial for staying at the forefront of data analysis and statistical modeling.

I encourage you to share this blog post with others who may find it useful. If you have found value in this content, linking back to it or mentioning it in your own articles, blog posts, or social media posts can help others discover these valuable insights.

Remember, regularly updating R is a small investment that can yield significant benefits in terms of efficiency, compatibility, and access to the latest advancements in the R ecosystem.

Here are some other R resources you might find helpful:

- How to use %in% in R: 8 Example Uses of the Operator
- Countif function in R with Base and dplyr
- Mastering SST & SSE in R: A Complete Guide for Analysts
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- Test for Normality in R: Three Different Methods & Interpretation
- Report Correlation in APA Style using R: Text & Tables

The post Update R: Keeping Your RStudio Environment Up-to-Date appeared first on Erik Marsja.

]]>Learn to calculate and interpret SSE/SSR and SST in R. Understand their significance, generate fake data, fit a linear model, and calculate SST and SSR using different methods, including ANOVA. Gain insights into evaluating model performance and enhance your statistical analysis skills. A comprehensive guide for data analysts and researchers.

The post Mastering SST & SSE in R: A Complete Guide for Analysts appeared first on Erik Marsja.

]]>Calculating the sum of squared residuals (SSR, also known as the sum of squared errors; SSE) in R provides valuable insights into the quality of statistical models. In addition, computing the total sum of squares (SST) is crucial for understanding the overall variability in the data. Whether you are delving into psychology or hearing science, these calculations can offer a robust framework. For example, they can evaluate model performance and draw meaningful conclusions.

In psychology, understanding SSE in R can allow researchers to evaluate the accuracy of a predictive model. Moreover, it can enable us to determine how well the model fits the observed data. By quantifying the discrepancies between predicted and actual values, we can gain insights into the effectiveness of their theories and hypotheses. Similarly, in hearing science, calculating SSR in R can be important for analyzing auditory perception and assessing the performance of models predicting listeners’ responses. This evaluation aids in fine-tuning models and optimizing their predictive capabilities.

Moreover, the computation of SST in R plays can have an essential role in both fields. By capturing the total variability in the data, SST allows us to differentiate between the inherent variation within the data and the variability accounted for by the model. This information is indispensable in gauging the model’s explanatory power and discerning the significance of the predictors under investigation.

In the next section, the outline of the post will be described, providing a step-by-step guide on how to calculate SSE/SSR, and SST in R. By following this guide, you will gain a solid understanding of these calculations. Moreover, you will learn how to leverage them effectively in your research and data analysis. In the next section, the outline of the post will be described in more detail, expanding upon each topic and providing further insights and examples.

The outline of the post begins by discussing the requirements for calculating the sum of squared errors (SSE), also known as the sum of squared residuals (SSR), and the total sum of squares (SST) in R. It then delves into answering the questions of what SSE, SSR, and SST are, providing explanations and highlighting their significance. Next, the post demonstrates how to fit a linear model in R and subsequently guides the readers through calculating SST and SSR/SSE. The post also explains how these calculations can be performed using an ANOVA object. Throughout the content, active examples and code snippets illustrate the concepts. Lastly, the post summarizes the key points covered. Here the emphasis is on the importance of understanding and calculating SSE and SSR/SST in statistical analysis.

The requirement for this post is to have a basic understanding of statistical analysis and regression models. Familiarity with R programming language, including the usage of functions like `lm()`

, `%in%`

, and dplyr package, is essential. You should also run the latest version of R (or update R to the latest version). Additionally, a foundational understanding of the concepts (e.g., SSE/SSR, & SST) is necessary. The readers should be comfortable with data manipulation using dplyr, conditional operations with `%in%`

, and fitting linear regression models using `lm()`

. Prior knowledge of ANOVA (Analysis of Variance) and its use in R would be beneficial. The post assumes readers have a dataset available to perform the calculations.

This section will briefly describe the SSE/SSR, and SST.

The sum of squared errors (SSE) is a statistical measure that quantifies the overall discrepancy between observed data and the predictions made by a model. It calculates the sum of the squared differences between the actual values and the corresponding predicted values. SSE is commonly used to evaluate regression or predictive models’ accuracy and goodness of fit. A lower SSE indicates a better fit, meaning the model’s predictions align more closely with the observed data. Sum of Squared Errors is sometimes referred to as Residuals sum of squares.

The total sum of squares (SST) is a statistical measure representing a dataset’s total variability. It quantifies the sum of the squared differences between each data point and the overall mean of the dataset. SST provides a baseline reference for understanding the total variation in the data before any predictors or models are considered. By comparing the SST with the sum of squared residuals (SSR), one can assess how much of the total variation is accounted for by the model. SST is essential in calculating the coefficient of determination (R-squared) to evaluate the model’s explanatory power.

In the next section, we will generate fake data that we will use to calculate the SSE, SSR, and SST in R.

Here we use the `dplyr`

package to generate fake data:

```
library(dplyr)
# Creating a fake dataset
set.seed(123) # Set seed for reproducibility
# Variables
group <- rep(c("A", "B", "C"), each = 10)
x <- rnorm(30, 50, 10)
y <- ifelse(group %in% c("A", "B"), x * 2 + rnorm(30, 0, 5), x * 3 + rnorm(30, 0, 5))
# Creating a data frame
data <- data.frame(group, x, y)
# Standardizing the data
data <- data %>%
mutate(
x_std = scale(x),
y_std = scale(y)
)
```

Code language: R (r)

In the code chunk above, we create a fake dataset with three groups (`A`

, `B`

, and `C`

) and two variables (`x`

and `y`

). The values of `x`

are randomly generated from a normal distribution, and `y`

is calculated based on the groups and `x`

, with added noise. Note the `%in%`

operator in R is used in the `ifelse`

statement to check whether the values in the `group`

variable are either “A” or “B”. Specifically, the expression `group %in% c("A", "B")`

is evaluated for each element in the `group`

variable. Next, we create a dataframe called `data`

and then use `mutate`

from `dplyr`

to create standardized versions of `x`

and `y`

using the `scale`

function. This dataset will be used to calculate SSE, SSR, and SST in the following parts of the blog post. In the next section, we will fit a regression model using the data above.

Here is how to fit a linear regression model in R:

`model <- lm(y ~ x, data = data)`

Code language: R (r)

In the code chunk above, we fitted a linear model using the `lm()`

function. Of course, we used the fake data. However, if you use your own data, make sure to replace `y`

with the name of your dependent variable. Furthermore, you need to replace `x`

with the name of your independent variable (and add any more you have). Finally, adjust `data`

to the name of your dataset.

- Probit Regression in R: Interpretation & Examples
- How to Make a Residual Plot in R & Interpret Them using ggplot2
- Plot Prediction Interval in R using ggplot2
- How to Standardize Data in R

Here is a general code chunk for calculating the total sum of squares in R:

`SST <- sum((data$y - mean(data$y))^2)`

Code language: PHP (php)

In the code chunk above, twe calculate the total sum of squares (SST) using a formula. First, we calculate the mean of the dependent variable `y`

iby using the `mean`

function on `data$y`

. This provides the average value of `y`

across all observations.

Next, we use each individual observation of `y`

in the dataset `data`

and subtract them from the mean value. This subtraction is performed for every observation and creates a vector of deviations from the mean. Next, we use these deviations are squared using the `^2`

operator.

Finally, the we use the `sum`

function on the squared deviations, resulting in the sum of all the squared differences between each observation of `y`

and the mean value.

Note that the SST can be calculated in R by summing the sum of squared differences between predicted data points and SSE. In the next section, we will calculate the residual sum of squares in R using the `model`

(e.g., the results from the regression model we previously fitted).

Here are two methods that we can use to calculate the sum of squared errors using R:

Here is one way to find the SSR in R:

`SSR <- sum(residuals(model)^2)`

Code language: R (r)

In the code chunk above, we calculated the sum of squared residuals in R using a formula. First, we use the `residuals`

function to the `model`

. This function calculates the differences between the dependent variable’s observed values and the model’s corresponding predicted values. These differences are referred to as residuals.

Next, we use the `^2`

operator to square each residual, element-wise. Squaring the residuals ensures that positive and negative differences contribute positively to the overall SSR calculation. Finally, the `sum`

function is used to add up all the squared residuals, resulting in the SSR.

Here is how we can find the sum of squared errors in R:

`SSR <- sum((data$y - predict(model))^2)`

Code language: PHP (php)

In the code chunk above, we calculate the SSE in R using a formula (the same as SSR). First, we use the `predict`

function on the `model`

. This function generates predicted values of the dependent variable based on the model’s coefficients and the predictor variables in `data`

.

Next, we subtract each observed value of the dependent variable `y`

in `data`

from the corresponding predicted value. This subtraction calculates the difference or error between the observed and predicted values for each data point. We then square the errors using the ^2 operator. Finally, the sum function adds up all the squared errors, resulting in the sum of squared errors (SSE).

As seen in the image above, we can also use the `deviance`

function to calculate the SSE.

To perform the above calculations using an ANOVA object in R, you can utilize the `anova()`

function and a fitted ANOVA model. Here is an example using the built-in dataset `mtcars`

:

```
# Fitting an ANOVA model
model <- lm(mpg ~ cyl + hp, data = mtcars)
anova_result <- anova(model)
# Extracting SSE from lmobject
SSE <- sum(model$residuals^2)
# Calculating SST
SST <- sum((mtcars$mpg - mean(mtcars$mpg))^2)
```

Code language: R (r)

In the code chunk above, we fit an ANOVA model to explore the relationship between the dependent variable `mpg`

and the independent variables `cyl`

and `hp`

. We utilized the `lm()`

function to perform the linear regression analysis, with `mtcars`

dataset serving as the data source.

Next, we applied the `anova()`

function to the fitted model, resulting in the `anova_result`

object that contains the ANOVA table. Here are the results:

To extract the sum of squared errors (SSE) from the fitted object, we squared the residuals obtained from `model`

using the `^2`

operator and summed them up using `sum()`

. Just like in the previous examples. Notice that we already have the SSE in the ANOVA table.

Finally, for calculating the total sum of squares (SST), we subtracted each observed value of `mpg`

from the mean of `mpg`

in `mtcars`

, squared the differences, and summed them using `sum()`

.

In this post, we have learned about the prerequisites and concepts required for calculating SSE/SSR and SST in R. We discussed the importance of these measures in statistical analysis and regression modeling. By fitting a linear model in R using the `lm()`

function, we gained hands-on experience in applying regression techniques.

Throughout the post, we explored the step-by-step process of calculating SST/SSR and SSE. We learned how to find the total sum of squares by calculating the squared deviations from the mean, determine the sum of squared residuals by squaring and summing the residuals, and calculate the sum of squared errors by comparing observed and predicted values.

Moreover, we demonstrated the calculations using an ANOVA object, showcasing another approach to obtaining SSE, SSR, and SST. By understanding these calculations, researchers and analysts can better assess model performance and interpret the variation in their data.

I hope this post has provided you with a comprehensive understanding of SSE, SSR, and SST in R. By sharing this post on your favorite social media platforms, you can help others gain insights into these fundamental concepts. Furthermore, we encourage you to refer back to this post in your work, whether it be research papers, reports, or blog posts, to enhance the accuracy and clarity of your statistical analysis.

Here are some resources you may find helpful:

- How to Convert a List to a Dataframe in R – dplyr
- Countif function in R with Base and dplyr
- Report Correlation in APA Style using R: Text & Tables
- How to Create a Sankey Plot in R: 4 Methods
- Sum Across Columns in R – dplyr & base
- Test for Normality in R: Three Different Methods & Interpretation
- Update R: Keeping Your RStudio Environment Up-to-Date

The post Mastering SST & SSE in R: A Complete Guide for Analysts appeared first on Erik Marsja.

]]>In this post, you will learn how to report correlation according to APA. Adhering to APA (American Psychological Association) guidelines is crucial when reporting correlation analysis in academic research. Whether you are conducting research in psychology, cognitive hearing science, or cognitive science, APA style is often required by journals and conferences. This post will provide […]

The post Report Correlation in APA Style using R: Text & Tables appeared first on Erik Marsja.

]]>In this post, you will learn how to report correlation according to APA. Adhering to APA (American Psychological Association) guidelines is crucial when reporting correlation analysis in academic research. Whether you are conducting research in psychology, cognitive hearing science, or cognitive science, APA style is often required by journals and conferences. This post will provide a step-by-step guide on reporting correlation in APA style using R, including creating tables.

To report correlation in APA style, we can follow a specific template that includes the, e.g. correlation method, sample size, degrees of freedom, correlation coefficient value, and p-value. APA 7th edition also recommends reporting confidence intervals for correlation coefficients. In R, we can use functions such as `cor.test(`

) to obtain these statistics for Pearson’s r and other correlation methods. Here is how we can report Pearson’s product-moment correlation coefficient according to APA:

The outline of this post is to provide you with a comprehensive guide to reporting correlation results in APA style using R. Before diving into the specifics, we will discuss the requirements, including installing several R packages. Next, we’ll explore the necessary data and how to report correlation results for Pearson’s correlation (r), Spearman’s Rho, and Kendall’s Tau. Furthermore, we will provide templates for each reporting style and explain how to create a correlation matrix in R.

In the following section, we will guide you through reporting Pearson’s, Spearman’s, and Kendall’s correlation coefficients in APA 7 style using R. We will also show you how to create a correlation table in APA format using the `apaTables `

and `rempsyc `

packages. We will also walk you through the four steps to create an APA formatted table in R: loading the packages, creating a correlation matrix, calculating and adding mean and standard deviation, and creating the table itself. By the end of the post, you’ll have the tools and knowledge to report your correlation results in APA style confidently. Finally, we will compare `apaTables`

and `rempsyc `

for creating APA tables, highlighting some of the pros and cons of each.

There are several R packages available that can help researchers report their results in APA style. One such package is the `apaTables`

package, which provides functions for creating APA-style tables of descriptive statistics, ANOVA results, and correlation matrices. In this post, we will use the `apaTables`

and `rempsyc`

packages. We will also use the `report`

package. This package is also useful for creating APA-style reports, including tables, figures, and statistical analyses. Note the `report`

package is part of the easystats packages, which is a set of very helpful packages (e.g., correlation). Another useful package is `papaja`

, which allows researchers to write APA-style manuscripts in RMarkdown format. Other packages that can be helpful for APA-style reporting include `ggplot2`

and `dplyr`

. With these packages and others like them, researchers can easily generate high-quality reports and manuscripts that meet APA style guidelines.

To follow this blog post, you should understand R programming concepts, such as data manipulation, syntax, and regular expressions. Additionally, you will need to have some packages installed, including `tidyverse`

, `report`

, `corrr`

, `apaTables`

, and `rempsyc`

.

To install the packages, you can use the `install.packages()`

function followed by the package name within quotation marks. For example, to install `tidyverse`

, you can type `install.packages("tidyverse")`

into the console.

If you plan to use `apaTables`

, you may only need to install that package, as it includes many other necessary packages. However, if you plan to use `rempsyc`

, you must install all the other required packages separately. Note it is also recommended that you update R to the latest version.

In this blog post, we will be using `tidyverse`

packages such as `dplyr`

, `magittr`

, and `tidyr`

, to manipulate data. We will also be using the `corrr`

package to create a correlation matrix. Finally, we will use the `rempsyc`

package to create an APA-formatted table of the correlation results.

Here is an example dataset we can use practicing reporting and creating correlation tables following APA 7 style in R:

```
library(MASS)
library(dplyr)
set.seed(20230507)
# Generate correlation matrix for Span variables
cor_span <- matrix(c(1, 0.6, 0.45,
0.6, 1, 0.55,
0.45, 0.55, 1), ncol = 3)
# Generate correlation matrix for Effect variables
cor_effect <- matrix(c(1, 0.47, 0.45,
0.47, 1, 0.39,
0.45, 0.39, 1), ncol = 3)
# Generate Span variables
span_vars <- as.data.frame(mvrnorm(100, mu = c(0, 0, 0), Sigma = cor_span)) %>%
rename(OSpan = V1, RSpan = V2, DSpan = V3)
# Generate Effect variables
effect_vars <- as.data.frame(mvrnorm(100, mu = c(0, 0, 0), Sigma = cor_effect)) %>%
rename(Stroop = V1, Flanker = V2, Simon = V3)
data <- cbind(effect_vars, span_vars) %>% as_tibble()
# Modify the Span variables to have a higher correlation with Effect variables
data <- data %>% mutate(OSpan = 0.25 * Stroop + OSpan * 0.75,
RSpan = 0.25 * Flanker + RSpan * 0.75,
DSpan = 0.25 * Simon + DSpan * 0.75)
```

Code language: R (r)

In the code chunk above, we generated two correlation matrices using the `matrix`

function. The first matrix was for the three Span variables, and the second was for the three Effect variables. We then used the `mvrnorm`

function from the `mvtnorm`

package to generate 100 observations for each of the variables. We assigned these observations to separate data frames using `as.data.frame`

and then used `rename`

from the `dplyr`

package to give them meaningful names.

To create the correlation between the Working Memory Capacity and Inhibition variables, we used the `mutate`

function from `dplyr`

to modify the Working Memory Capacity variables. We multiplied each variable by 0.75 and added 0.25 times the corresponding Effect variable. This gave us Span variables that were moderately correlated with the Effect variables, with correlation coefficients ranging from 0.14 to 0.31.

To summarize, in the code chunk, we used the MASS and dplyr packages to generate and manipulate data for six variables with different correlation structures, allowing for the simulation of complex relationships between variables commonly used in cognitive psychology research. Here is a quick overview of the data:

Here we get a quick view of how the variables are correlated:

In the next section, we will provide a detailed step-by-step guide on obtaining and reporting correlation statistics in APA style using R, along with examples of how to customize the output to meet specific requirements.

In this section, we will provide text templates for reporting Pearson’s R, Spearman’s Rho, and Kendall’s Tau in APA style. These templates can be used to report the results of correlation analyses in research articles, theses, and dissertations.

Here is a template you can use to report Pearson’s correlation according to APA 7:

Pearson’s correlation was used to assess the relationship between [Variable X] and [Variable Y] (r = [correlation coefficient], p < [p-value], 95% CI [lower bound, upper bound], N = [sample size]).

Here is a template you can use to report Spearman’s correlation according to APA:

Spearman’s rho was used to assess the relationship between [Variable X] and [Variable Y] (

ρ= [correlation coefficient], N = [sample size], p < [p-value].

Here is how you can report Kendall’s Tau using APA style:

Kendall’s tau-b was used to assess the relationship between [Variable X] and [Variable Y] (

τ_b= [correlation coefficient],p< [p-value].

In the next section, we will look at how we can use the R package `report`

for reporting correlation results according to APA format.

Here is how to calculate Pearson’s product-moment correlation coefficient and use the `report()`

function:

```
library(report)
library(magrittr)
# Pearson's Correlation Coefficent:
results.r <- data %$%
cor.test(x = Stroop, y = Flanker)
report(results.r)
```

Code language: HTML, XML (xml)

In the code chunk above, we used the `library`

function to load the `report`

and `magrittr`

packages into our R environment.

Next, we used the `%$%`

operator from the `magrittr`

package to avoid having to repeatedly refer to the dataframe we are working with. Then, we used the `cor.test`

function to calculate the Pearson’s correlation coefficient between the `Stroop`

and `Flanker`

variables in our data frame.

Finally, we passed the output of `cor.test`

to the `report`

function to generate a report of the correlation analysis. Here is the output:

As you can see, it is not how we should report correlation according to APA. We need to adjust the output from the `report()`

function:

```
#| results: asis
# Store the original report output in a variable
report_output <- report(results.r) %>%
# Remove the first sentence
summary()
# Italicize the statistical letters:
report_output <- gsub("\\b(r|CI|t|p)\\b", "*\\1*", report_output)
# Print the output
report_output
```

Code language: PHP (php)

In the code chunk above, the report generated using `results.r`

is stored in `report_output`

variable. Then, the `summary()`

function is used to remove the first sentence from the report. Finally, `gsub()`

function is used to italicize the statistical letters, which are identified using the regular expression `\\b(r|CI|t|p)\\b`

. Note that “|” means “or” in R.

As you can see, it does not entirely follow our previous templates, but it will do.

Here is another example of how we can use R to report correlation results In APA style:

```
# Spearman's Rho
results.r <- data %$%
cor.test(x = Stroop, y = Flanker, method = "spearman")
# Store the original report output in a variable
report_output <- report(results.r) %>% summary()
# extract sample size from data
n <- dim(data)[1]
# generate new report text
report_output <- gsub("rho = ", "*r~s~* = ", report_output) %>%
gsub(", S = [0-9]+\\.[0-9]+", "", .) %>%
gsub("\\)", paste0(", *n* = ", n, ")"), .) %>%
gsub("\\b(p)\\b", "*\\1*", .) %>%
# Swapping places on n = 100 and p ... :
gsub("([*][a-z][*]\\s*[<>~=]\\s*[.0-9]+)([,]\\s*)([*][n][*]\\s*[=]\\s*[0-9]+)", "\\3\\2\\1", .)
# print report
report_output
```

Code language: R (r)

In the code chunk above, `gsub()`

is used to modify the report text generated from `report()`

function. We use `gsub()`

to replace “rho =” with “*r* =” in the *s*`report_output`

variable. Next, we remove “, S = [0-9]+\.[0-9]+” using regex from the `report_output`

. The third `gsub()`

replaces the closing parenthesis in `report_output`

with “, *n* = {sample size})” using `n`

extracted from the data. We use `gsub()`

a fourth time to italicize the letter “p” in `report_output`

. Finally, the fifth `gsub()`

swaps the places of “*n* = {sample size}” and “*p* {comparison} {value}” in `report_output`

using regex.

Here is a third example. Here we report Kendall’s Tau according to APA style:

```
# Spearman's Rho
results.r <- data %$%
cor.test(x = Stroop, y = Flanker, method = "kendal")
# Store the original report output in a variable
report_output <- report(results.r) %>% summary()
# generate new report text
report_output <- gsub("tau = ", "*r~τ~* = ", report_output) %>%
gsub(", z = [0-9]+\\.[0-9]+", "", .) %>%
gsub("\\b(p)\\b", "*\\1*", .)
# print report
report_output
```

Code language: PHP (php)

In the code chunk above, we compute Kendall’s tau correlation coefficient. Using `gsub()`

, we first replace “tau =” with “*r* =” in report_output. Then, we remove “, z =” and the corresponding value from report_output using another*τ*` gsub()`

. Lastly, we italicize “p” using a third `gsub()`

. All of these operations are done using the pipe operator. Here is the generated text that follows reporting correlation (Kendall’s tau) in APA style:

Now that we know how to report correlations according to APA 7 using R, we can also use RMarkdown to create APA tables. To create these tables, we need to generate a correlation matrix. This will allow us to easily view all the correlations in our dataset and report them clearly and concisely. Using R and RMarkdown, we can streamline the process of creating APA tables and ensure that they meet the necessary formatting guidelines.

Here is how to get apa correlation table in R:

```
library(apaTables)
apa.cor.table(data, filename = "APA_Correlation_Table.doc",
table.number=1)
```

Code language: R (r)

In the code chunk above, we use the `apaTables`

package to create an APA-style correlation table.

The `apa.cor.table()`

function takes in our `data`

object and generates a correlation matrix according to APA 7 guidelines.

We specify the `filename`

argument to save the table in a Word document named “APA_Correlation_Table.doc”. Additionally, we specify the `table.number`

argument to set the table number as “Table 1”.

That was simple. These few lines of code will create an APA 7-compliant correlation table in R and save it to a Word document for reporting and analysis purposes. Here is the output:

In addition, it is worth mentioning that apaTables is not only capable of creating APA-style correlation tables but also ANOVA tables, regression tables, and more. However, a drawback of using apaTables is that the output can only be saved as Word files and cannot be easily included in PDF or HTML reports. Therefore, we will look at how to use the rempsyc package to report correlation results in an APA formatted table.

Here are the steps for creating an APA 7 formatted correlation matrix using the `nice_table()`

function from the `rempsyc`

package:

First, we need to load the `corrr`

and `rempsyc`

packages:

```
library(rempsyc)
library(corrr)
```

Next, we will create the correlation matrix using `correlate()`

:

```
# Compute correlation matrix
corr_mat <- data %>%
correlate(data) %>%
# Remove the the last column of the correlation matrix
# select(-last_col()) %>%
# Upper triangle removed (filled with NA)
shave()
```

Code language: PHP (php)

In the code chunk above, we are computing a correlation matrix using the `correlate()`

function from the `corrr`

package. Next, we remove the first row and last column of the correlation matrix using `slice()`

and `select()`

functions. Finally, we remove the upper triangle of the matrix and replace those values with NAs using `shave()`

. Note how we used the dplyr’s `select()`

to remove a column in R. In the following step, we will calculate and add the mean and the standard deviation to the dataframe.

The third step is to calculate and add mean and standard deviation to the correlation matrix:

```
corr_tab <- data %>%
# Calculate mean and standard deviation
summarise_if(is.numeric, list(M = mean, SD = sd)) %>%
# Transform the data from wide to long:
tidyr::pivot_longer(cols = everything(),
names_sep = "_", names_to = c("term", "Stat")) %>%
# To keep the order we change term to factor
mutate(term = factor(term, levels = unique(term))) %>%
# Spread the data so we have two columns: mean and sd
tidyr::spread(Stat, value)
# Now we add the correlation matrix
corr_tab <- left_join(corr_tab, corr_mat, by = "term") %>%
# Remove the last column
select(-last_col()) %>%
# And round the numeric values
mutate_if(is.numeric, round, digits = 2)
```

Code language: PHP (php)

In the code chunk above, we first calculate the mean and standard deviation of the dataset. Next, we transform the data from wide to long format using `pivot_longer()`

. We separate the variable names into two columns: “term” and “Stat”. To maintain the order of the variables in the long format, we change the “term” column to a factor with unique levels. Then, we spread the data so that we have two columns: “mean” and “sd”. After that, we add the correlation matrix to the table using `left_join()`

and `by = "term"`

. Finally, we remove the last column of the table using `select(-last_col())`

and round.

Here is how to use the `nice_table()`

function to generate an APA formatted correlation table in R:

```
corr_table_apa <- corr_tab %>%
rename(Variable = term) %>%
# Create the table with rempsyc nice_table
nice_table(
italics = 2:3,
title = c("Table 1", "Means, standard deviations, and correlations"),
note = "M and SD are used to represent mean and standard deviation, respectively"
)
```

Code language: PHP (php)

In the code chunk above, we create an APA-formatted correlation table using the rempsyc package’s `nice_table`

function. First, we rename a column to “Variable” with `rename()`

. Then, we use `nice_table()`

to create the table. We set the second and third columns (mean and standard deviation) to be in italics using `italics = 2:3`

. We add a title and a note with the `title`

and `note`

arguments, respectively. The note specifies that “M” and “SD” are used to represent mean and standard deviation.

We can also save the APA formatted table in R a .docx file:

```
# Save table to word
mypath <- tempfile(fileext = ".docx")
flextable::save_as_docx(corr_table_apa, path = mypath)
```

Code language: PHP (php)

Here are the correlation results in a table formatted according to APA 7:

As you may have noticed, we had to write more code than when using the `apaTables `

package.

apaTables is a straightforward package with a single function for creating APA-style tables. It allows for quick and easy creation of tables and is useful for those who do not want to spend much time customizing their tables. However, a downside of `apaTables `

is that it only works for creating Word documents, not PDFs.

In contrast, `rempsyc `

requires more coding but provides many table customization options. `rempsyc `

can create tables in both Word and PDF formats, making it a more versatile option. Additionally, `rempsyc `

includes several functions for creating tables with descriptive statistics, regression results, and more. However, as you have seen above, we would have to calculate the confidence interval when using `rempsyc`

.

In summary, `apaTables `

is a good option for those who want a quick and easy solution for creating simple APA-style tables in Word documents. However, if you need more customization options or want to create tables in PDF format, `rempsyc `

might be a better choice despite requiring more coding.

This blog post covered different R packages and templates for reporting correlation results in APA style. It also explained the requirements for following the post, including basic R knowledge and installing necessary packages. The post demonstrated how to report Pearson’s r, Spearman’s rho, and Kendall’s tau correlation results. It also showed how to create an APA-formatted table using the `apaTables `

package and customize a table using `rempsyc`

. The post outlined the steps for creating a correlation matrix, adding mean and standard deviation, and creating an APA table.

While `apaTables `

offers a simple solution for creating an APA-formatted table, it cannot be saved as a PDF. In contrast, `rempsyc `

provides more customization options but requires more coding.

To conclude, this blog post provided a comprehensive guide for reporting correlation results in APA style using R. If you found this post helpful, please share it on social media and leave a comment to let us know your thoughts.

Here are other resources you might find helpful:

- How to Convert a List to a Dataframe in R – dplyr
- Papaja – APA manuscripts made easy
- R Count the Number of Occurrences in a Column using dplyr
- Countif function in R with Base and dplyr
- How to Make a Residual Plot in R & Interpret Them using ggplot2
- Plot Prediction Interval in R using ggplot2
- How to Standardize Data in R
- Sum Across Columns in R – dplyr & base
- How to use %in% in R: 8 Example Uses of the Operator
- How to Calculate Z Score in R

The post Report Correlation in APA Style using R: Text & Tables appeared first on Erik Marsja.

]]>In this post, we will learn how to transform data from wide to long in R. Wide-to-long format conversion is often an important data manipulation technique in data analysis. In R, we can use many packages and their functions to transform data from a wide to long format. These functions include the tidyr package’s pivot_longer() […]

The post Wide to Long in R using the pivot_longer & melt functions appeared first on Erik Marsja.

]]>In this post, we will learn how to transform data from wide to long in R. Wide-to-long format conversion is often an important data manipulation technique in data analysis. In R, we can use many packages and their functions to transform data from a wide to long format. These functions include the `tidyr`

package’s `pivot_longer()`

and the `reshape2`

package’s `melt()`

. The transformation is essential when dealing with data that has variables organized in a wide format, which makes it difficult to analyze or visualize using certain statistical analysis techniques or visualization libraries.

In cognitive hearing science and psychology, for example, wide format data is commonly used to store data obtained from multiple hearing or cognitive tests, such as multiple assessments of participants’ hearing and cognitive abilities. However, many analysis packages in R including `afex`

require data to be in long format. In addition, plotting packages, such as `ggplot2`

, also require data in long format for certain types of plots.

The `tidyr`

package’s `pivot_longer()`

function allows for efficient data conversion from wide to long format by gathering columns based on a set of rules. This function is handy when working with datasets with multiple sets of variables that need to be gathered separately. In contrast, the `reshape2`

package’s `melt()`

function creates a long format data frame by stacking all columns in a wide format dataset into a single column, making it easier to use with other R packages that require long format data.

In this tutorial, we will demonstrate how to transform wide format data to long format using `pivot_longer()`

from the `tidyr`

package and `melt()`

from the `reshape2`

package. We will use an example dataset of participants’ hearing and cognitive test scores to show how the two functions work. The tutorial will include two examples of using `pivot_longer()`

and one example of using `melt()`

, highlighting the similarities and differences between the two functions. The examples will show how the data changes from a wide to long format.

Wide format is a common way to store data in tables. In wide format, each variable is stored in its column, and each observation is stored in its row. This format is useful when dealing with data with a small number of variables but a large number of observations. For example, in psychology, a study might have a variable for each question on a survey. Here the responses to those questions might be stored in a wide format.

Here is an example of how data might be stored in wide format in R. Imagine that we conducted a study on the effects of caffeine on cognitive performance. In this study, we asked participants to complete a survey before and after drinking a cup of coffee. In wide format, our data might look like this:

In this example, we have three variables (Q1, Q2, and Q3) and six observations (two for each participant). While this format is easy to understand, it can be difficult to analyze because the data is spread across many columns.

Long format is another way to store data in tables. In long format, each observation is stored in its row, and each variable is stored in its own column. This format is useful when dealing with data that has a large number of variables but a small number of observations. For example, in psychology, a study might have many variables that measure different aspects of a person’s personality. The responses to those variables might be stored in a long format.

Here is an example of how data might be stored in long format in R. Imagine that we conducted a study on the effects of mindfulness meditation on well-being. Here we asked participants to complete a questionnaire with multiple questions. In long format, our data might look like this:

In this example, we have many variables (each question is a variable) and only three observations. While this format is more challenging to understand at first glance, it is often easier to analyze because the data is organized into a single column.

Long format is particularly useful when dealing with data that has multiple observations for each subject or when dealing with data that has many variables with similar meanings. Long format is also useful for storing data in a way that is more easily analyzed using certain statistical methods and packages.

For example, when using the R package tidyverse, many of the functions work best with data in long format. In addition, when using analysis of variance (ANOVA) or linear regression, long format can make it easier to perform the analysis. In particular when dealing with repeated measures or within-subjects designs.

There are exceptions to the rule mentioned above (that long format is often preferred). In some cases, wide format may be preferred. For instance, when dealing with data that has a small number of variables and a large number of observations. Additionally, some statistical methods and packages may require data to be in a specific format. Therefore, it is always important to consult the documentation or consult with a statistician before deciding on the format of your data.

In summary, while long format is often useful for storing and analyzing data in certain situations, it’s important to consider the specific needs of your analysis and consult with experts when making decisions about data format.

In R, many packages can transform data from wide to long format. Here are a few examples:

**tidyr**: tidyr is a package within the`tidyverse`

that provides functions for reshaping data. The`pivot_longer`

function can be used to convert data from wide to long format in R. We can use this func to specify which columns should be converted into “long” format. Also, we can specify what the resulting column names should be.**reshape2**: reshape2 is another package that provides functions for reshaping data. The`melt`

function can be used to convert data from wide to long format. This function lets you specify which columns should be converted into “long” format. Of course, we can specify what the resulting column names should be.**data.table**: data.table is a package that provides high-performance data manipulation tools. The`melt`

function in data.table is another R function that we can use to convert wide to long format. This function is similar to the “melt” function in reshape2.**reshape**: reshape is an older package that provides functions for reshaping data. The`melt`

function in reshape can be used to convert data from wide to long format. This function is similar to the “melt” function in reshape2.

When using these packages, it is important to note that converting data from wide to long format syntax can vary slightly between packages. Additionally, some packages may offer more advanced functionality. For example, some can reshape data based on regular expressions or other patterns. See an example when going from wide to long format when we report correlation results in an APA table.

In general, however, these packages provide powerful tools for transforming data between wide and long formats, which can be a key step in preparing data for analysis. In this post, we will use the tidyr and reshape2 packages to transform data from wide to long format in R.

If you want to learn how to convert your data from wide to long format using R, dplyr, and tidyr there are a few things you should know before getting started. First, you need some basic knowledge of R and how to read data. You should be comfortable using, e.g., the read.csv() function to read your data from a CSV file or a similar format. Of course, in this tutorial, we are working with example data.

Once you have your data loaded into R, you will need to have the `dplyr`

and `tidyr`

packages installed. These packages are part of the larger `tidyverse`

ecosystem and provide tools for working with tidy data. You can install these packages using the `install.packages()`

function in R. Here is to install a package in R:

`install.packages('tidyverse')`

Code language: R (r)

In addition to these basic steps, you may need to perform some data cleaning or manipulation before pivoting your data. For example, you may need to rename your column names or filter out missing values. However, these steps depend on your data’s specific structure and content. In the next section, we will generate some example data. Note that for, e.g., security reasons make sure to keep an updated version of R (i.e., the latest stable version).

Here is an example dataset we can use to practice converting from wide to long format in R. Imagine that we have data on participants’ performance in a hearing and cognition study. We measured their scores on two different hearing tests, as well as their scores on two different cognitive tests. The dataset is called `hearing_cognition_scores`

.

```
library(dplyr)
# Generate example data
hearing_cognition_scores <- data.frame(
participant_id = 1:20,
hearing_HINT1 = sample(1:10, 20, replace = TRUE),
hearing_HINT2 = sample(1:10, 20, replace = TRUE),
cognitive_Stroop1 = sample(1:100, 20, replace = TRUE),
cognitive_Stroop2 = sample(1:100, 20, replace = TRUE)
)
```

Code language: R (r)

In this dataset, each row corresponds to a single participant, and there are four variables: `participant_id`

, `hearing_test_1`

, `hearing_test_2`

, `cognitive_test_1`

, and `cognitive_test_2`

. The variables `hearing_test_1`

and `hearing_test_2`

represent scores on the two different hearing tests, while `cognitive_test_1`

and `cognitive_test_2`

represent scores on the two different cognitive tests. In the next section, before working with the data, we will look at the syntax of the `pivot_longer()`

function.

As you may know now,`pivot_longer()`

is a function in the`tidyr`

package that we can use to transform wide data into long format. The first argument, `data`

, is the input data frame.

We can use the second argument, `cols`

, to specify which columns to pivot. It can be a numeric vector or a selection helper such as `starts_with()`

, `ends_with()`

, or `contains()`

. In a way, it works similarly to the `select()`

function from `dplyr`

.

Additional arguments include `cols_vary`

, which determines how to treat columns with different lengths, and `names_to`

, which specifies the name of the new column containing the previously wide column names. We can use the `names_prefix`

and `names_sep`

to remove a prefix from the variable names or split them into multiple columns using a separator.

`names_pattern`

allows for specifying a regular expression pattern to match and split variable names. `values_to`

specifies the name of the new column containing the previously wide cell values. Other optional arguments include `values_drop_na`

to remove missing values, `values_ptypes`

to specify the data type of the new column containing cell values, and `values_transform`

to apply a transformation function to the cell values.

To summarize, `pivot_longer()`

is a versatile function that can handle various types of data transformations flexibly. In the next section, before transforming data from wide to long in R using pivot_longer() we will look at the syntax of the `melt()`

function.

We can also use the `melt()`

function in R to reshape a dataset from a wide format to a long format. It takes in the data set that needs to be reshaped as its first argument. The additional arguments are passed to or from other methods, and are optional.

We can, for example, use the `na.rm`

argument to remove missing values from the data set. If set to `TRUE`

, any missing values are removed, resulting in a smaller data set. If set to `FALSE`

, the missing values are preserved, and the resulting data set will be the same size as the original data set.

Moroevoer, we can use the `value.name`

argument to specify the name of the new variable that will be created to store the values of the original dataset. By default, the name of the new variable is set to “value”. However, this argument can be used to specify a different name for the new variable.

The `...`

argument allows for additional arguments to be passed to or from other methods. These arguments will depend on the specific implementation of `melt()`

being used, and may not be necessary in all cases.

To summarize, the `melt()`

function is a useful tool for reshaping a dataset from a wide format to a long format, allowing for easier analysis and visualization.

In this section, we are going to use `pivot_longer()`

in two examples to convert from long to wide in R.

Here is how to transform data from wide to long format in R using `pivot_longer`

:

```
library(tidyr)
# Convert wide to long format
hearing_cognition_long <- pivot_longer(hearing_cognition_scores,
cols = -participant_id,
names_to = c("test_type", "Test"),
names_sep = "_",
values_to = "Score")
```

Code language: R (r)

In the code chunk above, we use the `pivot_longer()`

function from the `tidyr`

package to convert a wide data set to a long format. The data set used in this example is called `hearing_cognition_scores`

, and it has a column for participant IDs and two columns each for hearing and cognitive test scores. The goal is to transform the data set from a wide to a long format. Now we can, for example, analyze the data more easily.

To achieve this, we pass the `hearing_cognition_scores`

data frame to the `pivot_longer()`

function, along with several arguments. The `cols`

argument specifies which columns to pivot. In this case, we want to pivot all columns except for the `participant_id`

column, so we use `-participant_id`

.

We use the `names_to`

argument to specify the new names for the columns that are created during the pivot. Next, we want to split the original column names into two new columns, one for the type of test (`test_type`

) and one for the test number (`Test`

).

Using the `names_sep`

argument, we specify the character that separates the test_type and Test columns in the original column names. In our example, the separator is an underscore (_). However, it could be another character, such as “.”.

Finally, we use the `values_to`

argument to specify the name of the new column that will contain the values that were previously in the hearing and cognitive test score columns. In this case, we use `Score`

.

In this example, we will use `pivot_longer()`

to gather both the hearing and cognitive test data into a single column, and use `separate()`

to split the test data into two separate columns based on the test type.

```
# Load required packages
library(dplyr)
library(tidyr)
# Pivot the data to a long format
hearing_cognition_long <- hearing_cognition_scores %>%
pivot_longer(
cols = starts_with("hearing_") | starts_with("cognitive_"),
names_to = c("test_type", "test_number"),
names_pattern = "(\\D+)(\\d+)",
values_to = "score"
) %>%
# Split test_type column into two separate columns
separate(test_type, c("test_domain", "test_name"), sep = "_")
```

Code language: R (r)

In the code chunk above, we start by loading the required packages for our analysis, which are dplyr and tidyr. Next, we use the `%>%`

operator from the dplyr package to pipe the `hearing_cognition_scores`

data frame into the `pivot_longer()`

function. We specify the columns to pivot using the `cols`

argument and use the `starts_with()`

function to select all columns that start with either `hearing_test`

or `cognitive_test`

. Here we make use of “|” which is or in R.

We then use the `names_to`

argument to split the column names into two parts, `test_type`

and `test_number`

. The `names_pattern`

argument uses regular expressions to specify the pattern for the split. In this case, we use `(\\D+)(\\d+)`

, which matches any non-digit characters followed by one or more digits.

The `values_to`

argument specifies the name of the column where the values will be stored, in this case “score”. We then use the `%>%`

operator again to pipe the output of `pivot_longer()`

into the `separate()`

function, which is used to split the `test_type`

column into two separate columns. We use the `sep`

argument to specify the separator as an underscore `_`

.

Finally, we assign the result to a new data frame called `hearing_cognition_long`

.

In this section, we are going to use the R package `reshape2`

to transform data from wide to long.

Here is to convert data from wide to long using the `melt()`

function:

```
# Load required package
library(reshape2)
# Convert wide to long format
hearing_cognition_long <- melt(hearing_cognition_scores,
id.vars = "participant_id",
variable.name = "test_type",
value.name = "Score") %>%
separate(test_type, into = c("test_domain", "test_number"), sep = "_")
```

Code language: R (r)

In the code chunk above, we start by loading the reshape2 package to work with our data. We then use the `melt()`

function to convert our wide format `hearing_cognition_scores`

data into a long format `hearing_cognition_long`

data. The `id.vars`

argument specifies the column to use as the identifier variable, in this case, the `participant_id`

column. The `variable.name`

argument specifies the name of the column that will contain the variable names in the melted data, which is `test_type`

in this case. The `value.name`

argument specifies the name of the column that will contain the variable values in the melted data, which is `Score`

.

After melting the data, we use the `%>%`

pipe operator to pass the melted data to the `separate()`

function. We use the `test_type`

column as the column to separate, and we separate it into two new columns, `test_domain`

and `test_number`

, using the `into`

argument. The `sep`

argument specifies the character to separate the `test_type`

column by, which is the underscore character in this case.

This code achieves the same result as the previous code chunks that used the `pivot_longer()`

function from the `tidyr`

package. However, in this case, we used the `melt()`

function from the `reshape2`

package to convert the data from wide to long format and then used the `separate()`

function to split the `test_type`

column into two new columns.

As you now know, `pivot_longer()`

is part of the `tidyr`

package. The function provides a lot of flexibility for reshaping data. One of its main advantages is its ability to handle multiple columns at once. We have seen that this function can be useful when many columns need to be melted. Additionally, it allows users to specify column types and even perform operations on the melted data as part of the same pipeline. Another benefit is that `pivot_longer()`

is generally faster than `melt()`

for large datasets.

On the other hand, `melt()`

from the `reshape2`

package is a simpler function that may be more intuitive for beginners. It is particularly useful for melting data that has a specific structure. For example, when we have data with columns that follow a consistent naming convention. Another benefit of `melt()`

is that it can handle data with missing values more easily than `pivot_longer()`

.

However, `melt()`

has some limitations when compared to `pivot_longer()`

. For example, it does not allow users to specify column types, which can make subsequent data processing more challenging. Additionally, it is not as flexible as `pivot_longer()`

in terms of its ability to handle complex data transformations.

In summary, both `pivot_longer()`

and `melt()`

are useful functions for converting data from wide to long format in R. `pivot_longer()`

is a more powerful and flexible option, while `melt()`

is simpler and more intuitive. The choice between the two depends on the specific requirements of the task at hand, as well as the user’s experience with each function.

In this post, we have learned about transforming data from wide to long format using two different R packages: `tidyr`

and `reshape2`

. We have seen that both packages provide functions that can be used to achieve this, namely `pivot_longer()`

and `melt()`

, respectively. Furthermore, we have compared the two functions in terms of syntax and functionality and have seen that they differ in some aspects, but both provide efficient ways of reshaping data.

We have also looked at example data from cognitive hearing science and psychology, where melting data from wide to long format can be useful for various analysis methods and visualization tools such as the `afex`

package and `ggplot2`

, respectively.

The syntax of `pivot_longer()`

and `melt()`

functions have been discussed in detail, outlining their various arguments and options. While `pivot_longer()`

uses a more intuitive syntax that is easy to read and understand, `melt()`

is more flexible in handling a wide range of data structures.

Two examples of using `pivot_longer()`

and one example of using `melt()`

have been presented, highlighting the practical applications of these functions.

In conclusion, both `tidyr`

and `reshape2`

provide efficient and flexible ways of transforming data from wide to long format in R. The choice of which function to use may depend on the specific requirements of the analysis and the structure of the data. By using these functions, researchers can save time and effort in preparing their data for analysis and visualization.

If you found this post useful, please consider sharing it on social media. If you have any questions or suggestions, please leave a comment below.

Here are some other tutorials that you may find helpful:

- How to Rename Factor Levels in R using levels() and dplyr
- Sum Across Columns in R – dplyr & base
- How to Calculate Z Score in R
- Countif function in R with Base and dplyr
- R Count the Number of Occurrences in a Column using dplyr
- Plot Prediction Interval in R using ggplot2
- How to Convert a List to a Dataframe in R – dplyr
- How to Add a Column to a Dataframe in R with tibble & dplyr

The post Wide to Long in R using the pivot_longer & melt functions appeared first on Erik Marsja.

]]>In this blog post, you will learn how to carry out a countif function using base R and dplyr. Countif is a powerful function that allows you to count the number of times a certain condition is met in a dataset. Countif is particularly useful in cognitive hearing science, where researchers often need to analyze […]

The post Countif function in R with Base and dplyr appeared first on Erik Marsja.

]]>In this blog post, you will learn how to carry out a countif function using base R and dplyr. Countif is a powerful function that allows you to count the number of times a certain condition is met in a dataset.

Countif is particularly useful in cognitive hearing science, where researchers often need to analyze large datasets of auditory signals. For example, you might want to use countif to count the number of times a particular sound occurs in a recording, or to count the number of times a listener correctly identifies a target sound in a speech recognition task.

The countif function can be applied to various data types, including vectors, matrices, and data frames. However, this blog post will focus on using countif on dataframes.

We will start by exploring how to use the base R function called sum(). We will show you how to use sum() to count the number of rows in a data frame that meets a specific condition.

Next, we will introduce the dplyr package, which provides a more intuitive syntax for data manipulation. We will show you how to use the `mutate() `

and `sum()`

functions in dplyr to achieve the same result as the `sum()`

function many times.

Finally, we will create a custom countif function that combines the power of base R and dplyr. Our custom function will allow you to easily count the number of rows in a data frame that meet a specific condition, using a simple and intuitive syntax.

By the end of this blog post, you will have a deep understanding of how to use the countif function in R to analyze and manipulate data frames in cognitive hearing science.

Here is a simple example of how to count values in a vector in R using a condition. We can use the sum() function as a COUNTIF function. Here is how we count how many times the value 2 appears in the vector `v`

:

```
v <- c(1, 4, 2, 5, 2, 6, 2, 7, 3, 2)
sum(v == 2)
```

Code language: HTML, XML (xml)

In the code chunk above, we used the `sum()`

function as a countif function to count the number of occurrences of a specific value in the vector. Here, we counted how often the value 2 appears in the vector ‘v’.

In the following section, we will generate fake data to practice more advanced countif examples using both base R and dplyr.

Let us first create a fake dataset that we can use to practice countif in R.

Here we will create a dataset with the following columns:

`Subject`

: Unique identifier for each participant`Group`

: Categorical variable indicating the group the participant belongs to (e.g., control, experimental)`HearingProblem`

: Binary variable indicating whether the participant reports having subjective hearing problems (0 = no, 1 = yes)`Age`

: Continuous variable indicating the age of the participant`HearingLoss`

: Continuous variable indicating the degree of hearing loss of the participant`DepressionScore`

: Continuous variable indicating the level of depressive symptoms of the participant`AnxietyScore`

: Continuous variable indicating the level of anxiety symptoms of the participant

```
library(dplyr)
set.seed(2023) # for reproducibility
n <- 100 # number of participants
df <- data.frame(
Subject = paste0("P", 1:n),
Group = rep(c("Control", "Experimental"), each = n/2),
HearingProblem = ifelse(rbinom(n, 1, 0.5) %in% 1, 1, 0),
Age = round(rnorm(n, mean = 50, sd = 10), 1),
HearingLoss = round(rnorm(n, mean = 30, sd = 10), 1),
DepressionScore = round(rnorm(n, mean = 20, sd = 5), 1),
AnxietyScore = round(rnorm(n, mean = 15, sd = 5), 1)
) %>%
mutate(Group = factor(Group))
```

Code language: PHP (php)

In the code chunk above, we generate a simulated hearing study dataset using R.

To ensure reproducibility, we set the seed to 123 using `set.seed()`

.

We create a data frame called `df`

with 100 participants using `data.frame()`

.

We used `rep()`

to repeat the two levels of the `Group`

variable, “Control” and “Experimental”, n/2 times each.

To set the values of the `HearingProblem`

column, we use `ifelse()`

and generate a random binomial distribution using `rbinom()`

. We also use `%in%`

in R to check if the generated value is equal to 1.

We generate random values for `Age`

, `HearingLoss`

, `DepressionScore`

, and `AnxietyScore`

using `rnorm()`

.

Finally, we use `%>%`

to pipe the data frame into `mutate()`

and convert the `Group`

column to a factor using `factor()`

. In the following sections, we will work with Base R to use the `sum()`

function as a countif function in R.

In base R, we can use the following functions as countif() functions:

- We can use
`sum()`

to count the number of`TRUE`

values that result from a logical expression. For example,`sum(x == 5)`

will return the number of elements in vector`x`

that are equal to 5. - We can use
`length()`

to count the number of elements in a vector or list that meet a certain condition. For example,`length(x[x > 5])`

it will return the number of elements in`x`

- We can use
`which()`

to get the indices of elements in a vector that meet a certain condition. We can then use the`length()`

function to count the number of indices. For example,`length(which(x > 5))`

will return the number of elements in`x`

that are greater than 5.

These functions can be used in combination with logical operators, such as `==`

, `>`

, `<`

, `<=`

, `>=`

, and `!=`

to count the number of elements in a vector that meet a certain condition.

We can use the `sum()`

function in combination with the ** ==** operator to count the number of rows where a certain condition is met. For example, to count the number of participants who have a hearing problem, we can use the following code:

`sum(df$HearingProblem == 1)`

Code language: R (r)

Using `sum()`

as a `countif()`

function in R will, in this case return the number of rows in the HearingProblem column that are equal to 1. In our case, this is the number of participants with a hearing problem.

To count the number of rows where a certain condition is greater than or equal to a specific value, we can use the ** sum()** function together with the

`>=`

`sum(df$Age >= 60)`

Code language: PHP (php)

This will return the number of rows in the ** Age** column that are greater than or equal to 60, i.e., the number of participants who are aged 60 or older.

To count the number of rows where a certain condition is less than or equal to a specific value, we can use the `sum()`

function with the ** <=** operator. For example, to count the number of participants who have a depression score less than or equal to 18, we can use the following code:

```
sum(df$DepressionScore <= 18)
```

Code language: R (r)

This will return the number of rows in the `DepressionScore`

column that are less than or equal to 18, i.e., the number of participants who have a depression score less than or equal to 18.

To count the number of rows where a certain condition is between two values, we can use the `sum()`

function with the `>`

and `<`

operators. For example, to count the number of participants who have a hearing loss between 25 and 35, we can use the following code:

`sum(df$HearingLoss > 25 & df$HearingLoss < 35)`

Code language: PHP (php)

This will return the number of rows in the `HearingLoss`

column that are greater than 25 and less than 35, i.e., the number of participants who have a hearing loss between 25 and 35.

To count the number of rows where a certain condition is not equal to a specific value, we can use the `sum()`

function with the `!=`

operator. For example, to count the number of participants who do not have a hearing problem, we can use the following code:

```
sum(df$HearingProblem != 1)
```

Code language: R (r)

This will return the number of rows in the `HearingProblem`

column that are not equal to 1, i.e., the number of participants who do not have a hearing problem.

Heris a countif example using `dplyr`

functions to count the number of elements in a vector that meet a certain condition. We use the same examples as before:

```
library(dplyr)
df %>%
select(HearingProblem, Age, HearingLoss, DepressionScore, AnxietyScore) %>%
mutate(
HearingProblemCount = sum(HearingProblem == 1),
AgeCount = sum(Age < 60),
HearingLossCount = sum(HearingLoss >= 20),
DepressionScoreCount = sum(DepressionScore != 20),
AnxietyScoreCount = sum(AnxietyScore > 10)
)
```

Code language: HTML, XML (xml)

In the code chunk above, we used `dplyr`

functions to select the columns of interest from the `df`

data frame, and then used `mutate()`

to create new columns that count the number of elements in each column that meet a certain condition. The `sum()`

function is used with various logical operators (`==`

, `<`

, `>=`

, `!=`

, `>`

) to count the number of elements that meet the specified condition. Finally, the resulting data frame shows the original and new columns with the counts. Here is the result:

Here is a `countif() `

function created in R using dplyr:

```
countif <- function(df, conditions) {
df %>%
mutate(count = 1) %>%
# Summarize the 'count' column across all columns in the conditions list
summarise(across(all_of(names(conditions)),
~sum(count * case_when(
conditions[[as.name(cur_column())]][2] == "equals" &
.x == conditions[[as.name(cur_column())]][1] ~ 1L,
conditions[[as.name(cur_column())]][2] == "less" &
.x < conditions[[as.name(cur_column())]][1] ~ 1L,
conditions[[as.name(cur_column())]][2] == "greater"
& .x > conditions[[as.name(cur_column())]][1] ~ 1L,
conditions[[as.name(cur_column())]][2] == "less or equal"
& .x <= conditions[[as.name(cur_column())]][1] ~ 1L,
conditions[[as.name(cur_column())]][2] == "greater or equal"
& .x >= conditions[[as.name(cur_column())]][1] ~ 1L,
conditions[[as.name(cur_column())]][2] == "not equals"
& .x != conditions[[as.name(cur_column())]][1] ~ 1L,
TRUE ~ 0L)))) %>%
set_names(names(conditions))
}
```

Code language: R (r)

The function `countif()`

takes two arguments: df, which is a dataframe, and conditions. Conditions is a list of conditions with the column names as names and the values and logical operators as vectors. The function counts the number of rows in “df” that satisfy the conditions and returns the counts as a named vector.

The function has the limitation that it can only handle one condition at a time, and the logical operators supported are limited to “equals”, “less”, “greater”, “less or equal”, “greater or equal”, and “not equals”. As to date, I have not solved having it to take multiple conditions as in the examples previously in the post.

In this blog post, you have learned about the COUNTIF function in R. Starting with a fake dataset, we explored how to count rows. Specifically, we counted rows equal to, greater or equal to, and less or equal to some value. We also looked at how to count rows between two values and rows not equal to some value using base R. After that, we created a general function using dplyr to count rows based on different conditions. This function takes two arguments. First the dataframe and then a list of conditions. This list contains three elements: column name, operator, and value. The function then returns a new data frame with the counts for each condition. Finally, we introduced the COUNTIF function in dplyr, a more concise way to perform the same operation.

Overall, the COUNTIF function in R is a powerful data analysis tool commonly used to filter, manipulate, and transform data. Automating the counting process can save you time and effort, allowing you to focus on more complex tasks. By mastering this function, you can improve your data analysis skills and become more efficient in R.

If you found this blog post helpful, consider sharing it on social media or leaving a comment below. We always appreciate feedback and suggestions for future posts. Thank you for reading!

Here are some other resources you might find useful:

- How to Convert a List to a Dataframe in R – dplyr
- R Count the Number of Occurrences in a Column using dplyr
- How to Rename Column (or Columns) in R with dplyr
- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Standardize Data in R
- How to Calculate Z Score in R

The post Countif function in R with Base and dplyr appeared first on Erik Marsja.

]]>Discover how to analyze non-parametric data using the Wilcoxon Signed-Rank Test in Python. Learn how to interpret the results and compare different Python packages for running the test. Get started now!

The post Wilcoxon Signed-Rank test in Python appeared first on Erik Marsja.

]]>In this blog post, we will explore the Wilcoxon Signed-Rank test in Python, a non-parametric test for comparing two related samples. We will learn about its hypothesis, uses in psychology, hearing science, and data science.

To carry out the Wilcoxon Signed-Rank test in Python, we will generate fake data and import real data. We will also perform the Shapiro-Wilks test to check for normality.

We will then move on to implementing the Wilcoxon Signed-Rank test in Python and interpreting the results. Additionally, we’ll visualize the data to better understand the test results.

Finally, we will learn how to report the results of the Shapiro-Wilks test for normality and the Wilcoxon Signed-Rank test. This will provide valuable insights into the relationship between the two related samples. By the end of this blog post, you will have a comprehensive understanding of the Wilcoxon Signed-Rank test. Importantly, you will know how to perform the test in Python and how to apply it to your data analysis projects.

Remember to consider alternatives, such as data transformation, when data does not meet the assumptions of the Wilcoxon Signed-Rank test.

The Wilcoxon signed-rank test is a non-parametric statistical test used to determine whether two related samples come from populations with the same median. We can use this non-parametric test when our data is not normally distributed. This test can be used instead of a paired samples t-test.

The test is conducted by ranking the absolute differences between paired observations, considering their signs. Next, the sum of the ranks for the positive differences is calculated and compared to the sum of the negative differences. The test statistic is then calculated as the smaller of these two sums.

The test has two possible outcomes: reject or fail to reject the null hypothesis. If the test rejects the null hypothesis, the two samples come from populations with different medians. If it fails to reject the null hypothesis, there is no evidence to suggest that the two samples come from populations with different medians.

The null hypothesis for the Wilcoxon signed-rank test is that the difference between the two related samples is zero. The alternative hypothesis is that the difference between the two related samples is not zero.

Here are three examples from psychology, hearing science, and data science when we may need to use the Wilcoxon signed-rank test:

Suppose we want to investigate whether a new therapy for depression is effective. We could administer a depression questionnaire to a group of patients before and after the therapy and then use the Wilcoxon signed-rank test to determine if there is a significant improvement in depression scores after the therapy.

Suppose we want to compare the effectiveness of two different hearing aids. We could measure the hearing ability of a group of participants with each hearing aid and then use the Wilcoxon signed-rank test to determine if there is a significant difference in hearing ability between the two hearing aids.

Suppose we want to investigate whether there is a significant difference in the time for two different algorithms to complete a task. We could run each algorithm multiple times and then use the Wilcoxon signed-rank test to determine if there is a significant difference in completion times between the two algorithms.

You will need a few skills and software packages to carry out the Wilcoxon signed-rank test in Python. Here is an overview of what you will need:

- Basic programming skills: You should be familiar with the Python programming language and its syntax. You should also have a basic understanding of statistics and hypothesis testing.
- Python environment: You must set up a Python environment on your computer. One popular option is the Anaconda distribution, with many useful packages pre-installed.
- Python packages: You must install the SciPy package, which contains the function to perform the Wilcoxon signed-rank test. You can install the SciPy package using the following command in your terminal or command prompt:

`pip install scipy`

Code language: Bash (bash)

Alternatively, you can use conda to install SciPy:

`conda install scipy`

Code language: Bash (bash)

Using pip or conda will install the latest version of SciPy and its dependencies into your Python environment. If you are using a specific version of Python, you may need to specify the version of SciPy that is compatible with your Python version. See this blog post: Pip Install Specific Version of a Python Package: 2 Steps.

It is often helpful to use Pandas to read data files and perform exploratory data analysis before conducting statistical analyses such as the Wilcoxon signed-rank test.

Here is how you can install Pandas using pip and conda:

Install Pandas using pip:

`pip install pandas`

Code language: Bash (bash)

Install Pandas using conda:

`conda install pandas`

Code language: Bash (bash)

In addition to SciPy, we also use Seaborn and NumPy in this post. To follow along, you will need to install these packages using the same methods mentioned earlier.

SciPy is a Python library for scientific and technical computing that provides modules for optimization, integration, interpolation, and statistical functions.

The Wilcoxon signed-rank test is one of the statistical functions provided by SciPy’s stats module. The function used to perform the test is called `wilcoxon()`

, and it takes two arrays of matched samples as inputs.

The basic syntax of the `wilcoxon() `

function is as follows:

```
from scipy.stats import wilcoxon
statistic, p_value = wilcoxon(x, y, zero_method='wilcox',
alternative='two-sided')
```

Code language: Python (python)

where x and y are the two arrays of matched samples to be compared, zero_method is an optional parameter that specifies how zero-differences are handled, and the alternative is another optional parameter that specifies the alternative hypothesis. The function returns the test statistic and the p-value.

There are several Python packages that can be used to perform the Wilcoxon signed-rank test in addition to SciPy. Here are three examples:

- Statsmodels is a Python library for fitting statistical models and performing statistical tests. It includes implementing the Wilcoxon signed-rank test in Python and other non-parametric tests.
- Pingouin is a statistical package that provides a wide range of statistical functions for Python. It includes an implementation of the Wilcoxon signed-rank test as well as other statistical tests and functions.
- Researchpy is a Python library for conducting basic research in psychology. It includes implementing the Wilcoxon Signed-Rank and other statistical tests commonly used in psychology research.

All three packages are open-source and can be installed using pip or conda. They provide similar functionality to SciPy for performing the Wilcoxon signed-rank test in Python.

Let us assume that we conducted a study to investigate the effect of a mindfulness intervention on working memory performance and anxiety levels in a sample of undergraduate students. The dataset consists of two dependent variables (N1 and N2) measured twice (pre-test and post-test). N1 represents participants’ performance in a working memory task, while N2 represents the level of anxiety experienced during the task. The pre-test and post-test measures were taken one week apart. Here is how to generate the fake data set in Python:

```
import pandas as pd
import numpy as np
from scipy.stats import norm, skewnorm
# Set the random seed for reproducibility
np.random.seed(123)
# Generate normally distributed data (dependent variable 1)
n1_pre = norm.rvs(loc=20, scale=5, size=50)
n1_post = norm.rvs(loc=25, scale=6, size=50)
# Generate skewed data (dependent variable 2)
n2_pre = skewnorm.rvs(a=-5, loc=20, scale=5, size=50)
n2_post = skewnorm.rvs(a=-5, loc=25, scale=6, size=50)
# Create a dictionary to store the data
data = {'N1_pre': n1_pre, 'N1_post': n1_post, 'N2_pre': n2_pre, 'N2_post': n2_post}
# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Print the first few rows of the DataFrame
print(df.head())
```

Code language: Python (python)

In the code chunk above, we first import the necessary Python libraries: Pandas, NumPy, and `scipy.stats`

.

We then set the random seed to ensure that the data we generate can be reproduced. Next, we generate normally distributed data for the dependent variable N1, both pre- and post-test. We also generate skewed data for the dependent variable N2, both pre- and post-test. We create a Python dictionary to store the generated data, with keys corresponding to the variable names. Finally, we create a Pandas DataFrame from the dictionary to store and manipulate the data.

In real-life research, scientists and data analysts import data from their experiments, studies, or surveys. These datasets are often quite large, and analysts must process, clean, and analyze them to extract meaningful insights.

Python is a popular programming language for data analysis, and it supports a wide range of data formats. This makes importing and working with data from different sources and tools easy. For example, Python can read the most common data files such as CSV, Excel, SPSS, Stata, and more. Here are some tutorials on how to import data in Python:

- How to Read SAS Files in Python with Pandas
- Your Guide to Reading Excel (xlsx) Files in Python
- Pandas Read CSV Tutorial: How to Read and Write
- How to Read & Write SPSS Files in Python using Pandas
- Tutorial: How to Read Stata Files in Python with Pandas

We start by testing the generated data for normality using the Shapiro-Wilks test:

```
from scipy.stats import shapiro
# Check normality of N1 (pre-test)
stat, p = shapiro(df['N1_pre'])
print('N1 pre-test:', 'Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('N1 pre-test data is normally distributed')
else:
print('N1 pre-test data is not normally distributed')
# Check normality of N1 (post-test)
stat, p = shapiro(df['N1_post'])
print('N1 post-test:', 'Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('N1 post-test data is normally distributed')
else:
print('N1 post-test data is not normally distributed')
# Check normality of N2 (pre-test)
stat, p = shapiro(df['N2_pre'])
print('N2 pre-test:', 'Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('N2 pre-test data is normally distributed')
else:
print('N2 pre-test data is not normally distributed')
# Check normality of N2 (post-test)
stat, p = shapiro(df['N2_post'])
print('N2 post-test:', 'Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('N2 post-test data is normally distributed')
else:
print('N2 post-test data is not normally distributed')
```

Code language: Python (python)

In the code chunk above, we first import the Python `shapiro()`

function from the `scipy.stats`

module. This function is used to calculate the Shapiro-Wilk test statistic and p-value, which are used to test the normality of a dataset.

Next, we call the `shapiro()`

function four times, once for each combination of the dependent variable and pre/post-test measure. We pass the relevant subset of the dataframe to the function as an argument. Here we used indexing to select the appropriate columns.

The `shapiro()`

function returns two values: the test statistic and the p-value. We store these values in the variables stat and p, respectively, using tuple unpacking.

Finally, we print the results of the normality tests using print statements. We check whether the p-value is greater than 0.05, the common significance level used in hypothesis testing. If the p-value is greater than 0.05, we conclude that the data is normally distributed; if it is less than or equal to 0.05, we conclude that the data is not normally distributed.

Overall, this code chunk allows us to quickly and easily test the normality of each variable and pre/post-test measure combination, which is an important step in determining whether the Wilcoxon signed-rank test is an appropriate statistical analysis to use.

To carry out the Wilcoxon signed-rank test in Python on the n2 variable, we can use the wilcoxon function from the scipy.stats module. Here is an example code chunk:

```
from scipy.stats import wilcoxon
# Subset the dataframe to include only the n2 variable and pre/post-test measures
n2_data = df[['N2_pre', 'N2_post']]
# Carry out the Wilcoxon signed-rank test on the n2 variable
stat, p = wilcoxon(n2_data['N2_pre'], n2_data['N2_post'])
# Print the test statistic and p-value
print("Wilcoxon signed-rank test for n2:")
print(f"Statistic: {stat}")
print(f"p-value: {p}")
```

Code language: Python (python)

In the code chunk above, we begin by importing the `wilcoxon()`

function from the `scipy.stats`

module.

Next, we subset the original dataframe only to include the N2 variable and its pre/post-test measures. This is stored in the `n2_data `

variable.

We then use the `wilcoxon()`

function to carry out the Wilcoxon signed-rank test in Python on the N2 dataframe. The `wilcoxon()`

function inputs the `N2_pre `

and `N2_post `

columns from the n2_data subset.

The test statistic and p-value are then returned by the `wilcoxon()`

function and stored in the stat and p variables, respectively.

Finally, we print the test results using print statements, including the test statistic and p-value. Here are the results:

To interpret the results, we can start by looking at the p-value. Suppose the p-value is less than our chosen significance level (usually 0.05). In that case, we reject the null hypothesis and conclude that there is a significant difference between the two dependent measures. Our results suggest a significant effect between the pre- and post-test.

In addition to the p-value, we can also look at the test statistic. The sign of the test statistic indicates the direction of the change. For example, the direction is positive if the post-test measure is greater than the pre-test. Moreover, it is negative if the post-test measure is less than the pre-test.

To visualize the data, we could create a box plot of the N2 variable for pre- and post-test measures. This would allow us to see the distribution of the data and any potential outliers. We could also add a line connecting the pre- and post-test measures for each participant to visualize each individual’s score change.

We can use the seaborn library to create a box plot of the N2 variable for both the pre- and post-test measures. Here is an example code chunk:

```
import seaborn as sns
# Create a box plot of the N2 variable for pre/post-test measures
boxp = sns.boxplot(data=n2_data, palette="gray")
# This will add title to plot
boxp.set_title("Box plot of N2 pre/post-test measures")
# Adding a label to X-axis
boxp.set_xlabel("Test")
# Adding a label l to Y-axis
boxp.set_ylabel("N2 Score")
# Removing the Grid
boxp.grid(False)
# Only lines on y- and x-axis
sns.despine()
# White background:
sns.set_style("white")
```

Code language: Python (python)

In the code chunk above, we first import the Seaborn data visualization library. We then create a box plot using Seaborn’s `boxplot()`

function, passing it the data to be plotted. The palette argument specifies the color palette to be used for the plot. We set the title, x-label, and y-label of the plot using the `set_title()`

,` set_xlabel()`

, and `set_ylabel()`

methods of the boxplot object. Next, we remove the grid using the grid() method of the boxplot object. Moreover, we remove the top and right spines of the plot using the `despine()`

function of Seaborn. Finally, we set the plot style to “white” using the `set_style() `

method of Seaborn. For more data visualization tutorials:

- How to Make a Violin plot in Python using Matplotlib and Seaborn
- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)
- How to Make a Scatter Plot in Python using Seaborn

Here is the boxplot:

A Shapiro-Wilk test was conducted to check for normality in the data. The results indicated that N1 pre-test data were normally distributed (*W*(30) = 0.985, *p *= 0.774) and N1 post-test data was also normally distributed (*W*(30) = 0.959, *p *= 0.077). However, N2 pre-test data was not normally distributed (*W*(30) = 0.944, *p *= 0.019) and neither was N2 post-test data (*W*(30) = 0.937, *p *= 0.010).

A Wilcoxon signed-rank test was conducted to compare the pre and post-test scores of N2. The results indicated that there was a significant difference between the pre and post-test scores of N2 (W(31) = 63.0, p < 0.001). Naturally, we would report the N1 test (e.g., results from a paired sample t-test conducted in Python).

If the assumptions of the Wilcoxon Signed-Rank test are not met, other non-parametric tests, such as the Kruskal-Wallis test or Friedman test, may not be appropriate. In such cases, alternative techniques such as bootstrapping or robust regression (most likely not) may be needed.

Several methods can be used to analyze non-normal data, including data transformation, bootstrapping, permutation tests, and robust regression. See this blog post for transforming data:

It is important to consider the specific characteristics of the data and the research question when choosing an appropriate technique.

Before we conclude this tutorial, we will have quick look on two other packages. What are the benefits of using, e.g., Pingouin to perform the Wilcoxon Signed-Rank test in Python?

SciPy and Pingouin provide similar functionalities and syntax for the Wilcoxon signed-rank test. However, Pingouin offers additional statistical tests and features, making it a more comprehensive statistical package.

ResearchPy, on the other hand, provides a simple interface for conducting various statistical tests, including the Wilcoxon signed-rank test. However, it has limited functionality compared to both SciPy and Pingouin.

The advantages of using Pingouin over SciPy and ResearchPy are:

- It offers a wide range of statistical tests beyond the Wilcoxon signed-rank test, making it a more comprehensive statistical package.
- It provides a simple and easy-to-use syntax for conducting various statistical tests, making it more accessible to beginners and non-experts.
- It provides detailed statistical reports and visualizations useful for interpreting and presenting statistical results.

However, SciPy and ResearchPy are still valuable statistical packages, especially if one only needs to conduct basic statistical tests. The choice between these packages ultimately depends on the user’s needs and preferences.

In this blog post, we learned about the Python Wilcoxon Signed-Rank test. It is a non-parametric statistical test that compares two related samples.

We discussed its hypothesis, and applications in psychology, hearing science, and data science. We also covered the requirements for conducting the test in Python.

This included generating fake data, importing data, testing for normality using the Shapiro-Wilks test, and implementing the Wilcoxon Signed-Rank test. We saw how to interpret the results and visualize data using Python.

The Wilcoxon Signed-Rank test is an essential tool for data analysis. It provides valuable insights into the relationship between two related samples, enabling informed decision-making.

We hope this post has helped you understand the Wilcoxon Signed-Rank test better. Please share on social media and comment below with any questions or feedback. Your input helps us improve and create more valuable content for you.

The post Wilcoxon Signed-Rank test in Python appeared first on Erik Marsja.

]]>In this blog post, you will learn how to test for the normality of residuals in R. Testing the normality of residuals is a step in data analysis. It helps determine if the residuals follow a normal distribution, which is important in many fields, including data science and psychology. Approximately normal residuals allow for powerful […]

The post Test for Normality in R: Three Different Methods & Interpretation appeared first on Erik Marsja.

]]>In this blog post, you will learn how to test for the normality of residuals in R. Testing the normality of residuals is a step in data analysis. It helps determine if the residuals follow a normal distribution, which is important in many fields, including data science and psychology. Approximately normal residuals allow for powerful parametric tests, while non-normal residuals can sometimes lead to inaccurate results and false conclusions.

Testing for normality in R can be done using various methods. One of the most commonly used tests is the Shapiro-Wilks test. This test tests the null hypothesis that a sample is drawn from a normal distribution. Another popular test is the Anderson-Darling test, which is more sensitive to deviations from normality in the distribution’s tails. Additionally, the Kolmogorov-Smirnov test can be used to test for normality. The Kolmogorov-Smirnov test compares the sample distribution to a normal one with the same mean and standard deviation. In addition to using various normality tests, it’s important to incorporate data visualization techniques to assess normality.

This blog post will provide examples of normality testing in data science and psychology and explain that the normality of residuals is crucial. We will also cover the three methods for testing the normality of residuals in R: the Shapiro-Wilks, Anderson-Darling, and Kolmogorov-Smirnov tests. We will explore how to interpret the results of each test and guide how to report the results according to APA 7. Having approximately normal residuals is essential for powerful parametric tests, while non-normal residuals can lead to inaccurate results and false conclusions.

Normality is a fundamental concept in statistics that refers to the distribution of a data set (residuals in our case). A normal distribution, also known as a Gaussian distribution, is a bell-shaped curve that is symmetric around the mean. In other words, the data is evenly distributed around the center of the distribution, with most values close to the mean and fewer values further away.

Normality can be important in some cases because common statistical tests, such as t-tests and ANOVA, assume the residuals are normally distributed. If the residuals are not normally distributed, these tests may not be valid and can lead to incorrect conclusions. Therefore, we can test for normality before using these tests.

Normality is a concept that is relevant to many fields, including data science and psychology. In data science, normality is important for many tasks, such as regression analysis and machine learning algorithms. For example, in linear regression, normality is a key assumption of the model. In this case, violations of normality can lead to biased or inconsistent estimates of the regression coefficients.

In psychology, normality is often used to describe the distribution of scores on psychological tests. For example, intelligence tests are designed to have a normal distribution, with most people scoring around the mean and fewer people scoring at the extremes. Normality is also important in hypothesis testing in psychology. Many statistical tests in psychology, such as t-tests and ANOVA, assume that the residuals are normally distributed. In some cases, violations of normality can lead to incorrect conclusions about the significance of the results.

There are several methods for testing normality, including graphical methods and formal statistical tests. Graphical methods include histograms, box plots, and normal probability plots. We can use these methods to visually inspect the data and assess whether it follows a normal distribution.

Formal statistical tests for normality include the Shapiro-Wilk test, the Anderson-Darling test, and the Kolmogorov-Smirnov test. These tests use different statistics to assess whether the data (e.g., residuals) deviates significantly from a normal distribution. All the tests should be used with caution and not on their own. In many cases, residual plots such as normal Q-Q plots, histograms, and residuals vs. fitted plots are more informative.

The Shapiro-Wilks test is commonly used to check for normality in a dataset. It tests the null hypothesis that a sample comes from a normally distributed population. The test is based on the sample data and computes a test statistic that compares the observed distribution of the sample with the expected normal distribution.

The Shapiro-Wilks test is considered one of the most powerful normality tests. This means that it has a high ability to detect deviations from normality when they exist. However, the test is sensitive to sample size. It may detect deviations from normality in larger samples, even if the deviations are small and unlikely to affect the validity of parametric tests.

One can use statistical software such as R or Python to perform the Shapiro-Wilks test. The test returns a p-value that can be compared to a significance level to determine whether the null hypothesis should be rejected. A small p-value indicates that the null hypothesis should be rejected, meaning the sample is not normally distributed.

It is important to note that normality tests, including the Shapiro-Wilks test, should not be used as the sole criterion for determining whether to use parametric or non-parametric tests. Rather, they should be used with other factors such as the sample size, research question, and data analysis type.

The Anderson-Darling test is another widely used normality test that can be used to check if a sample comes from a normally distributed population. The test is more sensitive to deviations from normality in the distribution’s tails, making it useful when it is important to ensure that the sample data is not just close to normal, but also has a similar shape.

Similar to the Shapiro-Wilks test, the Anderson-Darling test is based on the sample data and computes a test statistic that compares the observed distribution of the sample with the expected normal distribution. The test returns a p-value that can be compared to a significance level to determine whether the null hypothesis should be rejected.

The Anderson-Darling test has some advantages over other normality tests, including its ability to detect deviations from normality in smaller sample sizes and deviations in the distribution’s tails. However, it can be less powerful than the Shapiro-Wilks test in certain situations.

To perform the Anderson-Darling test, one can use statistical software such as R or Python. The test is widely available in most statistical software packages, making it easily accessible to researchers and analysts.

It is important to note that normality tests, including the Anderson-Darling test, should *not be* used as the sole criterion for determining whether to use parametric or non-parametric tests. Rather, they should be used with factors such as the sample size, research question, and data analysis type.

The Kolmogorov-Smirnov test is a statistical test used to check if a sample comes from a known distribution. Moreover, the test is non-parametric and can be used to check for normality and other distributions.

The test is based on the maximum difference between the sample’s cumulative distribution function (CDF) and the expected CDF of the tested distribution. The test statistic is called the Kolmogorov-Smirnov statistic and is used to determine if the null hypothesis should be rejected.

To perform the Kolmogorov-Smirnov test, one can use statistical software such as R or Python. The test returns a p-value that can be compared to a significance level to determine whether the null hypothesis should be rejected.

The Kolmogorov-Smirnov test is widely used in many fields, including finance, biology, and engineering. It is particularly useful when the sample size is small, or the underlying distribution is unknown.

However, like other normality tests, the Kolmogorov-Smirnov test should *not be* used as the sole criterion for determining whether to use parametric or non-parametric tests. Instead, it should be used with factors such as the sample size, research question, and data analysis type.

Parametric tests, such as t-tests and ANOVA, assume the model’s residual follows a normal distribution. However, violations of normality assumptions may be a minor concern for ANOVA and other parametric tests (e.g., regression, t-tests). The central limit theorem indicates that the distribution of means of samples taken from a population will approach normality, even if the original population distribution is not normal. This means that the ANOVA results may still be reliable even if the residuals are not normally distributed, especially when the sample sizes are large. Nevertheless, if the sample sizes are small or the deviations from normality are severe, non-parametric tests may be more appropriate.

If normality tests indicate that the residuals do not follow a normal distribution, several alternatives should be considered. These alternatives are non-parametric tests, which do not make any assumptions about the underlying distribution of the data. Non-parametric tests are often used when the assumptions of parametric tests, such as normality, are not met.

Here are some examples of non-parametric tests:

- Mann-Whitney U test is a non-parametric alternative to the t-test used to compare two independent samples. It does not assume that the data follows a normal distribution, but it does assume that the two groups have the same shape.
- Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test used to compare two related samples. It also does not assume that the residuals follow a normal distribution.
- The Kruskal-Wallis test is a non-parametric alternative to the ANOVA to compare three or more independent groups. It does not assume that the residuals follow a normal distribution, but it does assume that the variances of the groups are equal.
- Friedman test: This test is a non-parametric alternative to the repeated-measures ANOVA, used to compare three or more related samples. It also does not assume that the residuals follow a normal distribution.

Non-parametric tests are generally less powerful than their parametric counterparts, meaning they require a larger sample size to achieve the same level of statistical significance. However, they are more robust to violations of assumptions, such as normality, and are therefore often used when the data does not meet the assumptions of parametric tests. It is important to note that non-parametric tests also have their assumptions, such as independence, and should be chosen based on the research question and the data being analyzed.

To follow this blog post, you need basic statistics and data analysis knowledge. It would be helpful to understand normal distributions and statistical tests, such as t-tests and ANOVA.

You will also need access to R programming language and RStudio. The code in this blog post is written in R, so you must have R installed on your computer. Additionally, we used several R packages, including “nortest,” and “dplyr,” so you will need to have these packages installed as well.

To install an R package, you can use the following code in your R console:

`install.packages("package_name")`

Code language: R (r)

In our case, we can run this code `install.packages(c("dplyr", "nortest"))`

.

Here is some example data to use to practice testing for normality in R:

```
library(dplyr)
# For reproducibility
set.seed(20230410)
# Sample size
n_nh <- 100
n_hi <- 100
# Generate normally distributed variable (working memory capacity)
wm_capacity_nh <- rnorm(n_nh, mean = 50, sd = 10)
wm_capacity_hi <- rnorm(n_hi, mean = 45, sd = 10)
wm_capacity <- c(wm_capacity_nh, wm_capacity_hi)
# Generate non-normal variable (reaction time)
reaction_time_nh <- rlnorm(n_nh, meanlog = 4, sdlog = 1)
reaction_time_hi <- rlnorm(n_hi, meanlog = 3.9, sdlog = 1)
reaction_time <- c(reaction_time_nh, reaction_time_hi)
# Create categorical variable (hearing status)
hearing_status <- rep(c("Normal", "Hearing loss"), each = 50)
# Combine variables into a data frame
psych_data <- data.frame(Working_Memory_Capacity = wm_capacity,
Reaction_Time = reaction_time,
Hearing_Status = hearing_status)
# Recode categorical variable as a factor
psych_data <- psych_data %>%
mutate(Hearing_Status = recode_factor(Hearing_Status,
"Normal" = "1",
"Hearing loss" = "2"))
# Rename variables
psych_data <- psych_data %>%
rename(WMC = Working_Memory_Capacity,
RT = Reaction_Time,
Hearing = Hearing_Status)
```

Code language: PHP (php)

In the code chunk above, we first load the dplyr library for data manipulation.

Next, we use the `rnorm`

function to generate 50 random numbers from a normal distribution with a mean of 50 and standard deviation of 10 and store them in the `wm_capacity`

_nh variable. Note that we do the same for the `wm_capacity_hi`

. Finally, we create a `wm_capacity`

variable by combining these two vectors.

We then use the `rlnorm`

function to generate 100 random numbers from a lognormal distribution with a `meanlog `

of 4 and `sdlog `

of 1 and store them in the `reaction_time_nh`

variable (almost the same for the `reaction_time_hi`

)

To create a categorical variable, we use the `rep`

function to repeat the values “Normal” and “Hearing loss” each 50 times and store them in the `hearing_status`

variable.

We then combine the three variables into a data frame called `psych_data`

using the `data.frame`

function.

To recode the `hearing_status`

variable as a factor with levels “1” and “2” corresponding to “Normal” and “Hearing loss”, respectively, we use the `mutate`

function from dplyr along with the `recode_factor`

function. Of course, recoding the factor levels in R might not be useful and you can skip this step.

Finally, we use the `rename`

function to rename the variables to “WMC” and “RT” for Working Memory Capacity and Reaction Time, respectively, and “Hearing” for the categorical variable.

In this section, we will test for normality in R using three different methods. These methods are the Shapiro-Wilks test, the Anderson-Darling test, and the Kolmogorov-Smirnov test. Each of these tests provides a way to assess whether a sample of data comes from a normal distribution. Using these tests, we can determine if assumptions of normality are met for parametric statistical tests, such as t-tests or ANOVA. In the following sections, we will describe these tests in more detail and demonstrate how to perform them in R.

To perform a Shapiro-Wilks test for normality in R on the `psych_data`

data frame we created earlier, we can use the `shapiro.test`

function.

Here is an example code that performs a Shapiro-Wilks test on the `RT`

variable and outputs the test results:

```
# First we carry out an ANOVA:
aov.fit.rt <- aov(RT ~ Hearing, data = psych_data)
# Perform Shapiro-Wilks test dependent variable RT
shapiro.test(aov.fit.rt$residuals)
```

Code language: PHP (php)

In the code chunk above, an ANOVA is performed using the psych_data dataset. The ANOVA model is specified with RT as the dependent variable and Hearing as the independent variable. The results are stored in the object `aov.fit`

.

The following line of code uses the Shapiro-Wilk test to test for normality of the residuals. The residuals are extracted from the ANOVA object using `$residuals`

.

Here is an example code that runs the Shapiro-Wilks test for normality on the `WMC`

variable:

```
aov.fif.wmc <- aov(WMC ~ Hearing, data = psych_data)
# Perform Shapiro-Wilks test dependent variable WMC
shapiro.test(aov.fif.wmc$residuals)
```

Code language: PHP (php)

In the code chunk above, we perform an ANOVA using the psych_data dataset. The ANOVA model is specified with WMC as the dependent variable and Hearing as the independent variable, just like in the case of RT above. The results are stored in the object `aov.fif.wmc`

.

In the final line of code, we use the Shapiro-Wilk test to test for normality of the residuals from the ANOVA object `aov.fif.wmc`

.

Interpreting the results from a Shapiro-Wilks test conducted in R is pretty straightforward. For the model including the reaction time variable, the p-value is less than 0.05 (for both groups), and we reject the null hypothesis that the residuals are normally distributed.

On the other hand, for the model including working memory capacity as dependent variable, the p-value is greater than 0.05 (for both groups). We fail to reject the null hypothesis that the residuals are normally distributed.

We can report the results from the test like this:

The assumption of normality was assessed by using the Shapiro-Wilks test on the residuals from the ANOVA. Results indicated that the distribution of residuals deviated significantly from normal (

W= 0.56p< 0.05; Hearing Loss:W= 0.54,p< 0.05).

We can do something similar for the normally distributed results. However, in this case, we would write something like this:

According to the Shapiro-Wilk test, the residuals for the model including Working Memory Capacity as dependent variable was normally distributed (

W= 0.99,p= 0.26).

It is important to remember that relying solely on statistical tests to check for normality assumptions is only sometimes sufficient. Other diagnostic checks, such as visual inspection of histograms or Q-Q plots, may be useful in confirming the validity of the ANOVA results. Additionally, violations of normality assumptions may not necessarily pose a problem, particularly when the sample size is large or the deviations from normality are not severe.

In the following section, we will look at how to carry out another test for normality in R. Namely, the Anderson-Darling test.

To perform the Anderson-Darling test for normality, we can use the `ad.test()`

function from the `nortest`

package in R. Here is how to perform the test on our example data:

Here is an example of how to conduct the Anderson-Darling test on the residuals from an ANOVA:

```
library(nortest)
aov.fit.rt <- aov(RT ~ Hearing, data = psych_data)
# A-D Test:
ad.test(aov.fit.rt$residuals)
```

Code language: PHP (php)

Here is an example of the reaction time data and the normal-hearing group:

```
library(nortest)
aov.fit.wmc <- aov(wmc ~ Hearing, data = psych_data)
# A-D Test:
ad.test(aov.fit.wmc$residuals)
```

Code language: PHP (php)

In the code chunks above (RT and WMC), we first load the `nortest`

library, which contains the `ad.test()`

function for performing the Anderson-Darling test. Next, we carry out an ANOVA on the psych_data dataset, with Hearing as the independent variable and RT or wmc as the dependent variable. We save the residuals from each ANOVA object and pass them as an argument to the `ad.test()`

function that` `

tests the null hypothesis that the sample comes from a population with a normal distribution. This allows us to check the normality assumption for the ANOVA model’s residuals.

In the Anderson-Darling test for normality on the reaction time data for the normal-hearing group, the null hypothesis is that the residuals are normally distributed. The test result shows a statistic of A = 24.4 and a p-value smaller than 0.05. Since the p-value is smaller than the significance level of 0.05, we reject the null hypothesis. In conclusion, the residuals from the ANOVA using reaction time data as the dependent variable may not be normally distributed.

The Anderson-Darling test for normality was conducted on the model’s residuals using working memory capacity as the dependent variable. The results show that the test statistic (A) is 0.29, and the p-value is 0.62. Since the p-value is greater than the significance level of 0.05. In conclusion, we fail to reject the null hypothesis that the residuals are normally distributed. Therefore, we can assume that the residuals are normally distributed.

We can report the results like this:

The Anderson-Darling test for normality was performed on the reaction time model, and the results showed that the residuals were not normally distributed (

A=24.4 ,p< .001). The same test was performed on the working memory capacity model, and the results showed that the residuals were normally distributed (A= 0.29,p= .62).

As discussed in another section about interpreting the Shapiro-Wilks test, it is important to keep in mind that we should not rely solely on statistical tests for checking normality assumptions. Diagnostic checks such as visual inspection of histograms or Q-Q plots can help confirm the ANOVA results. Furthermore, violations of normality assumptions may not necessarily pose a problem.

In the following section, we will look at another test for normality that we can carry out using R.

To carry out the Kolmogorov-Smirnov Test for Normality in R, we can use the `ks.test() `

function from the stats package. This function tests whether a sample comes from a normal distribution by comparing the sample’s cumulative distribution function (CDF) to the CDF of a standard normal distribution.

Here are the code chunks for performing the Kolmogorov-Smirnov Test for Normality in R on the same example data:

Here is an example of the reaction time data:

```
aov.fit.rt <- aov(RT ~ Hearing, data = psych_data)
ks.test(aov.fit.rt$residuals, "pnorm", mean = mean(aov.fit.rt$residuals),
sd = sd(aov.fit.rt$residuals))
```

Code language: PHP (php)

Here is an example of the working memory capacity data:

```
aov.fit.wmc <- aov(WMC ~ Hearing, data = psych_data)
ks.test(aov.fit.wmc$residuals, "pnorm", mean = mean(aov.fit.wmc$residuals),
sd = sd(aov.fit.wmc$residuals))
```

Code language: R (r)

In the two code chunks above, we run an ANOVA using the `aov`

function in R, with `RT`

and `WMC`

as dependent variables and ‘Hearing’ as the independent variable. Next, we use the `ks.test`

function to perform a Kolmogorov-Smirnov test on the residuals of each ANOVA. The `pnorm`

argument specifies that the test should be performed against a normal distribution, and the ‘mean’ and ‘sd’ arguments specify the mean and standard deviation of the normal distribution. The KS test compares the sample distribution to the normal distribution and returns a p-value.

In the code chunk, the Kolmogorov-Smirnov test for normality was conducted on the RT data for participants with normal hearing. The test statistic, D, was 0.24, and the p-value was smaller than 0.05. This indicates that the distribution of RT scores is significantly different from a normal distribution. We get similar results for the hearing-impaired group.

In the third code chunk, the Kolmogorov-Smirnov test for normality was conducted on the WMC data for participants with normal hearing. The test statistic, D, was 0.04, and the p-value was 0.89. This indicates that the distribution of WMC scores is not significantly different from a normal distribution. We get similar results for the hearing-impaired group.

We can report the results from the Kolmogorov-Smirnov test for normality like this:

For the one-sample Kolmogorov-Smirnov tests, we found that the distribution of residuals was significantly different from a normal distribution (

D= 0.24,p< .001). . For the model including working memory capacity as a dependent varable, the distribution of residuals did not significantly deviate from normality (D= 0.04,p= .89).

Again, violations of normality assumptions may only sometimes be problematic, particularly with large sample sizes or mild deviations from normality. See the previous sections on interpreting the Shapiro-Wilks and Anderson-Darling tests for more information.

If the residuals are non-normal, we can explore the reasons for the non-normality, such as outliers or skewed distributions, and try to address them if possible.

For ANOVA, violations of normality assumptions are less of a concern when the sample sizes are large (typically over 30) because the central limit theorem indicates that the distribution of means of samples taken from a population will approach normality regardless of the shape of the original population distribution.

In such cases, the ANOVA results may still be reliable even if the residuals are not normally distributed. However, suppose the sample sizes are small, or the deviations from normality are severe. In that case, the ANOVA results may not be trustworthy, and non-parametric tests or data transformation techniques may be more appropriate.

One approach to handling non-normal data is to transform it to achieve normality. Z-score transformation is one option to standardize the data and can be helpful in some non-normal distributions. However, it is not always appropriate, and other types of transformations, such as log or square root transformations, may be more suitable depending on the data and research question. It is important to remember that the interpretation of the transformed data may be less intuitive, and the results should be carefully evaluated in the context of the research question. Additionally, if the non-normality is severe or cannot be adequately addressed by transformation, it may be necessary to use non-parametric statistical tests instead of traditional parametric tests.

Several non-parametric tests, such as the Wilcoxon rank-sum, Kruskal-Wallis, and Mann-Whitney U tests, can be used when data is not normally distributed. These tests do not assume normality and can be more appropriate when dealing with non-normal data. Here are some resources on this blog that may be useful:

Testing for normality in R is just one tool for assessing normality. It is also important to examine the residual distribution visually using histograms, density plots, and Q-Q plots. Additionally, it is important to consider the context of the data and the research question being asked. Sometimes, even if the data is not perfectly normal, it may be appropriate to use parametric tests if certain assumptions are met. Another reason is that the deviations from normality are minor. Therefore, it is important to use a combination of statistical tests and visual inspection to make decisions about the normality of the data.

This blog post was about the importance of testing for normality in data science and psychology. We covered various methods to test for normality in R, including the Shapiro-Wilks, Anderson-Darling, and Kolmogorov-Smirnov tests. We also discussed interpreting and reporting these tests according to APA 7 guidelines.

Testing for normality is important in many statistical analyses, as parametric tests assume normality of the residuals. Violations of normality can lead to inaccurate results and conclusions. Therefore, it is essential to use appropriate methods to test for normality and, if necessary, apply appropriate transformations or non-parametric tests.

In addition to the technical aspects, we provided real-world examples of normal and non-normal data in both data science and psychology contexts. We also discussed additional approaches for dealing with non-normal data (i.e., when the residuals deviates).

By mastering the methods for testing for normality in R, you will be better equipped to conduct rigorous statistical analyses that produce accurate results and conclusions. We hope this blog post was helpful in your understanding of this crucial topic. Please share this post on social media and cite it in your work.

Here are other blog posts you may find helpful:

- How to use %in% in R: 8 Example Uses of the Operator
- Sum Across Columns in R – dplyr & base
- How to Calculate Five-Number Summary Statistics in R
- Durbin Watson Test in R: Step-by-Step incl. Interpretation
- R Count the Number of Occurrences in a Column using dplyr
- How to Add a Column to a Dataframe in R with tibble & dplyr
- Mastering SST & SSE in R: A Complete Guide for Analysts
- How to Transpose a Dataframe or Matrix in R with the t() Function

The post Test for Normality in R: Three Different Methods & Interpretation appeared first on Erik Marsja.

]]>This blog will teach you how to carry out the Durbin-Watson Test in R. Have you ever run a linear regression model in R and wondered if the model’s assumptions hold? One common assumption of a linear regression model is the independence of observations, which means that the residuals (the differences between predicted and actual […]

The post Durbin Watson Test in R: Step-by-Step incl. Interpretation appeared first on Erik Marsja.

]]>This blog will teach you how to carry out the Durbin-Watson Test in R. Have you ever run a linear regression model in R and wondered if the model’s assumptions hold? One common assumption of a linear regression model is the independence of observations, which means that the residuals (the differences between predicted and actual values) should not be correlated. However, this assumption may be violated in some cases, leading to biased estimates and incorrect conclusions.

To check for autocorrelation, we can use the Durbin-Watson test, which is a statistical test to determine if there is evidence of autocorrelation in the residuals of a linear regression model.

We will begin by discussing the hypotheses of the Durbin-Watson test and the requirements for carrying out the test in R. We will provide examples of how the Durbin-Watson test can be applied in different fields, such as data science, psychology, and hearing science.

Next, we will introduce you to the syntax of the `dwtest()`

and `durbinWatsonTest()`

functions, two functions that can be used to carry out the Durbin-Watson test in R. We will also provide you with an example data set to help you understand how to use these functions.

We will then guide you through the steps to carry out the Durbin-Watson test in R, including how to fit a linear regression model, install and load the necessary packages, run the test, and interpret the results. Moreover, we will also discuss what to do if autocorrelation is detected and alternative methods for testing for autocorrelation.

By the end of this blog post, you will have a good understanding of how to check for autocorrelation in a linear regression model using the Durbin-Watson test in R. Whether you are a data scientist, psychologist, or hearing scientist, this post will equip you with the knowledge and skills to ensure that your linear regression models are sound and valid.

We can use the statistical test Durbin-Watson test to detect the presence of autocorrelation in regression models. Autocorrelation refers to the presence of correlation between the error terms of a regression model, which can occur when the data points are not independent of each other. The Durbin-Watson test can be used to test for autocorrelation in various fields, including data science, cognitive psychology, and hearing science.

The null hypothesis for the Durbin-Watson test is that no autocorrelation exists in the model’s residuals. The alternative hypothesis is that the residuals have positive or negative autocorrelation.

The test statistic for the Durbin-Watson test ranges from 0 to 4, with a value of 2 indicating no autocorrelation. A less than 2 indicates positive autocorrelation, while a value greater than 2 indicates negative autocorrelation. A value of 0 indicates perfect positive autocorrelation, and 4 indicates perfect negative autocorrelation.

In data science, the Durbin-Watson test is used to evaluate the presence of autocorrelation in time series data. Time series data are points collected over time, such as stock prices or weather patterns. Autocorrelation in time series data occurs when the error terms of a regression model are correlated with previous error terms, indicating that the data points are not independent of each other. This can lead to biased estimates and incorrect statistical inferences. Data scientists can use the Durbin-Watson test to identify and correct autocorrelation in their time series models.

For example, a data scientist may analyze stock prices to predict future market trends. If there is autocorrelation in the data, the model may incorrectly predict future prices, leading to significant financial losses. Using the Durbin-Watson test, the data scientist can detect and correct for any autocorrelation, improving the accuracy of their model.

In cognitive psychology, the Durbin-Watson test is used to analyze data from experiments that involve repeated measurements of the same participants. Autocorrelation in this context can occur when the error terms of a regression model are correlated with previous error terms, indicating that the participant’s responses are not independent of each other. This can lead to biased estimates of the effects of independent variables and incorrect statistical inferences. Cognitive psychologists can use the Durbin-Watson test to identify and correct autocorrelation in their models.

For example, a cognitive psychologist may conduct an experiment to test the effects of sleep on memory. Participants are tested on their memory recall immediately after learning a list of words and again after a night of sleep. Autocorrelation in this context may occur if the error terms of the regression model are correlated with previous error terms, indicating that the participants’ memory recall is not independent of each other. Using the Durbin-Watson test, cognitive psychologists can detect and correct for any autocorrelation, improving the accuracy of their results.

In hearing science, the Durbin-Watson test can analyze data from experiments involving repeated auditory stimuli measurements. Autocorrelation in this context can occur when the error terms of a regression model are correlated with previous error terms, indicating that the participant’s responses to auditory stimuli are not independent of each other. This can lead to biased estimates of the effects of auditory stimuli and incorrect statistical inferences. Hearing scientists can identify and correct autocorrelation in their models using the Durbin-Watson test.

For example, a hearing scientist may be conducting an experiment to test the effects of background noise on speech perception. Participants are tested on their ability to identify spoken words in background noise. Autocorrelation in this context may occur if the error terms of the regression model are correlated with previous error terms, indicating that the participant’s responses to auditory stimuli are not independent of each other. Using the Durbin-Watson test, the hearing scientist can detect and correct for any autocorrelation, improving the accuracy of their results.

You can carry out a Durbin-Watson test in R using the `lmtest`

package or the `car`

package. Both packages provide functions for running the Durbin-Watson test on a linear regression model in R.

A few requirements need to be met to carry out the Durbin-Watson test in R. First, we need a model (e.g., a linear regression model) we want to test for autocorrelation. We could, e.g., fit the model using the `lm()`

function in R. The model should have at least one independent variable and one dependent variable. Once we have a fitted model, we can use the `dwtest()`

function in the `lmtest`

package to carry out the Durbin-Watson test.

The `dwtest()`

the function takes a fitted linear regression model as an input and returns a p-value that indicates the test’s significance. If the p-value is less than the significance level (typically 0.05), we can reject the null hypothesis and conclude that there is evidence of autocorrelation in the residuals.

In addition to the `lmtest`

package, we can also use the `durbinWatsonTest()`

function in the `car`

package to carry out the Durbin-Watson test. The `durbinWatsonTest()`

function also takes a fitted linear regression model as an input and returns a test statistic and a p-value.

The ** dwtest()** function from the

`lmtest`

Here are the arguments briefly explained:

`formula`

: This is the formula for the regression model you want to test for autocorrelation. It should be in the form of`response ~ predictor1 + predictor2 + ...`

. It can also contain a`lm()`

object that you have fitted.`order.by`

: This optional argument allows you to specify a variable to order the data by before running the test. This is useful if you suspect that there may be some other variable that is causing the autocorrelation in the data.`alternative`

: This argument specifies the alternative hypothesis for the test. It can take on one of three values: “greater”, “two.sided”, or “less”. If you think the autocorrelation is positive, you should set this to “greater”. If you think it is negative, you should set it to “less”. If you don’t know, leave it at the default value of “two.sided”.`iterations`

: This is the number of iterations used in the Monte Carlo simulation to compute the p-value. The default value is 15, which is generally sufficient for most purposes. However, you may need to increase this value if you have a large dataset or the autocorrelation is very strong.`exact`

: This optional argument allows you to specify whether to use an exact test instead of the Monte Carlo simulation. This is generally unnecessary but useful if you have a small dataset.`tol`

: This is the tolerance level used in the test. The default value is 1e-10, which is generally sufficient for most purposes. However, if you have a very large dataset or if the autocorrelation is very weak, you may need to decrease this value.`data`

: This dataframe contains the variables used in the regression model.

In the next section, we will have a look at the `durbinWatsonTest()`

function from the `car`

package.

The ** durbinWatsonTest()** function from the

`car`

Again, here are the arguments briefly explained.

: The fitted linear regression model object.`model`

: The maximum lag to test for autocorrelation. By default, the function tests for first-order autocorrelation.`max.lag`

: A logical value indicating whether to use a simulation-based approach to calculate the p-value. If`simulate`

, the function performs a Monte Carlo simulation to estimate the p-value. If`simulate=TRUE`

, the function uses an approximation to the distribution of the test statistic.`simulate=FALSE`

: The number of Monte Carlo replications to use if`reps`

.`simulate=TRUE`

: The method to use for calculating the p-value if`method`

. The options are`simulate=TRUE`

(default) and`"resample"`

. The`"normal"`

method is a permutation-based approach that resamples the model’s residuals to simulate the null distribution. The`"resample"`

method assumes that the residuals are normally distributed and uses this assumption to calculate the p-value.`"normal"`

: The alternative hypothesis to test. The options are`alternative`

(default),`"two.sided"`

, and`"positive"`

.`"negative"`

: Additional arguments to be passed to the`...`

function.`summary()`

In the next section, we will generate fake data to practice running the Durbin-Watson test.

Here is some example data to practice the Durbin-Watson test in R:

```
library(tidyverse)
# Define the number of participants, conditions, and trials per condition
n_participants <- 50
n_conditions <- 2
n_trials <- 10
# Define the autocorrelation coefficient (rho)
rho <- 0.8
# Generate the reaction time data
rt_data <- tibble(
participant = rep(1:n_participants, each = n_conditions * n_trials),
condition = rep(rep(1:n_conditions, each = n_trials), times = n_participants),
rt = unlist(
lapply(1:n_participants, function(p) {
lapply(1:n_conditions, function(c) {
x <- seq(from = 500, to = 700, length.out = n_trials)
for (i in 2:n_trials) {
x[i] <- rho * x[i - 1] + abs(rnorm(1, mean = 0, sd = 50))
}
# x[x < 200] <- 200
return(x)
})
})
)
)
rt_data <- rt_data %>%
mutate(condition = if_else(condition == 1, "low", "high"),
trial = rep(seq(1, 20), n_participants))
```

Code language: R (r)

In the code chunk above, we generated example data using the tidyverse package in R. The data consists of reaction times from 50 participants who performed two conditions (high vs. low load) with ten trials per condition. Next, we set the autocorrelation coefficient (rho) to 0.8 to ensure a correlation between consecutive reaction times within each condition. Using the seq() function, we generated the reaction times to create a sequence from 500 to 700 with ten values. Then, the autocorrelation coefficient was applied to create a correlated sequence using a for loop. Here we also used the abs() function to take the absolute value in R. Moreover, any values under 200 were replaced with 200 to ensure no values under 200 ms.

Now, we are ready to go to the next section and carry out the Durbin-Watson test in R.

To carry out the Durbin-Watson test in R, you can follow these steps:

- Fit a linear regression model using the
`lm()`

function in R. - Install and load the
`lmtest`

package or the`car`

package, which both contain the Durbin-Watson test function. - Use the
`dwtest()`

function from the`lmtest`

package or the`durbinWatsonTest()`

function from the`car`

package to perform the Durbin-Watson test. - Interpret the results of the Durbin-Watson test by examining the test statistic and the associated p-value.

Here is how to fit a linear regression model in R using the `lm()`

function:

```
# Fit a linear regression model
rt_model <- lm(rt ~ trial, data = rt_data)
```

Code language: R (r)

In the code chunk above, we fitted a linear regression model using the `lm()`

function in R. We used the model formula `rt ~ trial`

to specify that we want to model the response variable `rt`

as a function of the predictor variable `trial`

. Here, `rt`

is the variable containing the reaction time data, and `trial`

is the variable containing the trial numbers.

Finally, we used the `data`

argument to specify the dataframe containing the variables used in the model. Here, the data frame is `rt_data`

, which we previously generated.

Here is how we can install and load the lmtest package in R:

```
install.packages('lmtest')
library('lmtest')
```

Code language: R (r)

Alternatively, we can use the `durbinWatsonTest()`

from the `car`

package. In the following section, we are ready to run the Durbin-Watson test in R. Note that we do not have to install the `car`

package.

Here is an example code chunk that demonstrates how to carry out the Durbin-Watson test in R:

```
# Perform the Durbin-Watson test
dwtest(rt_model)
```

Code language: R (r)

Alternatively, we can use the `durbinWatsonTest()`

from the `car`

package to carry out the Durbin-Watson test:

```
# Carry out the Durbin-Watson test
durbinWatsonTest(rt_model)
```

Code language: PHP (php)

In the following section, we are going to interpret the results. Note that this is just example data; we would most likely not analyze a similar actual data set like this.

In the Durbin-Watson test output above, we performed a test for first-order autocorrelation in the residuals of the linear regression model `rt_model`

that was fit to the `rt_data`

. Remember, the null hypothesis for the test is that there is no first-order autocorrelation in the residuals, i.e., the errors are independent. On the other hand, the alternative hypothesis is positive autocorrelation in the residuals.

The Durbin-Watson test statistic (`DW`

) is a value between 0 and 4 that measures the degree of autocorrelation in the residuals. The value `DW`

is interpreted as follows:

- If
, there is no autocorrelation in the residuals.`DW = 2`

- If
, there is positive autocorrelation in the residuals.`DW < 2`

- If
, there is negative autocorrelation in the residuals.`DW > 2`

In our case, the value of `DW`

is 0.89, which is less than 2. This suggests positive autocorrelation in the model’s residuals, which supports the alternative hypothesis.

The p-value for the test is less than 2.2e-16, which is smaller than the conventional level of significance (e.g., 0.05). This indicates strong evidence against the null hypothesis of no autocorrelation in the residuals. Therefore, we can reject the null hypothesis in favor of the alternative hypothesis and conclude that there is positive autocorrelation in the residuals.

We can see that the results from the Durbin-Watson test output suggest positive autocorrelation in the residuals of the linear regression model rt_model. This implies that the errors are not independent and may violate the assumption of independent errors in the linear regression model. If this were real data, we would have to be cautious as this can lead to biased parameter estimates and incorrect inference. Therefore, it is important to account for autocorrelation in the residuals when analyzing the data or to use a different model that allows for autocorrelated errors.

Suppose the results from the Durbin-Watson test are significant, indicating a presence of autocorrelation in the residuals of a linear regression model. In that case, several steps can be taken:

- Model modification: Autocorrelation in the residuals can indicate some important information that the current model is not capturing. One possible solution is to modify the model to include additional predictors or interactions between predictors that might account for autocorrelation.
- Use a different regression method: If autocorrelation in the residuals persists even after model modification, it might be necessary to use a different regression method that can handle autocorrelation, such as generalized least squares (GLS) or autoregressive integrated moving average (ARIMA) models.
- Use a different type of data: If autocorrelation cannot be eliminated using the above methods, it might be necessary to use another type of data collection to reduce the potential for autocorrelation. For example, repeated measures designs or time-series data collection can help to reduce autocorrelation.
- Report the results: Regardless of whether or not autocorrelation can be eliminated, it is important to report the results of the Durbin-Watson test and any subsequent modifications to the model in any publications or reports. This allows readers to interpret the results and understand any potential analysis limitations.

Finally, here are some alternatives to the Durbin-Watson test:

- Breusch-Godfrey Test: This test is an extension of the Durbin-Watson test and can be used to test for higher-order autocorrelation. The null hypothesis is no autocorrelation, and the alternative hypothesis is the presence of autocorrelation.
- Cochrane-Orcutt Procedure: This method is used when the errors have first-order autocorrelation. It involves transforming the variables and fitting a new model to the transformed data. The Durbin-Watson test can be used to test for autocorrelation in the transformed errors.
- The generalized Method of Moments (GMM) estimates the model’s parameters using instrumental variables. We can use it to correct autocorrelation in the errors.
- Newey-West Estimator: This is a robust method that can be used to estimate the standard errors of the coefficients in autocorrelation. It involves adjusting the standard errors using a correction factor based on the lagged values of the residuals.

Of course, you need to check other linear (regression) model assumptions. For example, you can look at the possible outliers by making a residual plot in R and testing for normality in R.

In this blog post, you have learned about the Durbin-Watson test, a statistical method used to examine autocorrelation in regression models. You have seen how the test is based on two hypotheses, which can help you determine whether there is evidence of positive or negative autocorrelation in your data.

We examined three examples of how the Durbin-Watson test can be used in different fields, including Data Science, Psychology, and Hearing Science. We also discussed the requirements for carrying out the test in R and provided step-by-step instructions on using both the `lmtest `

and `car `

packages to perform the test.

Moreover, we have explained the syntax of the `dwtest()`

and `durbinWatsonTest() `

functions and demonstrated how to interpret the results. Additionally, we have shown how to correct autocorrelation in your data and discussed alternative methods for testing for autocorrelation.

We hope that this blog post has been informative and helpful. If you enjoyed reading it or found it useful, please consider sharing it on social media or citing it in your work. If you have any comments or questions, please leave them below. I appreciate any feedback and would be happy to hear from you.

- How to Convert a List to a Dataframe in R – dplyr
- Sum Across Columns in R – dplyr & base
- How to Rename Column (or Columns) in R with dplyr
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- How to Rename Factor Levels in R using levels() and dplyr

The post Durbin Watson Test in R: Step-by-Step incl. Interpretation appeared first on Erik Marsja.

]]>In this blog post, we will learn how to sum across columns in R. Summing can be a useful data analysis technique in various fields, including data science, psychology, and hearing science. We will explore several examples of how to sum across columns in R, including summing across a matrix, summing across multiple columns in […]

The post Sum Across Columns in R – dplyr & base appeared first on Erik Marsja.

]]>In this blog post, we will learn how to sum across columns in R. Summing can be a useful data analysis technique in various fields, including data science, psychology, and hearing science. We will explore several examples of how to sum across columns in R, including summing across a matrix, summing across multiple columns in a dataframe, and summing across all columns or specific columns in a dataframe using the tidyverse packages. Whether you are new to R or an experienced user, these examples will help you better understand how to summarize and analyze your data in R.

To follow this blog post, readers should have a basic understanding of R and dataframes. Familiarity with the tidyverse packages, including dplyr, will also be helpful for some of the examples. However, we will provide explanations and code examples to guide readers through each step of the process. No prior knowledge of summing across columns in R is required.

Summing across columns in data analysis is common in various fields like data science, psychology, and hearing science. It involves calculating the sum of values across two or more columns in a dataset. This section will discuss examples of when we might want to sum across columns in data analysis for each field.

Summing across columns is a common calculation technique for financial metrics in financial analysis. For example, we might want to calculate a company’s total revenue over time. In this case, we would sum the revenue generated in each period. Another example is calculating the total expenses incurred by a company. In this case, we would sum the expenses incurred in each period.

In survey analysis, we might want to calculate the total score of a respondent on a questionnaire. The questionnaire might have multiple questions, and each question might be assigned a score. In this case, we would sum the scores assigned to each question to calculate the respondent’s total score.

In psychometric testing, we might want to calculate a total score for a test that measures a particular psychological construct. For example, the Big Five personality traits test measures five traits: extraversion, agreeableness, conscientiousness, neuroticism, and openness. Each trait might have multiple questions, and each question might be assigned a score. In this case, we would sum the scores assigned to each question for each trait to calculate the total score for each trait. Here is an example table in which the columns E1 and E2 are summed as the new columns Extraversion (and so on):

In behavioral analysis, we might want to calculate the total number of times a particular behavior occurs. For example, we might want to calculate the total number of times a child engages in aggressive behavior in a classroom setting. We might record each instance of aggressive behavior, and then sum the instances to calculate the total number of aggressive behaviors.

In audiological testing, we might want to calculate the total score for a hearing test. The test might involve multiple frequencies, and each frequency might be assigned a score based on the individual’s ability to hear that frequency. In this case, we would sum the scores assigned to each frequency to calculate the total score for the hearing test.

In speech analysis, we might want to calculate the number of phonemes an individual produces. Phonemes are the basic sound units in a language, and different languages have different sets of phonemes. In this case, we would transcribe the individual’s speech and then count the number of phonemes produced to calculate the total number of phonemes.

To sum across columns using base R, you can use the ** apply()** function with

`margin = 1`

```
# Create a sample matrix
mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
# View the matrix
mat
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
# Sum the values across rows
row_sums <- apply(mat, 1, sum)
# View the row sums
row_sums
```

Code language: R (r)

In the code chunk above, we first create a 2 x 3 matrix in R using the matrix() function. We then use the `apply()`

function to sum the values across rows by specifying margin = 1. Finally, we use the sum() function as the function to apply to each row. The resulting `row_sums `

vector shows the sum of values for each matrix row.

You can use the function to bind the vector to the matrix to add a new column with the row sums to the matrix using base R. Here is how we add it to our matrix:

```
# Add a new column to the matrix with the row sums
mat_with_row_sums <- cbind(mat, row_sums)
# Print the matrix with the row sums
mat_with_row_sums
```

Code language: PHP (php)

In the code chunk above, we used the ** cbind()** function to combine the original

`mat`

`row_sums`

`mat`

`row_sums`

`mat_with_row_sums`

`mat`

More about adding columns in R:

- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

To sum across multiple columns in R in a dataframe we can use the `rowSums()`

function. Here is an example:

```
# Create a list of variables
data_list <- list(var1 = c(1, 2, 3), var2 = c(4, 5, 6), var3 = c(7, 8, 9))
# Convert the list to a dataframe
df <- data.frame(data_list)
# Sum the values across columns for each row
row_sums <- rowSums(df)
# Add a new column to the dataframe with the row sums
df$Row_Sums <- row_sums
```

Code language: R (r)

In the code chunk above, we first created a list called ** data_list** with three variables

`var1`

`var2`

`var3`

We then use the ** data.frame()** function to convert the list to a dataframe in R called

`df`

Next, we use the ** rowSums()** function to sum the values across columns in R for each row of the dataframe, which returns a vector of row sums.

We then add a new column called ** Row_Sums** to the original dataframe

`df`

`<-`

`$`

Finally, we view the modified dataframe ** df** with the added column using the

`print()`

`df`

We can use the `%in%`

operator in R to identify the columns that we want to sum over:

```
df <- data.frame(x1 = 1:3, x2 = 4:6, x3 = 7:9, y1 = 10:12, y2 = 13:15, y3 = 16:18)
cols_to_sum <- names(df) %in% c("y1", "y2", "y3")
row_sums <- rowSums(df[, cols_to_sum])
```

Code language: R (r)

In the code chunk above, we first use the ** names()** function to get the names of all the columns in the data frame

`df`

`%in%`

`cols_to_sum`

`TRUE`

`FALSE`

Finally, we use the ** rowSums()** function to sum the values in the columns specified by

`cols_to_sum`

`row_sums`

`y1`

`y2`

`y3`

`df`

Using ** %in%** can be a convenient way to identify columns that meet specific criteria, especially when you have a large data frame with many columns.

We can use the ** dplyr** package from the tidyverse to sum across all columns in R. Here is an example:

```
library(dplyr)
# Create a dataframe
df <- data.frame(
var1 = c(1, 2, 3),
var2 = c(4, 5, 6),
var3 = c(7, 8, 9)
)
# Sum the values across all columns for each row
df <- df %>%
mutate(Row_Sums = rowSums(across(everything())))
```

Code language: R (r)

In the code chunk above, we first use the ** %>%** operator to pipe the dataframe

`df`

`mutate()`

`Row_Sums`

We used the ** across()** function to select all columns in the dataframe (i.e.,

`everything()`

`rowSums()`

Finally, the resulting ** row_sums** vector is then added to the dataframe

`df`

`Row_Sums`

Here is an example of how to sum across all numeric columns in a dataframe in R:

```
library(dplyr)
# Create a dataframe
df <- data.frame(
var1 = c(1, 2, 3),
var2 = c("a", "b", "c"),
var3 = c(4, 5, 6),
var4 = c("d", "e", "f"),
var5 = c(7, 8, 9)
)
# Sum the values across all numeric columns for each row using across()
df <- df %>%
mutate(rowSums = rowSums(across(where(is.numeric))))
```

Code language: R (r)

First, we take the dataframe ** df** and pass it to the

`mutate()`

`dplyr`

Within ** mutate()**, we use the

`across()`

`where(is.numeric)`

Then, we apply the ** rowSums()** function to the selected columns, which calculates the sum of each row across those columns. Finally, we create a new column in the dataframe

`rowSums`

The resulting dataframe ** df** will have the original columns as well as the newly added column

`rowSums`

To sum across Specific Columns in R, we can use `dplyr`

and `mutate()`

:

```
library(dplyr)
# Create a sample dataframe
df <- data.frame(id = 1:5,
a = c(3, 4, 5, 6, 7),
b = c(2, 2, 2, 2, 2),
c = c(1, 2, 3, 4, 5))
# Sum columns 'a' and 'b' using the sum() function and create a new column 'ab_sum'
df <- df %>%
mutate(ab_sum = sum(a, b))
```

Code language: R (r)

In the code chunk above, we create a new column called ** ab_sum** using the

`mutate()`

`+`

`sum()`

`a`

`b`

`sum()`

The resulting dataframe ** df** will have the original columns as well as the newly added column

`ab_sum`

`a`

`b`

We can use the `select()`

function from the `dplyr`

package to select the columns we want to sum across and then use the `rowSums()`

function to sum across those columns. Here is an example:

```
library(dplyr)
# Create a sample data frame
df <- data.frame(
id = 1:5,
x1 = c(1, 2, 3, 4, 5),
x2 = c(2, 4, 6, 8, 10),
y1 = c(3, 6, 9, 12, 15),
y2 = c(4, 8, 12, 16, 20)
)
# Select columns x1 and x2 using select() and sum across rows using rowSums()
df <- df %>%
mutate(row_sum = rowSums(select(., c(x1, x2))))
# View the resulting data frame
df
```

Code language: PHP (php)

In the code chunk above, we first load the `dplyr`

package and create a sample data frame with columns `id`

, `x1`

, `x2`

, `y1`

, and `y2`

. We then use the `mutate()`

function from `dplyr`

to create a new column called `row_sum`

, where we sum across the columns `x1`

and `x2`

for each row using `rowSums()`

and the `select()`

function to select those columns in R.

In this blog post, we learned how to sum across columns in R. We covered various examples of when and why we might want to sum across columns in fields such as Data Science, Psychology, and Hearing Science. We have shown how to sum across columns in matrices and data frames using base R and the dplyr package. We have also demonstrated adding the summed columns to the original dataframe. I encourage readers to leave a comment if they have any questions or find any errors in the blog post. Finally, I encourage readers to share this post on social media to help others learn these important data manipulation skills.

- How to Rename Column (or Columns) in R with dplyr
- R Count the Number of Occurrences in a Column using dplyr
- How to Calculate Z Score in R
- How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Remove Duplicates in R – Rows and Columns (dplyr)
- How to Rename Factor Levels in R using levels() and dplyr

The post Sum Across Columns in R – dplyr & base appeared first on Erik Marsja.

]]>In this blog post, you will learn how to plot the prediction interval in R. If you work with data or are interested in statistical analysis, you know that making predictions is an essential part of the process. Moreover, if you are new to prediction intervals or want to refresh your knowledge, this post will […]

The post Plot Prediction Interval in R using ggplot2 appeared first on Erik Marsja.

]]>In this blog post, you will learn how to plot the prediction interval in R. If you work with data or are interested in statistical analysis, you know that making predictions is an essential part of the process. Moreover, if you are new to prediction intervals or want to refresh your knowledge, this post will cover everything you need. We’ll start with an overview of the statistical methods involved, including regression models, time series models, and Bayesian inference. We’ll also provide practical examples of when and how to plot the prediction interval in R, including applications in psychology and hearing science.

Whether you’re a data analyst, researcher, or someone interested in statistics, learning to plot the prediction interval in R will be valuable to your skill set. So, let’s dive in and explore this important concept together!

A prediction interval is a measure that estimates the range of values within which a future observation or measurement will likely fall with a certain confidence level. It differs from a confidence interval, which estimates the precision of a point estimate, such as the mean or median of a population. A prediction interval accounts for both the uncertainty associated with the estimation of the underlying parameters and the variability of the observed data.

Prediction intervals can be obtained using different statistical methods. It all depends on the nature of the data and the assumptions made about the underlying probability distribution. Some common methods include:

Prediction intervals can be obtained from linear or nonlinear regression models. These models describe the relationship between predictor variables and a response variable. The prediction interval considers the variability of the residuals (the differences between the observed values and the predicted values) and the uncertainty of the regression coefficients. This post will focus on visualizing the prediction interval from regression models in R.

Prediction intervals can be obtained from time series models, which describe the evolution of a variable over time. The prediction interval considers the uncertainty of the model parameters and the noise in the data.

Prediction intervals can be obtained from Bayesian models, which assign probabilities to the possible values of the parameters and future observations. The prediction interval considers the prior knowledge about the parameters and the information in the observed data.

In psychology, prediction intervals can be used to estimate the variability of individual differences in a population. One example from Psychological science: predicting a new individual’s IQ score based on a sample. We set up a model to estimate the relationship between predictor variables (e.g., age, education) and the response variable (IQ score). We use prediction intervals to assess our confidence in the predicted score. The prediction interval considers the residuals’ variability and the uncertainty of the regression coefficients. We can use prediction intervals to compare the new individual’s IQ score with others in the population. Here is a scatter plot created in R with confidence and prediction intervals:

An example of using prediction intervals in hearing science can be seen in the study of the relationship between the pure-tone average (PTA) and the probability of hearing loss. The PTA is a measure of hearing threshold at various frequencies. A probit or logit model can estimate the probability of hearing loss. Probit regression models the relationship between the PTA and the probability of hearing loss.

Once the model is fitted, a prediction interval can be calculated to estimate the range of likely probabilities of hearing loss for a new individual, given their PTA. This prediction interval considers the PTA variability and the uncertainty of the model parameters. It measures our confidence in the predicted probability of hearing loss.

In summary, prediction intervals are a useful statistical measure that provides a range of values within which a future observation or measurement will likely fall within a certain confidence level. Prediction intervals can be obtained using different statistical methods. The method you choose depends on the nature of the data and the assumptions made about the underlying probability distribution. For example, prediction intervals can be used in psychology and hearing science to estimate the variability of individual differences and auditory thresholds, respectively. These examples demonstrate the importance of considering the uncertainty and variability of the data when making predictions and drawing conclusions.

To plot a prediction interval in R, you must understand linear regression models and their associated concepts, such as confidence intervals, standard errors, and residuals. You should also be familiar with the R language and have some knowledge of the ggplot2 package. First, you need to fit a statistical model using regression analysis. This could be linear, logistic, or any other regression model. The model can be created using the lm() or glm() function, depending on the type of regression analysis.

In R, you can use the `predict() `

function to generate predicted values based on, e.g., a linear regression model.

To use ggplot2, you must install the package using the `install.packages()`

function. You must also load the package into your R session using the` library() `

function. You will also need to understand the grammar of graphics and how to use ggplot2 to create visualizations in R.

In addition to ggplot2, you may also need to use packages such as dplyr, tidyr, and broom. We can use these packages to manipulate and clean your data, visualize data, and extract model coefficients and standard errors.

To plot a prediction interval in R, you will need a good understanding of regression models and associated concepts. Moreover, you must be familiar with the R language and ggplot2 package. You may also need to use other packages to manipulate and clean your data, fit linear regression models, and extract model coefficients and standard errors.

Suppose we study the relationship between pure-tone average (PTA4) and speech recognition threshold (SRT) in a speech-in-noise task. We can collect data from 300 normal-hearing participants. Moreover, we measure PTA4 in decibels hearing level (dB HL), while we measure SRT in dB signal-to-noise ratio (SNR). Here is this dataset generated in R:

```
library(tidyverse)
set.seed(20230318) # for reproducibility
n <- 300 # sample size
PTA4 <- rnorm(n, mean = 25, sd = 5)
SRT <- 0.3 * PTA4 + rnorm(n, mean = 0, sd = 5)
data <- tibble(PTA4, SRT)
```

Code language: PHP (php)

The dataset consists of two variables, PTA4 and SRT, with 300 observations each. PTA4 represents the average hearing threshold levels at 500, 1000, 2000, and 4000 Hz. On the other hand, SRT represents the lowest SNR at which participants can correctly identify 50% of the target words in a speech-in-noise task. Furthermore, we created data that normally distributed (mean PTA4 = 25 dB HL, SD = 5 dB HL. Finally, the relationship between PTA4 and SRT is linear. Specifically, with a slope of 0.3 and a random error that follows a normal distribution (mean = 0, SD = dB SNR). Finally, before creating a prediction plot in R, we also need some new data:

```
# Generate new data with noise
PTA4_new <- seq(from = min(PTA4), to = max(PTA4), length.out = 100)
SRT_new <- predict(model, newdata = data.frame(PTA4 = PTA4_new)) + rnorm(length(PTA4_new), mean = 0, sd = 1)
```

Code language: PHP (php)

In the code chunk above, we use `predict() `

to get the prediction interval for new data. We specify the model and the new data using `data.frame()`

. We set the interval argument to “prediction”. Here are some more tutorials on creating dataframes in R:

- How to Create a Matrix in R with Examples – empty, zeros
- Learn How to Convert Matrix to dataframe in R with base functions & tibble
- R Excel Tutorial: How to Read and Write xlsx files in R

To plot the prediction interval in R, we need to follow the following steps:

First, we need to fit a linear regression model to our data:

```
# Fit linear regression model
model <- lm(SRT ~ PTA4, data = data)
```

Code language: HTML, XML (xml)

Note that you can check the assumptions of linear regression (and maybe you should) by creating a residual plot:

Finally, if your predictor variables are on different scales, you might want to use R to standardize the data using e.g., z-scores.

Second, we need to use the `predict()`

function on another dataset:

```
library(dplyr)
# Get prediction interval
PI <- predict(model, newdata = data.frame(PTA4 = PTA4_new), interval = "prediction")
# Combine data into a single data frame
PTA_new_df <- tibble(xvals = PTA4_new,
pred = SRT_new,
low = PI[, 2],
upr = PI[, 3])
```

Code language: PHP (php)

In the code chunk above, we start by loading the “dplyr”. Then, we calculate the prediction interval (PI) using a previously fitted model with the `predict()`

function. The “newdata” argument specifies the values of the predictor variable “PTA4” for which we want to predict the response variable. The “interval” argument specifies that we want to calculate the prediction interval.

Next, we combine the results into a dataframe called “PTA_new_df”. We use the `tibble()`

function to create the dataframe with four columns. Here the `xvals `

column contains the new predictor variable values (“PTA4_new”), pred contains the predicted response variable values (“SRT_new”). Finally, column `lwr `

contains the lower bounds of the prediction interval and `upr `

the upper bounds of the interval. We use the “[” operator to extract the lower and upper bounds of the prediction interval. Here are a couple of data wrangling tutorials:

- How to Rename Column (or Columns) in R with dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr

Finally, we are ready to use ggplot2 to visualize the prediction interval in R:

```
ggplot(data = data, aes(x = PTA4, y = SRT)) +
geom_point() +
geom_smooth(method = "lm", color = "black", se = FALSE) +
geom_line(data = PTA_new_df, aes(x = xvals, y = lwr), linetype = "dashed", color = "grey") +
geom_line(data = PTA_new_df, aes(x = xvals, y = upr), linetype = "dashed", color = "grey") +
xlab("Pure-tone Average (dB HL)") +
ylab("SRT") + theme_classic()
```

Code language: PHP (php)

In the code chunk above, we use ggplot to create a scatter plot with PTA4 as the x-axis and SRT as the y-axis. We add points to the plot with geom_point(). We then add a linear regression line to the plot using geom_smooth() with method “lm”. Moreover, we set `se = FALSE`

to remove the confidence interval. We add two dashed lines for the prediction interval using geom_line(). Here we used PTA_new_df as the data frame. Importantly, we used `xvals`

, `lwr`

, and `upr `

as the prediction interval’s x-axis, lower, and upper values. Finally, we label the x and y-axes with `xlab()`

and `ylab()`

and set the theme to theme_classic(). Here is the resulting prediction plot:

Here are some more data visualization tutorials:

- How to Create a Violin plot in R with ggplot2 and Customize it
- How to Create a Sankey Plot in R: 4 Methods

To plot the prediction and confidence interval in R, we can slightly change the code from the last example:

```
ggplot(data = data, aes(x = PTA4, y = SRT)) +
geom_point() +
geom_smooth(method = "lm", color = "black", se = TRUE) +
geom_line(data = PTA_new_df, aes(x = xvals, y = lwr), linetype = "dashed", color = "grey") +
geom_line(data = PTA_new_df, aes(x = xvals, y = upr), linetype = "dashed", color = "grey") +
xlab("Pure-tone Average (dB HL)") +
ylab("SRT") + theme_classic()
```

Code language: PHP (php)

In the code chunk above, we changed the `se`

argument to `TRUE`

to get this plot:

Plotting the prediction interval for a polynomial regression in R is straightforward. We can use the same steps as linear regression. For demonstration purposes, we use the Boston housing dataset:

```
# Load the Boston housing data:
data("Boston", package = "MASS")
# We create a training and a test dataset
training <- Boston[1:406,]
test <- Boston[407:506,]
```

Code language: R (r)

In the code chunk above, the Boston dataset is being split into two sets, `training`

and `test`

, using row indices. The dataset contains 506 observations on housing prices in the Boston area and various other variables that might be related to those prices.

The first line of code, `training <- Boston[1:406,]`

, we create a dataframe called `training`

. This dataframe includes the first 406 rows of the Boston dataset, which will be used to train the machine learning model.

In the second line of code, `test <- Boston[407:506,]`

we create another dataframe . The `test`

dataframe includes the remaining 100 rows of the Boston dataset, which will be used to evaluate the performance of the trained model. We will use our example to see how well the polynomial model predicts new data.

Here we fit a polynomial regression model in R (you can skip to the next step if you already have your model):

```
# Now we fit a polynomial model to the data
poly_model <- lm(medv ~ poly(lstat, 5), data=training)
summary(poly_model)
```

Code language: R (r)

In the code chunk above, we fit a polynomial model to the data using the `lm()`

function in R. First, we define the model formula using the `~ `

operator, where `medv `

is the response variable (the median value of owner-occupied homes in $1000s), and `poly(lstat, 5)`

is the predictor variable, which is a polynomial transformation of `lstat `

(the percentage of lower status of the population).

We used the `poly()`

function to create the polynomial transformation, where the second argument (5 in this case) specifies the degree of the polynomial. In this case, we are using a fifth-degree polynomial, which means that the predictor variable is transformed into five new variables with increasing powers of `lstat`

. Finally, we get the summary of the fitted model using the `summary() `

function.

We are now ready to plot the prediction interval for a polynomial regression in R:

```
ls_predict <- predict(poly_model, test, interval = "prediction") #
test[c("fit", "lwr", "upr")] <- ls_predict
ggplot(data = test, aes(x = lstat, y = fit)) +
geom_point() +
geom_smooth(method = "lm", color = "black", se = TRUE, formula = y~poly(x, 5, raw = TRUE)) +
geom_line(aes(x = lstat, y = lwr), linetype = "dashed", color = "grey") +
geom_line(aes(x = lstat, y = upr), linetype = "dashed", color = "grey") +
xlab("% lower status of the population") +
ylab("Median value of owner-occupied homes in $1000's.") +
annotate("text", x = 35, y = 40, label = "Boston Dataset,\n for demo purposes",
hjust=1.1, vjust=-1.1, col = "grey", cex=4,
fontface = "italic", alpha = 0.5) +
annotation_custom(marsja, xmin=35, xmax=43, ymin=-9,ymax=-9) +
coord_cartesian(ylim=c(-1, 50), clip="off") +
theme_classic()
```

Code language: PHP (php)

In the code chunk above, `ls_predict `

calculates the prediction intervals of a polynomial model using the testing data, with the `interval`

argument set to “prediction”. The resulting intervals are then added to the testing dataset using `test[c("fit", "lwr", "upr")] <- ls_predict`

, where `fit`

, `lwr`

, and `upr`

are the fitted values, lower bounds, and upper bounds.

We use the `ggplot `

function to visualize the model and prediction intervals. The `geom_point`

layer plots the original data, the `geom_smooth`

layer with `method = "lm"`

adds a line of best fit with polynomial regression, and the `geom_line`

layers add the prediction intervals with dashed lines. Moreover, we use the `xlab`

and `ylab`

arguments to define the x and y-axis labels, respectively. Finally, w use the `theme_classic`

function to apply a classic look to the plot. Here is the resulting plot:

In this blog post, you have learned about prediction intervals in R and how to visualize them using ggplot2. Prediction intervals are a statistical method that can estimate the range within which future observations are likely to fall. You can plot prediction intervals in R for various disciplines, including psychology, data science, and hearing science.

To plot a prediction interval in R, you must first fit a model, e.g., polynomial regression, ARIMA, ANCOVA. Once you have a model, you can use the `predict()`

function to generate predictions for new data points. These predictions can then be used to plot the prediction interval using ggplot2.

The blog post also covered confidence intervals, similar to prediction intervals but are used to estimate the range within which the population parameter is likely to fall. The post concludes with an example of how to plot the prediction interval for a polynomial regression model in R.

In summary, this blog post gave you an overview of prediction intervals and how to use R to visualize them. Following the step-by-step instructions, even beginner R programmers can create informative plots to help understand their data.

Here are some more R tutorials that you might find helpful:

- How to Take Absolute Value in R – vector, matrix, & data frame
- R Count the Number of Occurrences in a Column using dplyr
- How to use %in% in R: 7 Example Uses of the Operator
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- How to Remove Duplicates in R – Rows and Columns (dplyr)

The post Plot Prediction Interval in R using ggplot2 appeared first on Erik Marsja.

]]>In this blog post, we will learn how to use probit regression in R, a statistical modeling technique for analyzing binary response variables. Probit regression is particularly useful when the outcome variable is dichotomous. That is when the outcome variable takes only two possible values, such as “success” or “failure,” “yes” or “no,” or “tinnitus” […]

The post Probit Regression in R: Interpretation & Examples appeared first on Erik Marsja.

]]>In this blog post, we will learn how to use probit regression in R, a statistical modeling technique for analyzing binary response variables. Probit regression is particularly useful when the outcome variable is dichotomous. That is when the outcome variable takes only two possible values, such as “success” or “failure,” “yes” or “no,” or “tinnitus” or “no tinnitus.”

We will start by exploring examples of when to use probit regression in R. Then, we will look at other alternatives and compare them to probit regression. We will also provide an example dataset that we can use to illustrate how to perform probit regression in R.

Once we have a dataset, we will dive into the syntax of building a probit model in R, explaining each step. We will also provide a complete example of probit regression in R to see how it fits together.

Finally, we will learn how to visualize a probit model using ggplot2, a powerful data visualization package in R. By creating a predicted probability plot, we can see the relationship between our predictor variables and the probability of the outcome variable.

We can use probit regression in R to model the relationship between a binary variable and one or more predictor variables. Note that a binary variable takes on one of two possible values, such as the presence or absence of a particular characteristic or even. The main goal of probit regression is to estimate the probability that the binary response variable is equal to 1 (as opposed to 0) as a function of the predictor variables.

Probit regression is based on the assumption that the relationship between the predictors and the probability of the response variable can be modeled using the cumulative distribution function (CDF) of a normal distribution. In other words, probit regression assumes that the probability of the response variable being equal to 1 can be modeled using a normal probability density function and that the values of the predictor variables determine the mean and standard deviation of the distribution.

In a recent post, you can learn how to make a plot prediction in R using ggplot2. This plot type can be used for linear and non-linear regression models. Remember to test for normality of the residuals or create a residual plot.

There are several examples of how probit regression can be applied in psychology and hearing science:

- In a study investigating the effect of a new treatment on depression, researchers may use probit regression. Here we can use it to model the relationship between the treatment (predictor variable) and the likelihood of a patient experiencing reduced depression symptoms (binary response variable).
- Researchers may use probit regression in a study examining the relationship between noise exposure and the likelihood of hearing loss. They can use it to model the relationship between the loudness of sounds (predictor variable) and the likelihood of a participant experiencing hearing loss (binary response variable). If we conducted such a study, it could help determine the threshold at which loud sounds can cause hearing damage, which can be used to set occupational safety standards for workers in noisy environments.
- In a study investigating the effect of age and gender on the likelihood of experiencing anxiety, researchers may use probit regression to model the relationship between age, gender (predictor variables), and the likelihood of experiencing anxiety (binary response variable).

We can use several alternative analysis methods for data like the ones described in examples 1-3. Here are a few examples:

- Logistic Regression: Logistic regression is a generalized linear model for binary outcomes. It is similar to probit regression, which models the relationship between a binary outcome variable and one or more predictor variables. However, logistic regression uses a different link function (the logistic function) to model the relationship between the predictor variables and the probability of the binary outcome.
- Generalized Estimating Equations (GEE): GEE is a method for analyzing data with correlated outcomes, such as repeated measures data or clustered data. GEE allows for the estimation of regression coefficients adjusted for within-subject or within-cluster correlations, which can improve the estimates’ accuracy and the statistical inference’s validity.
- Mixed-effects models: Mixed-effects models are regression models that analyze data with fixed and random effects. They are handy for analyzing nested or hierarchical data structures, such as repeated measures data or clustered data. Mixed-effects models can also be used to analyze longitudinal data or data with missing values.
- Bayesian methods: Bayesian methods are a family of statistical methods based on Bayes’ theorem. They are particularly useful for analyzing complex data structures and models. They allow for incorporating prior knowledge and uncertainty into the analysis. Bayesian methods can analyze data with binary outcomes, repeated measures, clustered, and longitudinal data.
- Machine learning methods: Machine learning methods are a family of computational methods used to build predictive models from data. They are particularly useful for analyzing large and complex data sets, as they can handle many predictor variables and non-linear relationships between the predictor variables and the outcome variable. Some common machine-learning methods include decision trees, random forests, and support vector machines.

The choice of analysis method will depend on the specific research question, the data type, and the different techniques’ assumptions. It is important to consider each method’s strengths and limitations carefully and choose the most appropriate method for the research question.

We will generate two pieces of example data: one with continuous variables and one with a categorical variable. Note that you are free to use your data, and the data is not related to any actual studies.

Suppose that we are interested in the predictors of hearing aid use. We have a dataset with 500 observations with two continuous variables: age, the experience of hearing aids, and the binary response variable hearing aid use. We assume a relationship between the predictor variables and the binary response variable.

```
# Set coefficients
beta0 <- -0.5
beta1 <- 1.5
beta2 <- 2.5
age <- rnorm(n, 50, 10)
experience <- rnorm(n, 15, 3)
# Simulate data
n <- 500
x1 <- scale(age)
x2 <- scale(experience)
eta <- beta0 + beta1*x1 + beta2*x2
eta <- eta + rnorm(n, 0, 0.1)
hau <- rbinom(n, 1, pnorm(eta))
# Create dataframe
df1 <- data.frame(hau, age=age, experience=experience)
```

Code language: PHP (php)

Here are some tutorials that you may find useful for generating dataframes in R:

- How to Convert a List to a Dataframe in R – dplyr
- Learn How to Convert Matrix to dataframe in R with base functions & tibble
- How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)

Suppose we have collected data from 500 participants with hearing loss to investigate the relationship between hearing loss and tinnitus. In the study, we also collected audiogram data and calculated the pure-tone audiogram (PTA) at four frequencies (e.g., 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz). Additionally, we obtained demographic information, including the gender of each participant. Using this data, we want to investigate whether there is an association between PTA4 and the presence of tinnitus while also taking gender into account. Finally, we collected self-reported tinnitus symptoms on a binary scale (yes or no). Here we simulate this hypothetical data:

```
# Set seample size
samp_n <- 500
# Set random seed for reproducibility
set.seed(123)
# Generate PTA4 values
PTA4 <- rnorm(samp_n, mean = 50, sd = 10)
# Generate gender values
gender <- rbinom(samp_n, size = 1, prob = 0.5)
# Calculate probability of tinnitus based on coefficients
prob_tinnitus <- pnorm(0.890118 - 0.015338*PTA4 - 0.584865*gender)
# Generate tinnitus values based on probabilities
tinnitus <- rbinom(samp_n, size = 1, prob = prob_tinnitus)
# Create dataframe
df2 <- data.frame(PTA4 = PTA4,
gender = ifelse(gender == 1, "Male", "Female"),
tinnitus = tinnitus)
```

Code language: R (r)

Although not a requirement, it may be easier to transform your data from wide to long in R using tidyr before analyzing using the `glm()`

function.

First, before carrying out the probit regression, we can calculate descriptive statistics in R using dplyr:

```
library(dplyr)
df %>% group_by(gender) %>%
summarise("Mean Tinnitus" = mean(tinnitus),
"SD Tinnitus" = sd(tinnitus),
"Mean PTA4" = mean(PTA4),
"SD PTA4" = sd(PTA4)
```

Code language: JavaScript (javascript)

To carry out probit regression in R, we can use the following steps:

- Define the formula for the regression model in the
`glm()`

function. The formula specifies the binary response variable and one or more predictor variables. - Set the
`family`

argument in the`glm()`

function to`binomial(link = "probit")`

to specify that the probit link function should be used. - Fit the model by calling the
`glm()`

function with the formula and data as arguments. - Use the
`summary()`

function to print the results.

This section will explore how to perform probit regression in R with continuous and categorical predictor variables.

Here is how we carry out a probit regression in R with continuous variables:

```
# Run probit model
expage_model<- glm(hau ~ experience + age, family = binomial(link = "probit"), data = df)
# View model summary
summary(expage_model)
```

Code language: HTML, XML (xml)

In the code chunk above, we run a probit regression model in R with `experience`

and `age`

as predictors to explain the probability of `hau`

(i.e., hearing aid use). We use the `glm()`

function to estimate the model with the family argument set to “binomial” and the link argument set to “probit” to indicate that we are estimating a probit regression model. The data argument specifies the data frame where the variables are located.

Here is how we run a probit model in R with a continuous and a categorical variable:

```
tinnitus_probit <- glm(tinnitus ~ PTA4 + gender, family = binomial(link = "probit"),
data = df)
```

Code language: R (r)

In the code chunk above, we use the `glm()`

function in R to conduct probit regression. We use the formula argument to specify the model we want to fit. In this case, we fit a model with tinnitus as the binary response variable and PTA4 and gender as the predictor variables. Again, we use the tilde (~) symbol to separate the response and predictor variables.

Importantly, we use the family argument to specify the type of response variable we are working with. In this case, we use the binomial family because tinnitus is a binary variable. The link argument is used to specify the type of link function we want to use. In this case, we use the probit link function because we are interested in modeling the probability of tinnitus using a normal distribution.

Finally, the data argument is used to specify the dataset we want to use for the analysis. In this case, we are using the df dataset we created earlier. Note that we do not have to create a dummy variable in R of the `gender`

variable, the `glm()`

function will handle this for us.

To print the results, we can use the `summary()`

function:

```
## model summary
summary(tinnitus_probit)
```

Code language: PHP (php)

From the output above, we can see that the Intercept (or baseline probability) is -0.69806. This means that the predicted probability of having tinnitus is 0.498, or close to 50% when PTA4 and gender are both 0 (i.e., the reference group).

Moreover, we can see that the PTA4 coefficient is 0.06830. Here we can infer that for each unit increase in PTA4, the predicted probability of having tinnitus increases by a factor of exp(0.06830) = 1.070, or about 7%. This effect is statistically significant with a p-value of 1.17e-08.

Finally, we can see that the genderMale coefficient is -0.56991. This means that the predicted probability of tinnitus is exp(-0.56991) = 0.566, or about 57%, lower for males than females. This effect is also statistically significant, with a p-value of 0.0039.

One way to visualize a probit model is to create a predicted probability plot. This plot shows the relationship between the predictor variable(s) and the probability of the outcome variable.

Here is how we can use ggplot2 in R:

```
library(ggplot2)
# Generate predicted probabilities
df$predicted_prob <- predict(tinnitus_probit, type="response")
# Create a predicted probability plot
ggplot(df, aes(x=PTA4, y=predicted_prob, color=gender)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family=binomial(link="probit")), se=F) +
labs(x="PTA4", y="Probability of tinnitus") +
scale_color_manual(values=c("blue", "red"), labels=c("Female", "Male")) +
theme_bw()
```

Code language: R (r)

In the code chunk above, we first load the ggplot2 library. Then, we use the predict function to generate predicted probabilities based on our probit regression model. These probabilities are added to the original dataframe as a new column called “predicted_prob”.

Next, we create a scatter plot using `ggplot`

, where the x-axis represents the `PTA4 `

and the y-axis represents the predicted probability of tinnitus. We color the points based on gender using the “color” aesthetic.

To add a smooth line to the scatter plot, we use the “geom_smooth” function. Here we used the “method” argument set to “glm” to indicate that we want to fit a generalized linear model. We specify the family as binomial with a probit link function using the “method.args” argument. We set “se” to “F” to remove the standard error shading.

Finally, we add x and y-axis labels using “labs” and set the colors for the two genders using “scale_color_manual”. We also apply a black-and-white theme using “theme_bw”. Here is the resulting plot:

Note that you can check the fit of your model by calculating the total sum of squares or the sum of squared errors in R.

In this blog post, we have learned about probit regression, a type of logistic regression used to model binary outcomes. We have explored examples of when to use probit regression in hearing science and audiology. We have also looked at alternative models for binary outcomes, such as logistic regression and discriminant function analysis.

To illustrate how to perform probit regression in R, we have generated example data and provided the R syntax for running the model. We have also demonstrated how to visualize the model using the ggplot2 package.

Furthermore, we have discussed how to interpret the results of a probit regression analysis. Here we looked at the deviance residuals, but coefficients and significance codes, in particular.

In conclusion, probit regression is a useful tool for modeling binary outcomes with continuous predictors in R. By using probit regression, we can gain insights into the relationship between the predictor variables and the binary outcome. Visualizing the model can help us better understand and communicate the results to others.

The post Probit Regression in R: Interpretation & Examples appeared first on Erik Marsja.

]]>Sankey plots are an essential tool for data visualization in science and business. Whether you are exploring complex data flows, identifying patterns, or communicating insights, Sankey diagrams make it easy to visualize connections and gain meaningful insights. Discover four methods to create stunning Sankey plots in R and elevate your data analysis game!

The post How to Create a Sankey Plot in R: 4 Methods appeared first on Erik Marsja.

]]>In this post, we will walk you through creating a Sankey plot in R using four packages. We will use the packages ggsankey, ggalluvial, networkD3, and plotly. Sankey plots are powerful visualizations that can help you understand data flow or resources between different categories or stages. However, creating a Sankey plot can be challenging, especially if unfamiliar with the programming language or visualization tools.

First, we will discuss the basics of Sankey plots and the data format required to create them. Then, we will dive into each package and show you how to create Sankey plots using them.

If you are new to Sankey plots or R programming, do not worry. Luckily, this post is aimed to be beginner-friendly and assumes little prior knowledge of Sankey plots or the R language. By the end of this post, you should be able to create stunning Sankey plots that will impress your colleagues and clients.

So, let’s get started and explore how to create Sankey plots in R using ggsankey, ggalluvial, networkD3, and plotly packages!

A Sankey plot is a graphical representation of flow quantities or amounts. Furthermore, the plot typically uses arrows or lines of varying widths to illustrate the flow from one category to another. The width of each arrow is proportional to the quantity or magnitude of the flow it represents. Sankey plots help show complex systems, networks, or processes where tracking the flow of items or energy from one stage to another is important.

Sankey plots are useful in psychology and hearing science to illustrate the flow of information from one stage of a cognitive or perceptual process to another. Here are some examples of when Sankey plots might be helpful:

- In cognitive psychology, Sankey plots can represent the flow of information from the sensory input stage to the decision-making stage in a perceptual task.
- In hearing science, Sankey plots can be used to illustrate the flow of sound energy from the ear canal to the cochlea and then to the auditory nerve.
- Sankey plots can also represent the flow of patients through a healthcare system, from the initial diagnosis to treatment and follow-up.
- In social psychology, Sankey plots can illustrate the influence flow from one person to another in a social network.

To follow this blog post on creating a Sankey diagram in R, you need the following:

- Prior knowledge of the R programming language.
- Knowledge of data manipulation in R is required.
- Be familiar with data visualization and creating plots in R.

In terms of packages, the blog post covers five different packages for creating Sankey plots: ggplot2, ggsankey, ggalluvial, networkD3, and plotly. To install the required packages in R, you can use the `install.packages()`

function followed by the package names. For example, `install.packages("ggalluvial", "networkD3", "plotly")`

will install three of the packages. However, the third, `ggsankey`

, need to be installed from GitHub. Here is how to install ggsankey using devtools: devtools::install_github(“davidsjoberg/ggsankey”). Additionally, you need to install dplyr.

We need to have our data in a specific format. Specifically, each row in the dataframe should represent a flow from one node to another. Moreover, the nodes themselves need to be represented in a separate dataframe.

Here is an example dataset and code to generate a Sankey plot in R based on a hypothetical psychology example:

In the example data, we suppose we have conducted a study on the relationship between personality traits and career choices. We collected data on 100 participants and grouped their personality traits and career choices into four categories. Here is the summary table (descriptive statistics) of the data collection:

```
# create a dataframe with 100 participants
df <- data.frame(id = 1:100)
# randomly assign gender and personality traits
df$gender <- sample(c("Male", "Female"), 100, replace = TRUE)
df$field <- sample(c("Science", "Art", "Business", "Law"), 100, replace = TRUE)
# assign personality traits based on field of study
df$personality <- ifelse(df$field %in% c("Science", "Art"),
sample(c("Introverted", "Introverted",
"Introverted", "Extroverted"), 100,
replace = TRUE),
ifelse(df$field == "Business",
sample(c("Introverted", "Extroverted",
"Extroverted"), 100, replace = TRUE),
sample(c("Introverted", "Extroverted"),
100, replace = TRUE)))
# use ifelse() to set gender proportions based on field of study
df$gender <- ifelse(df$field %in% c("Science", "Business"),
sample(c("Male", "Female"), 100, replace = TRUE,
prob = c(0.611, 0.389)),
ifelse(df$field == "Art",
sample(c("Male", "Female"), 100,
replace = TRUE, prob = c(0.388, 0.612)),
sample(c("Male", "Female"), 100,
replace = TRUE, prob = c(0.545, 0.455))))
```

Code language: R (r)

In the code chunk above, we start by using the `set.seed()`

function to ensure reproducibility of the results.

Next, we create a dataframe `df`

with 100 rows representing 100 participants. Here we use the data.frame() function to create a new dataframe. In this dataframe, the id column is initialized with values ranging from 1 to 100 (a unique number for each participant).

We then use the `sample()`

function to assign gender and field of study to each participant randomly. The `sample()`

function selects a random subset of values from the specified vector, and the `replace`

argument allows for sampling with replacement.

We also used R’s %in% operator to check if the `df$field`

values belong to the specified categories. This operator returns a logical vector indicating if the element on the left-hand side is found in the vector on the right-hand side.

Based on the field of study, we use the `ifelse()`

function to assign personality traits to each participant. If the field is science or art, the personality is randomly assigned to be either introverted or extroverted, with introverted being more likely to be assigned. Extroverted personalities are more likely to be assigned if the field is business. Otherwise, introverted and extroverted personalities are equally likely to be assigned.

Finally, we use the `ifelse()`

function again to set gender proportions based on field of study. If the field is science or business, the gender is randomly assigned with probabilities based on the proportion of males and females in the respective fields. If the field is art, the probabilities are swapped. Otherwise, equal probabilities are used. Here are the first six rows of the generated data:

In this section, we will dive into how to create a Sankey graph in R. Each example is followed by information on how the data have to be formatted.

To create a Sankey graph in R, we can use ggplot2 and the ggsankey packages. Here is a pretty straightforward example of how to use ggplot2 and ggsankey to create a Sankey plot:

```
library(ggplot2)
library(ggsankey)
# Creating a Sankey diagram:
skeypl <- ggplot(df_skey, aes(x = x
, next_x = next_x
, node = node
, next_node = next_node
, fill = factor(node)
, label = node)) +
geom_sankey(flow.alpha = 0.5
,node.color = "black"
,show.legend = FALSE)
```

Code language: R (r)

In the code chunk above, we start by loading the two R libraries ggplot2 and ggsankey. We use the ggplot() function to create a plot object called skeypl. The plot shows a Sankey diagram which displays the flow of data or information between different categories or levels.

Next, we specify the data for the plot in a dataframe called ‘df’, which has columns for ‘x’, ‘next_x’, ‘node’, and ‘next_node’. The ‘x’ and ‘next_x’ columns represent the values of the flow from one node to another node. The ‘node’ and ‘next_node’ columns represent the names of the nodes.

We use the aes() function to specify the aesthetics of the plot. The ‘x’ and ‘next_x’ columns are mapped to the horizontal axis, while the ‘node’ and ‘next_node’ columns are mapped to the nodes of the Sankey diagram. The ‘fill’ argument colors the nodes based on their factor level, and the ‘label’ argument labels the nodes with their names.

Importantly, we create the plot using two ggsankey functions: geom_sankey() and geom_sankey_label(). We use the geom_sankey() function to create the Sankey diagram in R. In this function, we use the parameters for the flow alpha value, node color, and whether to show the legend. We did not want a legend, so we set it to FALSE. Finally, we use the geom_sankey_label() function to add node labels. Moreover, we used the font size, color, fill, and horizontal justification parameters here.

More data visualization tutorials:

- How to Make a Residual Plot in R & Interpret Them using ggplot2
- How to Create a Violin plot in R with ggplot2 and Customize it
- How to Make a Scatter Plot in R with Ggplot2

As you may have noticed, when creating a Sankey plot in R using ggplot2 and ggsankey, the data must be in a specific format. Specifically, the dataframe should be in long format. The dataframe should have columns for the source and target nodes and the value or flow between them.

To prepare a data frame for Sankey plotting, we can use the tidyr package. We use it to convert the dataframe to a long format. If we were to use this method, it would involve using the `gather() `

function. This function can convert the columns to rows and specify the column names. However, this may involve renaming the columns using e.g., dplyr.

Alternatively, the ggsankey package provides a make_long() function that can convert the dataframe to a long format directly. This function takes the original dataframe and the column names for the source node, target node, and flow columns as arguments. When using `make_long()`

we get a new dataframe in long-format suitable for Sankey plotting. Here is how we can use the function on the example data:

```
df_skey <- df %>%
make_long(personality, field, gender)
```

Code language: R (r)

To create a Sankey graph in R, we can also use ggplot2 and ggalluvial:

```
library(ggplot2)
library(ggalluvial)
# Create the Sankey plot:
skeypl2 <- ggplot(data = frequencies,
aes(axis1 = personality, # First variable on the X-axis
axis2 = field, # Second variable on the X-axis
axis3 = gender, # Third variable on the X-axis
y = n)) +
geom_alluvium(aes(fill = gender)) +
geom_stratum() +
geom_text(stat = "stratum",
aes(label = after_stat(stratum))) +
scale_fill_viridis_d() +
theme_void()
skeypl2
```

Code language: PHP (php)

In the code chunk above, we use the `ggplot()`

function to specify the dataframe and aesthetics. We use the `aes()`

function to specify the aesthetics. Here we include the three variables “personality”, “field”, and “gender” on the X-axis, and the frequency count “n” on the Y-axis. Next, we use three functions to create the Sankey plot in R. First, the `geom_alluvium()`

function is used to create the alluviums or “flows” between the different levels of the plot. In this case, the alluviums are determined by the “gender” variable and filled with different colors using the fill aesthetic. Next, we use the `geom_stratum()`

function to create the rectangular blocks representing each level of the Sankey plot.

Additionally, we use the `geom_text()`

function to add labels to the rectangular blocks, which in this case, are the names of the different levels. Finally, we use two functions to make a visually more attractive plot. First, the `scale_fill_viridis_d()`

function sets the color scale used in the plot. Finally, we use the `theme_void()`

function to remove unnecessary visual elements from the plot, leaving only the Sankey plot itself. Here is the result:

As you may have noticed, we used a different dataframe than we created (and then in the first example). Again, we need to transform our data. In this case, we use dplyr to calculate descriptive statistics in R (i.e., frequencies).

To use `ggaluvial`

, we need to restructure the data into a format that allows us to specify the variables we want to plot on each axis of the Sankey plot. This is why we created a new data frame `frequencies`

using the `dplyr`

package:

```
library(dplyr)
# summarize the data and count the frequencies
frequencies <- df %>%
count(personality, field, gender) %>%
arrange(field, desc(n))
```

Code language: PHP (php)

In the code chunk above, we use the `count()`

function to summarize the data by counting the frequency of each combination of `personality`

, `field`

, and `gender`

. The `%>%`

operator is then used to pass the resulting dataframe as input to the next operation, which is `arrange()`

. Here, we sort the data by `field`

and the descending order of the frequency `n`

.

The resulting `frequencies`

data frame has columns for `personality`

, `field`

, `gender`

, and `n`

(the frequency count), which is suitable for creating the Sankey plot using `ggaluvial`

.

To create a Sankey plot in R using the `networkD3`

package, we can use the `sankeyNetwork()`

function:

```
# create Sankey plot using networkD3
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
sinksRight = FALSE)
```

Code language: R (r)

In the code chunk above, we make a Sankey plot using the `sankeyNetwork`

function from the `networkD3`

package in R. We use a range of different arguments. First, we use the `Links`

argument to specify the links between nodes. This means our data should have at least three columns: source, target, and value. Next, we use the `Nodes`

argument to specify the nodes in the plot. Here we need a second dataframe with a column of unique node IDs.

Moreover, we use the `Source`

, `Target`

, and `Value`

arguments to specify the column names in the `Links`

dataframe for the links’ source, target, and value. Next, we use the `NodeID`

argument to specify the column name in the `Nodes`

dataframe for the node IDs. Finally, we use the `sinksRight`

argument to specify whether the nodes without outgoing links should be displayed on the right or left side of the plot. In this case, it is set to `FALSE`

, which means the nodes without outgoing links will be displayed on the left side.

You may have noticed in the previous section that we used two dataframes as arguments in the `sankeyNetwork()`

function. To create a networkD3 visualization in R, we need to transform our data into a specific format that the function can read. This format requires three columns in our data frame: the source, target, and value.

Here is how we create the `nodes´ dataframe from the previous example:

```
# create a table of frequencies
freq_table <- df %>% group_by(personality, field, gender) %>%
summarise(n = n())
# create a nodes data frame
nodes <- data.frame(name = unique(c(as.character(freq_table$personality),
as.character(freq_table$field),
as.character(freq_table$gender))))
```

Code language: PHP (php)

In the code chunk above, we perform two tasks. First, we used the `%>%`

pipe operator to create a table of frequencies named `freq_table`

. We create this table by grouping a dataframe `df`

by three variables: `personality`

, `field`

, and `gender`

. Then, we use the `summarise()`

function to calculate the frequency count of each unique combination of these variables, and named it as `n`

.

Second, we created a data frame named `nodes`

. This dataframe contains a column named `name`

that holds unique values of all distinct combinations of `personality`

, `field`

, and `gender`

variables found in the `freq_table`

data frame. We created this dataframe by using the `unique()`

function to extract unique values from three different columns of the `freq_table`

data frame and then combining them using the `c()`

function. We can also use dplyr to convert a list to dataframe in R, if necessary.

In the next step, we will create the `links`

datframe:

```
# create links dataframe
links <- data.frame(source = match(freq_table$personality, nodes$name) - 1,
target = match(freq_table$field, nodes$name) - 1,
value = freq_table$n,
stringsAsFactors = FALSE)
links <- rbind(links,
data.frame(source = match(freq_table$field, nodes$name) - 1,
target = match(freq_table$gender, nodes$name) - 1,
value = freq_table$n,
stringsAsFactors = FALSE))
```

Code language: PHP (php)

In the code chunk above, we make the dataframe using the data.frame() function. Here we use the match() function to find each unique personality value index in the nodes$name vector. Moreover, we subtract one from it to adjust the index starting from 0 instead of 1. Similarly, we repeat the same process for the `field`

and `gender`

columns of the `freq_table`

dataframe, and named these indices as `source`

and `target`

columns in the `links`

dataframe.

Second, we create a new row in the links data frame using the rbind() function. This new row represents the relationship between field and gender variables. We use the same match() and subtraction process to find the indices of field and gender values in the nodes$name vector. Next, we assign the frequency counts of this relationship to the `value`

column of the `links`

data frame.

Finally, we use the `stringsAsFactors = FALSE`

parameter to both `data.frame()`

function calls to avoid converting string columns to factor data type.

To use Plotly to create an interactive Sankey Plot in R, we can use the following code:

```
# Make Sankey diagram
plot_ly(
type = "sankey",
orientation = "h",
node = list(pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5),
label = nodes$name),
link = list(source = links$source,
target = links$target,
value = links$value),
textfont = list(size = 10),
width = 720,
height = 480
) %>%
layout(title = "Sankey Diagram: Personality, Field, and Gender",
font = list(size = 14),
margin = list(t = 40, l = 10, r = 10, b = 10))
```

Code language: R (r)

In the code chunk above, we use the `plot_ly()`

function from the `plotly`

package to create a Sankey plot visualization. We set the `type`

argument to “sankey” to specify the type of plot we want to create. Moreover, we set the `orientation`

to “h” to specify horizontal orientation.

Next, we use the `node`

argument to define the properties of the nodes in the plot. We set the `pad`

parameter to 15 to add padding around the node labels. Next, we set `thickness`

to 20 to set the thickness of the nodes, and label to nodes$name to specify the labels of the nodes. We also set the line color to “black” and the width to 0.5 for the node outlines.

Additionally, to define the properties of the links between nodes, we use the `link`

argument. Next, we set `source`

to `links$source`

to specify the source nodes of each link, `target`

to `links$target`

to specify the target nodes, and `value`

to `links$value`

to specify the width of each link. We also set `textfont`

to specify the font size of the node labels, and `width`

and `height`

to set the size of the plot.

Finally, we use the `%>%`

operator to add a `layout`

to the plot. We set the `title`

and `font`

properties, as well as the `margin`

parameter to add some space around the plot. Here is the resulting Sankey plot:

As you can see in the plot above, we can move the mouse pointer over the different nodes to get some information. Note also that the data format for a Sankey plot using plotly is the same as when using networkD3.

Here is how we can format the data to create an interacftive Sankey diagram with plotly:

```
nodes <- data.frame(name = unique(c(as.character(freq_table$personality),
as.character(freq_table$field),
as.character(freq_table$gender))))
links <- data.frame(source = match(freq_table$personality, nodes$name) - 1,
target = match(freq_table$field, nodes$name) - 1,
value = freq_table$n,
stringsAsFactors = FALSE)
links <- rbind(links,
data.frame(source = match(freq_table$field, nodes$name) - 1,
target = match(freq_table$gender, nodes$name) - 1,
value = freq_table$n,
stringsAsFactors = FALSE))
```

Code language: PHP (php)

In the code chunk above, we create a dataframe called “nodes”. This dataframe includes all unique values of personality, field, and gender from the “freq_table” data frame. Then, we create another dataframe called “links” that is used to connect the nodes in the Sankey plot. We do this by matching the personality and field values in the “freq_table” data frame with their corresponding indices in the “nodes” dataframe. Here, we subtract 1 from each index to match the zero-based indexing used in R. We also add the frequency count value to each link. Finally, we add another set of links to connect field and gender nodes using a similar approach. See the example in which we transform data to use networkD3 to create a Sankey plot in R. Here are the the dataframes:

This blog post taught us how to create a Sankey plot in R using four packages. Namely, the ggsankey, ggalluvial, networkD3, and plotly packages. Sankey plots are a great way to visualize flow or connections between different categories or groups.

To create a Sankey plot, we need to structure the data in a specific way. The data need nodes representing the categories and links representing the flow between them. Therefore, we learned how to transform data into this format using functions like dplyr and tidyr.

We started with ggsankey and ggalluvial, both great options for creating static Sankey plots. Additionally, both packages offer various customization options, allowing us to tailor the plot to our needs.

Next, we looked at networkD3, a popular package for creating interactive Sankey plots. With networkD3, we can create dynamic plots that allow the user to explore the data in more detail.

Finally, we learned how to create Sankey plots using plotly, which offers even more interactivity options than networkD3. With plotly, we can create highly customized plots that are easy to share and embed in websites.

In conclusion, if you notice any errors or suggestions for improvement, please comment and let us know! Finally, if you found this post helpful, please share it on social media to help others learn.

The post How to Create a Sankey Plot in R: 4 Methods appeared first on Erik Marsja.

]]>