In Python, it is possible to print numbers in scientific notation using base functions as well as NumPy. Specifically, you will learn how to use Python to print very large or very small (i.e., floating point) numbers in scientific notation using three different methods. In the final two sections, before concluding this post, you will […]

The post Python Scientific Notation & How to Suppress it in Pandas and NumPy appeared first on Erik Marsja.

]]>In Python, it is possible to print numbers in scientific notation using base functions as well as NumPy. Specifically, you will learn how to use Python to print very large or very small (i.e., floating point) numbers in scientific notation using three different methods. In the final two sections, before concluding this post, you will also learn how to suppress scientific form in NumPy arrays and Pandas dataframe.

As previously mentioned, this post will show you how to print scientific notation in Python using three different methods. First, however, we will learn more about scientific notation. After this, we will have a look at the first example using the Python function `format()`

. In the next example, we will use `fstrings`

to represent scientific notation. In the third, and final, example, we will use NumPy. After these three examples, you will learn how to suppress standard index form in NumPy arrays and Pandas dataframes.

Now, to follow this post you need to have a working Python installation. Moreover, if you want to use `fstrings`

you need to have at least Python 3.6 (or higher). Obviously, if you want to use the Python package NumPy to represent large or small (floating numbers) in scientific notation you need to install this Python package. In Python, you can install packages using pip:

Code language: Bash (bash)`pip install numpy`

In the next section, we will learn more about scientific notation and, then, we will have a look at the first example using the `format()`

function.

Scientific notation, also known as scientific form, standard index form, or standard form (in the UK), is used to represent numbers that are either too large or too small, to be represented in decimal form.

Here’s how to represent scientific notation in Python using the `format()`

function:

`print(format(0.00000001,'.1E'))`

Code language: Python (python)

Typically, the `format()`

function is used when you want to format strings in a specific format. In the code chunk above, the use of the `format()`

function is pretty straightforward. The first parameter was the (small) number we wanted to represent in scientific form in Python. Moreover, the second parameter was used to specify the formatting pattern. Specifically, E indicates exponential notation to print the value in scientific notation. Moreover, .1 is used to tell the `format()`

function that we want one digit following the decimal. Here are two working examples using Python to print large and small numbers in scientific notation:

Now, if we want to format a string we can use the `format()`

function like this:

`'A large value represented in scientific form in Python: {numb:1E}'.format(numb=1000000000000000000)`

Code language: Python (python)

Notice how we used the curly brackets where we wanted the scientific notation. Now, within the curly braces we added `numb`

and then, again, .1E (for the same reason as previously). In the `format()`

function, then, we used numb again and here we added the number we wanted to print as standard index form in Python. In the next section, we will use Python’s `fstrings`

to print numbers in standard index form.

Here’s another method you can use if you want to represent small numbers as scientific notation in Python:

`print(f'{0.00000001: .1E}')`

Code language: Python (python)

In this example, the syntax is fairly similar to the one we used in the previous example. Notice, however, how we used `f`

prior to single quotation marks. Now within the curly braces, we put the decimal number we want to print in scientific form. Again, we use `.1E`

in a similar way as above. To tell fstrings that we want to be formatted in scientific notation. Here are two examples in which we do the same for both small and large numbers:

Remember, `fstrings`

can only be used if you have Python 3.6 or higher installed and it will make your code a bit more readable compared to when using the `format()`

function. In the next example, we will use NumPy.

Here’s how we can use NumPy to print numbers in scientific notation in Python:

```
import numpy as np
np.format_float_scientific(0.00000001, precision = 1, exp_digits=2)
```

Code language: Python (python)

In the code chunk above, we used the function `format_float_scientific()`

. Here we used the precision parameter to specify the number of `decimal digits`

and the `exp_digits`

to tell how many digits we want in the exponential notation. Note, however, that NumPy will print large and small numbers in scientific form by default. In the next, and last example, we will have a look at how we can suppress scientific notation in Python.

Here’s how we can suppress scientific form in Python NumPy arrays:

```
import numpy as np
# Suppressing scientific notation
np.set_printoptions(suppress=True)
# Creating a np array
np_array = [np.random.normal(0, 0.0000001, 10),
np.random.normal(0, 1000000, 10)]
np_array
```

Code language: Python (python)

In the example, here, we first created a NumPy array (a normal distribution with 10 small and 10 large numbers). Second, we used the `set_printoptions()`

function and the parameter suppress. Naturally, setting this parameter to True will print the numbers “as they are”.

In the next, and final example, we will have a look at how to suppress scientific notation in Pandas dataframes.

Here’s how we can use the set_option() method to suppress scientific notation in Pandas dataframe:

```
import pandas as pd
df = pd.DataFrame(np.random.randn(4, 2)*100000000,
columns=['A', 'B'])
```

Code language: Python (python)

In the code chunk above, we used Pandas dataframe method to convert a NumPy array to a dataframe. This dataframe, when printed, will show the numbers in scientific form. Therefore, we used the `set_option()`

method to suppress this print. It is also worth noting, here, that this will set the global option in the Jupyter Notebook. There are other options as well such as using the `round()`

method.

In this post, you have learned how to use Python to represent numbers in scientific notation. Specifically, you have learned three different methods to print large and small numbers in scientific form. After these three examples, you have also learned how to suppress scientific notation in NumPy arrays and Pandas Dataframes. Hope you learned something valuable. If you did, please leave a comment below and share the article on your social media channels. Finally, if you have any corrections, or suggestions, for this post (or any other post on the blog) please leave a comment below or use the contact form.

Here are some other useful Python tutorials:

- How to get Absolute Value in Python with abs() and Pandas
- Create a Correlation Matrix in Python with NumPy and Pandas
- How to do Descriptive Statistics in Python using Numpy
- Pipx: Installing, Uninstalling, & Upgrading Python Packages in Virtual Envs
- How to use Square Root, log, & Box-Cox Transformation in Python
- Pip Install Specific Version of a Python Package: 2 Steps

The post Python Scientific Notation & How to Suppress it in Pandas and NumPy appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to create a matrix in R. We will use the matrix() function, among two other functions, for this aim. Specifically, we will go into the details of this function as this will enable us to e.g. name the columns and rows in the matrix we create. That […]

The post How to Create a Matrix in R with Examples – empty, zeros appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to create a matrix in R. We will use the `matrix()`

function, among two other functions, for this aim. Specifically, we will go into the details of this function as this will enable us to e.g. name the columns and rows in the matrix we create. That is, we will have a look at the different arguments of the `matrix()`

function in R. In the next section, you will find the outline of the tutorial.

As previously mentioned, this post will cover the creation of a matrix in R by examples. In the first section, however, we will have a look at the different arguments of the `matrix()`

function (the function we will use to create matrices). In the second section, we will answer the questions “what is a matrix in R?” and “how to create a matrix in R?”

After these two questions (and answers) we will continue with the first example on how to create a matrix in R. Here we will use the `matrix()`

function when creating a matrix in R. In the second example, we will have a look at how we can use the `rbind()`

function to combine a couple of vectors into a matrix. After this, we will use the `cbind()`

to accomplish the same (but with a different result) and, then, the `rbind()`

function. In these two examples, we will see how we can create a matrix from vectors in R. The 5th example will show you have to create an empty matrix and the last how to name the rows and columns. In the 6th, and final example you will learn how to create a matrix of zeros in R.

In R, a matrix is **a collection of elements of the same data type such as numeric, character, or logical). Moreover, these elements are arranged into a fixed number of rows and columns** (e.g., 3 rows and 3 columns, 10 rows, and 2 columns). This type of data is 2-dimensional.

A matrix can be created in R using the matrix() function. For example, the following code will produce a 3 by 3 matrix: `mtx <- matrix(3:11, nrow = 3, ncol = 3)`

. Moreover, it is possible to combine vectors to create a matrix.

In the next section, you will get an overview of the `matrix()`

function.

Here’s the general syntax of the matrix function:

```
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
```

Code language: PHP (php)

As you can see, there are 5 parameters that you can use:

*data*– this argument is optional and will contain e.g. a vector of your data (see the previous section or the following examples).*nrow*– this argument is used to get the number of rows, you want (see the coming examples)*ncol*– this argument is, like nrow, but for the number of columns you want in your created matrix*dimnames*– this argument is used if you want to name the columns and rows (see the last example in the post)

In the next section, we will create our first matrix in R.

Here’s the first example of how to create a matrix:

```
mtx <- matrix(seq(1,9),
nrow = 3,
ncol = 3)
```

Code language: HTML, XML (xml)

As you can see, in the code chunk above, we used the seq() function to create a sequence in R (i.e., a vector). Moreover, we used the `nrow`

and `ncol`

arguments to tell the matrix() function that we want to create a three by three matrix. Here’s how the matrix, called `mtx`

, look like:

In the next example, we will have a look at how to make a matrix and setting the `byrow`

argument to `TRUE`

.

Here’s how we can create a matrix in R, from a sequence of numbers (i.e., a vector) and get the numbers by rows:

```
mtx <- matrix(seq(1,9),
nrow = 3,
ncol = 3,
byrow = TRUE)
```

Code language: R (r)

Note that we can flip the order by transposing the matrix in R using the `t()`

function (i.e., by typing `t(mtx)`

). This would result in a matrix exactly like the one in the first example. Here’s a post for more information about transposing in R:

In the next example, we will have a look at how we can use `cbind()`

to merge three vectors into a matrix.

Here’s how to use the `cbind()`

function to produce a matrix in R:

`mtx <- cbind(seq(1,3), seq(4, 6), seq(7, 9))`

Note, this method may be more feasible when we happen to have data stored in vectors already. Here’s the created matrix:

As you can see in the image above, we get the exact same result as in the first example. In the next example, before returning to using the `matrix()`

function, we will have a look at

Here’s how to use the `rbind()`

function to make a matrix from vectors in R:

`mtx <- rbind(seq(1,3), seq(4, 6), seq(7, 9))`

As with the `cbind()`

method for creating a matrix it might be more useful to create matrices this way when we have data in vectors, to begin with. Here’s the created matrix:

Again, this method created the exact same matrix as in example 2. In the next example, we will have a look at how we can create an empty matrix in R.

Here’s how we can create an empty matrix in R:

`empty_matrix <- matrix(nrow = 4, ncol = 4)`

Code language: HTML, XML (xml)

As you can see, we skipped the first argument (i.e., *data*) this time. However, we did create an empty 4 by 4 matrix, and here’s the result:

It is now possible to fill this empty matrix with data (i.e., calculations). In the next example, we will have a look at how we can name the rows and columns when creating the matrix.

Now, to name the rows and columns when creating a matrix we use the `dimnames`

argument:

```
mtx <- matrix(seq(1,9),
nrow = 3,
ncol = 3,
dimnames = list(c("1", "2" ,"3"),
c("Vector", "List", "Matrix")))
```

Code language: PHP (php)

In the code chunk above, we named the rows something very simple: 1, 2, and 3. Just for fun, we named the columns “Vector”, “List”, and “Matrix”.

In the next, and final example, you will learn how to generate a matrix of zeros in R.

Here’s how we can create a matrix of zeros in R:

```
mtx_zeros <- matrix(rep(0, 9),
ncol = 3,
nrow = 3)
```

Code language: R (r)

In the code chunk above, we used the rep() function in R to repeat the number zero 9 times. This enabled us to generate a matrix populated by zeros. Here’s the result:

For more information, see the documentation of the different functions used in this tutorial:

- matrix() function
- rbind() and cbind() functions

Now that you have your data in your matrix you can calculate the five-number summary in R, create a violin plot in R, or make a scatter plot in r, for example. Moreover, you can also convert the matrix to a data frame.

In this post, you have learned how to create a matrix in R. More specifically, you have learned by 6 examples. First, we had a quick look at the `matrix()`

function and its syntax. Second, we answered some questions. Additionally, you have learned how to make a matrix by the examples. First, we just created a 3 x 3 matrix using a sequence of numbers and the `nrow`

and `ncol`

arguments. Second, we used the `byrow`

argument to get a slightly different matrix. In the third and fourth examples, we used the `cbind()`

and `rbind()`

functions, respectively. This was followed by learning how to create an empty matrix in R and how to name the columns and rows when creating the matrix. I hope you learned something. If you did, please share the post on your social media accounts and leave a comment below. Additionally, if there is something you want to be covered on the blog, leave a comment or contact me.

Here are some other blog posts that you may find useful:

- If you need to change the variables names you can rename columns in R with dplyr
- When your data is stored in lists you can convert a list to a data frame in R using the dplyr package
- How to Extract Year from Date in R with Examples
- How to Concatenate Two Columns (or More) in R – stringr, tidyr

The post How to Create a Matrix in R with Examples – empty, zeros appeared first on Erik Marsja.

]]>Here are 4 examples in which you will learn how to convert a list to a dataframe in R.

The post How to Convert a List to a Dataframe in R – dplyr appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to convert a list to a dataframe in R. To know how to convert lists to dataframes may be useful when you, for example, get your data from a source and they end up in a list of e.g. vectors. Here’s an example code template that you can use to change a list of vectors to a dataframe in R:

```
# Converting list to dataframe in R
DF <- as.data.frame(YourList)
```

Code language: R (r)

In the next section, you will get an overview of the outline of this post.

The outline of this post is as follows. First, you will get some information on what you need to follow this tutorial. First, you will get to create example data to use in the rest of the post. Third, we will have a look at the first example of converting a list to a dataframe in R. Fourth, you will learn how to convert a list to a dataframe by a second example. Now, we will also have a look at how we can use the `do.call`

function. This can be done to make the rows columns. Here it may be worth pointing out that this can be done also using the t function. See the post “How to Transpose a Dataframe or Matrix in R with the t() Function” for more information. In the last example, you will learn how to use Tidyverse to change a list to a dataframe.

To follow this post, and all of its examples, you will need to have 1) a working R installation, 2) the dplyr package which is part of the Tidyverse package. On the other hand, if you only want to know how to create a dataframe from a list you can stick with base R. However, it is worth pointing out that dplyr can be used to add a column to the dataframe in R, remove duplicates, and count the number of occurrences in a column.

In the next example, we will create some sample data using some base R functions.

Here’s how we can create a list containing a couple of vectors:

```
Data <- list(A = seq(1,4),
B = c("A", "D", "G", "L"),
C = c("Best", "List", "Dataframe", "Rstats"))
```

Code language: R (r)

We can also display our list (and the vectors) like this:

Code language: R (r)`Data`

In the two code chunks above, we used the `list()`

function together with two functions to first create the list called `Data`

. Here we used c and seq in R to generate vectors. Additionally, we named the different vectors, in the list, to A, B, and C. Finally, we printed the list using the name of the list (i.e., Data). Here’s the generated list:

In the next section, you will learn how to convert the list to a dataframe.

In the first example, we are going to use R’s as.data.frame() function to convert the list to a dataframe:

Code language: R (r)`dataFrame <- as.data.frame(Data) dataFrame`

In the code chunk above, we simply used the above mentioned function using the list as the first argument. Here’s the dataframe that we generated from our list:

If your data is stored in a matrix, for example, it is possible to convert a matrix to a dataframe in R. In the next example, we are going to set the column names while converting the list to a dataframe.

Here’s how we simply add the col.names parameter to change the column names:

```
dataFrame <- as.data.frame(Data,
col.names = c("Numbers", "Letters", "Words"))
dataFrame
```

Code language: R (r)

Noteworthy, it is also possible to name the rows by using the row.names argument. However, if you have a lot of observations/data points this might not be feasible. In the next example, we are going to have a look at how we can use the do.call function to accomplish the same but making the rows columns as well. In a recent post, you can learn how to create a matrix in R.

In this example, you will learn how to use the `as.data.frame`

function, the `cbind`

, and the `do.call`

functions to convert a list to a dataframe in R. Here’s a code snippet:

Code language: R (r)`as.data.frame(do.call(cbind, Data))`

As you can see, in the code chunk above, we used the `do.call`

function as an argument in the `cas.data.frame`

function. Moreover, we used the `cbind`

function and, finally, the list we wanted to convert as the last argument. This will create a dataframe similar to the earlier ones. Now, you may wonder why we would like to do something like this. Well, we can use the `rbind`

function instead of the `cbind`

function. This will give us this result:

As you can see, we got the rows as columns. If you need to change the column names you can have a look at the post: How to Rename Column (or Columns) in R with dplyr

In the next, and final example, we will use dplyr to convert a list to a dataframe in R.

Here’s how we can convert a list to dataframe in R using dplyr:

```
library(dplyr)
dataFrame <- Data %>%
as_tibble()
dataFrame
```

Code language: R (r)

In the code chunk above, there are some new things introduced. First, we used the piping operator (%>%). Following this operator we used the `as_tibble`

function. This code chunk will create a dataframe called `dataFrame`

by taking the list (Data) and put it as an argument to the `as_tibble`

function. That is, we need to have the input/argument (i.e., the data in the list) left to the piping operator.

In this post, you have learned how to convert list to dataframe in R. More specifically, you learned how to do this by 4 examples. First, we started out using the as.data.frame function on an example list. Second, we changed the column names using one of the arguments of the as.data.frame function when we converted the list. Third, we also had a look on how we can use the do.call function. In the final example, we used the dplyr package from the popular Tidyverse package. To conclude, the easiest way to convert a list to a dataframe in R is to either use the as.data.frame function or the as_tibble function from dplyr. Hope you learned something valuable. If you did, please leave a comment below and share the posts on your social media accounts. Finally, if you want something covered on the blog – drop a comment below or use the contact information found here.

The post How to Convert a List to a Dataframe in R – dplyr appeared first on Erik Marsja.

]]>In this data visualization tutorial, we are going to learn how to make a violin plot in R using ggplot2. Now, there are several techniques for visualizing data (see for example the Python-related post “9 Data Visualization Techniques You Should Learn in Python“) that we can use to visualize our data in r. Briefly described, […]

The post How to Create a Violin plot in R with ggplot2 and Customize it appeared first on Erik Marsja.

]]>In this data visualization tutorial, we are going to learn how to make a violin plot in R using ggplot2. Now, there are several techniques for visualizing data (see for example the Python-related post “9 Data Visualization Techniques You Should Learn in Python“) that we can use to visualize our data in r. Briefly described, violin plots combine both a box plot and a histogram in the same figure. In the next section, after the table of contents, you will get a brief overview of the content of this blog post.

Here's how you can create a violin plot in R: p <- ggplot(Data, aes(CategoricalVar, ResponseVar) + geom_violin() #RStats #Dataviz

Click to Tweet

Click to Tweet

Before we get into the details on how to create a violin plot in R we will have a look at what you need to follow this data visualization tutorial. When we have what we need, we will answer a couple of questions (e.g., learn what a violin plot is). In the sections following this, we will get into the practical details. That is, we will learn how to create violin plots in R using ggplot2. Furthermore, we will also learn how to customize the plots. For example, you will learn how to show the plot horizontally, fill it with a color based on category, and add/change labels.

First of all, you need to have an active installation of R, obviously. Second, to use both ggplot2 you need to install the package. Installing R packages can be done by using the `install.packages()`

command:

`install.packages("ggplot2)`

Code language: CSS (css)

Here it is worth pointing out that ggplot2 is part of the Tidyverse package. This means that you can install Tidyverse to get ggplot2 among a lot of other handy R packages. For example, you can use dplyr to rename a column in R, remove duplicates, and count the number of occurrences in a column. In the next section, we will get answers to some commonly asked questions.

As mentioned earlier in the post, a violin plot is a data visualization method combining box plots and histograms. This type of plot will display the distribution, median, interquartile range (iqr) of data. The iqr and median are the statistical information shown in the box plot whereas distribution is being displayed by the histogram.

A violin plot is showing numerical data. Specifically, it will reveal the distribution shape and summary statistics of the numerical data. It can be used to explore data across different groups or variables in our datasets.

To make a violin plot in R you can use ggplot2 and the geom_violin() function. For example, if we have the dataframe dataF and want to create a violin plot of the two groups response times you can use the following code: <code>p <- ggplot(aes(Group, RT), data = dataF))</code>.

In this post, we are going to work with fake data from a Psychology experiment. The dataset can be downloaded here and is fake data that could be obtained using e.g. Flanker task created with OpenSesame. Here is how we can read the data into R using `read.csv()`

function:

```
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df <- read.csv(data)
head(df)
```

Code language: R (r)

Note, you can get import data from different sources than CSV files:

- How to Read and Write Stata (.dta) Files in R with Haven
- How to Read & Write SPSS Files in R Statistical Environment

Here’s a quick overview of the dataframe in which we can see the first 6 rows of the columns:

If you already have your data in a list, you can convert a list to dataframe in R. In the next code chunk, we will use some neat functions from the dplyr package: `group_by()`

and `summarise_all()`

to calculate descriptive statistics in R:

Code language: R (r)`df %>% group_by(TrialType) %>% select(ACC, RT) %>% summarise_all(list(mean = mean, std = sd, min = min, max = max))`

Now, in the code, above we first used dplyr’s group_by to group the data by trial type (i.e., the column TrialType). Second, we also used dplyr to select columns by name, using the `select()`

function. Finally, we used the `summarise_all()`

function (also from dplyr) together with `list()`

. Here we calculate mean, standard deviation, min, and max. For more information about summary statistics in R see the following posts:

- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Calculate Five-Number Summary Statistics in R

In the next section, we will load the ggplot2 library and learn how to create a simple violin plot in R.

Here’s how to create a violin plot with the R package ggplot2:

Code language: R (r)`p <- ggplot(df, aes(TrialType, RT)) p + geom_violin()`

In the code above, we first created a plot object with the `ggplot()`

function. Here we used the `aes()`

function as input. Moreover, we used the grouping column (i.e., TrialType) as the first argument and the dependent variable (response time) as the second. In the next row, we use the `geom_violin()`

function. This will, in turn, create the violin plot layer. Here’s the resulting violin plot that we created using R and ggplot2:

In the next example, we will use the `coord_flip()`

function to create a horizontal violin plot:

`p <- ggplot(df, aes(TrialType, RT)) p + geom_violin() + coord_flip()`

As you can see, in the code chunk above, we just added the function and this will result in this plot:

In the next section, we will continue by creating a violin plot using R and ggplot2 overlaying a boxplot.

Here is how we can display a violin plot in R and adding interquartile range and median by overlaying a boxplot:

`p <- ggplot(df, aes(TrialType, RT)) p + geom_violin() + geom_boxplot()`

As you can see, the only addition to the previous code is that we use the `geom_boxplot()`

function as well. However, the created violin plot (see image above) can be better. For example, if we use the width argument we can get a better violin plot:

```
p <- ggplot(df, aes(TrialType, RT))
p + geom_violin() + geom_boxplot(width = .2)
```

Code language: HTML, XML (xml)

In the next section, we will use the quantiles argument, in the geom_violin() function. We will see that we can use this to also display 25th, 50th, and 75th quantiles, for example. In the next examples, we are going to play around with the color and theme of violin plots we have created with R.

Here’s how we use the *quantiles *parameter to add quantiles to a violin plot:

```
p <- ggplot(df, aes(TrialType, RT))
p + geom_violin(draw_quantiles = c(.25, .50, .75))
```

Code language: R (r)

As you can see, we get three lines in the violin plot now. In the next example, we are going to learn how to customize the violin plot we create in R using the *color* parameter.

Here is how we can change the color of a violin plot, in R:

```
p <- ggplot(df, aes(TrialType, RT, color = TrialType))
p + geom_violin() + geom_boxplot(width = .2)
```

Code language: HTML, XML (xml)

In the code chunk above, we added one argument to the aes() function: the color argument. We can use this parameter if we want the lines of the violin plot to be different for the different groups (i.e., of different colors). Here is the resulting plot:

In the next example, we are going to fill the violin plot as well. This is easy, as you will sea, and we just use the fill parameter.

Here is how you can change the color (or fill) a violin plot in R:

```
p <- ggplot(df, aes(TrialType, RT, fill = TrialType))
p + geom_violin() + geom_boxplot(width = .2)
```

Code language: HTML, XML (xml)

In the code chunk above, we added a parameter: fill. Moreover, we used the TrialType (a categorical variable) column here so we fill the violin plots and box plots based on which trial type they belong to. Here is the resulting plot:

Here’s how we can change the labels of the violin plot we have created in R:

```
p + geom_violin() + geom_boxplot(width = .2) +
labs(
title = "Comparison of Response Time by Trial Type",
x = "Trial Type",
y = "Response time (ms.)"
)
```

Code language: R (r)

In the code chunk above, we added the `labs()`

function. In this function, we worked with a couple of parameters. First, we added a title by using the *title* parameter. In the next two rows, we changed the x- and y-titles. This may be useful when we want to present the data to other researchers (e.g., publishing our results) and our variables (in the dataset) have shortened names (such as RT). To learn more about customizing ggplot2 figures see e.g., the post How to Make a Scatter Plot in R with Ggplot2.

Now, before concluding this post it may be worth mentioning that there are plenty of other options (e.g., vioplot, violinplotter) that can be used to create violin plots in R. For example, here’s how to install and create a plot using the violinplotter package:

```
install.packages("violinplotter")
library(violinplotter)
violinplotter(RT ~ TrialType, data = df)
```

Code language: R (r)

As you can see, in the code chunk above, we use a formula as the first parameter (the second is the dataframe). Here’s the resulting violin plot:

In the image above, we get some more information in the violin plot: the number of observations in each category, the standard deviation, standard error, and 95% confidence intervals.

In this post, you have learned how to make a violin plot in R. First, you learned what you need to create a violin plot. Second, you learned more about this data visualization technique. Second, you learned how to use ggplot2 to create a violin plot by a couple of examples. Specifically, you learned how to display the violin plot horizontally, to add a boxplot to the violin plot, to change the color and fill the plot, and finally, how to change the labels on the plot. Hopefully, you have learned something. I really hope you did. If you have any questions concerning the blog post, please drop a comment below. Moreover, if you have any suggestions on what I should cover on this blog; comment below.

The post How to Create a Violin plot in R with ggplot2 and Customize it appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to convert a matrix to a dataframe in R. Specifically, you will learn how to use base R and the package tibble to change the matrix to a dataframe. You will learn this task by 4 different examples (2 using each method). Outline This post is structured […]

The post Learn How to Convert Matrix to dataframe in R with base functions & tibble appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to convert a matrix to a dataframe in R. Specifically, you will learn how to use base R and the package tibble to change the matrix to a dataframe. You will learn this task by 4 different examples (2 using each method).

This post is structured as follows. First, you will learn briefly about tibble and how to install this R package. After this, you will get the answer to the question “How do I convert a matrix to a dataframe in R”. In the next section, we will create a simple matrix. The following sections, of the blog post, will be converted to a dataframe in different examples throughout the post. These examples will, hopefully, deepen your knowledge concerning converting matrices in R.

In the first example, we will use base R to convert the matrix. Subsequently, we will also add column names when converting the matrix to a dataframe.

In the third example, we will then use tibble and the function `as_tibble()`

to change the matrix to a dataframe (i.e. a tibble object). Finally, we will also use tibble and `setNames()`

when converting a matrix to a dataframe. In the next example, you will learn how to install tibble or Tidyverse.

Here’s how we can instal tibble:

`install.packages("tibble")`

Code language: R (r)

As usual, we use the `install.packages()`

function and write the package (i.e., “tibble”) within quotation marks. Note that we can install the Tidyverse package. This package contains the tibble among other useful packages. We can, for example, use the Tidyverse package remove duplicates, and rename factor levels. Moreover, the package tibble can be used to add empty columns to the dataframe, add new columns to the dataframe, and much more.

To convert a matrix to a dataframe in R, you can use the as.data.frame() function, For example, to change the matrix, named “MX”, to a dataframe you can use the following code: <code>df_m <- as.data.frame(mtx)</code>.

In the next section, we are going to create a matrix using the `matrix()`

function.

Before we change a matrix to a dataframe, we will need to create a matrix. Here’s how we can create a matrix in R using the `matrix()`

function:

`mtx <- matrix(seq(1, 15), nrow = 5)`

Code language: HTML, XML (xml)

In the code above, we used the seq() function to generate a sequence of numbers (i.e., from 1 to 15). Moreover, we also created 5 rows, using the `nrow`

argument. Here’s the resulting matrix:

In the next section, we will have a look at the first example of converting the matrix, we have created, to a dataframe.

To convert a matrix to a dataframe in R we can use the `as.data.frame()`

function:

Code language: R (r)`df_mtx <- as.data.frame(mtx)`

In the code above, we simply used the function (i.e., `as.data.frame()`

) to create a dataframe from the matrix. Here’s the converted dataframe:

Now, that we have converted the matrix to a dataframe we can use e.g. the `str()`

function to look at the structure of the data:

As you can see, in the output above, we have 3 columns of the data type integer. This is, of course, expected (we created a sequence of numbers as a matrix). Notice how we have the column names V1 to V3. This is not that informative and there are a number of options here. First, we could name the columns in the matrix (or when creating the matrix). Second, we can rename the columns of the created dataframe. In this post, we will change the column names after we have converted the matrix.

Now, after converting the matrix, using the `as.data.frame()`

function, we can use the `colnames()`

function:

```
df_mtx <- as.data.frame(mtx)
colnames(df_mtx) <- c("A", "B", "C")
```

Code language: JavaScript (javascript)

In the code chunk above, we used the `colnames()`

function and assigned a character vector. This character vector contained the three column names. Here’s the converted matrix (i.e., the dataframe):

It is also possible to convert a list to a dataframe in R. In the next example, we will continue by using an installed R package: tibble.

In this section, you will learn how to use another package for converting a matrix to a dataframe: tibble. Here’s how to transform a matrix to a dataframe in R using tibble:

```
library(tibble)
df_mtx <- mtx %>%
as_tibble()
```

Code language: HTML, XML (xml)

As you probably notice, there is a difference in how we, now, use the function. Instead of adding the matrix within the parentheses, as in the previous two examples, we used the pipe operator (“%>%”). On the left side of pipe operator we had the matrix, the new dataframe, and on the right side we use the function. Here’s the dataframe that we have created from the matrix:

Here are some blog posts about other useful operators:

- How to use %in% in R: 7 Example Uses of the Operator
- How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)

Now, most of the time we would like to have better column names than what we get in this example. As previously mentioned, we could have set the column (and row names) when we created the matrix. However, if we already had a matrix without names but we knew the column names we can use the setNames() function together with another pipe. This is what we will have a look at in the final example.

Here’s how we can convert a matrix to a dataframe and set the column names:

```
df_mtx <- mtx %>%
as_tibble() %>%
setNames(c("A", "B", "C"))
```

Code language: JavaScript (javascript)

In the code chunk above, we used another pipe (see Example 3) and added the function setNames() to add the column names “A”, “B”, and “C”. Here’s the resulting dataframe:

As previously mentioned, tibble is part of the Tidyverse and this means that we could have used dplyr to rename the columns after we created the dataframe.

In this post, we have converted a matrix to dataframe in R. More specifically, we have learned how to carry out this task by following 4 different examples. In the first two examples, we used base R. In the final two examples, on the other hand, we will use the Tidyverse package tibble. Whether we use base R or Tibble to convert matrices to dataframes, we need to set the column names. That is, if the matrix we convert does not have column names. Hope you learned something valuable in this tutorial.

If you have anything you would like me to cover in a blog post (e.g., something you need to learn) please drop a comment below. For any suggestions or corrections, please drop a comment below, as well.

The post Learn How to Convert Matrix to dataframe in R with base functions & tibble appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to count the number of occurrences in a column. Sometimes, before starting to analyze your data, it may be useful to know how many times a given value occurs in your variables. For example, when you have a limited set of possible values that you […]

The post R Count the Number of Occurrences in a Column using dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to count the number of occurrences in a column. Sometimes, before starting to analyze your data, it may be useful to know how many times a given value occurs in your variables. For example, when you have a limited set of possible values that you want to compare, In this case, you might want to know how many there are of each possible value before you carry out your analysis. Another example may be that you want to count the number of duplicate values in a column. Moreover, if we want to get an overview or information, let us say: how many men and women you have in your data set. In Psychological science. In this example, it is obligatory that you report the number of men and women in your research articles.

In this post, you will learn how to use the R function table() to count the number of occurrences in a column. Moreover, we will also use the function count() from the package dplyr. First, we start by installing dplyr and then we import example data from a CSV file. Second, we will start looking at the table() function and how we can use it to count distinct occurrences. Here we will also have a look at how we can calculate the relative frequencies of factor levels.

Third, we will have a look at the count() function from dplyr and how to count the number of times a value appears in a column in R. Finally, we will also have a look at how we can calculate the proportion of factor/characters/values in a column.

In the next section, you are going to learn how to install dplyr. Of course, if you prefer to use table() you can jump to this section, directly.

As you may already be aware, it is quite easy to install R packages. Here’s how you install dplyr using the install.packages() function:

`install.packages("dplyr")`

Code language: R (r)

Note that dplyr is part of the Tidyverse package which can be installed. Installing the Tidyverse package will install a number of very handy and useful R packages. For example, we can use dplyr to remove columns, and remove duplicates in R. Moreover, we can use tibble to add a column to the dataframe in R. Finally, the package Haven can be used to read an SPSS file in R and to convert a matrix to a dataframe in R. For more examples, and R tutorials, see the end of the post.

Before learning how to use R to count the number of occurrences in a column, we need some data. For this tutorial, we will read data from a CSV file found online:

`df <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv')`

Code language: R (r)

This data contains details of a person who has been arrested and in this tutorial we are going to have a look sex and age columns. First, the sex column classifies an individual’s gender as male or female. Second, the age is, of course, referring to an individual in the datasets age. Let us have a quick look at the dataset:

Now, using the str() function we can see that we have 5226 observations across 9 columns. Moreover, we can se the data type of the 9 columns.

Here’s how to use the R function table() to count occurrences in a column:

`table(df['sex'])`

Code language: R (r)

As you can see, we selected the column ‘sex’ using brackets (i.e. df[‘sex’]) and used is the only parameter to the table() function. Here’s the result:

Note it is also possible to use $ in R to select a single column. Now, as you can see in the image above, the function returns the count of all unique values in the given column (‘sex’ in our case) in descending order without any null values. By glancing at the above output see that there are more men than women in the dataset. In fact, the results show us that the vast majority are men.

Note, both of the examples above will remove missing values. This, of course, means that they will not be counted at all. In some cases, however, we may want to know how many missing values there are in a column as well. In the next section, we will therefore have a look at an argument that we can use (i.e., useNA) to count unique values and missing values, in a column. First, however, we are going to add 10 missing values to the column sex:

```
df_nan <- df
df_nan$sex[c(12, 24, 41, 44, 54, 66, 77, 79, 91, 101)] <- NaN
```

Code language: R (r)

In the code above, we first used the column name (with the $ operator) and, then, used brackets to select rows. Finally, we used the NaN function to add the missing values to these rows that we selected. In the next section, we will count the occurrences including the 10 missing values that we just added to the dataframe.

Here’s a code snippet that you can use to get the number of unique values in a column as well as how many missing values:

```
df_nan <- df
df_nan$sex[c(12, 24, 41, 44, 54, 66, 77, 79, 91, 101)] <- NaN
table(df_nan$sex, useNA = "ifany")
```

Code language: PHP (php)

Now, as you can see in the code chunk above, we used the useNA argument. Here we added the character object “ifany” which will also count the missing values, if there are any. Here’s the output:

Now, we already knew that we had 10 missing values in this column. Of course, when we are dealing with collected data we may not know this and, this, will let us know how many missing values there are in a specific column. In the next section, we will not count the number of times a value appears in a column in R. Next we will rather count the relative frequencies of unique values in a column.

Another thing we can do, now, when we know how to count unique values in a column in R’s dataframe is to calculate the relative frequencies of unique values. Here’s how we can calculate the relative frequencies of men and women in the dataset:

Code language: PHP (php)`table(df$sex)/length(df$sex)`

In the code chunk above, we used the table() function as in the first example. We added something to get the relative frequencies of the factors (i.e., men and women). In the example, above, we used the length() function to get the total number of observations. We used this to calculate the relative frequency. This may be useful if we not only want to count the occurrences but want to know e.g. what percentage of the sample that are male and female.

Here’s how we can use R to count the number of occurrences in a column using the package dplyr:

```
library(dplyr)
df %>%
count(sex)
```

Code language: R (r)

In the example, above, we used the %>% operator which enables us to use the count() function to get this beautiful output. Now, as you can see when we are counting the number of times a value appears in a column in R using dplyr we get a different output compared to when using table(). For another great operator, see the post about how to use the %in% operator in R.

In the next section, we are going to count the relative frequencies of factor levels. Again, we will use dplyr but this time we will use group_by(), summarise(), and mutate().

In this example, we are going to use three R functions (i.e., from the dplyr package). First, we use the piping operator, again, and then we group the data by a column. After we have grouped the data we count the unique occurrences in the column, we have selected. Finally, we are calculating the frequency of factor levels:

Code language: R (r)`df %>% group_by(sex) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))`

Using the code above, we get two columns. What we did, in the code chunk above, was to group the data by the column containing gender information. We then summarized the data. Using the n() function we got the number of observations of each value. Finally, we calculated a new variable, called “Freq”. Here is were we calculate the frequencies. This gives us another nice output. Let us have a look at the output:

As you can see in the output, above, we get two columns. This is because we added a new column to the summarized data: the frequencies. Of course, counting a column, such as age, as we did in the previous example would not provide any useful information. In the next section, we will have a look at how to use the R package dplyr to count unique occurrences in a column.

There are 53 unique values of age data, a mean of 23.84 and a standard deviation of 8.31. Therefore, counting the unique values of the age column would produce a lot of headaches. In the next example, we will have a look at how we can count age but getting a readable output by binning. This is useful if we want to count e.g. even more continuous data.

As previously mentioned, we can create bins and count the number of occurrences in each of these bins. Here’s an example code in which we get 5 bins:

```
df %>%
group_by(group = cut(age, breaks = seq(0, max(age), 11))) %>%
summarise(n = n())
```

Code language: R (r)

In the code chunk above, we used the group_by() function, again (of course, after the %>% operator). In this function, we also created the groups (i.e., the bins). Here we used the seq() function that can be used to generate a sequence of numbers in R. Finally, we used the summarise() function to get the number of occurrences in the column, binned. Here’s the output:

For each bin, the range of age values is the same: 11 years. One contains ages from 11 to 22. The next bin contains ages from 22 to 33. However, we also see that there are a different number of persons in each age range. This enables us to see that most people, that are arrested are under the age of 22 Now this kind of makes sense, in this case, right?

In this post, you have learned how to use R to count the number of occurrences in a column. Specifically, you have learned how to count occurrences using the table() function and dplyr’s count() function. Moreover, you have learned how to calculate the relative frequency of factor levels in a column. Furthermore, you have learned how to count the number of occurrences in different bins, as well.

Here are a bunch of other tutorials you might find useful:

- How to Do the Brown-Forsythe Test in R: A Step-By-Step Example
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- How to Calculate Five-Number Summary Statistics in R
- How to Concatenate Two Columns (or More) in R – stringr, tidyr
- Learn How to Create a Violin plot in R with ggplot2 and Customize it

The post R Count the Number of Occurrences in a Column using dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to do the Brown-Forsythe test in R. This test is great as you can use it to test the assumption of homogeneity of variances, which is important for e.g. Analysis of Variance (ANOVA). Outline of the Post This post is structured as follows. First, we start by answer […]

The post How to Do the Brown-Forsythe Test in R: A Step-By-Step Example appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to do the Brown-Forsythe test in R. This test is great as you can use it to test the assumption of homogeneity of variances, which is important for e.g. Analysis of Variance (ANOVA).

This post is structured as follows. First, we start by answer a couple of questions related to this test. Second, we learn about the hypotheses of the Brown-Forsythe test. This is followed by the most important section, maybe, the 5 steps to performing the Brown-Forsythe test in R. Now, of course, it is possible to do it in fewer steps. Here’s how to carry out the test in three steps, one which involves installing a package:

In this section, you will get some brief details on what this test is. As previously mentioned, the Brown Forsythe test is used whenever we need to test the assumption of equal variances. Furthermore, it is a modification of Levene’s test but the Brown-Forsythe test uses the median, rather than the mean (Levene’s). The test is considered a robust test that is based on the absolute differences within each group from the group median, as previously mentioned. The Brown-Forsythe test is a suitable alternative to Bartlett’s Test for equal variances, as it is not sensitive to lack of normality and unequal sample sizes. For more information, on how the Brown-Forsythe test works see this article or the resources towards the end of the post.

You can perform the Brown-Forsythe test using the bf.function() from the R package onewaytests. For example, bf.function(DV ~ IV, data=dataFrame) will successfully perform the test one the dependent variable DV and the groups IV, in the dataframe dataFrame.

In the next section, you will learn the hypotheses of the Brown-Forsythe test. Knowing the hypothesis will make interpretion of the results easier.

When carrying out the Brown-Forsythe test using R we are testing the following two hypotheses:

- H
^{0}: The population variances are equal. - H
_{A}: The population variances are not equal.

Therefore, as we will see when going trough the example, we don’t want to reject the null hypothesis (H0). In the next section, you will get a brief overivew of one of the R packages that can be used to perform the test.

Now, R is, as you may know, an open-source language. This means that there are probably more packages that make it possible, for us, to do the Brown-Forsythe test in R. In this post, however, we will only use one Package:

The Onewaytests is more focused on carrying out one-way tests. Using this package we can carry out one-way ANOVA, Welch’s heteroscedastic F test, Welch’s heteroscedastic F test with trimmed means and Winsorized variances, Brown-Forsythe test, and Alexander-Govern test, James second-order test to name a few. The function bf.test() is, of course, of interest for this blog post.

We are now ready to carry out hte Brown-Forsythe test in R. I

Now, you may already know how to install R-packages but here’s how we install the onewaytests package:

`install.packages("onewaytests")`

Code language: R (r)

Note, we are, in step three also going to summarize data to calculate variance, for each group, using dplyr. Moreover, we are going to import the example dataset using the readxl package. Both packages are part of the Tidyverse package. Therefore, to fully follow this post, install the TIdyverse package (or just dplyr, of course), as well:

`install.packages(c("onewaytests", "tidyverse"))`

Code language: R (r)

The above code will install both onewaytests and Tidyverse. If you, on the other hand, only want to install dplyr and readxl (for reading Excel files) you can remove “tidyverse” and add “dplyr” and “readxl”. Just follow the syntax above. Now, Tidyverse comes with a lot of great packages. For example, you can use dplyr to rename columns, count the number of occurrences in a column, stringr to merge two columns in R.

In the next step, we are going use the readxl package to import the example dataset.

Here’s how we read an Excel file in R using the readxl package:

```
library(readxl)
dataFrame <- read_excel('brown-forsythe-test-in-R-example-data.xlsx')
```

Code language: R (r)

First, before, going on to the next step we can explore the data frame a bit. For example, we can get the first 6 rows:

Code language: R (r)`head(dataFrame)`

As we can see, there are only two variables in this example data. First, we have the column “Group”, in which we find the different treatment groups (“A”, “B”, and “C”). If we want to see what data type we can type this:

`str(dataFrame)`

Now, we see that Group is factor and Response is numeric (i.e., num). In the next, section, we will have a visual look at the variance of Response, in each group.

It is also possible to convert a matrix to a dataframe in R or convert a list to dataframe in R. If your data is stored in any of these two data types, of course.

As you may know, there are many different ways to visualize data in R. Here we will make use of the boxplot() function which will give us an idea of whether the variances are equal across the groups, or not. Here’s how to create a boxplot:

Code language: R (r)`boxplot(Response ~ Group, data = dataFrame)`

When inspecting the boxplots, it sure looks like the variances are different for the different treatment groups. We can also calculate the variance, by group, using dplyr:

```
library(dplyr)
dataFrame %>%
group_by(Group) %>%
summarize(Variance=var(Response))
```

Code language: R (r)

Note, you can see the following two posts if you need to calculate other summary statistics as well:

- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Calculate Five-Number Summary Statistics in R

Now, judging from the image, above, it also looks like we have different variances in the different treatment groups. In the next step, however, we will use the bf.test() function to carry out the Brown-Forsythe test testing the null hypothesis that the variances are equal.

Here’s how you can perform the Brown-Forsythe Test in R:

```
library(onewaytests)
bf.test(Response ~ Group, data=dataFrame)
```

Code language: R (r)

In the code chunk above, we used the bf.test() function (onewaytests package) to carry out the Brown-Forsythe test. Note how we used a formula as the first argument. This would be the exact same formula you would use performing ANOVA in R. Here’s the output from the function:

In the next section, we will learn how to interpret the results from the test.

Interpreting the Brown-Forsythe test is quite simple. Just remember that we had the null hypothesis that the variances are equal across the groups. Therefore, if the p-value is under 0.05, we reject the null hypothesis and conclude that the data is not meeting the assumption of homogeneity of variances.

In our example, the null hypothesis is rejected. However, if the p-value would have been above 0.05 we would not reject the null hypothesis. In this case, we can safely go on and carry out e.g. one-way ANOVA.

If your data is violating the assumption of homogeneity but is normally distributed you should carry on with Welch’s ANOVA, which also can be carried out in R.

In this blog post, you have learned how to carry out the Brown-Forsythe test of homogeneity of variances in R. Specifically, you have learned, step-by-step, how to carry out this test. First, you learned how to install an R package enabling the Brown-Forsythe test in R. Second, you imported example data and, third, explored the data. Finally, you learned how to carry out the test using the bf.test() function. Now there are probably other packages and functions that enable us to carry out this test of equal variances. Please leave a comment below, if you know any other packages or functions that we can use to do the Brown-Forsythe test in R. You are, of course, also welcome to suggest what I should cover in future blog posts, correct any mistakes in my blog posts, or just let me know if you found the post useful. That is, I encourage you to comment below!

Here are some references and useful resources that you might find useful on the topic:

- Morton B. Brown & Alan B. Forsythe (1974) Robust Tests for the Equality of Variances, Journal of the American Statistical Association, 69:346, 364-367, DOI: 10.1080/01621459.1974.10482955
- Tests for equality of variances between two samples which contain both paired observations and independent observations (pdf)

Here are some other blog posts, found on this blog, that you might find useful.

- How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)
- Learn How to use %in% in R: 7 Example Uses of the Operator

- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Remove a Column in R using dplyr (by name and index)
- R Count the Number of Occurrences in a Column using dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

The post How to Do the Brown-Forsythe Test in R: A Step-By-Step Example appeared first on Erik Marsja.

]]>In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using str_c() and unite()). In the final section of this post, you will learn which function is the best […]

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using `str_c()`

and `unite()`

). In the final section of this post, you will learn which function is the best to use when combining columns.

If you have some experience using dataframe (or in this case tibble) objects in R and you’re ready to learn how to combine data found in them, then this tutorial will help you do precisely that.

Knowing how to do this may prove useful when you have a dataframe containing information, in two columns, and you want to combine these two columns into one using R. For example, you might have a column containing first names and last names. In this case, you may want to concatenate these two columns into one e.g. called Names.

You can follow along with the examples in this tutorial using the interactive Jupyter Notebook found towards the end of the tutorial. Here’s the example data that we use to learn how to combine two, or more, columns to one variable.

In this post, you will learn, by example, how to concatenate two columns in R. As you will see, we will use R’s $ operator to select the columns we want to combine. The outline of the post is as follows. First, you will learn what you need to have to follow the tutorial. Second, you will get a quick answer on how to merge two columns. After this, you will learn a couple of examples using 1) `paste()`

and 2) `str_c()`

and, 3) `unite()`

. In the final section, of this concatenating in R tutorial, you will learn which method I prefer and why. That is, you will get my opinion on why I like the `unite()`

function. In the next section, you will learn about the requirements of this post.

If you prefer to use base R you don’t need more than a working R installation. However, if you are going to use either str_() or unite() you need to have at least one of the packages stringr or tidyr. It is worth pointing out, here, that both of these packages are part of the Tidyverse package. This package contains multiple useful R packages that can be used for reading data, visualizing data (e.g., scatter plots with ggplot2), extracting year from date in R, adding new columns, among other things. Installing an R package is simple, here’s how you install Tidyverse:

`install.packages("tidyverse")`

Code language: R (r)

Note, if you want to install stringr or tidyr just exchange “tidyverse” for e.g. “stringr”. In the next section, you will get a quick answer, without any details, on how to concatenate two columns in R.

To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns *A *and *B* in the dataframe *df* you can use the following code: <code>df[‘AB’] <- paste(df$A, df$B)</code>. Note, however, that using <code>paste</code> will result in whitespace between the values in the new column.

Before we are going to have a more detailed look at how to use paste() to combine two columns, we are going to load an example dataset.

Here’s how to read a .xlsx file in R using the readxl package:

```
# Importing Example Data:
library('readxl')
dataf <- read_excel("combine_columns_in_R.xlsx")
```

Code language: R (r)

Now, we can have a look at the structure of the imported data using the `str() `

function:

We will also have a quick look at the first five rows using the `head()`

function:

Now, in the images above we can see that there are 5 variables and 7 observations. That is, there are 5 columns and 7 rows, in the tibble. Moreover, we can see the types of the variables and we can, of course, also use the column names. In the next section, we are going to start by concatenating the month and year columns using the paste() function.

- R Count the Number of Occurrences in a Column using dplyr
- How to Create a Matrix in R with Examples – empty, zeros

Here’s one of the simplest way to combine two columns in R using the `paste()`

: function:

Code language: R (r)`dataf$MY <- paste(dataf$Month, dataf$Year)`

In the code above, we used $ in R to 1) create a new column but, as well, selecting the two columns we wanted to combine into one. Here’s the tibble with the new column, named *MY*:

In the next example, we will merge two columns and adding a hyphen (“-”), as well. For more useful operators, and how to use them, see for example the post “How to use %in% in R: 7 Example Uses of the Operator“.

Now, to add “-” (hyphen) between the values we want to combine we add a third parameter to the `paste()`

function:

`dataf$MY <- paste(dataf$Month, "-", dataf$Year)`

Code language: R (r)

In the code example above, we used the sep parameter and set it as “-”. As you can see, in the image below, we have whitespaces between the two values (i.e. “Month” and “Year”).

Now, using R’s `paste()`

function we can add another parameter: the sep parameter. Here’s a code example combining the two columns, adding the “-” without the whitespaces:

`dataf$MY <- paste(dataf$Month, dataf$Year, sep= "-")`

Code language: R (r)

Notice, that instead of pasting the hyphen we used it as a separator. Before moving on to the next example, it is worth pointing out that if we don’t want to add whitespaces we can use the `paste0()`

function instead. This way, we don’t need the sep parameter. In the next example, we are going to have a look at how to combine multiple columns (i.e., three or more) in R.

As you may have understood, combining more than 2 columns is as simple as adding a parameter to the `paste()`

function. Here’s how we combine three columns in R:

Code language: R (r)`dataf$DMY <- paste(dataf$Date, dataf$Month, dataf$Year)`

That was also pretty simple. It is worth, mentioning, that if you use the sep parameter, in a case as above, you will end up with whatever character you chose between each value from each column. For example, if we were to add the sep argument to the code above and put underscore (“_”) as a separator here’s how the resulting tibble would look like:

Now, you may understand that using the sep parameter enables you to use almost any character to separate your combined values. In the next section, we will have a look at the str_c() function from the stringr package.

Combining two columns with the str_c() function is super simple. Here’s how to merge the columns “Snake” and “Size” using the str_c() function:

```
library(stringr)
dataf$SnakeNSize <- str_c(dataf$Snake," ", dataf$Size)
```

Code language: PHP (php)

Notice that we added something in between the two columns we wanted to concatenate? When working with this function, we need to do this, or else we end up with nothing separating the two values that we are combining. As previously mentioned, the stringr package is part of the Tidyverse packages which also includes packages such as tidyr and the unite() function. In the next section, we are going to merge two columns in R using the unite() function as well.

- You may also like: How to Add a Column to a Dataframe in R with tibble & dplyr

Here’s how we concatenate two, or more, columns using the unite() function:

```
library(tidyverse) # or library(tidyr)
dataf <- dataf %>%
unite("DM", Date:Month)
```

Code language: R (r)

Notice something in the code above. First, we used a new operator (i.e., %>%). Among a lot of things, this enables us to use unite() without the $ operator to select the columns. As you can see, in the code example above, we used two parameters. First, we name the new column we want to add (“DM”), second we select all the columns from “Date” to “Month” and combine them into the new column. Here’s the resulting dataframe/tibble:

Now, as you can see in the image above, both columns that we combined have disappeared. If we want to keep the original columns after we have concatenated them we can set the remove parameter to FALSE. Here’s a code chunk that you can use, instead, to not remove the columns:

```
dataf <- dataf %>%
unite("DM", Date:Month, remove = FALSE)
```

Code language: R (r)

Finally, did you notice how we have an underscore as a separator? If we want to change to another separator we can use the sep parameter. This is exactly what we will do in the next example:

Here’s how we use the unite() function together with the sep parameter to change the separator to “-” (hyphen):

```
dataf <- dataf %>%
unite("DM", Date:Month, sep= "-",
remove = FALSE)
```

Code language: R (r)

That was as simple as the previous example, right? In the next section, you will learn which function I prefer to use and why.

Naturally, this section will contain my opinion. I have not done any optimization testing (e.g., I don’t know which function is the fastest when it comes to combining columns in R). That said, although all of the functions used in this post are simple to use I prefer the unite() function. Why? Well, together with the piping operator I think it makes the column very readable. It is, as well, very handy to use unite() if you are going to concatenate multiple columns in R. As you may have noticed, in the examples above, we can use “:” when combining columns. This means that we can merge multiple columns from the first column (i.e., left of the column sign) to the last column (i.e., right of the “:”). This is pretty neat and will definitely save some space in your code and make it easier to read!

Another neat thing is that we add the new column name as a parameter and we, automatically, get rid of the columns combined (if we don’t need them, later, of course). Finally, we can also set the na.rm parameter to TRUE if we want missing values to be removed before combining values. Here’s a Jupyter Notebook with all the code in this post.

In this post, you have learned how to concatenate two (or more) columns in R using three different functions. First, we used the paste() function from base R. Using this function, we combined two and three columns, changed the separator from whitespaces to hyphen (“-”). Second, we used the str_() function to merge columns. Third, we used the unite() function. Of course, it is possible (we saw some example of that) to change the separator using the two last functions as well. To conclude, the unite() function seems to be the handiest function to use to concatenate columns in R.

Hope you learned something! If you did, please leave a comment below, share on your social media, include a link to the post on your projects (e.g., blog posts, articles, reports), or become a Patreon:

Finally, if you have any suggestions, other comments, or there is something you wish me to cover: don’t hesitate to contact me.

- How to Calculate Five-Number Summary Statistics in R
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Rename Column (or Columns) in R with dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate: Minimum Lower-hinge Median Upper-hinge Maximum Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and […]

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate:

- Minimum
- Lower-hinge
- Median
- Upper-hinge
- Maximum

Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and then how we can use one single function to get all of them directly.

To follow this R tutorial you will need to have readxl and ggplot2 installed. The easiest way to install these to r-packages is to use the `install.packages()`

function:

`install.packages(c("readxl", "ggplot"))`

Code language: R (r)

Note, both these two packages are part of the Tidyverse. This means that you get them, as well as a lot of other packages when installing Tidyverse. For example, you can use packages such as dplyr to rename columns, remove columns in R, merge two columns, and select columns, as well.

Before getting to the 6 steps to finding the five-number summary statistics using R we will get the answer to some questions, however.

As you may have understood, the five-number summary statistics are 1) the minimum, 2) the lower-hinge, 3) the median, 4) the upper-hinge, and 5) the maximum. The five-number summary is a quick way to explore your dataset.

The absolutely easiest way to find the five number summary statistics in R is to use the <code>fivenum()</code> function. For example, if you have a vector of numbers called “A” you can run the following code: <code>fivenum(A)</code> to get the five number summary.

Now that we know what the five-number summary is we can go on and learn the simple steps to calculate the 5 summary statistics.

In this section, we are ready to go through the 6 simple steps to calculate the five-number statistics using the R statistical environment. To recap: the first step is to import the dataset (e.g., from an xlsx file). Second, we calculate the min value, and then, in the third step, get the lower-hinge. In the fourth step, we get the median. In the fifth step we get the upper-hinge and, then, in the sixth, and final step, we get the max value.

Here’s how to read a .xslx file in R using the readxl package:

```
library(readxl)
dataf <- read_excel("play_data.xlsx", sheet = "play_data",
col_types = c("skip", "numeric",
"text","text", "numeric",
"numeric", "numeric"))
head(dataf)
```

Code language: JavaScript (javascript)

We can see that in this example dataset there’s only one column containing numerical data (i.e., the column RT). In the next step, we will take the minimum of this column. Note, it is also possible to create a matrix in R (in which you can store your data).

Here’s how to get the minimum value in a column in R:

`min.rt <- max(dataf$RT, na.rm = TRUE)`

Code language: PHP (php)

Notice how we used the `min()`

function with the dataframe and the column (i.e., RT) as the first argument. The second argument we set to TRUE because we have some missing values in the column. Finally, we used the $ operator in R to select a column. If we, on the other hand, were using dplyr we could use the select() function. That said, let’s move on and get the max value.

Here’s how we get the lower-hinge:

```
# Lower Hinge:
RT <- sort(dataf$RT)
lower.rt <- RT[1:round(length(RT)/2)]
lower.h.rt <- median(lower.rt)
```

Code language: PHP (php)

Notice, how we started by selecting only response times (i.e. the RT column) and sorted the values. Second, we get the lower part of the response times and, then, we get the lower-hinge by calculating the median of this vector.

To calculate the median we can use the `median()`

function:

```
# Median
median.rt <- median(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we used the `na.rm`

argument (`TRUE`

) because there are some missing values in the dataset. Of course, if your data doesn’t have any missing values you can leave this argument out.

Here’s how to get the upper-hinge:

```
# Upper Hinge
RT <- sort(dataf$RT)
upper.rt <- RT[round((length(RT)/2)+1):length(RT)]
upper.h.rt <- median(upper.rt)
```

Code language: PHP (php)

Similar to when we got the lower-hinge, we first sorted the RT column. Then, we get the upper half and calculate the median of it.

We can get the maximum by using the `max()`

function:

```
# Max
max.rt <- max(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we selected the RT-column using the dollar sign operator and we removed the missing values. Here’s the output:

Note, that the lower- and upper-hinge is the same as the first and third quartile when the sample size is odd. If this is the case, an easier way to get the lower- and upper-hinge is to use the `quantile()`

function. In the example data above, however, we had an equal number of observations (leaving out the missing values). If you need to combine two variables, in your dataset, into one make sure to check this post out:

In this section, we are going to put everything together so we get a somewhat nicer output:

```
fivenumber <- cbind(min.rt, lower.h.rt,
median.rt, upper.h.rt,
max.rt)
colnames(fivenumber) <- c("Min", "Lower-hinge",
"Median", "Upper-hinge", "Max")
fivenumber
```

Code language: CSS (css)

As you can see in the above code chunk, we used the `cbind()`

function to combine the different objects into one. Then, we give the combined object better column names. In the next section, we are going to see that there already is a function that can calculate the five-number statistics in R in one line of code, basically.

Here’s how to find the five-number summary statistics in R with the `fivenum()`

function:

```
# Five summary with R's fivenum()
fivenum(dataf$RT)
```

Code language: PHP (php)

Pretty simple. We just selected the column containing our data. Again, we used the $ operator to get the RT column and use the `fivenum()`

function on. Note that `fivenum()`

function is removing any missing values by default.

As you can see in the output above, we don’t get any column names but the five-number summary statistics are ordered as follows: min, lower-hinge, median, upper-hinge, and max. We can see that we get the same values as in the 6 step method:

In the next section, we are going to create a boxplot displaying the five-number summary statistics in R.

Here’s how we can visualize Tukey’s 5 number summary statistics in R using a boxplot:

```
library(ggplot2)
df <- data.frame(
x = 1,
ymin = fivenumber[1],
Lower = fivenumber[2],
Median = fivenumber[3],
Upper = fivenumber[4],
ymax = fivenumber[5]
)
ggplot(df, aes(x)) +
geom_boxplot(aes(ymin=ymin, lower=Lower,
middle=Median, upper=Upper, ymax=ymax),
stat = "identity") +
scale_y_continuous(breaks=seq(0.2,0.8, 0.05)) +
# Style the plot bit
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
# After this is just to annotate the plot and can be removed
# Min
geom_segment(aes(x = 1, y = ymin, xend = 0.95, yend = ymin), data = df) +
annotate("text", x = 0.93, y = df$ymin, label = "Min") +
# Lower-hinge
geom_segment(aes(x = 0.60, y = Lower, xend = 0.60, yend = Lower-0.05), data = df) +
annotate("text", x = 0.60, y = df$Lower-0.06, label = "Lower-hinge") +
# Median
annotate("text", x = 1, y = df$Median + .012, label = "Median") +
# Upper-hinge
geom_segment(aes(x = 1.40, y = Upper, xend = 1.40, yend = Upper+0.05), data = df) +
annotate("text", x = 1.40, y = df$Upper+0.06, label = "Upper-hinge") +
# Max
geom_segment(aes(x = 1, y = ymax, xend = 1.05, yend = ymax), data = df) +
annotate("text", x = 1.07, y = df$ymax, label = "Max")
```

Code language: R (r)

We are not getting into details in the example above. However, we did create a dataframe from the first object we created and then we used `ggplot()`

and `ggplot_boxplot()`

to create the boxplot. Notice how we used the `aes()`

function and set the different values found in the dataframe as arguments. Here ymin and ymax are the minimum and maximum values, respectively. Note we also changed the number of ticks on the y-axis. Here we used the seq() function to generate a sequence of numbers. The plot is somewhat styled and the code for drawing segments (lines) and adding text can be skipped, of course, if you just want to visualize the five summary statistics in R.

More data visualization tutorials:

In this post, you have learned 2 ways to get the five summary statistics in R: 1) min, 2) lower-hinge, 3) median, 4) upper-hinge, and 5) max. In the first method, we calculated each of these summary statistics separately. Furthermore, we have also learned how to use the handy fivenum() function to get the same values. In the final section, we created a boxplot from the five summary statistics. Hope you have learned something valuable. If you did, please link to the blog post in your projects and reports, share on your social media accounts, and/or drop a comment below.

Here are some other tutorials that you may find useful:

- How to Take Absolute Value in R – vector, matrix, & data frame
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Extract Year from Date in R with Examples
- Get the Absolute Value in R – from a vector, a matrix, & a data frame
- How to Rename Factor Levels in R using levels() and dplyr
- Learn How to Remove Duplicates in R – Rows and Columns (dplyr)
- How to Add a Column to a Dataframe in R with tibble & dplyr

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the […]

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the box plot and the histogram. In the next section, you will get a brief overview of the content of this blog post.

Before we get into the details on how to create a violin plot in Python we will have a look at what is needed to follow this Python data visualization tutorial. When we have what we need, we will answer a couple of questions (e.g., learn what a violin plot is). In the following sections, we will get into the practical parts. That is, we will learn how to use 1) Matplotlib and 2) Seaborn to create a violin plot in Python.

First of all, you need to have Python 3 installed to follow this post. Second, to use both Matplotlib and Seaborn you need to install these two excellent Python packages. Now, you can install Python packages using both Pip and conda. The latter if you have Anaconda (or Miniconda) Python distribution. Note, Seaborn requires that Matplotlib is installed so if you, for example, want to try both packages to create violin plots in Python you can type `pip install seaborn`

. This will install Seaborn and Matplotlib along with other dependencies (e.g., NumPy and SciPy). Oh, we are also going to read the example data using Pandas. Pandas can, of course, also be installed using pip.

As previously mentioned, a violin plot is a data visualization technique that combines a box plot and a histogram. This type of plot therefore will show us the distribution, median, interquartile range (iqr) of data. Specifically, the iqr and median are the statistical information shown in the box plot whereas distribution is being displayed by the histogram.

A violin plot is showing numerical data. Specifically, it will reveal the distribution shape and summary statitistics of the numerical data. It can be used to explore data across different groups or variables in our datasets.

In this post, we are going to work with a fake dataset. This dataset can be downloaded here and is data from a Flanker task created with OpenSesame. Of course, the experiment was never actually run to collect the current data. Here’s how we read a CSV file with Pandas:

```
import pandas as pd
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df = pd.read_csv(data, index_col=0)
df.head()
```

Code language: Python (python)

Now, we can calculate descriptive statistics in Python using Pandas `describe()`

:

`df.loc[:, 'TrialType':'ACC'].groupby(by='TrialType').describe()`

Code language: Python (python)

Now, in the code above we used loc to slice the Pandas dataframe. This as we did not want to calculate summary statistics on the SubID. Furthermore, we used Pandas groupby to group the data by condition (i.e., “TrialType”). Now that we have some data we will continue exploring the data by creating a violin plot using 1) Matplotlib and 2) Seaborn.

Here’s how to create a violin plot with the Python package Matplotlib:

```
import matplotlib.pyplot as plt
plt.violinplot(df['RT'])
```

Code language: Python (python)

n the code above, we used the `violinplot()`

method and used the dataframe as the only parameter. Furthermore, we selected only the response time (i.e. the “RT” column) using the brackets. Now, as we know there are two conditions in the dataset and, therefore, we should create one violin plot for each condition. In the next example, we are going to subset the data and create violin plots, using matplotlib, for each condition.

One way to create a violin plot for the different conditions (grouped) is to subset the data:

```
# Subsetting using Pandas query():
congruent = df.query('TrialType == "congruent"')['RT']
incongruent = df.query('TrialType == "incongruent"')['RT']
fig, ax = plt.subplots()
inc = ax.violinplot(incongruent)
con = ax.violinplot(congruent)
fig.tight_layout()
```

Code language: Python (python)

Now we can see that there is some overlap in the distributions but they seem a bit different. Furthermore, we can see that iqr is a bit different. Especially, the tops. However, we don’t really know which color represents which. However, from the descriptive statistics earlier, we can assume that the blue one is incongruent. Note we also know this because that is the first one we created.

We can make this plot easier to read by using some more methods. In the next code chunk, we are going to create a list of the data and then add ticks labels to the plot as well as set (two) ticks to the plot.

```
# Combine data
plot_data = list([incongruent, congruent])
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data)
```

Code language: Python (python)

Notice how we now get the violin plots side by side instead. In the next example, we are going to add the median to the plot using the `showmedians`

parameter.

Here’s how we can show the median in the violin plots we create with the Python library matplotlib:

```
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data, showmedians=True)
```

Code language: Python (python)

In the next section, we will start working with Seaborn to create a violin plot in Python. This package is built as a wrapper to Matplotlib and is a bit easier to work with. First, we will start by creating a simple violin plot (the same as the first example using Matplotlib). Second, we will create grouped violin plots, as well.

Here’s how we can create a violin plot in Python using Seaborn:

```
import seaborn as sns
sns.violinplot(y='RT', data=df)
```

Code language: JavaScript (javascript)

In the code chunk above, we imported seaborn as sns. This enables us to use a range of methods and, in this case, we created a violin plot with Seaborn. Notice how we set the first parameter to be the dependent variable and the second to be our Pandas dataframe.

Again, we know that there two conditions and, therefore, in the next example we will use the `x`

parameter to create violin plots for each group (i.e. conditions).

To create a grouped violin plot in Python with Seaborn we can use the `x`

parameter:

```
sns.violinplot(y='RT', x="TrialType",
data=df)
```

Code language: Python (python)

Now, this violin plot is easier to read compared to the one we created using Matplotlib. We get a violin plot, for each group/condition, side by side with axis labels. All this by using a single Python metod! If we have further categories we can also use the `split`

parameter to get KDEs for each category split. Let’s see how we do that in the next section.

Here’s how we can use the `split`

parameter, and set it to `True`

to get a KDE for each level of a category:

```
sns.violinplot(y='RT', x="TrialType", split=True, hue='ACC',
data=df)
```

Code language: Python (python)

In the next and final example, we are going to create a horizontal violin plot in Python with Seaborn and the `orient`

parameter.

Here’s how we use the `orient`

parameter to get a horizontal violin plot with Seaborn:

```
sns.violinplot(y='TrialType', x="RT", orient='h',
data=df)
```

Code language: Python (python)

Notice how we also flipped the `y`

and `x`

parameters. That is, we now have the dependent variable (“RT”) as the `x`

parameter. If we want to save a plot, whether created with Matplotlib or Seaborn, we might want to e.g. change the Seaborn plot size and add or change the title and labels. Here’s a code example customizing a Seaborn violin plot:

```
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.gcf()
# Change seaborn plot size
fig.set_size_inches(10, 8)
# Increase font size
sns.set(font_scale=1.5)
# Create the violin plot
sns.violinplot(y='RT', x='TrialType',
data=df)
# Change Axis labels:
plt.xlabel('Condition')
plt.ylabel('Response Time (MSec)')
plt.title('Violin Plot Created in Python')
```

Code language: Python (python)

In the above code chunk, we have a fully working example creating a violin plot in Python using Seaborn and Matplotlib. Now, we start by importing the needed packages. After that, we create a new figure with plt.gcf(). In the next code lines, we change the size of 1) the plot, and 2) the font. Now, we are creating the violin plot and, then, we change the x- and y-axis labels. Finally, the title is added to the plot.

For more data visualization tutorials:

- How to Plot a Histogram with Pandas in 3 Simple Steps
- 9 Python Data Visualization Examples (Video)
- How to Make a Scatter Plot in Python using Seaborn
- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)

In this post, you have learned how to make a violin plot in Python using the packages Matplotlib and Seaborn. First, you learned a bit about what a violin plot is and, then, how to create both single and grouped violin plots in Python with 1) Matplotlib and 2) Seaborn.

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to work with $ in R. First, we will have a look at a couple of examples for a list object and then for a dataframe object.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this very short tutorial, you will learn by example how to use the operator $ in R. First, we will learn what the $ operator does by getting the answer to some frequently asked questions. Second, we will work with a list that we create, and use the dollar sign operator to both select and add a variable. Here you will also learn about the downsides of using $ in R as well as the alternatives that you can use. In the following section, we will also work with a dataframe. Both sections will involve creating the list and the dataframe.

To follow this post you need a working installation of the R statistical environment, of course. If you want to read the example Excel file you will also need the readxl package.

The $ operator can be used to select a variable/column, to assign new values to a variable/column, or to add a new variable/column in an R object. This R operator can be used on e.g. lists, and dataframes. For example, if we want to print the values in the column “A” in the dataframe called “dollar” we can use the following code: `print(dollar$A)`

,

First of all, using the double brackets enables us to e.g. select multiple columns whereas the $ operator only enables us to select one column.

Before we go on to the next section, we will create a list using the list() function.

```
dollar <- list(A = rep('A', 5), B = rep('B', 5),
'Life Expectancy' = c(10, 9, 8, 10, 2))
```

Code language: R (r)

In the next section, we will, then, work with the $ operator to 1) add a new variable to the list, and 2) print a variable in the list. In the third example, we will learn how to use $ in R to select a variable which variable contains whitespaces.

Here we will start learning, by examples, how to work with the $ operator in R. First, however, we will create a list.

Here’s how to use $ in R to add a new variable to a list:

`dollar$Sequence <- seq(1, 5)`

Code language: R (r)

Notice how we used the name of the list, then the $ operator, and the assignment (“<-”) operator. On the left side of <- we used seq() function to generate a sequence of numbers in R. This sequence of numbers was added to the list. Here’s our example list with the new variable:

In the next example, we will use the $ operator to print the values of the new variable that we added.

Here’s how we can use $ in R to select a variable in a list:

Code language: R (r)`dollar$Sequence`

Again, we used the list name, and the $ operator to print the new column we previously added:

Note, that if you want to select two, or more, columns you have to use the double brackets and put in each column name as a character. Another option to select columns is, of course, using the `select()`

function from the excellent package dplyr.

You might also be interested in: How to use %in% in R: 7 Example Uses of the Operator

Here’s how we can print, or select, a variable with white space in the name:

Code language: R (r)`dollar$`Life Expectancy``

Notice how we used the ` in the code above. This way, we can select, or add values, even though the variable contains white space. I would, however, suggest that you rename the column (or replace the white spaces). See the recent post to learn how to rename columns in R. Again, using brackets, in this case, would be the same as when the variable is not containing white spaces.

In the next section, we will use the same examples above but on a dataframe. First, however, we will read an .xlsx file in R using the readxl package.

```
dataf <- read_excel('example_sheets.xlsx',
skip=2)
```

Code language: R (r)

Note, that we used the skip argument to skip the first two rows. In the example data (download here), the column names are on the third row. We can print the first 5 rows of the dataframe using the `head()`

function:

Here we can see that there are 5 columns. In the next section, we will use the $ operator on this dataframe.

In the first example, we will add a new column to the dataframe. After this, we will select the new column and print it using the $ operator. Finally, we will also add a new example on how to use this operator: to remove a column.

Here’s how we can use $ to add a new column in R:

`dataf$NewData <- rep('A', length(dataf$ID))`

Code language: R (r)

Notice how we used R’s rep() function to generate a vector containing the letter ‘A’. It is important that we generate a vector of the same length as the number of rows in our dataframe. Therefore, we used the length() function as the second argument.

Now, if you want to learn easier ways to add a column in R check the following posts:

- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

In the next example, we are going to select this column using the $ operator and print it.

Here’s how we select and print the values in the column we created:

Code language: R (r)`dataf$NewData`

Notice, to select, and print the values, of a column in a dataframe we used R’s $ operator the same way as we used it when we worked with a list. Here’s the output of the code above:

Now, it is easier to use the R package dplyr to select certain columns in R compared to using the $ operator. Another option is, of course, to use the double brackets.

In the next example, we are going to drop a column from the dataframe.

Here’s how we can delete a column using the $ operator and the NULL object:

`dataf$NewData <- NULL`

Code language: PHP (php)

Again, we can use the R package dplyr to remove columns. More specifically, we can make use of the select() function to delete multiple columns in a quick and easy way.

Note, that example 3 will also work if we have a column containing white spaces in our dataframe. Finally, before concluding this post, we will have a quick look on how to use brackets to select a column:

`dataf['ID']`

Code language: R (r)

Notice how we used the column name of the variable we wanted to select. This, again, will work on a list as well.

In this post, you have learned, by examples, how to use $ in R. First, we worked with a list to add a new variable and select a variable. Then, we used the same methods on a dataframe. As a bonus, we also had a look at how to remove a column using the $ operator. Hope you learned something. If you did please share the post in your work, on your social media accounts, or link back to it in your own blog posts. If you have any comments or suggestions to the post please leave a comment below.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the rename() function. Now, renaming a column with dplyr and the rename() function is super simple. But, of course, […]

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the `rename()`

function. Now, renaming a column with dplyr and the `rename()`

function is super simple. But, of course, it is not super hard to change the column names using base R as well.

Now, there are some cases in which you need to get rid of strange column names such as “x1”, “x2”, “x3”. If we encounter data, such as this, cleaning up the names of the variables in our dataframes may be required and will definietly make work more readable. This is very important especially in those situations we are working together with others or share our data with with others. It is also very important that the column names have clear names if we plan to make the data open in a repository.

The outline of the post is a follows. First, you will learn about the requirements of this post. After you know what you need to follow this tutorial, you will get the answer to two questions. In the section, following the FAQs, we will load an example data set to work with. Here we will read an Excel file using the readxl package. When we have successfully imported data into R we can start by changing name on the oclumns. First, we will start by using a couple of techniques that can be done using base R. Second, we will work with dplyr. Specifically, in this section we will use the rename-family functions to change the names of some of the variables in the dataframe.That is, we will use the `rename()`

, and `rename_with().`

Now, before going on to the next section it is worth mentioning that we can use dplyr to select columns as well as remove columns in R.

To follow this post you need to have R installed as well as the packages readxl and dplyr. If you want to install the two packages you can use the `install.packages()`

function. Here’s how to install readxl and dplyr: `install.packages(c('dplyr', 'readxl')`

.

It is worth pointing out, here, that both these packages are part of the Tidyverse. This means that you can install them, among with a bunch of other great packages, by typing `install.packages('tidyverse')`

.

You can rename a column in R in many ways. For example, if you want to rename the colunn called “A” to “B” you can use this code: <code>names(dataframe)[names(dataframe)==”A”] <- “B”</code>. This way you changed the column name to “B”.

To rename a column in R you can use the <code>rename()</code> function from dplyr. For example, if you want to rename the column “A” to “B”, again, you can run the following code: <code>rename(dataframe, B = A)</code>.

That was it, we are getting ready to practice how to change the column names in R. First, however, we need some data that we can practice on. In the next section, we are going to import data by reading a .xlsx file.

Here’s how we can read a .xlsx file in R with the readxl package:

```
library(readxl)
titanic_df <- read_excel('titanic.xlsx')
```

Code language: R (r)

In the code chunk above, we started by loading the library readxl and then we used the `read_excel()`

function to read titanic.xlsx file. Here’s the first 6 rows of this dataframe:

In the next section, we will start by using the base functionality to rename a column in R.

Here’s how to rename a single column with base R:

`names(titanic_df)[1] <- 'P_Class'`

Code language: JavaScript (javascript)

In the code chunk above, we used the `names()`

n function to assign a new name to the first column in the dataframe. Specifically, using the `names()`

n function we get all the column names in the the dataframe and then we select the first columns using the brackets. Finally, we assigned the new column name using the <- and the character ‘P_Class’ (the new name). Note, you can, of course, rename multiple columns in the dataframe using the same method as above. Just change what you put within the brackets. For example, if you want to rename columns 1 to 5 you can put “1:5” within the brackets and then a character vector with 5 column names.

In the next example, we are going to use the old column name, instead. to rename the column.

Here’s how to change the column name by using the old name when selecting it:

`names(titanic_df)[names(titanic_df) == 'P_Class'] <- "PCLASS'`

Code language: JavaScript (javascript)

In the code chunk above, we did something quite similar as in the first method. However, here we selected the column we previously renamed by its name. This is what we do within the brackets. Notice how we, again, there used names and the == to select the column named “P_Class”. Here’s the output (new column name marked with red):

In the next example, you will learn how to rename multiple columns using base R. In fact, we are going to rename all columns in the dataframe.

Renaming all columns can be done in a similar way as the last example. Here’s how we change all the columns in the R dataframe:

```
names(titanic_df) <- c('PC', 'SURV', 'NAM', 'Gender', 'Age', 'SiblingsSPouses',
'ParentChildren', 'Tick', 'Cost', 'Cab', 'Embarked',
'Boat', 'Body', 'Home')
```

Code language: R (r)

Notice how we only used `names()`

in the code above. Here it’s worth knowing that if the character vector (right of the <-) should contain as many elements as there are column names. Or else, one or more columns will be named “NA”. Moreover, you need to know the order of the columns. In the next few examples, we are going to work with dplyr and the rename-family of functions.

You might also be interested in: How to use $ in R: 6 Examples – list & dataframe

Renaming a column in dplyr is quite simple. Here’s how to change a column name:

Code language: R (r)`titanic_df <- titanic_df %>% rename(pc_class = PC)`

In the code chunk above, there are some new things that we work with. First, we start by importing dplyr. Second, we are changing the name in the dataframe using the `rename()`

function. Notice how we use the %>% operator. This is very handy because the functions we use after this will be applied to the dataframe to the left of the operator. Third, we use the `rename()`

function with one argument: the column we want to rename. For a blog post on another handy operator in R:

Remember, we renamed all of the columns in the previous example. In the code chunk above, we are actually changing the column back again. That is, to the left of = we have the new column name, and to the right, the old name. As you will see in the next example, we can rename multiple columns in the dataframe by adding arguments.

It may be worth mentioning that we can use dplyr to rename factor levels in R, and to add a column to a dataframe. In the next section, however, we are going to rename columns in R with dplyr.

If we, on the other hand, want to change the name of multiple columns we can do as follows:

Code language: R (r)`titanic_df <- titanic_df %>% rename(Survival = SURV, Name = NAM, Sibsp = SiblingsSPouses)`

It was quite simple to change the name multiple columns using dplyr’s `rename()`

function. As you can see, in the code chunk above, we just added each column that we wanted to change the name of. Again, the name to the right of the equal sign is the old column name. Here’s the first 6 columns and rows of the dataframe with new column names marked with **red**:

In the following sections, we will work with the `rename_with()`

function. This is a great function that enables us to, as you will see, change the column names to upper or lower case.

Here’s how we can use the `rename_with()`

function (dplyr) to change all the column names to lowercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(tolower)`

In the code chunk above, we used the `rename_with()`

function and then the `tolower()`

function. This function was applied on all the column names and the resulting dataframe look like this:

In the next example, we are going to change the column names to uppercase using the `rename_with()`

function together with the `toupper()`

function.

In this section, we will just change the function that we use as the only argument in `rename_with()`

. This will enable us to change all the column names to uppercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(toupper)`

Here are the first 6 rows where all the column names now are in uppercase:

In the next section, we are going to continue working with the rename_with() function and see how we can use other functions to clean the column names from unwanted characters. For example, we can use the gsub() function to remove punctuation from column names.

In some cases, our column names may contain characters that we don’t really need. Here’s how to use `rename_with()`

from dplyr together with `gsub()`

to remove punctuation from all the column names in the R dataframe:

```
titanic_df <- titanic_df %>%
rename_with(~ gsub('[[:punct:]]', '', .x))
```

Code language: JavaScript (javascript)

Notice how we added the tilde sign (~) before the `gsub()`

function. Moreover, the first argument is the regular expression for punctuation and the second is what we want to remove it with. In our case, here, we just remove it from the column names. We could, however, add an underscore (“_”) if we want to replace the punctuation in the column names. Finally, if we wanted to replace specific characters we could add them as well, instead of the regular expression for punctuation.

Now that you have renamed the columns that needed a better and clearer name you can continue with your data pre-processing. For example, you can add a column to the dataframe based on othher columns with dplyr, calculate descriptive statistics (also with dplyr), take the absolute value in your R dataframe, or remove duplicate rows or columns in the dataframe.

In this tutorial, you have learned how to use base R as well as dplyr. First, you learned how to use the base are functions to change the column name of a single columns based on their index and name. Second, you learned how to do the same with dplyr and the rename function. Here we also renamed multiple columns as well as removed punctuation from the column names. Hope you found the post useful. If you did, please share it on your social media accounts and link to it in your projects. Finally, if you have any corrections on the particular post or suggestion, both on this post or in general what should be covered on this blog, please let me know.

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using abs() is very easy and to take the absolute value from e.g. a vector you can type abs(YourVector). […]

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using `abs()`

is very easy and to take the absolute value from e.g. a vector you can type `abs(YourVector)`

. Furthermore, you will learn how to take the absolute value of both a matrix and a data frame. In the next section, you will get a brief overview of what is covered in this R tutorial.

The structure of the post is as follows. First, we will get the answer to a couple of simple questions. Note, most of them might actually be enough for you to understand how to get the absolute value using the R statistical programming environment. After this, you will learn what you need to know and have installed in your R environment to follow this post. Third, we will start by going into a more detailed example on how to take the absolute value of a vector in R. This section is followed by how to use the abs() function, again, on a matrix containing negative values. Finally, we will also have a look at how to take the absolute values in a data frame in R. This section will also use some of the functions of the dplyr (Tidyverse) package.

The absolute value in R is is the non-negative *value* of x. To be clear, the absolute value in R is no different from the absolute value in any other programming language as this has something to do with mathematics rather than a programming language. In the next FAQ, you will learn how to use the <code>abs()</code> function to get absolute values of a e.g. vector.

To change the ne gative numbers to positive in R we can use the <code>abs()</code> function. For example, if we have the vector <code>x</code> containing negative numbers, we can change them to positive numbers by typing <code>abs(x)</code> in R.

Now that we have some basic understanding on how to chang negative numbers to positive, by taking their absolute values we can go ahead and have a look at what we need to follow this tutorial. That is, in the next section you will learn about the requirements of this post.

First of all, if you already have R installed you will also have the function abs() installed. However, if you want to use some functionality of the dplyr package (as in the later examples) you will also need to install dplyr (or Tidyverse). Moreover, if you want to read an .xlsx file in R with the readxl package you need to install it, as well. Here it might be worth pointing out that dplyr contains a lot of great functions. For example, you can use dplyr to remove columns in R as well as to select columns by e.g. name or index.

To install dplyr you can use the `install.packages()`

function. For example, to install the packages dplyr and readxl you type `install.packages(c("dplyr", "readxl"))`

. Note, you can change “dplyr” and “readxl” to “tidyverse” if you want to install all these packages as they are both part of the Tidyverse packages. In the next section, you will get the first example of how to take absolute value in R using the `abs()`

function.

Here’s how to take the absolute value from a vector in R:

```
# Creating a vector with negative values
negVec <- seq(-0.1, -1.1, by=-.1)
# R absolute value from vector
abs(negVec)
```

Code language: R (r)

In the code chunk above, we first created a sequence of numbers in R with the seq() method. As you may understand, all the numbers we generated were negative. In the second line, therefore, we used the `abs()`

function to take the absolute value of the vector. Here’s the output in which all the negative numbers are now positive:

In the next example, we are going to create a matrix filled with negative numbers and get the absolute values from the matrix.

If we, on the other hand, have a matrix here’s how to take the absolute value in R:

```
negMat <- matrix(
c(-2, -4, 3, 1, -5, 7,
-3, -1.1, -5, -3, -1,
-12, -1, -2.2, 1, -3.0),
nrow=4,
ncol=4)
# Take absolute value in R
abs(negMat)
```

Code language: R (r)

In the example above, we created a small matrix using the `matrix()`

function and, then, used the `abs()`

function to convert all negative numbers in this matrix to positive (i.e., take the absolute values of the matrix). This example will be followed by a couple of examples in which we will take the absolute values in data frames.

Now that you have changed the negative numbers to positive, you may want to quickly get Tukey’s five number summary statistics using the R function `fivenum()`

In this section, we will learn how to get the absolute value in dataframes in R. First, we will select one column and change it to absolute values. Second, we will select multiple columns, and again, use the `abs()`

function on these. Note, that here we will use the `mutate()`

function from dplyr. In the last example, we will also use the `select_if()`

function. This is dplyr function is great if we want to be able to use `abs()`

function on e.g. all numerical columns in a dataframe.

First, however, we are going to import the example dataset “r_absolute_value.xlsx” using the readxl package and `read_excel()`

function:

```
library(readxl)
dataf <- read_excel('./SimData/r_absolute_value.xlsx')
head(dataf)
```

Code language: JavaScript (javascript)

We are not getting into detail when it comes to reading .xlsx files in R. However, you can download the example dataset in the link above. If you store this .xlsx file in a subfolder to your r-script (see code above) you can just copy-paste the code chunk above. However, if you store it somewhere else on your computer you should change the path to the location of the file. In the next example, we are going to get the absolute value from a single column in the dataframe.

Here’s how to take the absolute value from one column in R and create a new column:

Code language: R (r)`dataf$D.abs <- abs(dataf$D) head(dataf)`

Note, that in the example above, we selected a column using the $-operator, and then we used the `abs()`

function to take the absolute value of this column. The absolute values of this column, in turn, were also added to a new column which we created, again, using the $-operator. It is, of course, also possible to use dplyr and the `mutate()`

function instead. Here’s another method, that we used to add a new column to a R dataframe as well as to add a column based on values in other columns in R. Here’s how to:

Code language: R (r)`dataf <- dataf %>% mutate(D.abs <- abs(D))`

Now, learning the above method is quite neat because it is a bit simpler to work with `mutate()`

compared to using only the $-operator. For example, we can make use of the %>%-operator as well (as in the example above). Furthermore, it will make the code look cleaner when creating more than one new column (as in the next example). In the next example, we re going to create two new columns by taking the absolute values of two other.

Here’s how we would take two columns and get the absolute value from them:

```
library(dplyr)
dataf <- dataf %>%
mutate(F.abs = abs(F),
C.abs = abs(C))
```

Code language: HTML, XML (xml)

Again, we worked with the `mutate()`

function and created two new variables. Here it might be worth mentioning that if we only want to get the absolute values from the numerical columns in our dataframe without creating new variables we can, instead, use the `select()`

function to select the specific columns. Here’s an example in which we select two columns and take their absolute value:

```
dataf <- dataf %>%
select(c(F, C)) %>%
abs()
```

Code language: R (r)

In the next section, we will use this newly learned method to take the absolute value in all the columns, that are numerical, in the dataframe. However, in this example, we are going to use the `select_if()`

function and only select the numerical columns. This is good to know because if we tried to run `abs()`

on the complete dataframe we would get an error. Specifically, this would return the error “Error in Math.data.frame(dataf) : non-numeric variable(s) in data frame: M”.

In the next section, we will work with the `select_if()`

function as well as the %>% operator, again. Another awesome operator in R is the %in% operator. Make sure you check this post out to learn more:

Here’s to apply the `abs()`

function on all the numerical columns in the dataframe:

Code language: R (r)`dataf.abs <- dataf %>% select_if(is.numeric) %>% abs()`

Note, how we, again, used the %>%-operator (magittr but imported with dplyr) to apply the `select_if()`

on the dataframe. Again, we used the %>%-operator and applied the `abs()`

function on all the numerical columns. Notice how the new dataframe *only* contains numerical columns (and absolute values).

Now, before concluding this post it may be worth that, again, point out that the tidyverse package is a very handy package. That is, it comes with a range of different packages that can be used for manipulating and cleaning your data. For example, you can use dplyr to rename factor levels in R , the lubridate package to extract year from date in R, and ggplot2 to create a scatter plot.

In this tutorial, you have learned about the absolute value, how to take the absolute value in R from 1) vectors, 2) matrices, and 3) columns in a dataframe. Specifically, you have learned how to use the abs() function to convert negative values to positive in a vector, a matrix, and a dataframe. When it comes to the dataframe you have learned how to select columns and convert them using r-base as well as dplyr. I really hope you learned something. If you did, please leave a comment below. You should also drop a comment if you got a suggestion or correction to the blog post. Stay safe!

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe. Outline In the first section, we are going to have a look at what you […]

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe.

In the first section, we are going to have a look at what you need to follow in this tutorial. Second, we will answer some questions that might have brought you to this post. Third, we are going to use base R to select certain columns from the dataframe. In this section, we are also going to use the great operator %in% in R to select specific columns. Fourth, we are going to use dplyr and the select() family of functions. For example, we will use the `select_if()`

to get all the numeric columns and some helper functions. The helper functions enable us to select columns starting with, or ending with, a certain word or a specific character, for instance.

Note, the `select_if()`

function is also great if you, for example, want to take the absolute value in R dataframe and only select the numerical columns.

To select a column in R you can use brackets e.g., `YourDataFrame['Column']`

will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, `select(YourDataFrame, c('A', 'B')`

will take the columns named “A” and “B” from the dataframe.

If you want to use dplyr to select a column in R you can use the `select()`

function. For instance, `select(Data, 'Column_to_Get')`

will get the column “Column_to_Get” from the dataframe “Data”.

In the next section, we are going to learn about the prerequisites of this post and how to install R packages such as dplyr (or Tidyverse).

To follow this post you, obviously, need a working installation of R. Furthermore, we are going to use the read the example data from an Excel file using the readxl package. Moreover, if you want to use dplyr’s `select()`

and the different helper functions (e.g., startsWith(), endsWith()) you also need to install dplyr. It may be worth pointing out, that just by using the “-“-character you can use select() (from dplyr) to drop columns in R.

It may be worth pointing out that both readxl and dplyr are part of the tidyverse. Tidyverse comes with a number of great packages that are packed with great functions. Besides selecting, or removing, columns with dplyr (part of Tidyverse) you can extract year from date in R using the lubridate package, create scatter plots with ggplot2, and calculate descriptive statistics. That said, you can install one of these r-packages, depending on what you need, using the `install.packages()`

function. For example, installing dplyr is done by running this in R: `install.packages(c('dplyr', 'readxl'))`

.

Before we continue and practice selecting columns in R, we will read data from a .xlsx file.

```
library(readxl)
dataf <- read_excel("add_column.xlsx")
head(dataf)
```

Code language: R (r)

This example dataset is one that we used in the tutorial, in which we added a column based on other columns. We can see that it contains 9 different columns. If we want to, we can check the structure of the dataframe so that we can see what kind of data we have.

Code language: R (r)`str(dataf)`

Now, we see that there are 20 rows, as well, and that all but one column is numeric. In a more recent post, you can learn how to rename columns in R with dplyr. In the next section, we are going to learn how to select certain columns from this dataframe using base R.

In this section, we are going to practice selecting columns using base R. First, we will use the column indexes and, second, we will use the column names.

Here’s one example on how to select columns by their indexes in R:

`dataf[, c(1, 2, 3)]`

Code language: R (r)

As you can see, we selected the first three columns by using their indexes (1, 2, 3). Notice, how we also used the “,” within the brackets. This is done to get the columns rather than subsetting rows (i.e., by placing the “,” after the vector with indexes). Before moving on to the next example it may be worth knowing that the vector can contain a sequence. For instance, we can generate a sequence of numbers using `:`

. For example, replacing `c(1, 2, 3)`

with `c(1:3)`

would give us the same output, as above. Naturally, we can also select e.g. the third, fifth, and the sixth column if we want to. In the next example, we are going to subset certain columns by their name. Note, sequences of numbers can also be generated in R with the seq() function.

Here’s how we can select columns in R by name:

`dataf[, c('A', 'B', 'Cost')]`

Code language: R (r)

In the code chunk above, we basically did the same as in the first example. Notice, however, how we removed the numbers and added the column names. In the vector, that is, we now used the names of the column we wanted to select. Ín the next example, we are going to learn a neat little trick by using the %in% operator when selecting columns by name.

Here’s how we can make use of the %in% operator to get columns by name from the R dataframe:

```
head(dataf[, (colnames(dataf) %in% c('Depr1', 'Depr2',
'Depr4', 'Depr7'))])
```

Code language: R (r)

In the code chunk above, we used the great %in% operator. Notice something diffrent in the character vector? There’s a column that doesn’t exist in the example data. The cool thing, here, is that even though if we do this when using the %in% operator, we will get the columns that actually exists in the dataframe selected. In the next section, we are going to have a look at a couple of examples using dplyr’s `select()`

and some of the great helper functions.

In this section, we will start with the basic examples of selecting columns (e.g., by name and index). However, the focus will be on using the helper functions together with `select()`

, and the `select_if()`

function.

Here’s how we can get columns by index using the `select()`

function:

`library(dplyr) dataf %>% select(c(2, 5, 6))`

Notice how we used another great operator: %>%. This is the pipe operator and following this, we used the select() function. Again, when selecting columns with base R, we added a vector with the indexes of the columns we want. In the next example, we will basically do the same but select by column names.

Here’s how we use `select()`

to get the columns we want by name:

```
library(dplyr)
dataf %>%
select(c('A', 'Cost', 'Depr1'))
```

Code language: R (r)

n the code chunk above, we just added the names of the columns in the vector. Simple! In the next example, we are going to have a look at how to use `select_if()`

to select columns with containing data of a specific data type.

Here’s how to select all the numeric columns in an R dataframe:

```
dataf %>%
select_if(is.numeric)
```

Code language: CSS (css)

Remember, all columns except for one are of numeric type. This means that we will get 8 out of 9 columns running the above code. If we, on the other hand, added the `is.character`

function we would only select the first column. In the next section, we will learn how to get columns starting with a certain letter.

Here’s how we use the `starts_with()`

helper function and `select()`

to get all columns starting with the letter “D”:

```
dataf %>%
select(starts_with('D'))
```

Code language: R (r)

Selecting columns with names starting with a certain letter was pretty easy. In the `starts_with()`

helper function we just added the letter.

Here’s how we use the `ends_with()`

helper function and `select()`

to get all columns ending with the letter “D”:

```
dataf %>%
select(ends_with('D'))
```

Code language: R (r)

Note, that in the example dataset there is only one column ending with the letter “D”. In fact, all column names are ending with unique characters. That is, here it would not make sense to select columns using this method. It is worth noting here, that we can use a word when working with both the `starts_with()`

and `ends_with()`

helper functions. Let’s have a look!

Here’s how we can select certain columns starting with a specific word:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. Before going to the next section, it may be worth mentioning another great feature of the dplyr package. You can use dplyr to rename factor levels in R. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(contains('pr'))
```

Code language: R (r)

Again, this particular example doesn’t make sense on the example dataset. There’s a final helper function that is worth mentioning: `matches()`

. This function can be used to check whether column names contain a pattern (regular expression) such as digits. Now that you have selected the columns you need, you can continue manipulating your data and get it ready for data analysis. For example, you can now go ahead and create dummy variables in R or add a new column.

In this post, you have learned how to select certain columns using base R and dplyr. Specifically, you have learned how to get columns, from the dataframe, based on their indexes or names. Furthermore, you have learned to select columns of a specific type. After this, you learned how to subset columns based on whether the column names started or ended with a letter. Finally, you have also learned how to select based on whether the columns contained a string or not. Hope you found this blog post useful. If you did, please share it on your social media accounts, add a link to the tutorial in your project reports and such, and leave a comment below.

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do […]

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do if your data violates some of the assumptions.

Third, you will learn how to perform a paired sample t-test using the following Python packages:

- Scipy (scipy.stats.ttest_ind)
- Pingouin (pingouin.ttest)

In the final sections, of this tutorial, you will also learn how to:

- Interpret and report the paired t-test
- P-value, effect size

- report the results and visualizing the data

In the first section, you will learn about what is required to follow this post.

In this tutorial, we are going to use both SciPy and Pingouin, two great Python packages, to carry out the dependent sample t-test. Furthermore, to read the dataset we are going to use Pandas. Finally, we are also going to use Seaborn to visualize the data. In the next three subsections, you will find a brief description of each of these packages.

SciPy is one of the essential data science packages. This package is, furthermore, a dependency of all the other packages that we are going to use in this tutorial. In this tutorial, we are going to use it to test the assumption of normality as well as carry out the paired sample t-test. This means, of course, that if you are going to carry out the data analysis using Pingouin you will get SciPy installed anyway.

Pandas is also a very great Python package for someone carrying out data analysis with Python, whether a data scientist or a psychologist. In this post, we will use Pandas import data into a dataframe and to calculate summary statistics.

In this tutorial, we are going to use data visualization to guide our interpretation of the paired sample t-test. Seaborn is a great package for carrying out data visualization (see for example these 9 examples of how to use Seaborn for data visualization in Python).

In this tutorial, Pingouin is the second package that we are going to use to do a paired sample t-test in Python. One great thing with the ttest function is that it returns a lot of information we need when reporting the results from the test. For instance, when using Pingouin we also get the degrees of freedom, Bayes Factor, power, effect size (Cohen’s d), and confidence interval.

In Python, we can install packages with pip. To install all the required packages run the following code:

Code language: Bash (bash)`pip install scipy pandas seaborn pingouin`

In the next section, we are going to learn about the paired t-test and it’s assumptions.

The paired sample t-test is also known as the *dependent sample t-test*, and *paired t-test*. Furthermore, this type of t-test compares two averages (means) and will give you information if the difference between these two averages are zero. In a paired sample t-test, each participant is measured twice, which results in pairs of observations (the next section will give you an example).

For example, if clinical psychologists want to test whether a treatment for depression will change the quality of life, they might set up an experiment. In this experiment, they will collect information about the participants’ quality of life before the intervention (i.e., the treatment and after. They are conducting a pre- and post-test study. In the pre-test the average quality of life might be 3, while in the post-test the average quality of life might be 5. Numerically, we could think that the treatment is working. However, it could be due to a fluke and, in order to test this, the clinical researchers can use the paired sample t-test.

Now, when performing dependent sample t-tests you typically have the following two hypotheses:

- Null hypotheses: the true mean difference is equal to zero (between the observations)
- Alternative hypotheses: the true mean difference is not equal to zero (two-tailed)

Note, in some cases we also may have a specific idea, based on theory, about the direction of the measured effect. For example, we may strongly believe (due to previous research and/or theory) that a specific intervention should have a positive effect. In such a case, the alternative hypothesis will be something like: the true mean difference is greater than zero (one-tailed). Note, it can also be smaller than zero, of course.

Before we continue and import data we will briefly have a look at the assumptions of this paired t-test. Now, besides that the dependent variable is on interval/ratio scale, and is continuous, there are three assumptions that need to be met.

- Are the two samples independent?
- Does the data, i.e., the differences for the matched-pairs, follow a normal distribution?
- Are the participants randomly selected from the population?

If your data is not following a normal distribution you can transform your dependent variable using square root, log, or Box-Cox in Python. In the next section, we will import data.

Before we check the normality assumption of the paired t-test in Python, we need some data to even do so. In this tutorial post, we are going to work with a dataset that can be found here. Here we will use Pandas and the read_csv method to import the dataset (stored in a .csv file):

```
df = pd.read_csv('./SimData/paired_samples_data.csv',
index_col=0)
```

Code language: Python (python)

In the image above, we can see the structure of the dataframe. Our dataset contains 100 observations and three variables (columns). Furthermore, there are three different datatypes in the dataframe. First, we have an integer column (i.e., “ids”). This column contains the identifier for each individual in the study. Second, we have the column “test” which is of object data type and contains the information about the test time point. Finally, we have the “score” column where the dependent variable is. We can check the pairs by grouping the Pandas dataframe and calculate descriptive statistics:

In the code chunk above, we grouped the data by “test” and selected the dependent variable, and got some descriptive statistics using the `describe()`

method. If we want, we can use Pandas to count unique values in a column:

`df['test'].value_counts()`

Code language: Python (python)

This way we got the information that we have as many observations in the post test as in the pre test. A quick note: before we continue to the next subsection, in which we subset the data, it has to be mentioned that you should check whether the dependent variable is normally distributed or not. This can be done by creating a histogram (e.g., with Pandas) and/or carrying out the Shapiro-Wilks test.

Both the methods, whether using SciPy or Pingouin, require that we have our dependent variable in two Python variables. Therefore, we are going to subset the data and select only the dependent variable. To our help we have the `query()`

method and we will select a column using the brackets ([]):

```
b = df.query('test == "Pre"')['score']
a = df.query('test == "Post"')['score']
```

Code language: Python (python)

Now, we have the variables a and b containing the dependent variable pairs we can use SciPy to do a paired sample t-test.

Here’s how to carry out a paired sample t-test in Python using SciPy:

```
from scipy.stats import ttest_rel
# Python paired sample t-test
ttest_rel(a, b)
```

Code language: Python (python)

In the code chunk above, we first started by importing `ttest_rel()`

, the method we then used to carry out the dependent sample t-test. Furthermore, the two parameters we used were the data, containing the dependent variable, in the pairs (a, and b). Now, we can see by the results (image below) that the difference between the pre- and post-test is statistically significant.

In the next section, we will use Pingouin to carry out the paired t-test in Python.

Here’s how to carry out the dependent samples t-test using the Python package Pingouin:

```
import pingouin as pt
# Python paired sample t-test:
pt.ttest(a, b, paired=True)
```

Code language: Python (python)

There’s not that much to explain, about the code chunk above, but we started by importing pingouin. Next, we used the `ttest()`

method and used our data. Notice how we used the paired parameter and set it to True. We did this because it is a paired sample t-test we wanted to carry out. Here’s the output:

As you can see, we get more information when using Pingouin to do the paired t-test. In fact, here we basically get all we need to continue and interpret the results. In the next section, before learning how to interpret the results, you can also watch a YouTube video explaining all the above (with some exceptions, of course):

Here’s the majority of the current blog post explained in a YouTube video:

In this section, you will be given a short explanation on how to interpret the results from a paired t-test carried out with Python. Note, we will focus on the results that we got from Pingouin as they give us more information (e.g., degrees of freedom, effect size).

Now, the p-value of the test is smaller than 0.001, which is less than the significance level alpha (e.g., 0.05). This means that we can draw the conclusion that the quality of life has increased when the participants conducted the post-test. Note, this can, of course, be due to other things than the intervention but that’s another story.

Note that, the p-value is a probability of getting an effect at least as extreme as the one in our data, assuming that the null hypothesis is true. Pp-values address only one question: how likely your collected data is, assuming a true null hypothesis? Notice, the p-value can never be used as support for the alternative hypothesis.

Normally, we interpret Cohen’s D in terms of the relative strength of e.g. the treatment. Cohen (1988) suggested that *d*=0.2 is a ‘small’ effect size, 0.5 is a ‘medium’ effect size, and that 0.8 is a ‘large’ effect size. You can interpret this such as that iif two groups’ means don’t differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.

When using Pingouin to carry out the paired t-test we also get the Bayes Factor. See this post for more information on how to interpret BF10.

In this section, you will learn how to report the results according to the APA guidelines. In our case, we can report the results from the t-test like this:

The results from the pre-test (

M= 39.77,SD= 6.758) and post-test (M= 45.737,SD= 6.77) quality of life test suggest that the treatment resulted in an improvement in quality of life,t(49) = 115.4384,p< .01. Note, that the “quality of life test” is something made up, for this post (or there might be such a test, of course, that I don’t know of!).

In the final section, before the conclusion, you will learn how to visualize the data in two different ways: creating boxplots and violin plots.

Here’s how we can guide the interpretation of the paired t-test using boxplots:

```
import seaborn as sns
sns.boxplot(x='test', y='score', data=df)
```

Code language: Python (python)

In the code chunk above, we imported seaborn (as sns), and used the boxplot method. First, we put the column that we want to display separate plots for on the x-axis. Here’s the resulting plot:

Here’s another way to report the results from the t-test by creating a violin plot:

```
import seaborn as sns
sns.violinplot(x='test', y='score', data=df)
```

Code language: Python (python)

Much like creating the box plot, we import seaborn and add the columns/variables we want as x- and y-axis’. Here’s the resulting plot:

As you may already be aware of, there are other ways to analyze data. For example, you can use Analysis of Variance (ANOVA) if there are more than two levels in the factorial (e.g. tests during the treatment, as well as pre- and post -tests) in the data. See the following posts about how to carry out ANOVA:

- Repeated Measures ANOVA in R and Python using afex & pingouin
- Two-way ANOVA for repeated measures using Python
- Repeated Measures ANOVA in Python using Statsmodels

Recently, machine learning methods have grown popular. See the following posts for more information:

In this post, you have learned two methods to perform a paired sample t-test.Specifically, in this post you have installed, and used, three Python packages for data analysis (Pandas, SciPy, and Pingouin). Furthermore, you have learned how to interpret and report the results from this statistical test, including data visualization using Seaborn. In the Resources and References section, you will find useful resources and references to learn more. As a final word: the Python package Pingouin will give you the most comprehensive result and that’s the package I’d choose to carry out many statistical methods in Python.

If you liked the post, please share it on your social media accounts and/or leave a comment below. Commenting is also a great way to give me suggestions. However, if you are looking for any help please use other means of contact (see e.g., the About or Contact pages).

Finally, support me and my content (much appreciated, especially if you use an AdBlocker): become a patron. Becoming a patron will give you access to a Discord channel in which you can ask questions and may get interactive feedback.

Here are some useful peer-reviewed articles, blog posts, and books. Refer to these if you want to learn more about the t-test, p-value, effect size, and Bayes Factors.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

It’s the Effect Size, Stupid – What effect size is and why it is important

Using Effect Size—or Why the P Value Is Not Enough.

Beyond Cohen’s d: Alternative Effect Size Measures for Between-Subject Designs (Paywalled).

A tutorial on testing hypotheses using the Bayes factor.

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>