In this short tutorial, you will learn how to convert a matrix to a dataframe in R. Specifically, you will learn how to use base R and the package tibble to change the matrix to a dataframe. You will learn this task by 4 different examples (2 using each method). Outline This post is structured […]

The post Learn How to Convert Matrix to dataframe in R with base functions & tibble appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to convert a matrix to a dataframe in R. Specifically, you will learn how to use base R and the package tibble to change the matrix to a dataframe. You will learn this task by 4 different examples (2 using each method).

This post is structured as follows. First, you will learn briefly about tibble and how to install this R package. After this, you will get the answer to the question “How do I convert a matrix to a dataframe in R”. In the next section, we will create a simple matrix. The following sections, of the blog post, will be converted to a dataframe in different examples throughout the post. These examples will, hopefully, deepen your knowledge concerning converting matrices in R.

In the first example, we will use base R to convert the matrix. Subsequently, we will also add column names when converting the matrix to a dataframe.

In the third example, we will then use tibble and the function `as_tibble()`

to change the matrix to a dataframe (i.e. a tibble object). Finally, we will also use tibble and `setNames()`

when converting a matrix to a dataframe. In the next example, you will learn how to install tibble or Tidyverse.

Here’s how we can instal tibble:

`install.packages("tibble")`

Code language: R (r)

As usual, we use the `install.packages()`

function and write the package (i.e., “tibble”) within quotation marks. Note that we can install the Tidyverse package. This package contains the tibble among other useful packages. We can, for example, use the Tidyverse package remove duplicates, and rename factor levels. Moreover, the package tibble can be used to add empty columns to the dataframe, add new columns to the dataframe, and much more.

To convert a matrix to a dataframe in R, you can use the as.data.frame() function, For example, to change the matrix, named “MX”, to a dataframe you can use the following code: <code>df_m <- as.data.frame(mtx)</code>.

In the next section, we are going to create a matrix using the `matrix()`

function.

Before we change a matrix to a dataframe, we will need to create a matrix. Here’s how we can create a matrix using the `matrix()`

function:

`mtx <- matrix(seq(1, 15), nrow = 5)`

Code language: HTML, XML (xml)

In the code above, we used the seq() function to generate a sequence of numbers (i.e., from 1 to 15). Moreover, we also created 5 rows, using the `nrow`

argument. Here’s the resulting matrix:

In the next section, we will have a look at the first example of converting the matrix, we have created, to a dataframe.

To convert a matrix to a dataframe in R we can use the `as.data.frame()`

function:

Code language: R (r)`df_mtx <- as.data.frame(mtx)`

In the code above, we simply used the function (i.e., `as.data.frame()`

) to create a dataframe from the matrix. Here’s the converted dataframe:

Now, that we have converted the matrix to a dataframe we can use e.g. the `str()`

function to look at the structure of the data:

As you can see, in the output above, we have 3 columns of the data type integer. This is, of course, expected (we created a sequence of numbers as a matrix). Notice how we have the column names V1 to V3. This is not that informative and there are a number of options here. First, we could name the columns in the matrix (or when creating the matrix). Second, we can rename the columns of the created dataframe. In this post, we will change the column names after we have converted the matrix.

Now, after converting the matrix, using the `as.data.frame()`

function, we can use the `colnames()`

function:

```
df_mtx <- as.data.frame(mtx)
colnames(df_mtx) <- c("A", "B", "C")
```

Code language: JavaScript (javascript)

In the code chunk above, we used the `colnames()`

function and assigned a character vector. This character vector contained the three column names. Here’s the converted matrix (i.e., the dataframe):

In the next example, we will continue by using an installed R package: tibble.

In this section, you will learn how to use another package for converting a matrix to a dataframe: tibble. Here’s how to transform a matrix to a dataframe in R using tibble:

```
library(tibble)
df_mtx <- mtx %>%
as_tibble()
```

Code language: HTML, XML (xml)

As you probably notice, there is a difference in how we, now, use the function. Instead of adding the matrix within the parentheses, as in the previous two examples, we used the pipe operator (“%>%”). On the left side of pipe operator we had the matrix, the new dataframe, and on the right side we use the function. Here’s the dataframe that we have created from the matrix:

Here are some blog posts about other useful operators:

- How to use %in% in R: 7 Example Uses of the Operator
- How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)

Now, most of the time we would like to have better column names than what we get in this example. As previously mentioned, we could have set the column (and row names) when we created the matrix. However, if we already had a matrix without names but we knew the column names we can use the setNames() function together with another pipe. This is what we will have a look at in the final example.

Here’s how we can convert a matrix to a dataframe and set the column names:

```
df_mtx <- mtx %>%
as_tibble() %>%
setNames(c("A", "B", "C"))
```

Code language: JavaScript (javascript)

In the code chunk above, we used another pipe (see Example 3) and added the function setNames() to add the column names “A”, “B”, and “C”. Here’s the resulting dataframe:

As previously mentioned, tibble is part of the Tidyverse and this means that we could have used dplyr to rename the columns after we created the dataframe.

In this post, we have converted a matrix to dataframe in R. More specifically, we have learned how to carry out this task by following 4 different examples. In the first two examples, we used base R. In the final two examples, on the other hand, we will use the Tidyverse package tibble. Whether we use base R or Tibble to convert matrices to dataframes, we need to set the column names. That is, if the matrix we convert does not have column names. Hope you learned something valuable in this tutorial.

If you have anything you would like me to cover in a blog post (e.g., something you need to learn) please drop a comment below. For any suggestions or corrections, please drop a comment below, as well.

The post Learn How to Convert Matrix to dataframe in R with base functions & tibble appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to count the number of occurrences in a column. Sometimes, before starting to analyze your data, it may be useful to know how many times a given value occurs in your variables. For example, when you have a limited set of possible values that you […]

The post R Count the Number of Occurrences in a Column using dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to count the number of occurrences in a column. Sometimes, before starting to analyze your data, it may be useful to know how many times a given value occurs in your variables. For example, when you have a limited set of possible values that you want to compare, In this case, you might want to know how many there are of each possible value before you carry out your analysis. Another example may be that you want to count the number of duplicate values in a column. Moreover, if we want to get an overview or information, let us say: how many men and women you have in your data set. In Psychological science. In this example, it is obligatory that you report the number of men and women in your research articles.

In this post, you will learn how to use the R function table() to count the number of occurrences in a column. Moreover, we will also use the function count() from the package dplyr. First, we start by installing dplyr and then we import example data from a CSV file. Second, we will start looking at the table() function and how we can use it to count distinct occurrences. Here we will also have a look at how we can calculate the relative frequencies of factor levels.

Third, we will have a look at the count() function from dplyr and how to count the number of times a value appears in a column in R. Finally, we will also have a look at how we can calculate the proportion of factor/characters/values in a column.

In the next section, you are going to learn how to install dplyr. Of course, if you prefer to use table() you can jump to this section, directly.

As you may already be aware, it is quite easy to install R packages. Here’s how you install dplyr using the install.packages() function:

`install.packages("dplyr")`

Code language: R (r)

Note that dplyr is part of the Tidyverse package which can be installed. Installing the Tidyverse package will install a number of very handy and useful R packages. For example, we can use dplyr to remove columns, and remove duplicates in R. Moreover, we can use tibble to add a column to the dataframe in R. Finally, the package Haven can be used to read an SPSS file in R and to convert a matrix to a dataframe in R. For more examples, and R tutorials, see the end of the post.

Before learning how to use R to count the number of occurrences in a column, we need some data. For this tutorial, we will read data from a CSV file found online:

`df <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv')`

Code language: R (r)

This data contains details of a person who has been arrested and in this tutorial we are going to have a look sex and age columns. First, the sex column classifies an individual’s gender as male or female. Second, the age is, of course, referring to an individual in the datasets age. Let us have a quick look at the dataset:

Now, using the str() function we can see that we have 5226 observations across 9 columns. Moreover, we can se the data type of the 9 columns.

Here’s how to use the R function table() to count occurrences in a column:

`table(df['sex'])`

Code language: R (r)

As you can see, we selected the column ‘sex’ using brackets (i.e. df[‘sex’]) and used is the only parameter to the table() function. Here’s the result:

Note it is also possible to use $ in R to select a single column. Now, as you can see in the image above, the function returns the count of all unique values in the given column (‘sex’ in our case) in descending order without any null values. By glancing at the above output see that there are more men than women in the dataset. In fact, the results show us that the vast majority are men.

Note, both of the examples above will remove missing values. This, of course, means that they will not be counted at all. In some cases, however, we may want to know how many missing values there are in a column as well. In the next section, we will therefore have a look at an argument that we can use (i.e., useNA) to count unique values and missing values, in a column. First, however, we are going to add 10 missing values to the column sex:

```
df_nan <- df
df_nan$sex[c(12, 24, 41, 44, 54, 66, 77, 79, 91, 101)] <- NaN
```

Code language: R (r)

In the code above, we first used the column name (with the $ operator) and, then, used brackets to select rows. Finally, we used the NaN function to add the missing values to these rows that we selected. In the next section, we will count the occurrences including the 10 missing values that we just added to the dataframe.

Here’s a code snippet that you can use to get the number of unique values in a column as well as how many missing values:

```
df_nan <- df
df_nan$sex[c(12, 24, 41, 44, 54, 66, 77, 79, 91, 101)] <- NaN
table(df_nan$sex, useNA = "ifany")
```

Code language: PHP (php)

Now, as you can see in the code chunk above, we used the useNA argument. Here we added the character object “ifany” which will also count the missing values, if there are any. Here’s the output:

Now, we already knew that we had 10 missing values in this column. Of course, when we are dealing with collected data we may not know this and, this, will let us know how many missing values there are in a specific column. In the next section, we will not count the number of times a value appears in a column in R. Next we will rather count the relative frequencies of unique values in a column.

Another thing we can do, now, when we know how to count unique values in a column in R’s dataframe is to calculate the relative frequencies of unique values. Here’s how we can calculate the relative frequencies of men and women in the dataset:

Code language: PHP (php)`table(df$sex)/length(df$sex)`

In the code chunk above, we used the table() function as in the first example. We added something to get the relative frequencies of the factors (i.e., men and women). In the example, above, we used the length() function to get the total number of observations. We used this to calculate the relative frequency. This may be useful if we not only want to count the occurrences but want to know e.g. what percentage of the sample that are male and female.

Here’s how we can use R to count the number of occurrences in a column using the package dplyr:

```
library(dplyr)
df %>%
count(sex)
```

Code language: R (r)

In the example, above, we used the %>% operator which enables us to use the count() function to get this beautiful output. Now, as you can see when we are counting the number of times a value appears in a column in R using dplyr we get a different output compared to when using table(). For another great operator, see the post about how to use the %in% operator in R.

In the next section, we are going to count the relative frequencies of factor levels. Again, we will use dplyr but this time we will use group_by(), summarise(), and mutate().

In this example, we are going to use three R functions (i.e., from the dplyr package). First, we use the piping operator, again, and then we group the data by a column. After we have grouped the data we count the unique occurrences in the column, we have selected. Finally, we are calculating the frequency of factor levels:

Code language: R (r)`df %>% group_by(sex) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))`

Using the code above, we get two columns. What we did, in the code chunk above, was to group the data by the column containing gender information. We then summarized the data. Using the n() function we got the number of observations of each value. Finally, we calculated a new variable, called “Freq”. Here is were we calculate the frequencies. This gives us another nice output. Let us have a look at the output:

As you can see in the output, above, we get two columns. This is because we added a new column to the summarized data: the frequencies. Of course, counting a column, such as age, as we did in the previous example would not provide any useful information. In the next section, we will have a look at how to use the R package dplyr to count unique occurrences in a column.

There are 53 unique values of age data, a mean of 23.84 and a standard deviation of 8.31. Therefore, counting the unique values of the age column would produce a lot of headaches. In the next example, we will have a look at how we can count age but getting a readable output by binning. This is useful if we want to count e.g. even more continuous data.

As previously mentioned, we can create bins and count the number of occurrences in each of these bins. Here’s an example code in which we get 5 bins:

```
df %>%
group_by(group = cut(age, breaks = seq(0, max(age), 11))) %>%
summarise(n = n())
```

Code language: R (r)

In the code chunk above, we used the group_by() function, again (of course, after the %>% operator). In this function, we also created the groups (i.e., the bins). Here we used the seq() function that can be used to generate a sequence of numbers in R. Finally, we used the summarise() function to get the number of occurrences in the column, binned. Here’s the output:

For each bin, the range of age values is the same: 11 years. One contains ages from 11 to 22. The next bin contains ages from 22 to 33. However, we also see that there are a different number of persons in each age range. This enables us to see that most people, that are arrested are under the age of 22 Now this kind of makes sense, in this case, right?

In this post, you have learned how to use R to count the number of occurrences in a column. Specifically, you have learned how to count occurrences using the table() function and dplyr’s count() function. Moreover, you have learned how to calculate the relative frequency of factor levels in a column. Furthermore, you have learned how to count the number of occurrences in different bins, as well.

Here are a bunch of other tutorials you might find useful:

- How to Do the Brown-Forsythe Test in R: A Step-By-Step Example
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- How to Calculate Five-Number Summary Statistics in R
- How to Concatenate Two Columns (or More) in R – stringr, tidyr

The post R Count the Number of Occurrences in a Column using dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to do the Brown-Forsythe test in R. This test is great as you can use it to test the assumption of homogeneity of variances, which is important for e.g. Analysis of Variance (ANOVA). Outline of the Post This post is structured as follows. First, we start by answer […]

The post How to Do the Brown-Forsythe Test in R: A Step-By-Step Example appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to do the Brown-Forsythe test in R. This test is great as you can use it to test the assumption of homogeneity of variances, which is important for e.g. Analysis of Variance (ANOVA).

This post is structured as follows. First, we start by answer a couple of questions related to this test. Second, we learn about the hypotheses of the Brown-Forsythe test. This is followed by the most important section, maybe, the 5 steps to performing the Brown-Forsythe test in R. Now, of course, it is possible to do it in fewer steps. Here’s how to carry out the test in three steps, one which involves installing a package:

In this section, you will get some brief details on what this test is. As previously mentioned, the Brown Forsythe test is used whenever we need to test the assumption of equal variances. Furthermore, it is a modification of Levene’s test but the Brown-Forsythe test uses the median, rather than the mean (Levene’s). The test is considered a robust test that is based on the absolute differences within each group from the group median, as previously mentioned. The Brown-Forsythe test is a suitable alternative to Bartlett’s Test for equal variances, as it is not sensitive to lack of normality and unequal sample sizes. For more information, on how the Brown-Forsythe test works see this article or the resources towards the end of the post.

You can perform the Brown-Forsythe test using the bf.function() from the R package onewaytests. For example, bf.function(DV ~ IV, data=dataFrame) will successfully perform the test one the dependent variable DV and the groups IV, in the dataframe dataFrame.

In the next section, you will learn the hypotheses of the Brown-Forsythe test. Knowing the hypothesis will make interpretion of the results easier.

When carrying out the Brown-Forsythe test using R we are testing the following two hypotheses:

- H
^{0}: The population variances are equal. - H
_{A}: The population variances are not equal.

Therefore, as we will see when going trough the example, we don’t want to reject the null hypothesis (H0). In the next section, you will get a brief overivew of one of the R packages that can be used to perform the test.

Now, R is, as you may know, an open-source language. This means that there are probably more packages that make it possible, for us, to do the Brown-Forsythe test in R. In this post, however, we will only use one Package:

The Onewaytests is more focused on carrying out one-way tests. Using this package we can carry out one-way ANOVA, Welch’s heteroscedastic F test, Welch’s heteroscedastic F test with trimmed means and Winsorized variances, Brown-Forsythe test, and Alexander-Govern test, James second-order test to name a few. The function bf.test() is, of course, of interest for this blog post.

We are now ready to carry out hte Brown-Forsythe test in R. I

Now, you may already know how to install R-packages but here’s how we install the onewaytests package:

`install.packages("onewaytests")`

Code language: R (r)

Note, we are, in step three also going to summarize data to calculate variance, for each group, using dplyr. Moreover, we are going to import the example dataset using the readxl package. Both packages are part of the Tidyverse package. Therefore, to fully follow this post, install the TIdyverse package (or just dplyr, of course), as well:

`install.packages(c("onewaytests", "tidyverse"))`

Code language: R (r)

The above code will install both onewaytests and Tidyverse. If you, on the other hand, only want to install dplyr and readxl (for reading Excel files) you can remove “tidyverse” and add “dplyr” and “readxl”. Just follow the syntax above. Now, Tidyverse comes with a lot of great packages. For example, you can use dplyr to rename columns, count the number of occurrences in a column, stringr to merge two columns in R.

In the next step, we are going use the readxl package to import the example dataset.

Here’s how we read an Excel file in R using the readxl package:

```
library(readxl)
dataFrame <- read_excel('brown-forsythe-test-in-R-example-data.xlsx')
```

Code language: R (r)

First, before, going on to the next step we can explore the data frame a bit. For example, we can get the first 6 rows:

Code language: R (r)`head(dataFrame)`

As we can see, there are only two variables in this example data. First, we have the column “Group”, in which we find the different treatment groups (“A”, “B”, and “C”). If we want to see what data type we can type this:

`str(dataFrame)`

Now, we see that Group is factor and Response is numeric (i.e., num). In the next, section, we will have a visual look at the variance of Response, in each group.

As you may know, there are many different ways to visualize data in R. Here we will make use of the boxplot() function which will give us an idea of whether the variances are equal across the groups, or not. Here’s how to create a boxplot:

Code language: R (r)`boxplot(Response ~ Group, data = dataFrame)`

When inspecting the boxplots, it sure looks like the variances are different for the different treatment groups. We can also calculate the variance, by group, using dplyr:

```
library(dplyr)
dataFrame %>%
group_by(Group) %>%
summarize(Variance=var(Response))
```

Code language: R (r)

Note, you can see the following two posts if you need to calculate other summary statistics as well:

- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Calculate Five-Number Summary Statistics in R

Now, judging from the image, above, it also looks like we have different variances in the different treatment groups. In the next step, however, we will use the bf.test() function to carry out the Brown-Forsythe test testing the null hypothesis that the variances are equal.

Here’s how you can perform the Brown-Forsythe Test in R:

```
library(onewaytests)
bf.test(Response ~ Group, data=dataFrame)
```

Code language: R (r)

In the code chunk above, we used the bf.test() function (onewaytests package) to carry out the Brown-Forsythe test. Note how we used a formula as the first argument. This would be the exact same formula you would use performing ANOVA in R. Here’s the output from the function:

In the next section, we will learn how to interpret the results from the test.

Interpreting the Brown-Forsythe test is quite simple. Just remember that we had the null hypothesis that the variances are equal across the groups. Therefore, if the p-value is under 0.05, we reject the null hypothesis and conclude that the data is not meeting the assumption of homogeneity of variances.

In our example, the null hypothesis is rejected. However, if the p-value would have been above 0.05 we would not reject the null hypothesis. In this case, we can safely go on and carry out e.g. one-way ANOVA.

If your data is violating the assumption of homogeneity but is normally distributed you should carry on with Welch’s ANOVA, which also can be carried out in R.

In this blog post, you have learned how to carry out the Brown-Forsythe test of homogeneity of variances in R. Specifically, you have learned, step-by-step, how to carry out this test. First, you learned how to install an R package enabling the Brown-Forsythe test in R. Second, you imported example data and, third, explored the data. Finally, you learned how to carry out the test using the bf.test() function. Now there are probably other packages and functions that enable us to carry out this test of equal variances. Please leave a comment below, if you know any other packages or functions that we can use to do the Brown-Forsythe test in R. You are, of course, also welcome to suggest what I should cover in future blog posts, correct any mistakes in my blog posts, or just let me know if you found the post useful. That is, I encourage you to comment below!

Here are some references and useful resources that you might find useful on the topic:

- Morton B. Brown & Alan B. Forsythe (1974) Robust Tests for the Equality of Variances, Journal of the American Statistical Association, 69:346, 364-367, DOI: 10.1080/01621459.1974.10482955
- Tests for equality of variances between two samples which contain both paired observations and independent observations (pdf)

Here are some other blog posts, found on this blog, that you might find useful.

- How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)
- Learn How to use %in% in R: 7 Example Uses of the Operator

- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Remove a Column in R using dplyr (by name and index)
- R Count the Number of Occurrences in a Column using dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

The post How to Do the Brown-Forsythe Test in R: A Step-By-Step Example appeared first on Erik Marsja.

]]>In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using str_c() and unite()). In the final section of this post, you will learn which function is the best […]

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using `str_c()`

and `unite()`

). In the final section of this post, you will learn which function is the best to use when combining columns.

If you have some experience using dataframe (or in this case tibble) objects in R and you’re ready to learn how to combine data found in them, then this tutorial will help you do precisely that.

Knowing how to do this may prove useful when you have a dataframe containing information, in two columns, and you want to combine these two columns into one using R. For example, you might have a column containing first names and last names. In this case, you may want to concatenate these two columns into one e.g. called Names.

You can follow along with the examples in this tutorial using the interactive Jupyter Notebook found towards the end of the tutorial. Here’s the example data that we use to learn how to combine two, or more, columns to one variable.

In this post, you will learn, by example, how to concatenate two columns in R. As you will see, we will use R’s $ operator to select the columns we want to combine. The outline of the post is as follows. First, you will learn what you need to have to follow the tutorial. Second, you will get a quick answer on how to merge two columns. After this, you will learn a couple of examples using 1) `paste()`

and 2) `str_c()`

and, 3) `unite()`

. In the final section, of this concatenating in R tutorial, you will learn which method I prefer and why. That is, you will get my opinion on why I like the `unite()`

function. In the next section, you will learn about the requirements of this post.

If you prefer to use base R you don’t need more than a working R installation. However, if you are going to use either str_() or unite() you need to have at least one of the packages stringr or tidyr. It is worth pointing out, here, that both of these packages are part of the Tidyverse package. This package contains multiple useful R packages that can be used for reading data, visualizing data (e.g., scatter plots with ggplot2), extracting year from date in R, adding new columns, among other things. Installing an R package is simple, here’s how you install Tidyverse:

`install.packages("tidyverse")`

Code language: R (r)

Note, if you want to install stringr or tidyr just exchange “tidyverse” for e.g. “stringr”. In the next section, you will get a quick answer, without any details, on how to concatenate two columns in R.

To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns *A *and *B* in the dataframe *df* you can use the following code: <code>df[‘AB’] <- paste(df$A, df$B)</code>. Note, however, that using <code>paste</code> will result in whitespace between the values in the new column.

Before we are going to have a more detailed look at how to use paste() to combine two columns, we are going to load an example dataset.

Here’s how to read a .xlsx file in R using the readxl package:

```
# Importing Example Data:
library('readxl')
dataf <- read_excel("combine_columns_in_R.xlsx")
```

Code language: R (r)

Now, we can have a look at the structure of the imported data using the `str() `

function:

We will also have a quick look at the first five rows using the `head()`

function:

Now, in the images above we can see that there are 5 variables and 7 observations. That is, there are 5 columns and 7 rows, in the tibble. Moreover, we can see the types of the variables and we can, of course, also use the column names. In the next section, we are going to start by concatenating the month and year columns using the paste() function.

Here’s one of the simplest way to combine two columns in R using the `paste()`

: function:

Code language: R (r)`dataf$MY <- paste(dataf$Month, dataf$Year)`

In the code above, we used $ in R to 1) create a new column but, as well, selecting the two columns we wanted to combine into one. Here’s the tibble with the new column, named *MY*:

In the next example, we will merge two columns and adding a hyphen (“-”), as well. For more useful operators, and how to use them, see for example the post “How to use %in% in R: 7 Example Uses of the Operator“.

Now, to add “-” (hyphen) between the values we want to combine we add a third parameter to the `paste()`

function:

`dataf$MY <- paste(dataf$Month, "-", dataf$Year)`

Code language: R (r)

In the code example above, we used the sep parameter and set it as “-”. As you can see, in the image below, we have whitespaces between the two values (i.e. “Month” and “Year”).

Now, using R’s `paste()`

function we can add another parameter: the sep parameter. Here’s a code example combining the two columns, adding the “-” without the whitespaces:

`dataf$MY <- paste(dataf$Month, dataf$Year, sep= "-")`

Code language: R (r)

Notice, that instead of pasting the hyphen we used it as a separator. Before moving on to the next example, it is worth pointing out that if we don’t want to add whitespaces we can use the `paste0()`

function instead. This way, we don’t need the sep parameter. In the next example, we are going to have a look at how to combine multiple columns (i.e., three or more) in R.

As you may have understood, combining more than 2 columns is as simple as adding a parameter to the `paste()`

function. Here’s how we combine three columns in R:

Code language: R (r)`dataf$DMY <- paste(dataf$Date, dataf$Month, dataf$Year)`

That was also pretty simple. It is worth, mentioning, that if you use the sep parameter, in a case as above, you will end up with whatever character you chose between each value from each column. For example, if we were to add the sep argument to the code above and put underscore (“_”) as a separator here’s how the resulting tibble would look like:

Now, you may understand that using the sep parameter enables you to use almost any character to separate your combined values. In the next section, we will have a look at the str_c() function from the stringr package.

Combining two columns with the str_c() function is super simple. Here’s how to merge the columns “Snake” and “Size” using the str_c() function:

```
library(stringr)
dataf$SnakeNSize <- str_c(dataf$Snake," ", dataf$Size)
```

Code language: PHP (php)

Notice that we added something in between the two columns we wanted to concatenate? When working with this function, we need to do this, or else we end up with nothing separating the two values that we are combining. As previously mentioned, the stringr package is part of the Tidyverse packages which also includes packages such as tidyr and the unite() function. In the next section, we are going to merge two columns in R using the unite() function as well.

- You may also like: How to Add a Column to a Dataframe in R with tibble & dplyr

Here’s how we concatenate two, or more, columns using the unite() function:

```
library(tidyverse) # or library(tidyr)
dataf <- dataf %>%
unite("DM", Date:Month)
```

Code language: R (r)

Notice something in the code above. First, we used a new operator (i.e., %>%). Among a lot of things, this enables us to use unite() without the $ operator to select the columns. As you can see, in the code example above, we used two parameters. First, we name the new column we want to add (“DM”), second we select all the columns from “Date” to “Month” and combine them into the new column. Here’s the resulting dataframe/tibble:

Now, as you can see in the image above, both columns that we combined have disappeared. If we want to keep the original columns after we have concatenated them we can set the remove parameter to FALSE. Here’s a code chunk that you can use, instead, to not remove the columns:

```
dataf <- dataf %>%
unite("DM", Date:Month, remove = FALSE)
```

Code language: R (r)

Finally, did you notice how we have an underscore as a separator? If we want to change to another separator we can use the sep parameter. This is exactly what we will do in the next example:

Here’s how we use the unite() function together with the sep parameter to change the separator to “-” (hyphen):

```
dataf <- dataf %>%
unite("DM", Date:Month, sep= "-",
remove = FALSE)
```

Code language: R (r)

That was as simple as the previous example, right? In the next section, you will learn which function I prefer to use and why.

Naturally, this section will contain my opinion. I have not done any optimization testing (e.g., I don’t know which function is the fastest when it comes to combining columns in R). That said, although all of the functions used in this post are simple to use I prefer the unite() function. Why? Well, together with the piping operator I think it makes the column very readable. It is, as well, very handy to use unite() if you are going to concatenate multiple columns in R. As you may have noticed, in the examples above, we can use “:” when combining columns. This means that we can merge multiple columns from the first column (i.e., left of the column sign) to the last column (i.e., right of the “:”). This is pretty neat and will definitely save some space in your code and make it easier to read!

Another neat thing is that we add the new column name as a parameter and we, automatically, get rid of the columns combined (if we don’t need them, later, of course). Finally, we can also set the na.rm parameter to TRUE if we want missing values to be removed before combining values. Here’s a Jupyter Notebook with all the code in this post.

In this post, you have learned how to concatenate two (or more) columns in R using three different functions. First, we used the paste() function from base R. Using this function, we combined two and three columns, changed the separator from whitespaces to hyphen (“-”). Second, we used the str_() function to merge columns. Third, we used the unite() function. Of course, it is possible (we saw some example of that) to change the separator using the two last functions as well. To conclude, the unite() function seems to be the handiest function to use to concatenate columns in R.

Hope you learned something! If you did, please leave a comment below, share on your social media, include a link to the post on your projects (e.g., blog posts, articles, reports), or become a Patreon:

Finally, if you have any suggestions, other comments, or there is something you wish me to cover: don’t hesitate to contact me.

- How to Calculate Five-Number Summary Statistics in R
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Rename Column (or Columns) in R with dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate: Minimum Lower-hinge Median Upper-hinge Maximum Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and […]

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate:

- Minimum
- Lower-hinge
- Median
- Upper-hinge
- Maximum

Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and then how we can use one single function to get all of them directly.

To follow this R tutorial you will need to have readxl and ggplot2 installed. The easiest way to install these to r-packages is to use the `install.packages()`

function:

`install.packages(c("readxl", "ggplot"))`

Code language: R (r)

Note, both these two packages are part of the Tidyverse. This means that you get them, as well as a lot of other packages when installing Tidyverse. For example, you can use packages such as dplyr to rename columns, remove columns in R, merge two columns, and select columns, as well.

Before getting to the 6 steps to finding the five-number summary statistics using R we will get the answer to some questions, however.

As you may have understood, the five-number summary statistics are 1) the minimum, 2) the lower-hinge, 3) the median, 4) the upper-hinge, and 5) the maximum. The five-number summary is a quick way to explore your dataset.

The absolutely easiest way to find the five number summary statistics in R is to use the <code>fivenum()</code> function. For example, if you have a vector of numbers called “A” you can run the following code: <code>fivenum(A)</code> to get the five number summary.

Now that we know what the five-number summary is we can go on and learn the simple steps to calculate the 5 summary statistics.

In this section, we are ready to go through the 6 simple steps to calculate the five-number statistics using the R statistical environment. To recap: the first step is to import the dataset (e.g., from an xlsx file). Second, we calculate the min value, and then, in the third step, get the lower-hinge. In the fourth step, we get the median. In the fifth step we get the upper-hinge and, then, in the sixth, and final step, we get the max value.

Here’s how to read a .xslx file in R using the readxl package:

```
library(readxl)
dataf <- read_excel("play_data.xlsx", sheet = "play_data",
col_types = c("skip", "numeric",
"text","text", "numeric",
"numeric", "numeric"))
head(dataf)
```

Code language: JavaScript (javascript)

We can see that in this example dataset there’s only one column containing numerical data (i.e., the column RT). In the next step, we will take the minimum of this column.

Here’s how to get the minimum value in a column in R:

```
library(readxl)
dataf <- read_excel("play_data.xlsx", sheet = "play_data",
col_types = c("skip", "numeric",
"text","text", "numeric",
"numeric", "numeric"))
head(dataf)
```

Code language: JavaScript (javascript)

Notice how we used the `min()`

function with the dataframe and the column (i.e., RT) as the first argument. The second argument we set to TRUE because we have some missing values in the column. Finally, we used the $ operator in R to select a column. If we, on the other hand, were using dplyr we could use the select() function. That said, let’s move on and get the max value.

Here’s how we get the lower-hinge:

```
# Lower Hinge:
RT <- sort(dataf$RT)
lower.rt <- RT[1:round(length(RT)/2)]
lower.h.rt <- median(lower.rt)
```

Code language: PHP (php)

Notice, how we started by selecting only response times (i.e. the RT column) and sorted the values. Second, we get the lower part of the response times and, then, we get the lower-hinge by calculating the median of this vector.

To calculate the median we can use the `median()`

function:

```
# Median
median.rt <- median(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we used the `na.rm`

argument (`TRUE`

) because there are some missing values in the dataset. Of course, if your data doesn’t have any missing values you can leave this argument out.

Here’s how to get the upper-hinge:

```
# Upper Hinge
RT <- sort(dataf$RT)
upper.rt <- RT[round((length(RT)/2)+1):length(RT)]
upper.h.rt <- median(upper.rt)
```

Code language: PHP (php)

SImilar to when we got the lower-hinge, we first sorted the RT column. Then, we get the upper half and calculate the median of it.

We can get the maximum by using the `max()`

function:

```
# Max
max.rt <- max(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we selected the RT-column using the dollar sign operator and we removed the missing values. Here’s the output:

Note, that the lower- and upper-hinge is the same as the first and third quartile when the sample size is odd. If this is the case, an easier way to get the lower- and upper-hinge is to use the `quantile()`

function. In the example data above, however, we had an equal number of observations (leaving out the missing values). If you need to combine two variables, in your dataset, into one make sure to check this post out:

In this section, we are going to put everything together so we get a somewhat nicer output:

```
fivenumber <- cbind(min.rt, lower.h.rt,
median.rt, upper.h.rt,
max.rt)
colnames(fivenumber) <- c("Min", "Lower-hinge",
"Median", "Upper-hinge", "Max")
fivenumber
```

Code language: CSS (css)

As you can see in the above code chunk, we used the `cbind()`

function to combine the different objects into one. Then, we give the combined object better column names. In the next section, we are going to see that there already is a function that can calculate the five-number statistics in R in one line of code, basically.

Here’s how to find the five-number summary statistics in R with the `fivenum()`

function:

```
# Five summary with R's fivenum()
fivenum(dataf$RT)
```

Code language: PHP (php)

Pretty simple. We just selected the column containing our data. Again, we used the $ operator to get the RT column and use the `fivenum()`

function on. Note that `fivenum()`

function is removing any missing values by default.

As you can see in the output above, we don’t get any column names but the five-number summary statistics are ordered as follows: min, lower-hinge, median, upper-hinge, and max. We can see that we get the same values as in the 6 step method:

In the next section, we are going to create a boxplot displaying the five-number summary statistics in R.

Here’s how we can visualize Tukey’s 5 number summary statistics in R using a boxplot:

```
library(ggplot2)
df <- data.frame(
x = 1,
ymin = fivenumber[1],
Lower = fivenumber[2],
Median = fivenumber[3],
Upper = fivenumber[4],
ymax = fivenumber[5]
)
ggplot(df, aes(x)) +
geom_boxplot(aes(ymin=ymin, lower=Lower,
middle=Median, upper=Upper, ymax=ymax),
stat = "identity") +
scale_y_continuous(breaks=seq(0.2,0.8, 0.05)) +
# Style the plot bit
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
# After this is just to annotate the plot and can be removed
# Min
geom_segment(aes(x = 1, y = ymin, xend = 0.95, yend = ymin), data = df) +
annotate("text", x = 0.93, y = df$ymin, label = "Min") +
# Lower-hinge
geom_segment(aes(x = 0.60, y = Lower, xend = 0.60, yend = Lower-0.05), data = df) +
annotate("text", x = 0.60, y = df$Lower-0.06, label = "Lower-hinge") +
# Median
annotate("text", x = 1, y = df$Median + .012, label = "Median") +
# Upper-hinge
geom_segment(aes(x = 1.40, y = Upper, xend = 1.40, yend = Upper+0.05), data = df) +
annotate("text", x = 1.40, y = df$Upper+0.06, label = "Upper-hinge") +
# Max
geom_segment(aes(x = 1, y = ymax, xend = 1.05, yend = ymax), data = df) +
annotate("text", x = 1.07, y = df$ymax, label = "Max")
```

Code language: R (r)

We are not getting into details in the example above. However, we did create a dataframe from the first object we created and then we used `ggplot()`

and `ggplot_boxplot()`

to create the boxplot. Notice how we used the `aes()`

function and set the different values found in the dataframe as arguments. Here ymin and ymax are the minimum and maximum values, respectively. Note we also changed the number of ticks on the y-axis. Here we used the seq() function to generate a sequence of numbers. The plot is somewhat styled and the code for drawing segments (lines) and adding text can be skipped, of course, if you just want to visualize the five summary statistics in R.

More data visualization tutorials:

In this post, you have learned 2 ways to get the five summary statistics in R: 1) min, 2) lower-hinge, 3) median, 4) upper-hinge, and 5) max. In the first method, we calculated each of these summary statistics separately. Furthermore, we have also learned how to use the handy fivenum() function to get the same values. In the final section, we created a boxplot from the five summary statistics. Hope you have learned something valuable. If you did, please link to the blog post in your projects and reports, share on your social media accounts, and/or drop a comment below.

Here are some other tutorials that you may find useful:

- How to Take Absolute Value in R – vector, matrix, & data frame
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Extract Year from Date in R with Examples
- Get the Absolute Value in R – from a vector, a matrix, & a data frame
- How to Rename Factor Levels in R using levels() and dplyr
- Learn How to Remove Duplicates in R – Rows and Columns (dplyr)
- How to Add a Column to a Dataframe in R with tibble & dplyr

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the […]

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the box plot and the histogram. In the next section, you will get a brief overview of the content of this blog post.

Before we get into the details on how to create a violin plot in Python we will have a look at what is needed to follow this Python data visualization tutorial. When we have what we need, we will answer a couple of questions (e.g., learn what a violin plot is). In the following sections, we will get into the practical parts. That is, we will learn how to use 1) Matplotlib and 2) Seaborn to create a violin plot in Python.

First of all, you need to have Python 3 installed to follow this post. Second, to use both Matplotlib and Seaborn you need to install these two excellent Python packages. Now, you can install Python packages using both Pip and conda. The later if you have Anaconda (or Miniconda) Python distribution. Note, Seaborn requires that Matplotlib is installed so if you, for example, want to try both packages to create violin plots in Python you can type `pip install seaborn`

. This will install Seaborn and Matplotlib along with other dependencies (e.g., NumPy and SciPy). Oh, we are also going to read the example data using Pandas. Pandas can, of course, also be installed using pip.

As previously mentioned, a violin plot is a data visualization technique that combines a box plot and a histogram. This type of plot therefore will show us the distribution, median, interquartile range (iqr) of data. Specifically, the iqr and median are the statistical information shown in the box plot whereas distribution is being displayed by the histogram.

A violin plot is showing numerical data. Specifically, it will reveal the distribution shape and summary statitisics of the numerical data. It can be used to explore data across different groups or variables in our datasets.

In this post, we are going to work with a fake dataset. This dataset can be downloaded here and is data from a Flanker task created with OpenSesame. Of course, the experiment was never actually run to collect the current data. Here’s how we read a CSV file with Pandas:

```
import pandas as pd
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df = pd.read_csv(data, index_col=0)
df.head()
```

Code language: Python (python)

Now, we can calculate descriptive statistics in Python using Pandas `describe()`

:

`df.loc[:, 'TrialType':'ACC'].groupby(by='TrialType').describe()`

Code language: Python (python)

Now, in the code above we used loc to slice the Pandas dataframe. This as we did not want to calculate summary statistics on the SubID. Furthermore, we used Pandas groupby to group the data by condition (i.e., “TrialType”). Now that we have some data we will continue exploring the data by creating a violin plot using 1) Matplotlib and 2) Seaborn.

Here’s how to create a violin plot with the Python package Matplotlib:

```
import matplotlib.pyplot as plt
plt.violinplot(df['RT'])
```

Code language: Python (python)

n the code above, we used the `violinplot()`

method and used the dataframe as the only parameter. Furthermore, we selected only the response time (i.e. the “RT” column) using the brackets. Now, as we know there are two conditions in the dataset and, therefore, we should create one violin plot for each condition. In the next example, we are going to subset the data and create violin plots, using matplotlib, for each condition.

One way to create a violin plot for the different conditions (grouped) is to subset the data:

```
# Subsetting using Pandas query():
congruent = df.query('TrialType == "congruent"')['RT']
incongruent = df.query('TrialType == "incongruent"')['RT']
fig, ax = plt.subplots()
inc = ax.violinplot(incongruent)
con = ax.violinplot(congruent)
fig.tight_layout()
```

Code language: Python (python)

Now we can see that there is some overlap in the distributions but they seem a bit different. Furthermore, we can see that iqr is a bit different. Especially, the tops. However, we don’t really know which color represents which. However, from the descriptive statistics earlier, we can assume that the blue one is incongruent. Note we also know this because that is the first one we created.

We can make this plot easier to read by using some more methods. In the next code chunk, we are going to create a list of the data and then add ticks labels to the plot as well as set (two) ticks to the plot.

```
# Combine data
plot_data = list([incongruent, congruent])
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data)
```

Code language: Python (python)

Notice how we now get the violin plots side by side instead. In the next example, we are going to add the median to the plot using the `showmedians`

parameter.

Here’s how we can show the median in the violin plots we create with the Python library matplotlib:

```
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data, showmedians=True)
```

Code language: Python (python)

In the next section, we will start working with Seaborn to create a violin plot in Python. This package is built as a wrapper to Matplotlib and is a bit easier to work with. First, we will start by creating a simple violin plot (the same as the first example using Matplotlib). Second, we will create grouped violin plots, as well.

Here’s how we can create a violin plot in Python using Seaborn:

```
import seaborn as sns
sns.violinplot(y='RT', data=df)
```

Code language: JavaScript (javascript)

In the code chunk above, we imported seaborn as sns. This enables us to use a range of methods and, in this case, we created a violin plot with Seaborn. Notice how we set the first parameter to be the dependent variable and the second to be our Pandas dataframe.

Again, we know that there two conditions and, therefore, in the next example we will use the `x`

parameter to create violin plots for each group (i.e. conditions).

To create a grouped violin plot in Python with Seaborn we can use the `x`

parameter:

```
sns.violinplot(y='RT', x="TrialType",
data=df)
```

Code language: Python (python)

Now, this violin plot is easier to read compared to the one we created using Matplotlib. We get a violin plot, for each group/condition, side by side with axis labels. All this by using a single Python metod! If we have further categories we can also use the `split`

parameter to get KDEs for each category split. Let’s see how we do that in the next section.

Here’s how we can use the `split`

parameter, and set it to `True`

to get a KDE for each level of a category:

```
sns.violinplot(y='RT', x="TrialType", split=True, hue='ACC',
data=df)
```

Code language: Python (python)

In the next and final example, we are going to create a horizontal violin plot in Python with Seaborn and the `orient`

parameter.

Here’s how we use the `orient`

parameter to get a horizontal violin plot with Seaborn:

```
sns.violinplot(y='TrialType', x="RT", orient='h',
data=df)
```

Code language: Python (python)

Notice how we also flipped the `y`

and `x`

parameters. That is, we now have the dependent variable (“RT”) as the `x`

parameter. If we want to save a plot, whether created with Matplotlib or Seaborn, we might want to e.g. change the Seaborn plot size and add or change the title and labels. Here’s a code example customizing a Seaborn violin plot:

```
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.gcf()
# Change seaborn plot size
fig.set_size_inches(10, 8)
# Increase font size
sns.set(font_scale=1.5)
# Create the violin plot
sns.violinplot(y='RT', x='TrialType',
data=df)
# Change Axis labels:
plt.xlabel('Condition')
plt.ylabel('Response Time (MSec)')
plt.title('Violin Plot Created in Python')
```

Code language: Python (python)

In the above code chunk, we have a fully working example creating a violin plot in Python using Seaborn and Matplotlib. Now, we start by importing the needed packages. After that, we create a new figure with plt.gcf(). In the next code lines, we change the size of 1) the plot, and 2) the font. Now, we are creating the violin plot and, then, we change the x- and y-axis labels. Finally, the title is added to the plot.

For more data visualization tutorials:

- How to Plot a Histogram with Pandas in 3 Simple Steps
- 9 Python Data Visualization Examples (Video)
- How to Make a Scatter Plot in Python using Seaborn
- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)

In this post, you have learned how to make a violin plot in Python using the packages Matplotlib and Seaborn. First, you learned a bit about what a violin plot is and, then, how to create both single and grouped violin plots in Python with 1) Matplotlib and 2) Seaborn.

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to work with $ in R. First, we will have a look at a couple of examples for a list object and then for a dataframe object.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this very short tutorial, you will learn by example how to use the operator $ in R. First, we will learn what the $ operator does by getting the answer to some frequently asked questions. Second, we will work with a list that we create, and use the dollar sign operator to both select and add a variable. Here you will also learn about the downsides of using $ in R as well as the alternatives that you can use. In the following section, we will also work with a dataframe. Both sections will involve creating the list and the dataframe.

To follow this post you need a working installation of the R statistical environment, of course. If you want to read the example Excel file you will also need the readxl package.

The $ operator can be used to select a variable/column, to assign new values to a variable/column, or to add a new variable/column in an R object. This R operator can be used on e.g. lists, and dataframes. For example, if we want to print the values in the column “A” in the dataframe called “dollar” we can use the following code: `print(dollar$A)`

,

First of all, using the double brackets enables us to e.g. select multiple columns whereas the $ operator only enables us to select one column.

Before we go on to the next section, we will create a list using the list() function.

```
dollar <- list(A = rep('A', 5), B = rep('B', 5),
'Life Expectancy' = c(10, 9, 8, 10, 2))
```

Code language: R (r)

In the next section, we will, then, work with the $ operator to 1) add a new variable to the list, and 2) print a variable in the list. In the third example, we will learn how to use $ in R to select a variable which variable contains whitespaces.

Here we will start learning, by examples, how to work with the $ operator in R. First, however, we will create a list.

Here’s how to use $ in R to add a new variable to a list:

`dollar$Sequence <- seq(1, 5)`

Code language: R (r)

Notice how we used the name of the list, then the $ operator, and the assignment (“<-”) operator. On the left side of <- we used seq() function to generate a sequence of numbers in R. This sequence of numbers was added to the list. Here’s our example list with the new variable:

In the next example, we will use the $ operator to print the values of the new variable that we added.

Here’s how we can use $ in R to select a variable in a list:

Code language: R (r)`dollar$Sequence`

Again, we used the list name, and the $ operator to print the new column we previously added:

Note, that if you want to select two, or more, columns you have to use the double brackets and put in each column name as a character. Another option to select columns is, of course, using the `select()`

function from the excellent package dplyr.

You might also be interested in: How to use %in% in R: 7 Example Uses of the Operator

Here’s how we can print, or select, a variable with white space in the name:

Code language: R (r)`dollar$`Life Expectancy``

Notice how we used the ` in the code above. This way, we can select, or add values, even though the variable contains white space. I would, however, suggest that you rename the column (or replace the white spaces). See the recent post to learn how to rename columns in R. Again, using brackets, in this case, would be the same as when the variable is not containing white spaces.

In the next section, we will use the same examples above but on a dataframe. First, however, we will read an .xlsx file in R using the readxl package.

```
dataf <- read_excel('example_sheets.xlsx',
skip=2)
```

Code language: R (r)

Note, that we used the skip argument to skip the first two rows. In the example data (download here), the column names are on the third row. We can print the first 5 rows of the dataframe using the `head()`

function:

Here we can see that there are 5 columns. In the next section, we will use the $ operator on this dataframe.

In the first example, we will add a new column to the dataframe. After this, we will select the new column and print it using the $ operator. Finally, we will also add a new example on how to use this operator: to remove a column.

Here’s how we can use $ to add a new column in R:

`dataf$NewData <- rep('A', length(dataf$ID))`

Code language: R (r)

Notice how we used R’s rep() function to generate a vector containing the letter ‘A’. It is important that we generate a vector of the same length as the number of rows in our dataframe. Therefore, we used the length() function as the second argument.

Now, if you want to learn easier ways to add a column in R check the following posts:

- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

In the next example, we are going to select this column using the $ operator and print it.

Here’s how we select and print the values in the column we created:

Code language: R (r)`dataf$NewData`

Notice, to select, and print the values, of a column in a dataframe we used R’s $ operator the same way as we used it when we worked with a list. Here’s the output of the code above:

Now, it is easier to use the R package dplyr to select certain columns in R compared to using the $ operator. Another option is, of course, to use the double brackets.

In the next example, we are going to drop a column from the dataframe.

Here’s how we can delete a column using the $ operator and the NULL object:

`dataf$NewData <- NULL`

Code language: PHP (php)

Again, we can use the R package dplyr to remove columns. More specifically, we can make use of the select() function to delete multiple columns in a quick and easy way.

Note, that example 3 will also work if we have a column containing white spaces in our dataframe. Finally, before concluding this post, we will have a quick look on how to use brackets to select a column:

`dataf['ID']`

Code language: R (r)

Notice how we used the column name of the variable we wanted to select. This, again, will work on a list as well.

In this post, you have learned, by examples, how to use $ in R. First, we worked with a list to add a new variable and select a variable. Then, we used the same methods on a dataframe. As a bonus, we also had a look at how to remove a column using the $ operator. Hope you learned something. If you did please share the post in your work, on your social media accounts, or link back to it in your own blog posts. If you have any comments or suggestions to the post please leave a comment below.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the rename() function. Now, renaming a column with dplyr and the rename() function is super simple. But, of course, […]

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the `rename()`

function. Now, renaming a column with dplyr and the `rename()`

function is super simple. But, of course, it is not super hard to change the column names using base R as well.

Now, there are some cases in which you need to get rid of strange column names such as “x1”, “x2”, “x3”. If we encounter data, such as this, cleaning up the names of the variables in our dataframes may be required and will definietly make work more readable. This is very important especially in those situations we are working together with others or share our data with with others. It is also very important that the column names have clear names if we plan to make the data open in a repository.

The outline of the post is a follows. First, you will learn about the requirements of this post. After you know what you need to follow this tutorial, you will get the answer to two questions. In the section, following the FAQs, we will load an example data set to work with. Here we will read an Excel file using the readxl package. When we have successfully imported data into R we can start by changing name on the oclumns. First, we will start by using a couple of techniques that can be done using base R. Second, we will work with dplyr. Specifically, in this section we will use the rename-family functions to change the names of some of the variables in the dataframe.That is, we will use the `rename()`

, and `rename_with().`

Now, before going on to the next section it is worth mentioning that we can use dplyr to select columns as well as remove columns in R.

To follow this post you need to have R installed as well as the packages readxl and dplyr. If you want to install the two packages you can use the `install.packages()`

function. Here’s how to install readxl and dplyr: `install.packages(c('dplyr', 'readxl')`

.

It is worth pointing out, here, that both these packages are part of the Tidyverse. This means that you can install them, among with a bunch of other great packages, by typing `install.packages('tidyverse')`

.

You can rename a column in R in many ways. For example, if you want to rename the colunn called “A” to “B” you can use this code: `names(dataframe)[names(dataframe)=="A"] <- "B"`

. This way you changed the column name to "B".

To rename a column in R you can use the `rename()`

function from dplyr. For example, if you want to rename the column "A" to "B", again, you can run the following code: `rename(dataframe, B = A)`

.

That was it, we are getting ready to practice how to change the column names in R. First, however, we need some data that we can practice on. In the next section, we are going to import data by reading a .xlsx file.

Here's how we can read a .xlsx file in R with the readxl package:

```
library(readxl)
titanic_df <- read_excel('titanic.xlsx')
```

Code language: R (r)

In the code chunk above, we started by loading the library readxl and then we used the `read_excel()`

function to read titanic.xlsx file. Here's the first 6 rows of this dataframe:

In the next section, we will start by using the base functionality to rename a column in R.

Here's how to rename a single column with base R:

`names(titanic_df)[1] <- 'P_Class'`

Code language: JavaScript (javascript)

In the code chunk above, we used the `names()`

n function to assign a new name to the first column in the dataframe. Specifically, using the `names()`

n function we get all the column names in the the dataframe and then we select the first columns using the brackets. Finally, we assigned the new column anme using the `<-`

and the character 'P_Class' (the new name). Note, you can, of course, rename multiple columns in the dataframe using the same method as above. Just change what you put within the brackets. For example, if you want to rename columns 1 to 5 you can put "1:5" within the brackes and then a character vector with 5 column names.

In the next example, we are going to use the old column name, instead. to rename the column.

Here's how to change the column name by using the old name when selecting it:

`names(titanic_df)[names(titanic_df) == 'P_Class'] <- "PCLASS'`

Code language: JavaScript (javascript)

In the code chunk above, we did something quite similar as in the first method. However, here we selected the column we previously renamed by its name. This is what we do within the brackets. Notice how we, again, there used names and the == to select the column named "P_Class". Here's the output (new column name marked with red):

In the next example, you will learn how to rename multiple columns using base R. In fact, we are going to rename all columns in the dataframe.

Renaming all columns can be done in a similar way as the last example. Here's how we change all the columns in the R dataframe:

```
names(titanic_df) <- c('PC', 'SURV', 'NAM', 'Gender', 'Age', 'SiblingsSPouses',
'ParentChildren', 'Tick', 'Cost', 'Cab', 'Embarked',
'Boat', 'Body', 'Home')
```

Code language: R (r)

Notice how we only used `names()`

in the code above. Here it's worth knowing that if the character vector (right of the <-) should contain as many elements as there are column names. Or else, one or more columns will be named "NA". Morever, you need to know the order of the columns. In the next few examples, we are going to work with dplyr and the rename-family of functions.

You might also be interested in: How to use $ in R: 6 Examples – list & dataframe

Renaming a column in dplyr is quite simple. Here's how to change a column name:

Code language: R (r)`titanic_df <- titanic_df %>% rename(pc_class = PC)`

In the code chunk above, there are some new things that we work with. First, we start by importing dplyr. Second, we are changing the name in the dataframe using the `rename()`

function. Notice how we use the %>% operator. This is very handy because the functions we use after this will be applied on the dataframe to the left of the operator. Third, we use the `rename()`

function with one argument: the column we want to rename. For a blog post on another handy operator in R:

Remember, we renamed all of the columns in the previous example. In the code chunk above, we are actually changing the column back again. That is, to the left of = we have the new column name and to the right, the old name. As you will see in the next example, we can rename multiple columns in the dataframe by adding arguments.

It may be worth mentioning that we can us dplyr to rename factor levels in R, and to add a column to a dataframe. In the next section, however, we are going to rename columns in R with dplyr.

If we, on the other hand, want to change the name of multiple columns we can do as follows:

Code language: R (r)`titanic_df <- titanic_df %>% rename(Survival = SURV, Name = NAM, Sibsp = SiblingsSPouses)`

It was quite simple to change the name multiple columns using dplyr's ename() function. As you can see, in the code chunk above, we just added each column that we wanted to change the name of. Again, the name to the right of the equal sign is the old column name. Here's the first 6 columns and rows of the dataframe with new column names marked with **red**:

In the following sections, we will work with the `rename_with()`

function. This is a great function which enables us to, as you will see, change the column names to upper or lower case.

Here's how we can use the `rename_with()`

function (dplyr) to change all the column names to lowercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(tolower)`

In the code chunk above, we used the `rename_with()`

function and then the `tolower()`

function. This function was applied on all the column names and the resulting dataframe look like this:

In the next example, we are going to change the column names to uppercase using the `rename_with()`

function together with the `toupper()`

function.

In this section, we wil just change the function that we use as the only argument in `rename_with()`

. This will enable us to change all the oclumn names to uppercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(toupper)`

Here's the first 6 rows where all the column names now is in uppercase:

In the next section, we are going to continue working with the rename_with() function and see how we can use other functions to clean the column names from unwanted characters. For example, we can use the gsub() function to remove punctuation from column names.

In some cases, our column names may contain characters that we don't really need. Here's how to use `rename_with()`

from dplyr together with `gsub()`

tro remove punctuation from all the column names in the R dataframe:

```
titanic_df <- titanic_df %>%
rename_with(~ gsub('[[:punct:]]', '', .x))
```

Code language: JavaScript (javascript)

Notice how we added the tilde sign (~) before the gsub() function. Moreover, the first argument is the regular expression for punctuation and the second is what we want to remove it with. In our case, here, we just remove it from the column names. We could, however, add like an underscore ("_") if we want to replace the punctuation in the column names. Finally, if we wanted to replace specific characters we could add them as well, instead of the regular expression for punctuation.

Now that you have renamed the columns that needed better and clearer name you can continue with your data pre-processing. For example, you can add a column to the dataframe based on othher columns with dplyr, calculate descriptive statistics (also with dplyr), take the absolute value in your R dataframe, or remove duplicate rows or columns in the dataframe.

In this tutorial, you have learned how to use base R as well as dplyr. First, you learned how to use the base are functions to change the column name of a single columns based on their index and name. Second, you learned how to do the same with dplyr and the rename function. Here we also renamed multiple columns as well as removed punctuation from the column names. Hope you found the post useful. If you did, please share it on your social media accounts and link to it in your projects. Finally, if you have any corrections on the particular post or suggestion, both on this post or in general what should be covered on this blog, please let me know.

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using abs() is very easy and to take the absolute value from e.g. a vector you can type abs(YourVector). […]

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using `abs()`

is very easy and to take the absolute value from e.g. a vector you can type `abs(YourVector)`

. Furthermore, you will learn how to take the absolute value of both a matrix and a data frame. In the next section, you will get a brief overview of what is covered in this R tutorial.

The structure of the post is as follows. First, we will get the answer to a couple of simple questions. Note, most of them might actually be enough for you to understand how to get the absolute value using the R statistical programming environment. After this, you will learn what you need to know and have installed in your R environment to follow this post. Third, we will start by going into a more detailed example on how to take the absolute value of a vector in R. This section is followed by how to use the abs() function, again, on a matrix containing negative values. Finally, we will also have a look at how to take the absolute values in a data frame in R. This section will also use some of the functions of the dplyr (Tidyverse) package.

The absolute value in R is is the non-negative *value* of x. To be clear, the absolute value in R is no different from the absolute value in any other programming language as this has something to do with mathematics rather than a programming language. In the next FAQ, you will learn how to use the `abs()`

function to get absolute values of a e.g. vector.

To change the ne gative numbers to positive in R we can use the `abs()`

function. For example, if we have the vector `x`

containing negative numbers, we can change them to positive numbers by typing `abs(x)`

in R.

Now that we have some basic understanding on how to chang negative numbers to positive, by taking their absolute values we can go ahead and have a look at what we need to follow this tutorial. That is, in the next section you will learn about the requirements of this post.

First of all, if you already have R installed you will also have the function abs() installed. However, if you want to use some functionality of the dplyr package (as in the later examples) you will also need to install dplyr (or Tidyverse). Moreover, if you want to read an .xlsx file in R with the readxl package you need to install it, as well. Here it might be worth pointing out that dplyr contains a lot of great functions. For example, you can use dplyr to remove columns in R as well as to select columns by e.g. name or index.

To install dplyr you can use the `install.packages()`

function. For example, to install the packages dplyr and readxl you type `install.packages(c("dplyr", "readxl"))`

. Note, you can change “dplyr” and “readxl” to “tidyverse” if you want to install all these packages as they are both part of the Tidyverse packages. In the next section, you will get the first example of how to take absolute value in R using the `abs()`

function.

Here’s how to take the absolute value from a vector in R:

```
# Creating a vector with negative values
negVec <- seq(-0.1, -1.1, by=-.1)
# R absolute value from vector
abs(negVec)
```

Code language: R (r)

In the code chunk above, we first created a sequence of numbers in R with the seq() method. As you may understand, all the numbers we generated were negative. In the second line, therefore, we used the `abs()`

function to take the absolute value of the vector. Here’s the output in which all the negative numbers are now positive:

In the next example, we are going to create a matrix filled with negative numbers and get the absolute values from the matrix.

If we, on the other hand, have a matrix here’s how to take the absolute value in R:

```
negMat <- matrix(
c(-2, -4, 3, 1, -5, 7,
-3, -1.1, -5, -3, -1,
-12, -1, -2.2, 1, -3.0),
nrow=4,
ncol=4)
# Take absolute value in R
abs(negMat)
```

Code language: R (r)

In the example above, we created a small matrix using the `matrix()`

function and, then, used the `abs()`

function to convert all negative numbers in this matrix to positive (i.e., take the absolute values of the matrix). This example will be followed by a couple of examples in which we will take the absolute values in data frames.

Now that you have changed the negative numbers to positive, you may want to quickly get Tukey’s five number summary statistics using the R function `fivenum()`

In this section, we will learn how to get the absolute value in dataframes in R. First, we will select one column and change it to absolute values. Second, we will select multiple columns, and again, use the `abs()`

function on these. Note, that here we will use the `mutate()`

function from dplyr. In the last example, we will also use the `select_if()`

function. This is dplyr function is great if we want to be able to use `abs()`

function on e.g. all numerical columns in a dataframe.

First, however, we are going to import the example dataset “r_absolute_value.xlsx” using the readxl package and `read_excel()`

function:

```
library(readxl)
dataf <- read_excel('./SimData/r_absolute_value.xlsx')
head(dataf)
```

Code language: JavaScript (javascript)

We are not getting into detail when it comes to reading .xlsx files in R. However, you can download the example dataset in the link above. If you store this .xlsx file in a subfolder to your r-script (see code above) you can just copy-paste the code chunk above. However, if you store it somewhere else on your computer you should change the path to the location of the file. In the next example, we are going to get the absolute value from a single column in the dataframe.

Here’s how to take the absolute value from one column in R and create a new column:

Code language: R (r)`dataf$D.abs <- abs(dataf$D) head(dataf)`

Note, that in the example above, we selected a column using the $-operator, and then we used the `abs()`

function to take the absolute value of this column. The absolute values of this column, in turn, were also added to a new column which we created, again, using the $-operator. It is, of course, also possible to use dplyr and the `mutate()`

function instead. Here’s another method, that we used to add a new column to a R dataframe as well as to add a column based on values in other columns in R. Here’s how to:

Code language: R (r)`dataf <- dataf %>% mutate(D.abs <- abs(D))`

Now, learning the above method is quite neat because it is a bit simpler to work with `mutate()`

compared to using only the $-operator. For example, we can make use of the %>%-operator as well (as in the example above). Furthermore, it will make the code look cleaner when creating more than one new column (as in the next example). In the next example, we re going to create two new columns by taking the absolute values of two other.

Here’s how we would take two columns and get the absolute value from them:

```
library(dplyr)
dataf <- dataf %>%
mutate(F.abs = abs(F),
C.abs = abs(C))
```

Code language: HTML, XML (xml)

Again, we worked with the `mutate()`

function and created two new variables. Here it might be worth mentioning that if we only want to get the absolute values from the numerical columns in our dataframe without creating new variables we can, instead, use the `select()`

function to select the specific columns. Here’s an example in which we select two columns and take their absolute value:

```
dataf <- dataf %>%
select(c(F, C)) %>%
abs()
```

Code language: R (r)

In the next section, we will use this newly learned method to take the absolute value in all the columns, that are numerical, in the dataframe. However, in this example, we are going to use the `select_if()`

function and only select the numerical columns. This is good to know because if we tried to run `abs()`

on the complete dataframe we would get an error. Specifically, this would return the error “Error in Math.data.frame(dataf) : non-numeric variable(s) in data frame: M”.

In the next section, we will work with the `select_if()`

function as well as the %>% operator, again. Another awesome operator in R is the %in% operator. Make sure you check this post out to learn more:

Here’s to apply the `abs()`

function on all the numerical columns in the dataframe:

Code language: R (r)`dataf.abs <- dataf %>% select_if(is.numeric) %>% abs()`

Note, how we, again, used the %>%-operator (magittr but imported with dplyr) to apply the `select_if()`

on the dataframe. Again, we used the %>%-operator and applied the `abs()`

function on all the numerical columns. Notice how the new dataframe *only* contains numerical columns (and absolute values).

Now, before concluding this post it may be worth that, again, point out that the tidyverse package is a very handy package. That is, it comes with a range of different packages that can be used for manipulating and cleaning your data. For example, you can use dplyr to rename factor levels in R , the lubridate package to extract year from date in R, and ggplot2 to create a scatter plot.

In this tutorial, you have learned about the absolute value, how to take the absolute value in R from 1) vectors, 2) matrices, and 3) columns in a dataframe. Specifically, you have learned how to use the abs() function to convert negative values to positive in a vector, a matrix, and a dataframe. When it comes to the dataframe you have learned how to select columns and convert them using r-base as well as dplyr. I really hope you learned something. If you did, please leave a comment below. You should also drop a comment if you got a suggestion or correction to the blog post. Stay safe!

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe. Outline In the first section, we are going to have a look at what you […]

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe.

In the first section, we are going to have a look at what you need to follow this tutorial. Second, we will answer some questions that might have brought you to this post. Third, we are going to use base R to select certain columns from the dataframe. In this section, we are also going to use the great operator %in% in R to select specific columns. Fourth, we are going to use dplyr and the select() family of functions. For example, we will use the `select_if()`

to get all the numeric columns and some helper functions. The helper functions enable us to select columns starting with, or ending with, a certain word or a specific character, for instance.

Note, the `select_if()`

function is also great if you, for example, want to take the absolute value in R dataframe and only select the numerical columns.

To select a column in R you can use brackets e.g., `YourDataFrame['Column']`

will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, `select(YourDataFrame, c('A', 'B')`

will take the columns named “A” and “B” from the dataframe.

If you want to use dplyr to select a column in R you can use the `select()`

function. For instance, `select(Data, 'Column_to_Get')`

will get the column “Column_to_Get” from the dataframe “Data”.

In the next section, we are going to learn about the prerequisites of this post and how to install R packages such as dplyr (or Tidyverse).

To follow this post you, obviously, need a working installation of R. Furthermore, we are going to use the read the example data from an Excel file using the readxl package. Moreover, if you want to use dplyr’s `select()`

and the different helper functions (e.g., startsWith(), endsWith()) you also need to install dplyr. It may be worth pointing out, that just by using the “-“-character you can use select() (from dplyr) to drop columns in R.

It may be worth to point out that both readxl and dplyr are part of the tidyverse. Tidyverse comes with a number of great packages that are packed with great functions. Besides selecting, or removing, columns with dplyr (part of Tidyverse) you can extract year from date in R using the lubridate package, create scatter plots with ggplot2, and calculate descriptive statistics. That said, you can install one of these r-packages, depending on what you need, using the `install.packages()`

function. For example, installing dplyr is done by running this in R: `install.packages(c('dplyr', 'readxl'))`

.

Before we continue and practice selecting columns in R, we will read data from a .xlsx file.

```
library(readxl)
dataf <- read_excel("add_column.xlsx")
head(dataf)
```

Code language: R (r)

This example dataset is one that we used in the tutorial, in which we added a column based on other columns. We can see that it contains 9 different columns. If we want to, we can check the structure of the dataframe so that we can see what kind of data we have.

Code language: R (r)`str(dataf)`

Now, we see that there are 20 rows, as well, and that all but one column is numeric. In a more recent post, you can learn how to rename columns in R with dplyr. In the next section, we are going to learn how to select certain columns from this dataframe using base R.

In this section, we are going to practice selecting columns using base R. First, we will use the column indexes and, second, we will use the column names.

Here’s one example on how to select columns by their indexes in R:

`dataf[, c(1, 2, 3)]`

Code language: R (r)

As you can see, we selected the first three columns by using their indexes (1, 2, 3). Notice, how we also used the “,” within the brackets. This is done to get the columns rather than subsetting rows (i.e., by placing the “,” after the vector with indexes). Before moving on to the next example it may be worth knowing that the vector can contain a sequence. For instance, we can generate a sequence of numbers using `:`

. For example, replacing `c(1, 2, 3)`

with `c(1:3)`

would give us the same output, as above. Naturally, we can also select e.g. the third, fifth, and the sixth column if we want to. In the next example, we are going to subset certain columns by their name. Note, sequences of numbers can also be generated in R with the seq() function.

Here’s how we can select columns in R by name:

`dataf[, c('A', 'B', 'Cost')]`

Code language: R (r)

In the code chunk above, we basically did the same as in the first example. Notice, however, how we removed the numbers and added the column names. In the vector, that is, we now used the names of the column we wanted to select. Ín the next example, we are going to learn a neat little trick by using the %in% operator when selecting columns by name.

Here’s how we can make use of the %in% operator to get columns by name from the R dataframe:

```
head(dataf[, (colnames(dataf) %in% c('Depr1', 'Depr2',
'Depr4', 'Depr7'))])
```

Code language: R (r)

In the code chunk above, we used the great %in% operator. Notice something diffrent in the character vector? There’s a column that doesn’t exist in the example data. The cool thing, here, is that even though if we do this when using the %in% operator, we will get the columns that actually exists in the dataframe selected. In the next section, we are going to have a look at a couple of examples using dplyr’s `select()`

and some of the great helper functions.

In this section, we will start with the basic examples of selecting columns (e.g., by name and index). However, the focus will be on using the helper functions together with `select()`

, and the `select_if()`

function.

Here’s how we can get columns by index using the `select()`

function:

`library(dplyr) dataf %>% select(c(2, 5, 6))`

Notice how we used another great operator: %>%. This is the pipe operator and following this, we used the select() function. Again, when selecting columns with base R, we added a vector with the indexes of the columns we want. In the next example, we will basically do the same but select by column names.

Here’s how we use `select()`

to get the columns we want by name:

```
library(dplyr)
dataf %>%
select(c('A', 'Cost', 'Depr1'))
```

Code language: R (r)

n the code chunk above, we just added the names of the columns in the vector. Simple! In the next example, we are going to have a look at how to use `select_if()`

to select columns with containing data of a specific data type.

Here’s how to select all the numeric columns in an R dataframe:

```
dataf %>%
select_if(is.numeric)
```

Code language: CSS (css)

Remember, all columns except for one are of numeric type. This means that we will get 8 out of 9 columns running the above code. If we, on the other hand, added the `is.character`

function we would only select the first column. In the next section, we will learn how to get columns starting with a certain letter.

Here’s how we use the `starts_with()`

helper function and `select()`

to get all columns starting with the letter “D”:

```
dataf %>%
select(starts_with('D'))
```

Code language: R (r)

Selecting columns with names starting with a certain letter was pretty easy. In the `starts_with()`

helper function we just added the letter.

Here’s how we use the `ends_with()`

helper function and `select()`

to get all columns ending with the letter “D”:

```
dataf %>%
select(ends_with('D'))
```

Code language: R (r)

Note, that in the example dataset there is only one column ending with the letter “D”. In fact, all column names are ending with unique characters. That is, here it would not make sense to select columns using this method. It is worth noting here, that we can use a word when working with both the `starts_with()`

and `ends_with()`

helper functions. Let’s have a look!

Here’s how we can select certain columns starting with a specific word:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. Before going to the next section, it may be worth mentioning another great feature of the dplyr package. You can use dplyr to rename factor levels in R. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(contains('pr'))
```

Code language: R (r)

Again, this particular example doesn’t make sense on the example dataset. There’s a final helper function that is worth mentioning: `matches()`

. This function can be used to check whether column names contain a pattern (regular expression) such as digits. Now that you have selected the columns you need, you can continue manipulating your data and get it ready for data analysis. For example, you can now go ahead and create dummy variables in R or add a new column.

In this post, you have learned how to select certain columns using base R and dplyr. Specifically, you have learned how to get columns, from the dataframe, based on their indexes or names. Furthermore, you have learned to select columns of a specific type. After this, you learned how to subset columns based on whether the column names started or ended with a letter. Finally, you have also learned how to select based on whether the columns contained a string or not. Hope you found this blog post useful. If you did, please share it on your social media accounts, add a link to the tutorial in your project reports and such, and leave a comment below.

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do […]

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do if your data violates some of the assumptions.

Third, you will learn how to perform a paired sample t-test using the following Python packages:

- Scipy (scipy.stats.ttest_ind)
- Pingouin (pingouin.ttest)

In the final sections, of this tutorial, you will also learn how to:

- Interpret and report the paired t-test
- P-value, effect size

- report the results and visualizing the data

In the first section, you will learn about what is required to follow this post.

In this tutorial, we are going to use both SciPy and Pingouin, two great Python packages, to carry out the dependent sample t-test. Furthermore, to read the dataset we are going to use Pandas. Finally, we are also going to use Seaborn to visualize the data. In the next three subsections, you will find a brief description of each of these packages.

SciPy is one of the essential data science packages. This package is, furthermore, a dependency of all the other packages that we are going to use in this tutorial. In this tutorial, we are going to use it to test the assumption of normality as well as carry out the paired sample t-test. This means, of course, that if you are going to carry out the data analysis using Pingouin you will get SciPy installed anyway.

Pandas is also a very great Python package for someone carrying out data analysis with Python, whether a data scientist or a psychologist. In this post, we will use Pandas import data into a dataframe and to calculate summary statistics.

In this tutorial, we are going to use data visualization to guide our interpretation of the paired sample t-test. Seaborn is a great package for carrying out data visualization (see for example these 9 examples of how to use Seaborn for data visualization in Python).

In this tutorial, Pingouin is the second package that we are going to use to do a paired sample t-test in Python. One great thing with the ttest function is that it returns a lot of information we need when reporting the results from the test. For instance, when using Pingouin we also get the degrees of freedom, Bayes Factor, power, effect size (Cohen’s d), and confidence interval.

In Python, we can install packages with pip. To install all the required packages run the following code:

Code language: Bash (bash)`pip install scipy pandas seaborn pingouin`

In the next section, we are going to learn about the paired t-test and it’s assumptions.

The paired sample t-test is also known as the *dependent sample t-test*, and *paired t-test*. Furthermore, this type of t-test compares two averages (means) and will give you information if the difference between these two averages are zero. In a paired sample t-test, each participant is measured twice, which results in pairs of observations (the next section will give you an example).

For example, if clinical psychologists want to test whether a treatment for depression will change the quality of life, they might set up an experiment. In this experiment, they will collect information about the participants’ quality of life before the intervention (i.e., the treatment and after. They are conducting a pre- and post-test study. In the pre-test the average quality of life might be 3, while in the post-test the average quality of life might be 5. Numerically, we could think that the treatment is working. However, it could be due to a fluke and, in order to test this, the clinical researchers can use the paired sample t-test.

Now, when performing dependent sample t-tests you typically have the following two hypotheses:

- Null hypotheses: the true mean difference is equal to zero (between the observations)
- Alternative hypotheses: the true mean difference is not equal to zero (two-tailed)

Note, in some cases we also may have a specific idea, based on theory, about the direction of the measured effect. For example, we may strongly believe (due to previous research and/or theory) that a specific intervention should have a positive effect. In such a case, the alternative hypothesis will be something like: the true mean difference is greater than zero (one-tailed). Note, it can also be smaller than zero, of course.

Before we continue and import data we will briefly have a look at the assumptions of this paired t-test. Now, besides that the dependent variable is on interval/ratio scale, and is continuous, there are three assumptions that need to be met.

- Are the two samples independent?
- Does the data, i.e., the differences for the matched-pairs, follow a normal distribution?
- Are the participants randomly selected from the population?

If your data is not following a normal distribution you can transform your dependent variable using square root, log, or Box-Cox in Python. In the next section, we will import data.

Before we check the normality assumption of the paired t-test in Python, we need some data to even do so. In this tutorial post, we are going to work with a dataset that can be found here. Here we will use Pandas and the read_csv method to import the dataset (stored in a .csv file):

```
df = pd.read_csv('./SimData/paired_samples_data.csv',
index_col=0)
```

Code language: Python (python)

In the image above, we can see the structure of the dataframe. Our dataset contains 100 observations and three variables (columns). Furthermore, there are three different datatypes in the dataframe. First, we have an integer column (i.e., “ids”). This column contains the identifier for each individual in the study. Second, we have the column “test” which is of object data type and contains the information about the test time point. Finally, we have the “score” column where the dependent variable is. We can check the pairs by grouping the Pandas dataframe and calculate descriptive statistics:

In the code chunk above, we grouped the data by “test” and selected the dependent variable, and got some descriptive statistics using the `describe()`

method. If we want, we can use Pandas to count unique values in a column:

`df['test'].value_counts()`

Code language: Python (python)

This way we got the information that we have as many observations in the post test as in the pre test. A quick note: before we continue to the next subsection, in which we subset the data, it has to be mentioned that you should check whether the dependent variable is normally distributed or not. This can be done by creating a histogram (e.g., with Pandas) and/or carrying out the Shapiro-Wilks test.

Both the methods, whether using SciPy or Pingouin, require that we have our dependent variable in two Python variables. Therefore, we are going to subset the data and select only the dependent variable. To our help we have the `query()`

method and we will select a column using the brackets ([]):

```
b = df.query('test == "Pre"')['score']
a = df.query('test == "Post"')['score']
```

Code language: Python (python)

Now, we have the variables a and b containing the dependent variable pairs we can use SciPy to do a paired sample t-test.

Here’s how to carry out a paired sample t-test in Python using SciPy:

```
from scipy.stats import ttest_rel
# Python paired sample t-test
ttest_rel(a, b)
```

Code language: Python (python)

In the code chunk above, we first started by importing `ttest_rel()`

, the method we then used to carry out the dependent sample t-test. Furthermore, the two parameters we used were the data, containing the dependent variable, in the pairs (a, and b). Now, we can see by the results (image below) that the difference between the pre- and post-test is statistically significant.

In the next section, we will use Pingouin to carry out the paired t-test.

Here’s how to carry out the dependent samples t-test using the Python package Pingouin:

```
import pingouin as pt
# Python paired sample t-test:
pt.ttest(a, b, paired=True)
```

Code language: Python (python)

There’s not that much to explain, about the code chunk above, but we started by importing pingouin. Next, we used the `ttest()`

method and used our data. Notice how we used the paired parameter and set it to True. We did this because it is a paired sample t-test we wanted to carry out. Here’s the output:

As you can see, we get more information when using Pingouin to do the paired t-test. In fact, here we basically get all we need to continue and interpret the results. In the next section, before learning how to interpret the results, you can also watch a YouTube video explaining all the above (with some exceptions, of course):

Here’s the majority of the current blog post explained in a YouTube video:

In this section, you will be given a short explanation on how to interpret the results from a paired t-test carried out with Python. Note, we will focus on the results that we got from Pingouin as they give us more information (e.g., degrees of freedom, effect size).

Now, the p-value of the test is smaller than 0.001, which is less than the significance level alpha (e.g., 0.05). This means that we can draw the conclusion that the quality of life has increased when the participants conducted the post-test. Note, this can, of course, be due to other things than the intervention but that’s another story.

Note that, the p-value is a probability of getting an effect at least as extreme as the one in our data, assuming that the null hypothesis is true. Pp-values address only one question: how likely your collected data is, assuming a true null hypothesis? Notice, the p-value can never be used as support for the alternative hypothesis.

Normally, we interpret Cohen’s D in terms of the relative strength of e.g. the treatment. Cohen (1988) suggested that *d*=0.2 is a ‘small’ effect size, 0.5 is a ‘medium’ effect size, and that 0.8 is a ‘large’ effect size. You can interpret this such as that iif two groups’ means don’t differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.

When using Pingouin to carry out the paired t-test we also get the Bayes Factor. See this post for more information on how to interpret BF10.

In this section, you will learn how to report the results according to the APA guidelines. In our case, we can report the results from the t-test like this:

The results from the pre-test (

M= 39.77,SD= 6.758) and post-test (M= 45.737,SD= 6.77) quality of life test suggest that the treatment resulted in an improvement in quality of life,t(49) = 115.4384,p< .01. Note, that the “quality of life test” is something made up, for this post (or there might be such a test, of course, that I don’t know of!).

In the final section, before the conclusion, you will learn how to visualize the data in two different ways: creating boxplots and violin plots.

Here’s how we can guide the interpretation of the paired t-test using boxplots:

```
import seaborn as sns
sns.boxplot(x='test', y='score', data=df)
```

Code language: Python (python)

In the code chunk above, we imported seaborn (as sns), and used the boxplot method. First, we put the column that we want to display separate plots for on the x-axis. Here’s the resulting plot:

Here’s another way to report the results from the t-test by creating a violin plot:

```
import seaborn as sns
sns.violinplot(x='test', y='score', data=df)
```

Code language: Python (python)

Much like creating the box plot, we import seaborn and add the columns/variables we want as x- and y-axis’. Here’s the resulting plot:

As you may already be aware of, there are other ways to analyze data. For example, you can use Analysis of Variance (ANOVA) if there are more than two levels in the factorial (e.g. tests during the treatment, as well as pre- and post -tests) in the data. See the following posts about how to carry out ANOVA:

- Repeated Measures ANOVA in R and Python using afex & pingouin
- Two-way ANOVA for repeated measures using Python
- Repeated Measures ANOVA in Python using Statsmodels

Recently, machine learning methods have grown popular. See the following posts for more information:

In this post, you have learned two methods to perform a paired sample t-test.Specifically, in this post you have installed, and used, three Python packages for data analysis (Pandas, SciPy, and Pingouin). Furthermore, you have learned how to interpret and report the results from this statistical test, including data visualization using Seaborn. In the Resources and References section, you will find useful resources and references to learn more. As a final word: the Python package Pingouin will give you the most comprehensive result and that’s the package I’d choose to carry out many statistical methods in Python.

If you liked the post, please share it on your social media accounts and/or leave a comment below. Commenting is also a great way to give me suggestions. However, if you are looking for any help please use other means of contact (see e.g., the About or Contact pages).

Finally, support me and my content (much appreciated, especially if you use an AdBlocker): become a patron. Becoming a patron will give you access to a Discord channel in which you can ask questions and may get interactive feedback.

Here are some useful peer-reviewed articles, blog posts, and books. Refer to these if you want to learn more about the t-test, p-value, effect size, and Bayes Factors.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

It’s the Effect Size, Stupid – What effect size is and why it is important

Using Effect Size—or Why the P Value Is Not Enough.

Beyond Cohen’s d: Alternative Effect Size Measures for Between-Subject Designs (Paywalled).

A tutorial on testing hypotheses using the Bayes factor.

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in […]

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in Python.

That the data we have is of normal shape (also known as following a Bell curve) is important for the majority of the parametric tests we may want to perform. This includes regression analysis, the two-sample t-test, and Analysis of Variance that can be carried out in Python, to name a few.

This post will start by briefly going through what you need to follow this tutorial. After this is done, you will 1) get information about skewness and kurtosis, and 2) a brief overview of the different methods of transformation. In the section, following the transformation methods, you will learn how to import data using Pandas read_csv. We will explore the example dataset a bit by creating histograms, getting the measures of skewness and kurtosis. Finally, the last sections will be covering how to transform data that is non-normal.

In this tutorial, we are going to use Pandas, SciPy, and NumPy. It is worth mentioning, here, that you only need to install Pandas as the other two Python packages are dependencies of Pandas. That is, if you install Python packages using e.g. pip it will also install SciPy and NumPy on your computer, whether you use e.g. Ubuntu Linux or Windows 10. Note, that you can use pip to install a specific version of e.g. Pandas and if you need, you can upgrade pip using either conda or pip.

Now, if you want to install the individual packages (e.g. you only want to use NumPy and SciPy) you can run the following code:

Code language: Bash (bash)`pip install pandas`

Now, if you only want to install NumPy, change “pandas” to “numpy”, in the code chuk above. That said, let us move on to the section about skewness and kurtosis.

Briefly, skewness is a measure of symmetry. To be exact, it is a measure of lack of symmetry. This means that the larger the number is the more your data lack symmetry (not normal, that is). Kurtosis, on the other hand, is a measure of whether your data is heavy- or light-tailed relative to a normal distribution. See here for a more mathematical definition of both measures. A good way to visually examine data for skewness or kurtosis is to use a histogram. Note, however, that there are, of course, also different statistical tests that can be used to test if your data is normally distributed.

One way of handling right, or left, skewed data is to carry out the logarithmic transformation on our data. For example, `np.log(x)`

will log transform the variable `x`

in Python. There are other options as well as the Box-Cox and Square root transformations.

One way to handle left (negative) skewed data is to reverse the distribution of the variable. In Python, this can be done using the following code:

Both of the above questions will be more detailed answered throughout the post (e.g., you will learn how to carry out log transformation in Python). In the next section, you will learn about the three commonly used transformation techniques that you, later, will also learn to apply.

As indicated in the introduction, we are going to learn three methods that we can use to transform data deviating from the normal distribution. In this section, you will get a brief overview of these three transformation techniques and when to use them.

The square root method is typically used when your data is moderately skewed. Now using the square root (e.g., sqrt(x)) is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.

The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. Worth noting, however, is that it can not be applied to zero or negative values.

The Box-Cox transformation is, as you probably understand, also a technique to transform non-normal data into normal shape. This is a procedure to identify a suitable exponent (Lambda = l) to use to transform skewed data.

Now, the above mentioned transformation techniques are the most commonly used. However, there are plenty of other methods, as well, that can be used to transform your skewed dependent variables. For example, if your data is of ordinal data type you can also use the arcsine transformation method. Another method that you can use is called reciprocal. This method, is basically carried out like this: 1/x, where x is your dependent variable.

In the next section, we will import data containing four dependent variables that are positively and negatively skewed.

In this tutorial, we will transform data that is both negatively (left) and positively (right) skewed and we will read an example dataset from a CSV file (Data_to_Transform.csv). To our help we will use Pandas to read the .csv file:

```
import pandas as pd
import numpy as np
# Reading dataset with skewed distributions
df = pd.read_csv('./SimData/Data_to_Transform.csv')
```

Code language: Python (python)

This is an example dataset that have the following four variables:

- Moderate Positive Skew (Right Skewed)
- Highly Positive Skew’ (Right Skewed)
- Moderate Negative Skew (Left Skewed)
- Highly Negative Skew (Left Skewed)

We can obtain this information by using the `info()`

method. This will give us the structure of the dataframe:

As you can see, the dataframe has 10000 rows and 4 columns (as previously described). Furthermore, we get the information that the 4 columns are of float data type and that there are no missing values in the dataset. In the next section, we will have a quick look at the distribution of our 4 variables.

In the next section, we will do a quick visual inspection of the variables in the dataset using Pandas hist() function.

In this section, we are going to visually inspect whether the data are normally distributed. Of course, there are several ways to plot the distribution of our data. In this post, however, we are going to only use Pandas and create histograms. Here’s how to create a histogram in Pandas using the `hist()`

method:

```
df.hist(grid=False,
figsize=(10, 6),
bins=30)
```

Code language: Python (python)

Now, the `hist()`

method takes all our numeric variables in the dataset (i.e.,in our case float data type) and creates a histogram for each. Just to quickly explain the parameters used in the code chunk above. First, using the `grid`

parameter and set it to `False`

to remove the grid from the histogram. Second, we changed the figure size using the `figsize`

parameter. Finally, we also changed the number of bins (default is 20) to get a better view of the data. Here is the distribution visualized:

It is pretty clear that all the variables are skewed and not following a normal distribution (as the variable names imply). Note, there are, of course, other visualization techniques that you can carry out to examine the distribution of your dependent variables. For example, you can use boxplots, stripplots, swarmplots, kernel density estimation, or violin plots. These plots give you a lot of (more) information about your dependent variables. See the post with 9 Python data visualization examples, for more information. In the next section, we are also going to have a look at how we can get the measures of skewness and kurtosis.

More data visualization tutorials:

- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)
- How to use Pandas Scatter Matrix (Pair Plot) to Visualize Trends in Data
- How to Save a Seaborn Plot as a File (e.g., PNG, PDF, EPS, TIFF)

In this section, before we start learning how to transform skewed data in Python, we will just have a quick look at how to get skewness and kurtosis in Python.

`df.agg(['skew', 'kurtosis']).transpose()`

Code language: Python (python)

In the code chunk above, we used the `agg()`

method and used a list as the only parameter. This list contained the two methods that we wanted to use (i.e., we wanted to calculate skewness and kurtosis). Finally, we used the `transpose()`

method to change the rows to columns (i.e., transpose the Pandas dataframe) so that we get an output that is a bit easier to check. Here’s the resulting table:

As rule of thumb, skewness can be interpreted like this:

Skewness | |

Fairly Symmetrical | -0.5 to 0.5 |

Moderate Skewed | -0.5 to -1.0 and 0.5 to 1.0 |

Highly Skewed | < -1.0 and > 1.0 |

There are, of course, more things that can be done to test whether our data is normally distributed. For example, we can carry out statistical tests of normality such as the Shapiro-Wilks test. It is worth noting, however, that most of these tests are susceptible for the sample size. That is, even small deviations from normality will be found using e.g. the Shapiro-Wilks test.

In the next section, we will start transforming the non-normal (skewed) data. First, we will transform the moderate skewed distributions and, then, we will continue with the highly skewed data.

Here’s how to do the square root transformation of non-normal data in Python:

```
# Python Square root transformation
df.insert(len(df.columns), 'A_Sqrt',
np.sqrt(df.iloc[:,0]))
```

Code language: Python (python)

In the code chunk above, we created a new column/variable in the Pandas dataframe by using the `insert()`

method. It is, furthermore, worth mentioning that we used the iloc[] method to select the column we wanted. In the following examples, we are going to continue using this method for selecting columns. Notice how the first parameter (i.e., “:”) is used to select all rows, and the second parameter (“0”) is used to select the first columns. If we, on the other hand, used the loc method we could have selected by the column name. Here’s a histogram of our new column/variable:

Again, we can see that the new, Box-Cox transformed, distribution is more symmetrical than the previous, right-skewed, distribution.

In the next subsection, you will learn how to deal with negatively (left) skewed data. If we try to apply sqrt() on the column, right now, we will get a ValueError (see towards the end of the post).

Now, if we want to transform the negatively (left) skewed data using the square root method we can do as follows.

```
# Square root transormation on left skewed data in Python:
df.insert(len(df.columns), 'B_Sqrt',
np.sqrt(max(df.iloc[:, 2]+1) - df.iloc[:, 2]))
```

Code language: PHP (php)

What we did, above, was to reverse the distribution (i.e., `max(df.iloc[:, 2] + 1) - df.iloc[:, 2]`

) and then applied the square root transformation. You can see, in the image below, that skewness becomes positive when reverting the negatively skewed distribution.

In the next section, you will learn how to log transform in Python on highly skewed data, both to the right and left.

Here’s how we can use the log transformation in Python to get our skewed data more symmetrical:

```
# Python log transform
df.insert(len(df.columns), 'C_log',
np.log(df['Highly Positive Skew']))
```

Code language: PHP (php)

Now, we did pretty much the same as when using Python to do the square root transformation. Here, we created a new column, using the insert() method. However, we used the log() method from NumPy, this time, because we wanted to do a logarithmic transformation. Here’s how the distribution looks like now:

Here’s how to log transform negatively skewed data in Python:

```
# Log transformation of negatively (left) skewed data in Python
df.insert(len(df.columns), 'D_log',
np.log(max(df.iloc[:, 2] + 1) - df.iloc[:, 2]))
```

Code language: PHP (php)

Again, we carried out the log transformation using the NumPy log() method. Furthermore, we did exactly as in the square root example. That is, we reversed the distribution and we can, again, see that all that happened is that the skewness went from negative to positive.

In the next section, we will have a look on how to use SciPy to carry out the Box Cox transformation on our data.

Here’s how to implement the Box-Cox transformation using the Python package SciPy:

```
from scipy.stats import boxcox
# Box-Cox Transformation in Python
df.insert(len(df.columns), 'A_Boxcox',
boxcox(df.iloc[:, 0])[0])
```

Code language: Python (python)

In the code chunk above, the only difference, basically, between the previous examples is that we imported `boxcox()`

from `scipy.stats`

. Furthermore, we used the `boxcox()`

method to apply the Box-Cox transformation. Notice how we selected the first element using the brackets (i.e. `[0]`

). This is because this method (i.e. `boxcox()`

) will give us a tuple. Here’s a visualization of the resulting distribution.

Once again, we managed to transform our positively skewed data to a relatively symmetrical distribution. Now, the Box-Cox transformation also requires our data to only contain positive numbers so if we want to apply it on negatively skewed data we need to reverse it (see the previous examples on how to reverse your distribution). If we try to use `boxcox()`

on the column “Moderate Negative Skewed”, for example, we get a ValueError.

More exactly, if you get the “ValueError: Data must be positive” while using either `np.sqrt()`

, `np.log()`

or SciPy’s `boxcox()`

it is because your dependent variable contains negative numbers. To solve this, you can reverse the distribution.

It is worth noting, here, that we can now check the skewness using the `skew()`

method:

`df.agg(['skew']).transpose()`

Code language: Python (python)

We can see in the output that the skewness values of the transformed values are now acceptable (they are all under 0.5). Of course, we could also run the previously mentioned tests of normality (e.g., the Shapiro-Wilks test). Note, that if your data is still not normally distributed you can carry out the Mann-Whitney U test in Python, as well.

In this post, you have learned how to apply square root, logarithmic, and Box-Cox transformation in Python using Pandas, SciPy, and NumPy. Specifically, you have learned how to transform both positive (left) and negative (right) skewed data so that it will hold the assumption of normal assumption. First, you learned briefly above the Python packages needed to transform non-normal, and skewed, data into normally distributed data. Second, you learned about the three methods that you, later, also learned how to carry out in Python.

Here are some useful resources for further reading.

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. *Psychological Methods*, *2*(3), 292–307. https://doi.org/10.1037//1082-989x.2.3.292

Blanca, M. J., Arnau, J., López-Montiel, D., Bono, R., & Bendayan, R. (2013). Skewness and kurtosis in real data samples. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences*, *9*(2), 78–84. https://doi.org/10.1027/1614-2241/a000057

Mishra, P., Pandey, C. M., Singh, U., Gupta, A., Sahu, C., & Keshri, A. (2019). Descriptive statistics and normality tests for statistical data. *Annals of cardiac anaesthesia*, *22*(1), 67–72. https://doi.org/10.4103/aca.ACA_157_18

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this post, you will learn what you need to add new columns to your dataframe in R. We will work both with base R and some of the great Tidyverse packages.

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this brief tutorial, you will learn how to add a column to a dataframe in R. More specifically, you will learn 1) to add a column using base R (i.e., by using the $-operator and brackets, 2) add a column using the add_column() function (i.e., from tibble), 3) add multiple columns, and 4) to add columns from one dataframe to another.

Note, when adding a column with tibble we are, as well, going to use the `%>%`

operator which is part of dplyr. Note, dplyr, as well as tibble, has plenty of useful functions that, apart from enabling us to add columns, make it easy to remove a column by name from the R dataframe (e.g., using the `select()`

function).

First, before reading an example data set from an Excel file, you are going to get the answer to a couple of questions. Second, we will have a look at the prerequisites to follow this tutorial. Third, we will have a look at how to add a new column to a dataframe using first base R and, then, using tibble and the `add_column()`

function. In this section, using dplyr and `add_column()`

, we will also have a quick look at how we can add an empty column. Note, we will also append a column based on other columns. Furthermore, we are going to learn, in the two last sections, how to insert multiple columns to a dataframe using tibble.

To follow this tutorial, in which we will carry out a simple data manipulation task in R, you only need to install dplyr and tibble if you want to use the `add_column()`

and `mutate()`

functions as well as the %>% operator. However, if you want to read the example data, you will also need to install the readr package.

It may be worth noting that all the mentioned packages are all part of the Tidyverse. This package comes packed with a lot of tools that can be used for cleaning data, visualizing data (e.g. to create a scatter plot in R with ggplot2).

To add a new column to a dataframe in R you can use the $-operator. For example, to add the column “NewColumn”, you can do like this: `dataf$NewColumn <- Values`

. Now, this will effectively add your new variable to your dataset.

To add a column from one dataframe to another you can use the $ operator. For example, if you want to add the column named "A" from the dataframe called "dfa" to the dataframe called "dfb" you can run the following code. `dfb$A <- dfa$A`

. Adding multiple columns from one dataframe to another can also be accomplished, of course.

In the next section, we are going to use the `read_excel()`

function from the readr package. After this, we are going to use R to add a column to the created dataframe.

Here’s how to read a .xlsx file in R:

```
# Import readr
library(readr)
# Read data from .xlsx file
dataf <- read_excel('./SimData/add_column.xlsx')
```

Code language: R (r)

In the code chunk above, we imported the file add_column.xlsx. This file was downloaded to the same directory as the script. We can obtain some information about the structure of the data using the `str()`

function:

Before going to the next section it may be worth pointing out that it is possible to import data from other formats. For example, you can see a couple of tutorials covering how to read data from SPSS, Stata, and SAS:

- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R
- How to Read & Write SPSS Files in R Statistical Environment

Now that we have some example data, to practice with, move on to the next section in which we will learn how to add a new column to a dataframe in base R.

First, we will use the $-operator and assign a new variable to our dataset. Second, we will use brackets ("[ ]") to do the same.

Here’s how to add a new column to a dataframe using the $-operator in R:

```
# add column to dataframe
dataf$Added_Column <- "Value"
```

Code language: R (r)

Note how we used the operator $ to create the new column in the dataframe. What we added, to the dataframe, was a character (i.e., the same word). This will produce a character vector as long as the number of rows. Here's the first 6 rows of the dataframe with the added column:

If we, on the other hand, tried to assign a vector that is not of the same length as the dataframe, it would fail. We would get an error similar to "*Error: Assigned data `c(2, 1)` must be compatible with existing data.*" For more about the dollar sign operator, check the post "How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)".

If we would like to add a sequence of numbers we can use `seq()`

function and the `length.out`

argument:

```
# add column to dataframe
dataf$Seq_Col <- seq(1, 10, length.out = dim(dataf)[1])
```

Code language: R (r)

Notice how we also used the `dim()`

function and selected the first element (the number of rows) to create a sequence with the same length as the number of rows. Of course, in a real-life example, we would probably want to specify the sequence a bit more before adding it as a new column. In the next section, we will learn how to add a new column using brackets.

Here’s how to append a column to a dataframe in R using brackets (“[]”):

```
# Adding a new column
dataf["Added_Column"] <- "Value"
```

Code language: R (r)

Using the brackets will give us the same result as using the $-operator. However, it may be easier to use the brackets instead of $, sometimes. For example, when we have column names containing whitespaces, brackets may be the way to go. Also, when selecting multiple columns you have to use brackets and not $. In the next section, we are going to create a new column by using tibble and the `add_column()`

function.

Here’s how to add a column to a dataframe in R:

```
# Append column using Tibble:
dataf <- dataf %>%
add_column(Add_Column = "Value")
```

Code language: R (r)

In the example above, we added a new column at “the end” of the dataframe. Note, that we can use dplyr to remove columns by name. This was done to produce the following output:

Finally, if we want to, we can add a column and create a copy of our old dataframe. Change the code so that the left “dataf” is something else e.g. “dataf2”. Now, that we have added a column to the dataframe it might be time for other data manipulation tasks. For example, we may now want to remove duplicate rows from the R dataframe or transpose your dataframe.

If we want to append a column at a specific position we can use the `.after`

argument:

```
# R add column after another column
dataf <- dataf %>%
add_column(Column_After = "After",
.after = "A")
```

Code language: R (r)

As you probably understand, doing this will add the new column after the column "A". In the next example, we are going to append a column before a specified column.

Here’s how to add a column to the dataframe before another column:

```
# R add column before another column
dataf <- dataf %>%
add_column(Column_Before = "Before",
.after = "Cost")
```

Code language: R (r)

In the next example, we are going to use `add_column()`

to add an empty column to the dataframe.

Here’s how we would do if we wanted to add an empty column in R:

Note that we just added NA (missing value indicator) as the empty column. Here’s the output, with the empty column, added, to the dataframe:

```
# Empty
dataf <- dataf %>%
add_column(Empty_Column = NA) %>%
```

Code language: R (r)

If we want to do this we just replace the `NA`

with "‘’", for example. However, this would create a character column and may not be considered as empty. In the next example, we are going to add a column to a dataframe based on other columns.

Here’s how to use R to add a column to a dataframe based on other columns:

```
# Append column conditionally
dataf <- dataf %>%
add_column(C = if_else(.$A == .$B, TRUE, FALSE))
```

Code language: R (r)

In the code chunk above, we added something to the `add_column()`

function: the `if_else()`

function. We did this because we wanted to add a value in the column based on the value in another column. Furthermore, we used the `.$`

so that we get the two columns compared (using `==`

). If the values in these two columns are the same we add `TRUE`

on the specific row. Here’s the new column added:

Note, you can also work with the `mutate()`

function (also from dplyr) to add columns based on conditions. See this tutorial for more information about adding columns on the basis of other columns.

In the next section, we will have a look at how to work with the `mutate()`

function to compute, and add, a new variable to the dataset.

Here’s how to compute and add a new variable (i.e., column) to a dataframe in R:

```
# insert new column with mutate
dataf <- dataf %>%
mutate(DepressionIndex = mean(c_across(Depr1:Depr5))) %>%
head()
```

Code language: R (r)

Notice how we, in the example code above, calculated a new variable called “depression index” which was the mean of the 5 columns named Depr1 to Depr5. Obviously, we used the `mean()`

function to calculate the mean of the columns. Notice how we also used the `c_across()`

function. This was done so that we can calculate the mean across these columns.

Note now that you have added new columns, to the dataframe, you may also want to rename factor levels in R with e.g. dplyr. In the next section, however, we will add multiple columns to a dataframe.

Here’s how you would insert multiple columns, to the dataframe, using the `add_column()`

function:

```
# Add multiple columns
dataf <- %>%
add_column(New_Column1 = "1st Column Added",
New_Column2 = "2nd Column Added")
```

Code language: R (r)

In the example code above, we had two vectors (“a” and “b”). Now, we then used the `add_column()`

method to append these two columns to the dataframe. Here’s the first 6 rows of the dataframe with added columns:

Note, if you want to add multiple columns, you just add an argument as we did above for each column you want to insert. It is, again, important that the length of the vector is the same as the number of rows in the dataframe. Or else, we will end up with an error. Note, a more realistic example can be that we want to take the absolute value in R (from e.g. one column) and add it to a new column. In the next example, however, we will add columns from one dataframe to another.

In this section, you will learn how to add columns from one dataframe to another. Here’s how you append e.g. two columns from one dataframe to another:

```
# Read data from the .xlsx files:
dataf <- read_excel('./SimData/add_column.xlsx')
dataf2 <- read_excel('./SimData/add_column2.xlsx')
# Add the columns from the second dataframe to the first
dataf3 <- cbind(dataf, dataf2[c("Anx1", "Anx2", "Anx3")])
```

Code language: R (r)

In the example above, we used the `cbind()`

function together with selecting which columns we wanted to add. Note, that dplyr has the `bind_cols()`

function that can be used in a similar fashion. Now that you have put together your data sets you can create dummy variables in R with e.g. the fastDummies package or calculate descriptive statistics.

In this post, you have learned how to add a column to a dataframe in R. Specifically, you have learned how to use the base functions available, as well as the add_column() function from Tibble. Furthermore, you have learned how to use the mutate() function from dplyr to append a column. Finally, you have also learned how to add multiple columns and how to add columns from one dataframe to another.

I hope you learned something valuable. If you did, please share the tutorial on your social media accounts, add a link to it in your projects, or just leave a comment below! Finally, suggestions and corrections are welcomed, also as comments below.

Here you will find some additiontal resources that you may find useful- The first three, here, is especially interesting if you work with datetime objects (e.g., time series data):

- How to Extract Year from Date in R with Examples with e.g. lubridate (Tidyverse)
- Learn How to Extract Day from Datetime in R with Examples with e.g. lubridate (Tidyverse)
- How to Extract Time from Datetime in R – with Examples

If you are interested in other useful functions and/or operators these two posts might be useful:

- How to use %in% in R: 7 Example Uses of the Operator
- How to use the Repeat and Replicate functions in R

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to rename factor levels in R using 1) levels and 2)

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to rename factor levels in R. First, we will use the base functions that are available in R, and then we will use dplyr.

To rename factor levels using `levels()`

we can assign a character vector with the new names. If we want to recode factor levels with dplyr we can use the `recode_factor()`

function.

This R tutorial has the following outline. First, we start by answering some simple questions. Second, we will have a look at what is required to follow this tutorial. Third, we will read an example data set so that we have something to practice on. Fourth, we will go into how to rename factor levels using 1) the levels() function, and 2) the recode_factor() function from the dplyr package.

One simple method to rename a factor level in R is `levels(your_df$Category1)[levels(our_df$Category1)=="A"] <- "B"`

where `your_df`

is your data frame and `Category1`

is the column containing your categorical data. Now, this would recode your factor level “A” to the new “B”.

The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: `levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3")`

. This would efficiently rename the factors “Factor 1” and so on.

In the next section, we will have a look at what is needed to follow this post.

To learn to recode factor levels by the examples in this post you need to download this data set. Furthermore, if you plan on using dplyr and the recode_factor() function, you will need to install this package. Here’s how to install an R-package:

`install.packages("dplyr")`

Code language: R (r)

Note that this package is very useful. You can, for instance, use dplyr to remove columns in R, and calculate descriptive statistics. A quick tip, before going on to the tutorial part of the post, is that you can install dplyr among plenty of other very good r packages if you install the Tidyverse package. For example, you will get ggplot2 that can be used for data visualization (e.g., can be used to create a scatter plot in R), lubridate to handle datetime data (e.g. to extract year from datetime). In the next section, we are going to read the example data from the .csv file.

Here is how to read a CSV file in R using the read.csv function:

```
# Import data
data <- read.csv("flanks.csv")
```

Code language: R (r)

Note that you need to download the CSV file and store it in the same directory as your R script. Data can, of course, also be imported from other data sources. See the following tutorials for more information:

- How to Read & Write SPSS Files in R Statistical Environment
- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R with Haven & sas7dbat

Now, we have the data frame called `data`

. If we want to get information about the variables in the data frame we can use the `str()`

function:

In the image above, we it is clear that we have a data frame containing 5 columns (i.e., variables). Notice that the first column probably is the index column but we will leave it like this. Of particular interest, for this post we can see that we have one column with a categorical variable called “TrialType”. Furthermore, we can see that this variable has two factor levels.

In the, we are going to use `levels()`

to change the name of the levels of a categorical variable. First, we are just assigning a character vector with the new names. Second, we are going to use a list renaming the factor levels by name.

Here’s how to change the name of factor levels using `levels()`

:

```
# Renaming factor levels
levels(data$TrialType) <- c("Con", "InCon")
```

Code language: R (r)

In the example above, we used the levels() function and selected the categorical variable that we wanted. Furthermore, we created a character vector. Notice how we here put the new names. If we use the levels() function again without assigning anything we can now see that we actually renamed the factor levels:

Note that if we try to assign a character vector containing too few, or too many, elements (i.e., names) it will not work. This will lead to an error (i.e., ‘*Error in `levels<-.factor`(`*tmp*`, value = "Con") : number of levels differs*’). Now that you have renamed the levels of a factor, you might want to clean the data frame from duplicate rows or columns. Furthermore, you can use the t() function to transpose in R (i.e a matrix OR dataframe).

In the next example we will rename factor levels by name also using the levels() function.

Here’s how to rename the factor levels by name:

```
# Recode factor levels by name
levels(data$TrialType) <- list(Congruent = "Con", InCongruent = "InCon")
```

Code language: R (r)

Here's the output from `str()`

in which we can see that we renamed the levels of the TrialType factor, again:

Note, however, that when we rename factor levels by name like in the example above, ALL levels need to be present in the list; if any are not in the list, they will be replaced with NA. In the next example, we are going to work with dplyr to change the name of the factor levels. That is, you will end up with only a single factor level and NA scores. Not that good.

Note, if you are planning on carrying out regression analysis and still want to use your categorical variables, you can at this point create dummy variables in R.

One of the simplest ways to rename factor levels is by using the `recode_factor()`

function:

```
# Renaming factor levels dplyr
data$TrialType <- recode_factor(data$TrialType, congruent = "Con",
incongruent = "InCon")
```

Code language: R (r)

In the code example above, we first loaded dplyr so that we get the `recode_factor()`

function into our name space. On the second line, we assign the renamed factors to the column containing our categorical variable. The `recode_factor()`

function works in a way that the first argument is the character vector. This argument is then followed by the level of a factor (e.g., the first) and then the new name. Each following argument is then the other factors we want to be renamed.

As previously mentioned, dplyr is a very useful package. It can also be used to add a column to an R data frame based other columns, or to simply add a column to a data frame in R. This can be, of course, also be done with other packages that are part of the TIdyverse. Note that there are other ways to recode levels of a factor in R. For instance, another package that is part of the Tidyverse package has a function that can be used: forcats.

In this tutorial, you have learned how to rename factor levels in R. First, we had a look at how to use the `levels()`

function to recode the levels of factors. Second, we had a look at the `recode_factor()`

function from the dplyr package to do the same. Hope you learned something valuable. Please share the tutorial on your social media accounts if you did.

Here are some other resources that you may find useful when working in R statistical environment:

- How to use %in% in R: 7 Example Uses of the Operator
- Learn How to Generate a Sequence of Numbers in R with :, seq() and rep()
- How to use the Repeat and Replicate functions in R
- More on working with datetime objects in R: How to Extract Day from Datetime in R with Examples and How to Extract Time from Datetime in R – with Examples
- R Resources for Psychologists - for a collection of useful resources
- How to Take Absolute Value in R – vector, matrix, & data frame

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicate rows and columns from a data frame. We will use the duplicated() and unique() functions from base R. Furthermore, we will use the distinct() function from the dplyr package.

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicates from the data frame. First, you will learn how to delete duplicated rows and, second, you will remove columns. Specifically, we will have a look at how to remove duplicate records from the data frame using 1) base R, and 2) dplyr.

The post starts out with answering a few questions (e.g., “How do I remove duplicate rows in R?”). In the second section, you will learn about what is required to follow this R tutorial. That is, you will learn about the dplyr (and Tidyverse) package and how to install them. When you have what you need to follow this R tutorial, we will create a data frame containing both duplicated rows and columns that we can use to practice on. In the next 5 sections, we will have a look at the example of how to delete duplicates in R. First, we will use Base R and the duplicated() and unique() functions. Second, we will use the distinct() function from dplyr.

To delete duplicate rows in R you can the `duplicated()`

function. Here’s how to remove all the duplicates in the data frame called “study_df”, `study_df.un <- study_df[!duplicated(df)]`

.

Now, that we know how to extract unique elements from the data frame (i.e., drop duplicate items) we are going to learn, briefly, about what is needed to follow this post.

Apart from having R installed you also need to have the dplyr package installed (this package can be used to rename factor levels in R, and to rename columns in R, as well). That is, you need dplyr if you want to use the distinct() function to remove duplicate data from your data frame. R packages are, of course, easy to install. You can install dplyr using the `install.packages()`

function. Here’s how to install packages in R:

```
# Installing packages in R:
install.packages("dplyr")
```

Code language: R (r)

It is worth noting here that dplyr is part of the Tidyverse package. This package is super useful because it comes with other awesome packages such as ggplot2 (see how to create a scatter plot in R with ggplot2, for example), readr, and tibble. To name a few! That said. Let’s create some example data to practice dropping duplicate records from!

Now, to practice removing duplicate rows and columns we need some data. Here’s some data with two duplicated rows and two duplicated columns:

```
# Creating a data frame:
example_df <- data.frame(FName =c ('Steve', 'Steve', 'Erica',
'John', 'Brody', 'Lisa', 'Lisa', 'Jens'),
LName = c('Johnson', 'Johnson', 'Ericson',
'Peterson', 'Stephenson', 'Bond', 'Bond',
'Gustafsson'),
Age = c(34, 34, 40,
44, 44, 51, 51, 50),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M'),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M')
```

Code language: R (r)

The data frame has 8 rows and 5 columns (we can use the `dim()`

function to see this). Here’s the data frame with the duplicate rows and columns:

Most of the time, of course, we import our data from an external source. See the following posts for more information:

- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read & Write SPSS Files in R Statistical Environment
- Reading SAS Files in R with Haven & sas7dbat
- How to Read and Write Stata (.dta) Files in R with Haven

In the next section, we are going to start by removing the duplicate rows using base R.

Here’s how to remove duplicate rows in R using the `duplicated()`

function:

```
# Remove duplicates from data frame:
example_df[!duplicated(example_df), ]
```

Code language: R (r)

As you can see, in the output above, we have now removed one of the two duplicated rows from the data frame. What we did, was to create a boolean vector with the rows that are duplicated in our data frame. Furthermore, we selected the columns using this boolean vector. Notice also how we used the `!`

operator to select the rows that *were not* duplicated. Finally, we also used the “,” so that we select any columns.

In the image above, we can see that two columns has been removed. Of course, if you want the changes to be permanent you need to use <-:

```
# Delete duplicate rows
example_df.un <- example_df[!duplicated(example_df), ]
```

Code language: R (r)

Note there are other good operations such as the %in% operator in R, that can be used for e.g. value matching.

In the next example, we are going to use the `duplicated()`

function to remove one of the two identical columns (i.e., “Gender” and “Gender.1”).

To remove duplicate columns we can, again, use the `duplicated()`

function:

```
# Drop Duplicated Columns:
ex_df.un <- example_df[!duplicated(as.list(example_df))]
# Dimenesions
dim(ex_df.un)
# 8 Rows and 4 Columns
# First five rows:
head(ex_df.un)
```

Code language: R (r)

Now, to remove duplicate columns we added the `as.list()`

function and removed the “,”. That is, we changed the syntax from Example 1 something. Again, we can use the `dim()`

function to see that we have dropped one column from the data frame. Here’s also the result from the `head()`

function:

Note, dplyr can be used to remove columns from the data frame as well. In the next example, we are going to use another base R function to delete duplicate data from the data frame: the `unique()`

function.

Here’s how you can remove duplicate rows using the `unique()`

function:

```
# Deleting duplicates:
examp_df <- unique(example_df)
# Dimension of the data frame:
dim(examp_df)
# Output: 6 5
```

Code language: R (r)

As you can see, using the `unique()`

function to remove the identical rows in the data frame is quite straight-forward. It is worth noting, here, that if you want to keep the last occurrences of the duplicate rows, you can use the `fromLast`

argument and set it to `TRUE`

. If you're now done carrying out data manipulation, you can now create a dummy variable in R, for example.

In the final two examples, we are going to use the `distinct()`

function from the dplyr package to remove duplicae rows.

Here’s how to drop duplicates in R with the `distinct()`

function:

```
# Deleting duplicates with dplyr
ex_df.un <- example_df %>%
distinct()
```

Code language: R (r)

In the code example above, we used the function distinct() to keep only unique/distinct rows from the data frame. When working with the `distinct()`

function, if there are duplicate rows, only the first row, of the identical ones, is preserved. Note, if you want to you can now go on and add an empty column to your data frame. This is something you can do with tibble, a package that is part of the Tidyverse. In the final example, we are going to look at an example in which we drop rows based on one column.

It is also possible to delete duplicate rows based on values in a certain column. Here's how to remove duplicate rows based on one column:

```
# remove duplicate rows with dplyr
example_df %>%
# Base the removal on the "Age" column
distinct(Age, .keep_all = TRUE)
```

Code language: PHP (php)

In the example above, we used the column as the first argument. Second, we used the .keep_all argument to keep all the columns in the data frame. If we now use the `dim()`

function, again, we can see that we have 5 rows and 5 columns. Let’s print the data frame to see which rows we dropped.

Although, we do not want to remove rows where there are duplicate values in a column containing values such as the age of the participants of a study there might be times when we want to remove duplicates in R based on a single column. Furthermore, we can add columns, as well, and drop whether there are identical values across more than one column. Now that you have removed duplicate rows and columns from your data frame you might want to use R to add a column to the data frame based on other columns.

In this short R tutorial, you have learned how to remove duplicates in R. Specifically, you have learned how to carry out this task by using two base functions (i.e., duplicated() and unique()) as well as the distinct() function from dplyr. Furthermore, you have learned how to drop rows and columns that are occurring as identical copies in, at least, two cases in your data frame.

Here are some other tutorials you may find useful:

- How to Transpose a Dataframe or Matrix in R with the t() Function
- How to use the Repeat and Replicate functions in R
- How to Generate a Sequence of Numbers in R with :, seq() and rep()

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>