In this R tutorial, you will learn how to remove duplicates from the data frame. First, you will learn how to delete duplicated rows and, second, you will remove columns. Specifically, we will have a look at how to remove duplicate records from the data frame using 1) base R, and 2) dplyr.
The post starts out with answering a few questions (e.g., “How do I remove duplicate rows in R?”). In the second section, you will learn about what is required to follow this R tutorial. That is, you will learn about the dplyr (and Tidyverse) package and how to install them. When you have what you need to follow this R tutorial, we will create a data frame containing both duplicated rows and columns that we can use to practice on. In the next 5 sections, we will have a look at the example of how to delete duplicates in R. First, we will use Base R and the duplicated() and unique() functions. Second, we will use the distinct() function from dplyr.
How do I remove duplicate rows in R?
To delete duplicate rows in R you can the
duplicated() function. Here’s how to remove all the duplicates in the data frame called “study_df”,
study_df.un <- study_df[!duplicated(df)].
Now, that we know how to extract unique elements from the data frame (i.e., drop duplicate items) we are going to learn, briefly, about what is needed to follow this post.
Apart from having R installed you also need to have the dplyr package installed (this package can be used to rename factor levels in R, as well). That is, you need dplyr if you want to use the distinct() function to remove duplicate data from your data frame. R packages are, of course, easy to install. You can install dplyr using the
install.packages() function. Here’s how to install packages in R:
# Installing packages in R: install.packages("dplyr")
It is worth noting here that dplyr is part of the Tidyverse package. This package is super useful because it comes with other awesome packages such as ggplot2 (see how to create a scatter plot in R with ggplot2, for example), readr, and tibble. To name a few! That said. Let’s create some example data to practice dropping duplicate records from!
Now, to practice removing duplicate rows and columns we need some data. Here’s some data with two duplicated rows and two duplicated columns:
# Creating a data frame: example_df <- data.frame(FName =c ('Steve', 'Steve', 'Erica', 'John', 'Brody', 'Lisa', 'Lisa', 'Jens'), LName = c('Johnson', 'Johnson', 'Ericson', 'Peterson', 'Stephenson', 'Bond', 'Bond', 'Gustafsson'), Age = c(34, 34, 40, 44, 44, 51, 51, 50), Gender = c('M', 'M', 'F', 'M', 'M', 'F', 'F', 'M'), Gender = c('M', 'M', 'F', 'M', 'M', 'F', 'F', 'M')
The data frame has 8 rows and 5 columns (we can use the
dim() function to see this). Here’s the data frame with the duplicate rows and columns:
Most of the time, of course, we import our data from an external source. See the following posts for more information:
- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read & Write SPSS Files in R Statistical Environment
- Reading SAS Files in R with Haven & sas7dbat
- How to Read and Write Stata (.dta) Files in R with Haven
In the next section, we are going to start by removing the duplicate rows using base R.
Example 1: Remove Duplicates using Base R and the duplicated() Function
Here’s how to remove duplicate rows in R using the
# Remove duplicates from data frame: example_df[!duplicated(example_df), ]
As you can see, in the output above, we have now removed one of the two duplicated rows from the data frame. What we did, was to create a boolean vector with the rows that are duplicated in our data frame. Furthermore, we selected the columns using this boolean vector. Notice also how we used the
! operator to select the rows that were not duplicated. Finally, we also used the “,” so that we select any columns.
In the image above, we can see that two columns has been removed. Of course, if you want the changes to be permanent you need to use <-:
# Delete duplicate rows example_df.un <- example_df[!duplicated(example_df), ]
Note there are other good operations such as the %in% operator in R, that can be used for e.g. value matching.
In the next example, we are going to use the
duplicated() function to remove one of the two identical columns (i.e., “Gender” and “Gender.1”).
Example 2: Remove Duplicate Columns using Base R’s duplicated()
To remove duplicate columns we can, again, use the
# Drop Duplicated Columns: ex_df.un <- example_df[!duplicated(as.list(example_df))] # Dimenesions dim(ex_df.un) # 8 Rows and 4 Columns # First five rows: head(ex_df.un)
Now, to remove duplicate columns we added the
as.list() function and removed the “,”. That is, we changed the syntax from Example 1 something. Again, we can use the
dim() function to see that we have dropped one column from the data frame. Here’s also the result from the
Note, dplyr can be used to remove columns from the data frame as well. In the next example, we are going to use another base R function to delete duplicate data from the data frame: the
Example 3: Removing Duplicates in R with the unique() Function
Here’s how you can remove duplicate rows using the
# Deleting duplicates: examp_df <- unique(example_df) # Dimension of the data frame: dim(examp_df) # Output: 6 5
As you can see, using the
unique() function to remove the identical rows in the data frame is quite straight-forward. It is worth noting, here, that if you want to keep the last occurrences of the duplicate rows, you can use the
fromLast argument and set it to
TRUE. If you're now done carrying out data manipulation, you can now create a dummy variable in R, for example.
In the final two examples, we are going to use the
distinct() function from the dplyr package to remove duplicae rows.
Example 4: Delete Duplicates in R using dplyr’s distinct() Function
Here’s how to drop duplicates in R with the
# Deleting duplicates with dplyr ex_df.un <- example_df %>% distinct()
In the code example above, we used the function distinct() to keep only unique/distinct rows from the data frame. When working with the
distinct() function, if there are duplicate rows, only the first row, of the identical ones, is preserved. Note, if you want to you can now go on and add an empty column to your data frame. This is something you can do with tibble, a package that is part of the Tidyverse. In the final example, we are going to look at an example in which we drop rows based on one column.
Example 5: Delete Duplicate Rows Based on Columns with the distinct() Function
It is also possible to delete duplicate rows based on values in a certain column. Here's how to remove duplicate rows based on one column:
# remove duplicate rows with dplyr example_df %>% # Base the removal on the "Age" column distinct(Age, .keep_all = TRUE)
In the example above, we used the column as the first argument. Second, we used the .keep_all argument to keep all the columns in the data frame. If we now use the
dim() function, again, we can see that we have 5 rows and 5 columns. Let’s print the data frame to see which rows we dropped.
Although, we do not want to remove rows where there are duplicate values in a column containing values such as the age of the participants of a study there might be times when we want to remove duplicates in R based on a single column. Furthermore, we can add columns, as well, and drop whether there are identical values across more than one column. Now that you have removed duplicate rows and columns from your data frame you might want to use R to add a column to the data frame based on other columns.
In this short R tutorial, you have learned how to remove duplicates in R. Specifically, you have learned how to carry out this task by using two base functions (i.e., duplicated() and unique()) as well as the distinct() function from dplyr. Furthermore, you have learned how to drop rows and columns that are occurring as identical copies in, at least, two cases in your data frame.
Other Useful R Tutorials
Here are some other tutorials you may find useful:
- How to Transpose a Dataframe or Matrix in R with the t() Function
- How to use the Repeat and Replicate functions in R
- How to Generate a Sequence of Numbers in R with :, seq() and rep()