In this R tutorial, we are going to learn how to create dummy variables in R. Now, creating dummy/indicator variables can be carried out in many ways. For example, we can write code using the ifelse() function, we can install the R-package fastDummies, and we can work with other packages, and functions (e.g. model.matrix). In this post, however, we are going to use the ifelse() function and the fastDummies package (i.e., dummy_cols() function). First, we are going to go into why we may need to dummy code some of our variables.
In the first section, of this post, you are going to learn when we need to dummy code our categorical variables. This section is followed by a section outlining what you need to have installed to follow this post. For example, this section will show you how to install packages that you can use to create dummy variables in R. Now, this is followed by three answers to frequently asked questions concerning dummy coding, both in general, but also in R. Note, the answers will also give you the knowledge to create indicator variables.
Finally, we are going to get into the different methods that we can use for dummy coding in R. First, we will use the
ifelse() funtion and you will learn how to create dummy variables in two simple steps. Second, we will use the fastDummies package and you will learn 3 simple steps for dummyc coding. The fastDummies package is also a lot easier to work with when you e.g. want to make indicator variables from multiple columns. Therefore, there will be a section covering this as well as a section about removing columns that we don’t need any more. In the following section, we will also have a look at how to use the recipes package for creating dummy variables in R. Before concluding the post, we will also learn about some other options that are available.
In regression analysis, a prerequisite is that all input variables are at the interval scale level, i.e. that the distance between all steps on the scale of the variable is the same length. However, it is not possible that all the possible things we want to research can be transformed into measurable scales. For example, different types of categories and characteristics do not necessarily have an inherent ranking. If we are, for example, interested in the impact of different educational approaches on political attitudes, it is not possible to assume that science education is twice as much as social science education, or that a librarian education is half the education in biomedicine. The different types of education are simply different (but some aspects of them can, after all, be compared, for example, the length).
What if we think that education has an important effect that we want to take into account in our data analysis? Well, these are some situations when we need to use dummy variables. Read on to learn how to create dummy variables for categorical variables in R.
What you Need to Make Dummy Variables
In this section, before answering some frequently asked questions, you are briefly going to learn what you need to follow this post. First. if you are planning on dummy coding using base R (e.g. by using the ifelse() function) you do not need to install any packages. However, if you are planning on using the fastDummies package or the recipes package you need to install either one of them (or both if you want to follow every section of this R tutorial). Installing packages can be done using the install.packages() function. Here’s to install the two dummy coding packages:
Code language: R (r)
Of course, if you only want to install one of them you can remove the vector (i.e. c()) and leave the package you want. Note, recipes is a package that is part of the Tidyverse. This means, that we can install this package, and get a lot of useful packages, by installing Tidyverse. In the next section, we will quickly answer some questions.
What is a Dummy Variable Give an Example?
A dummy variable is a variable that indicates whether an observation has a particular characteristic. A dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the property, and 1 indicates the presence of the same. The values 0/1 can be seen as no/yes or off/on. See the table below for some examples of dummy variables.
Why do we create dummy variables in R?
Creating dummy variables in R is a way to incorporate nominal variables into regression analysis It is quite easy to understand why we create dummy variables, once you understand the regression model.
How do You Create a Dummy variable in R?
To create a dummy variable in R you can use the ifelse() method:
df$Male <- ifelse(df$sex == 'male', 1, 0) df$Female <- ifelse(df$sex == 'female', 1, 0)
. This code will create two new columns where, in the column “Male” you will get the number “1” when the subject was a male and “0” when she was a female. For the column “Female”, it will be the opposite (Female = 1, Male =0).
|Smoking||Smoker = 1, Non-smoker = 0|
|Location||North = 1, South = 0|
|Answer||Yes = 1, No = 0|
Note, if you want to it is possible to rename the levels of a factor in R before making dummy variables. Now, let’s jump directly into a simple example of how to make dummy variables in R. In the next two sections, we will learn dummy coding by using R’s ifelse(), and fastDummies’ dummy_cols(). In the final section, we will quickly have a look at how to use the recipes package for dummy coding.
How to Create Dummy Variables in R in Two Steps: ifelse() example
Here’s how to create dummy variables in R using the ifelse() function in two simple steps:
1) Import Data
In the first step, import the data (e.g., from a CSV file):
Code language: R (r)
dataf <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv')
In the code above, we need to make sure that the character string points to where our data is stored (e.g., our .csv file). For example, when loading a dataset from our hard drive we need to make sure we add the path to this file. Now, in the next step, we will create two dummy variables in two lines of code.
2) Create the Dummy Variables with the ifelse() Function
Next, start creating the dummy variables in R using the ifelse() function:
Code language: QML (qml)
dataf$Disc_A <- ifelse(dataf$discipline == 'A', 1, 0) dataf$Disc_B <- ifelse(dataf$discipline == 'B', 1, 0)
In this simple example above, we created the dummy variables using the ifelse() function. First, we read data from a CSV file (from the web). Second, we created two new columns. In the first column we created, we assigned a numerical value (i.e., 1) if the cell value in column discipline was ‘A’. If not, we assigned the value ‘0’. Of course, we did the same when we created the second column. Here’s the first 5 rows of the dataframe:
Now, data can be imported into R from other formats. If the data, we want to dummy code in R, is stored in Excel files, check out the post about how to read xlsx files in R. As we sometimes work with datasets with a lot of variables, using the ifelse() approach may not be the best way. For instance, creating dummy variables this way will definitely make the R code harder to read. In the next section, we will go on and have a look at another approach for dummy coding categorical variables.
Three Steps to Create Dummy Variables in R with the fastDummies Package
In this section, we are going to use the fastDummies package to make dummy variables. Now, there are three simple steps for the creation of dummy variables with the dummy_cols function.
Here’s how to make dummy variables in R using the fastDummies package:
1) Install the fastDummies Package
First, we need to install the r-package. Installing r-packages can be done with the install.packages() function. So start up RStudio and type this in the console:
Code language: R (r)
# Install fastDummies: install.packages('fastDummies')
2) Load the fastDummies Package:
Next, we are going to use the library() function to load the fastDummies package into R:
Code language: R (r)
# Import fastDummies library('fastDummies')
Now that we have installed and louded the fastDummies package we will continue, in the next section, with dummy coding our variables.
3) Make Dummy Variables in R
Finally, we are ready to use the dummy_cols() function to make the dummy variables. Here’s how to make indicator variables in R using the dummy_cols() function:
Code language: R (r)
# Create dummy variables: dataf <- dummy_cols(dataf, select_columns = 'rank')
Now, the neat thing with using dummy_cols() is that we only get two line of codes. Furthermore, if we want to create dummy variables from more than one column, we’ll save even more lines of code (see next subsection). Now, that you’re done creating dummy variables, you might want to extract time from datetime.
How to Create Dummy Variables for More than One Column
In the previous section, we used the dummy_cols() method to make dummy variables from one column. It is, of course, possible to dummy code many columns both using the ifelse() function and the fastDummies package. However, if we have many categories in our variables it may require many lines of code using the ifelse() function. Thus, in this section we are going to start by adding one more column to the select_columns argument of the dummy_cols function.
Code language: PHP (php)
# Make dummy variables of two columns: dataf <- dummy_cols(dataf, select_columns = c('rank', 'discipline'))
Now, as evident from the code example above; the select_columns argument can take a vector of column names as well. Of course, this means that we can add as many as we need, here. Running the above code will generate 5 new columns containing the dummy coded variables. Note, you can use R to conditionally add a column to the dataframe based on other columns if you need to.
Removing the Columns
In this section, we are going to use one more of the arguments of the dummy_cols() function: remove_selected_columns. This may be very useful if we, for instance, are going to make dummy variables of multple variables and don’t need them for the data analysis later.
Code language: R (r)
dataf.2 <- dummy_cols(dataf, select_columns = c('rank', 'discipline'), remove_selected_columns = TRUE)
Note, if we don’t use the select_columns argument, dummy_cols will create dummy variables of all columns with categorical data. This is especially useful if we want to automatically create dummy variables for all categorical predictors in the R dataframe. See the documentation for more information about the dummy_cols function. Finally, if we use the fastDummies package we can also create dummy variables as rows with the dummy_rows function.
It is, of course, possible to drop variables after we have done the dummy coding in R. For example, see the post about how to remove a column in R with dplyr for more about deleting columns from the dataframe. In some cases, you also need to delete duplicate rows. Now that you have created dummy variables, you can also go on and extract year from date.
How to Make Dummy Variables in R with the step_dummy() Function
Here’s a code example you can use to make dummy variables using the step_dummy() function from the recipes package:
Code language: PHP (php)
library('recipes') # Making dummy variables dummies <- dataf %>% recipe(salary ~ .) %>% step_dummy(sex, one_hot = TRUE) %>% prep() %>% bake(dataf)
Not to get into the detail of the code chunk above but we start by loading the recipes package. Second, we create the variable dummies. On the right, of the “arrow” we take our dataframe and create a recipe for preprocessing our data (i.e., this is what this function is for). In this function, we start by setting our dependent variable (i.e., salary) and then, after the tilde, we can add our predictor variables. In our case, we want to select all other variables and, therefore, use the dot. Now, it is in the next part, where we use step_dummy(), where we actually make the dummy variables. Now, first parameter is the categorical variable that we want to dummy code. The second parameter are set to TRUE so that we get a column for male and a column for female. If this is not set to TRUE, we only get one column. Finally, we use the prep() so that we, later, kan apply this to the dataset we used (by using bake)). Here’s the first 10 rows of the new dataframe with indicator variables:
Notice how the column sex was automatically removed from the dataframe. That is, in the dataframe we now have, containing the dummy coded columns, we don’t have the original, categorical, column anymore.
Note, if you are planning on (also) doing Analysis of Variance, you can check the assumption of equal variances with the Brown-Forsythe Test in R.
Other Options for Dummy Coding in R
Now, before summarizing this R tutorial, it may be worth mentioning that there are other options to recode categorical data to dummy variables. For instance, we could have used the model.matrix function, and the dummies package. It is worth pointing out, however, that it seems like the dummies package hasn’t been updated for a while.
Finally, it may be worth to mention that the recipes package is part of the tidyverse package. Thus installing tidyverse, you can do a lot more than just creating dummy variables. For instance, using the tibble package you can add empty column to the R dataframe or calculate/add new variables/columns to a dataframe in R.
Summary and Conclusion
In this post, we have 1) worked with R’s ifelse() function, and 2) the fastDummies package, to recode categorical variables to dummy variables in R. In fact, we learned that it was an easy task with R. Especially, when we install and use a package such as fastDummies and have a lot of variables to dummy code (or a lot of levels of the categorical variable). The next step in the data analysis pipeline (may) now be to analyze the data (e.g., regression or random forest modeling).
Now, there are of course other valuables resources to learn more about dummy variables (or indicator variables). In this section, you will find some articles, and journal papers, that you mind find useful:
- Categorical Variables in Regression Analysis:A Comparison of Dummy and Effect Coding
- No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers
- Random Forests, Decision Trees, and Categorical Predictors:The “Absent Levels” Problem