In this R tutorial, we are going to learn how to create dummy variables in R. Now, creating dummy variables can be carried out in many ways. For example, we can write code using the ifelse() function, we can install the R-package fastDummies, and we can work with other packages, and functions (e.g. model.matrix). In this post, however, we are going to use the ifelse() function and the fastDummies package (i.e., dummy_cols() function). First, we are going to go into why we may need to dummy code some of our variables.
In regression analysis, a prerequisite is that all input variables are at the interval scale level, i.e. that the distance between all steps on the scale of the variable is the same length. However, it is not possible that all the possible things we want to research can be transformed into measurable scales. For example, different types of categories and characteristics do not necessarily have an inherent ranking. If we are, for example, interested in the impact of different educational approaches on political attitudes, it is not possible to assume that science education is twice as much as social science education, or that a librarian education is half the education in biomedicine. The different types of education are simply different (but some aspects of them can, after all, be compared, for example, the length).
What if we think that education has an important effect that we want to take into account in our data analysis? Well, these are some situations when we need to use dummy variables. Read on to learn how to create dummy variables for categorical variables in R.
What is a Dummy Variable Give an Example?
A dummy variable is a variable that indicates whether an observation has a particular characteristic. A dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the property, and 1 indicates the presence of the same. The values 0/1 can be seen as no/yes or off/on. See the table below for some examples of dummy variables.
How do You Create a Dummy variable in R?
To create a dummy variable in R you can use the ifelse() method:
. This code will create two new columns where, in the column "Male" you will get the number "1" when the subject was a male and "0" when she was a female. For the column "Female", it will be the opposite (Female = 1, Male =0).
df$Male <- ifelse(df$sex == 'male', 1, 0)
df$Female <- ifelse(df$sex == 'female', 1, 0)
|Smoking||Smoker = 1, Non-smoker = 0|
|Location||North = 1, South = 0|
|Answer||Yes = 1, No = 0|
Now, let's jump directly into a simple example on how to make dummy variables in R. In the next two sections, we will learn dummy coding by using R's ifelse(), and fastDummies' dummy_cols().
How to Create Dummy Variables in R: ifelse() example
Here's how to create dummy variables in R using the ifelse() function:
1) Import Data
In the first step, import the data (e.g., from a CSV file):
dataf <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv')
In the code above, we need to make sure that the character string points to where our data is stored (e.g., our .csv file). For example, when loading a dataset from our harddrive we need to make sure we add the path to this file.
2) Create the Dummy Variables with the ifelse() Function
Next, start creating the dummy variables in R using the ifelse() function:
dataf$Disc_A <- ifelse(dataf$discipline == 'A', 1, 0) dataf$Disc_B <- ifelse(dataf$discipline == 'B', 1, 0)
In this simple example above, we created the dummy variables using the ifelse() function. First, we read data from a CSV file (from the web). Second, we created two new columns. In the first column we created, we assigned a numerical value (i.e., 1) if the cell value in column discipline was 'A'. If not, we assigned the value '0'. Of course, we did the same when we created the second column. Here's the first 5 rows of the dataframe:
Now, data can be imported into R from other formats. If the data, we want to dummy code in R, is stored in Excel files, check out the post about how to read xlsx files in R. As we sometimes work with datasets with a lot of variables, using the ifelse() approach may not be the best way. For instance, creating dummy variables this way will definitely make the R code harder to read. In the next section, we will go on and have a look at another approach for dummy coding categorical variables.
Create Dummy Variables in R with the fastDummies Package
In this section, we are going to use the fastDummies package to make dummy variables. Now, there are three simple steps for the creation of dummy variables with the dummy_cols function:
1) Install the fastDummies Package
First, we need to install the r-package. Installing r-packages can be done with the install.packages() function. So start up RStudio and type this in the console:
2) Load the fastDummies Package:
Next, we are going to use the library() function to load the fastDummies package into R:
Now that we have installed and louded the fastDummies package we will continue, in the next section, with dummy coding our variables.
3) Make Dummy Variables in R
Finally, we are ready to use the dummy_cols() function to make the dummy variables:
dataf <- dummy_cols(dataf, select_columns = 'rank')
Now, the neat thing with using dummy_cols() is that we only get two line of codes. Furthermore, if we want to create dummy variables from more than one column, we'll save even more lines of code (see next subsection). Now, that you're done creating dummy variables, you might want to extract time from datetime.
How to Create Dummy Variables for More than One Column
In the previous section, we used the dummy_cols() method to make dummy variables from one column. It is, of course, possible to dummy code many columns both using the ifelse() function and the fastDummies package. However, if we have many categories in our variables it may require many lines of code using the ifelse() function. Thus, in this section we are going to start by adding one more column to the select_columns argument of the dummy_cols function.
dataf <- dummy_cols(dataf, select_columns = c('rank', 'discipline'))
Now, as evident from the code example above; the select_columns argument can take a vector of column names as well. Of course, this means that we can add as many as we need, here. Running the above code will generate 5 new columns containing the dummy coded variables. Note, you can use R to conditionally add a column to the dataframe based on other columns if you need to.
Removing the Columns
In this section, we are going to use one more of the arguments of the dummy_cols() function: remove_selected_columns. This may be very useful if we, for instance, are going to make dummy variables of multple variables and don't need them for the data analysis later.
dataf.2 <- dummy_cols(dataf, select_columns = c('rank', 'discipline'), remove_selected_columns = TRUE)
Note, if we don't use the select_columns argument, dummy_cols will create dummy variables of all columns with categorical data. This is especially useful if we want to automatically create dummy variables for all categorical predictors in the R dataframe. See the documentation for more information about the dummy_cols function. Finally, if we use the fastDummies package we can also create dummy variables as rows with the dummy_rows function.
It is, of course, possible to drop variables after we have done the dummy coding in R. For example, see the post about how to remove a column in R with dplyr for more about deleting columns from the dataframe. Now that you have created dummy variables, you can also go on and extract year from date.
Other Options for Dummy Coding in R
Now, before summarizing this R tutorial, it may be worth mentioning that there are other options to recode categorical data to dummy variables. For instance, we could have used the model.matrix function, the dummies package, and the step_dummy (recipes package).
Finally, it may be worth to mention that the recipes package is part of the tidyverse package. Thus installing tidyverse, you can do a lot more than just creating dummy variables. For instance, using the tibble package you can add empty column to the R dataframe.
Summary and Conclusion
In this post, we have 1) worked with R's ifelse() function, and 2) the fastDummies package, to recode categorical variables to dummy variables in R. In fact, we learned that it was an easy task with R. Especially, when we install and use a package such as fastDummies and have a lot of variables to dummy code (or a lot of levels of the categorical variable). The next step in the data analysis pipeline (may) now be to analyze the data (e.g., regression or random forest modeling).
Now, there are of course other valuables resources to learn more about dummy variables (or indicator variables). In this section, you will find some articles, and journal papers, that you mind find useful:
- Categorical Variables in Regression Analysis:A Comparison of Dummy and Effect Coding
- No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers
- Random Forests, Decision Trees, and Categorical Predictors:The “Absent Levels” Problem