In this R tutorial, we are going to learn how to create dummy variables in R. Now, creating dummy variables can be carried out in many ways. For example, we can write code using the ifelse() function, we can install the R-package fastDummies, and we can work with other packages, and functions (e.g. model.matrix). In this post, however, we are going to use the ifelse() function and the fastDummies package (i.e., dummy_cols() function). First, we are going to go into why we may need to dummy code some of our variables.
In regression analysis, a prerequisite is that all input variables are at the interval scale level, i.e. that the distance between all steps on the scale of the variable is the same length. However, it is not possible that all the possible things we want to research can be transformed into measurable scales. For example, different types of categories and characteristics do not necessarily have an inherent ranking. If we are, for example, interested in the impact of different educational approaches on political attitudes, it is not possible to assume that science education is twice as much as social science education, or that a librarian education is half the education in biomedicine. The different types of education are simply different (but some aspects of them can, after all, be compared, for example, the length).
What if we think that education has an important effect that we want to take into account in our data analysis? Well, these are some situations when we need to use dummy variables. Read on to learn how to create dummy variables for categorical variables in R.
What is a Dummy Variable Give an Example?
A dummy variable is a variable that indicates whether an observation has a particular characteristic. A dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the property, and 1 indicates the presence of the same. The values 0/1 can be seen as no/yes or off/on. See the table below for some examples of dummy variables.
How do You Create a Dummy variable in R?
To create a dummy variable in R you can use the ifelse() method:
. This code will create two new columns where, in the column "Male" you will get the number "1" when the subject was a male and "0" when she was a female. For the column "Female", it will be the opposite (Female = 1, Male =0).
df$Male <- ifelse(df$sex == 'male', 1, 0)
df$Female <- ifelse(df$sex == 'female', 1, 0)
|Smoking||Smoker = 1, Non-smoker = 0|
|Location||North = 1, South = 0|
|Answer||Yes = 1, No = 0|
Now, let's jump directly into a simple example on how to make dummy variables in R. In the next sections, we will learn dummy coding by using R's ifelse(), and fastDummies' dummy_cols().
How to Create Dummy Variables in R: ifelse() example
Here's how to create dummy variables in R using the ifelse() function:
1) Import Data
In the first step, import the data (e.g., from a CSV file):
dataf <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv')
In the code above, we need to make sure that the character string points to where our data is stored (e.g., our .csv file). For example, when loading a dataset from our harddrive we need to make sure we add the path to this file.
2) Create the Dummy Variables with the ifelse() Function
Next, start creating the dummy variables in R using the ifelse() function:
dataf$Disc_A <- ifelse(dataf$discipline == 'A', 1, 0) dataf$Disc_B <- ifelse(dataf$discipline == 'B', 1, 0)
In this simple example above, we created the dummy variables using the ifelse() function. First, we read data from a CSV file (from the web). Second, we created two new columns. In the first column we created, we assigned a numerical value (i.e., 1) if the cell value in column discipline was 'A'. If not, we assigned the value '0'. Of course, we did the same when we created the second column. Here's the first 5 rows of the dataframe:
Now, data can be imported into R from other formats. If the data, we want to dummy code in R, is stored in Excel files, check out the post about how to read xlsx files in R.
Create Dummy Variables in R with the fastDummies Package
In this section, we are going to use the fastDummies package to make dummy variables. Now, there are three simple steps for the creation of dummy variables with the dummy_cols function:
1) Install the fastDummies Package
First, we need to install the r-package. Installing r-packages can be done with the install.packages() function. So start up RStudio and type this in the console:
2) Load the fastDummies Package:
Next, we are going to use the library() function to load the fastDummies package into R:
Now that we have installed and louded the fastDummies package we will continue, in the next section, with dummy coding our variables.
3) Make Dummy Variables
Finally, we are ready to use the dummy_cols() function to make the dummy variables:
dataf <- dummy_cols(dataf, select_columns = 'rank')
Now, the neat thing with using dummy_cols() is that we only get two line of codes. Furthermore, if we want to create dummy variables from more than one column, we'll save even more lines of code (see next subsection).
How to Create Dummy Variables for More than One Column
In the previous section, we used the dummy_cols() method to make dummy variables from one column. It is, of course, possible to dummy code many columns both using the ifelse() function and the fastDummies package. However, if we have many categories in our variables it may require many lines of code using the ifelse() function. Thus, in this section we are going to start by adding one more column to the select_columns argument of the dummy_cols function.
dataf <- dummy_cols(dataf, select_columns = c('rank', 'discipline'))
Now, as evident from the code example above; the select_columns argument can take a vector of column names as well. Of course, this means that we can add as many as we need, here. Running the above code will generate 5 new columns containing the dummy coded variables.
Removing the Columns
In this section, we are going to use one more of the arguments of the dummy_cols() function: remove_selected_columns. This may be very useful if we, for instance, are going to make dummy variables of multple variables and don't need them for the data analysis later.
dataf.2 <- dummy_cols(dataf, select_columns = c('rank', 'discipline'), remove_selected_columns = TRUE)
Note, if we don't use the select_columns argument, dummy_cols will create dummy variables of all columns with categorical data. This is especially useful if we want to automatically create dummy variables for all categorical predictors in the R dataframe. See the documentation for more information about the dummy_cols function. Finally, if we use the fastDummies package we can also create dummy variables as rows with the dummy_rows function.
It is, of course, possible to drop variables after we have done the dummy coding in R. For example, see the post about how to remove a column in R with dplyr for more about deleting columns from the dataframe.
Other Options for Dummy Coding in R
Now, before summarizing this R tutorial, it may be worth mentioning that there are other options to recode categorical data to dummy variables. For instance, we could have used the model.matrix function, the dummies package, and the step_dummy (recipes package).
Summary and Conclusion
In this post, we have 1) worked with R's ifelse() function, and 2) the fastDummies package, to recode categorical variables to dummy variables. In fact, we learned that it was an easy task with R. Especially, when we install and use a package such as fastDummies and have a lot of variables to dummy code (or a lot of levels of the categorical variable). The next step in the data analysis pipeline (may) now be to analyze the data (e.g., regression or random forest modeling).
Now, there are of course other valuables resources to learn more about dummy variables (or indicator variables). In this section, you will find some articles, and journal papers, that you mind find useful:
- Categorical Variables in Regression Analysis:A Comparison of Dummy and Effect Coding
- No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers
- Random Forests, Decision Trees, and Categorical Predictors:The “Absent Levels” Problem