In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using str_c() and unite()). In the final section of this post, you will learn which function is the best […]

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this guide you will learn how to concatenate two columns in R. In fact, you will learn how to merge multiple columns in R using base R (e.g., using the paste function) and Tidyverse (e.g. using `str_c()`

and `unite()`

). In the final section of this post, you will learn which function is the best to use when combining columns.

If you have some experience using dataframe (or in this case tibble) objects in R and you’re ready to learn how to combine data found in them, then this tutorial will help you do precisely that.

Knowing how to do this may prove useful when you have a dataframe containing information, in two columns, and you want to combine these two columns into one using R. For example, you might have a column containing first names and last names. In this case, you may want to concatenate these two columns into one e.g. called Names.

You can follow along with the examples in this tutorial using the interactive Jupyter Notebook found towards the end of the tutorial. Here’s the example data that we use to learn how to combine two, or more, columns to one variable.

In this post, you will learn, by example, how to concatenate two columns in R. As you will see, we will use R’s $ operator to select the columns we want to combine. The outline of the post is as follows. First, you will learn what you need to have to follow the tutorial. Second, you will get a quick answer on how to merge two columns. After this, you will learn a couple of examples using 1) `paste()`

and 2) `str_c()`

and, 3) `unite()`

. In the final section, of this concatenating in R tutorial, you will learn which method I prefer and why. That is, you will get my opinion on why I like the `unite()`

function. In the next section, you will learn about the requirements of this post.

If you prefer to use base R you don’t need more than a working R installation. However, if you are going to use either str_() or unite() you need to have at least one of the packages stringr or tidyr. It is worth pointing out, here, that both of these packages are part of the Tidyverse package. This package contains multiple useful R packages that can be used for reading data, visualizing data (e.g., scatter plots with ggplot2), extracting year from date in R, adding new columns, among other things. Installing an R package is simple, here’s how you install Tidyverse:

`install.packages("tidyverse")`

Code language: R (r)

Note, if you want to install stringr or tidyr just exchange “tidyverse” for e.g. “stringr”. In the next section, you will get a quick answer, without any details, on how to concatenate two columns in R.

To concatenate two columns you can use the `paste()`

function. For example, if you want to combine the two columns *A *and *B* in the dataframe *df* you can use the following code: `df[‘AB’] <- paste(df$A, df$B)`

. Note, however, that using `paste`

will result in whitespace between the values in the new column.

Before we are going to have a more detailed look at how to use paste() to combine two columns, we are going to load an example dataset.

Here’s how to read a .xlsx file in R using the readxl package:

```
# Importing Example Data:
library('readxl')
dataf <- read_excel("combine_columns_in_R.xlsx")
```

Code language: R (r)

Now, we can have a look at the structure of the imported data using the `str() `

function:

We will also have a quick look at the first five rows using the `head()`

function:

Now, in the images above we can see that there are 5 variables and 7 observations. That is, there are 5 columns and 7 rows, in the tibble. Moreover, we can see the types of the variables and we can, of course, also use the column names. In the next section, we are going to start by concatenating the month and year columns using the paste() function.

Here’s one of the simplest way to combine two columns in R using the `paste()`

: function:

Code language: R (r)`dataf$MY <- paste(dataf$Month, dataf$Year)`

In the code above, we used $ in R to 1) create a new column but, as well, selecting the two columns we wanted to combine into one. Here’s the tibble with the new column, named *MY*:

In the next example, we will merge two columns and adding a hyphen (“-”), as well. For more useful operators, and how to use them, see for example the post "How to use %in% in R: 7 Example Uses of the Operator".

Now, to add “-” (hyphen) between the values we want to combine we add a third parameter to the `paste()`

function:

`dataf$MY <- paste(dataf$Month, "-", dataf$Year)`

Code language: R (r)

In the code example above, we used the sep parameter and set it as “-”. As you can see, in the image below, we have whitespaces between the two values (i.e. “Month” and “Year”).

Now, using R’s `paste()`

function we can add another parameter: the sep parameter. Here’s a code example combining the two columns, adding the “-” without the whitespaces:

`dataf$MY <- paste(dataf$Month, dataf$Year, sep= "-")`

Code language: R (r)

Notice, that instead of pasting the hyphen we used it as a separator. Before moving on to the next example, it is worth pointing out that if we don’t want to add whitespaces we can use the `paste0()`

function instead. This way, we don’t need the sep parameter. In the next example, we are going to have a look at how to combine multiple columns (i.e., three or more) in R.

As you may have understood, combining more than 2 columns is as simple as adding a parameter to the `paste()`

function. Here’s how we combine three columns in R:

Code language: R (r)`dataf$DMY <- paste(dataf$Date, dataf$Month, dataf$Year)`

That was also pretty simple. It is worth, mentioning, that if you use the sep parameter, in a case as above, you will end up with whatever character you chose between each value from each column. For example, if we were to add the sep argument to the code above and put underscore (“_”) as a separator here’s how the resulting tibble would look like:

Now, you may understand that using the sep parameter enables you to use almost any character to separate your combined values. In the next section, we will have a look at the str_c() function from the stringr package.

Combining two columns with the str_c() function is super simple. Here’s how to merge the columns “Snake” and “Size” using the str_c() function:

```
library(stringr)
dataf$SnakeNSize <- str_c(dataf$Snake," ", dataf$Size)
```

Code language: PHP (php)

Notice that we added something in between the two columns we wanted to concatenate? When working with this function, we need to do this, or else we end up with nothing separating the two values that we are combining. As previously mentioned, the stringr package is part of the Tidyverse packages which also includes packages such as tidyr and the unite() function. In the next section, we are going to merge two columns in R using the unite() function as well.

Here’s how we concatenate two, or more, columns using the unite() function:

```
library(tidyverse) # or library(tidyr)
dataf <- dataf %>%
unite("DM", Date:Month)
```

Code language: R (r)

Notice something in the code above. First, we used a new operator (i.e., %>%). Among a lot of things, this enables us to use unite() without the $ operator to select the columns. As you can see, in the code example above, we used two parameters. First, we name the new column we want to add (“DM”), second we select all the columns from “Date” to “Month” and combine them into the new column. Here’s the resulting dataframe/tibble:

Now, as you can see in the image above, both columns that we combined have disappeared. If we want to keep the original columns after we have concatenated them we can set the remove parameter to FALSE. Here’s a code chunk that you can use, instead, to not remove the columns:

```
dataf <- dataf %>%
unite("DM", Date:Month, remove = FALSE)
```

Code language: R (r)

Finally, did you notice how we have an underscore as a separator? If we want to change to another separator we can use the sep parameter. This is exactly what we will do in the next example:

Here’s how we use the unite() function together with the sep parameter to change the separator to “-” (hyphen):

```
dataf <- dataf %>%
unite("DM", Date:Month, sep= "-",
remove = FALSE)
```

Code language: R (r)

That was as simple as the previous example, right? In the next section, you will learn which function I prefer to use and why.

Naturally, this section will contain my opinion. I have not done any optimization testing (e.g., I don’t know which function is the fastest when it comes to combining columns in R). That said, although all of the functions used in this post are simple to use I prefer the unite() function. Why? Well, together with the piping operator I think it makes the column very readable. It is, as well, very handy to use unite() if you are going to concatenate multiple columns in R. As you may have noticed, in the examples above, we can use “:” when combining columns. This means that we can merge multiple columns from the first column (i.e., left of the column sign) to the last column (i.e., right of the “:”). This is pretty neat and will definitely save some space in your code and make it easier to read!

Another neat thing is that we add the new column name as a parameter and we, automatically, get rid of the columns combined (if we don’t need them, later, of course). Finally, we can also set the na.rm parameter to TRUE if we want missing values to be removed before combining values. Here's a Jupyter Notebook with all the code in this post.

In this post, you have learned how to concatenate two (or more) columns in R using three different functions. First, we used the paste() function from base R. Using this function, we combined two and three columns, changed the separator from whitespaces to hyphen (“-”). Second, we used the str_() function to merge columns. Third, we used the unite() function. Of course, it is possible (we saw some example of that) to change the separator using the two last functions as well. To conclude, the unite() function seems to be the handiest function to use to concatenate columns in R.

Hope you learned something! If you did, please leave a comment below, share on your social media, include a link to the post on your projects (e.g., blog posts, articles, reports), or become a Patreon:

Finally, if you have any suggestions, other comments, or there is something you wish me to cover: don’t hesitate to contact me.

- How to Calculate Five-Number Summary Statistics in R
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Rename Column (or Columns) in R with dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

The post How to Concatenate Two Columns (or More) in R – stringr, tidyr appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate: Minimum Lower-hinge Median Upper-hinge Maximum Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and […]

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this short tutorial, you will learn how to find the five-number summary statistics in R. Specifically, in this post we will calculate:

- Minimum
- Lower-hinge
- Median
- Upper-hinge
- Maximum

Now, we will also visualize the five-number summary statistics using a boxplot. First, we will learn how to calculate each of the five summary statistics each and then how we can use one single function to get all of them directly.

To follow this R tutorial you will need to have readxl and ggplot2 installed. The easiest way to install these to r-packages is to use the `install.packages()`

function:

`install.packages(c("readxl", "ggplot"))`

Code language: R (r)

Note, both these two packages are part of the Tidyverse. This means that you get them, as well as a lot of other packages when installing Tidyverse. For example, you can use packages such as dplyr to rename columns, remove columns in R, and select columns, as well.

Before getting to the 6 steps to finding the five-number summary statistics using R we will get the answer to some questions, however.

As you may have understood, the five-number summary statistics are 1) the minimum, 2) the lower-hinge, 3) the median, 4) the upper-hinge, and 5) the maximum. The five-number summary is a quick way to explore your dataset.

The absolutely easiest way to find the five number summary statistics in R is to use the `fivenum()`

function. For example, if you have a vector of numbers called “A” you can run the following code: `fivenum(A)`

to get the five number summary.

Now that we know what the five-number summary is we can go on and learn the simple steps to calculate the 5 summary statistics.

In this section, we are ready to go through the 6 simple steps to calculate the five-number statistics using the R statistical environment. To recap: the first step is to import the dataset (e.g., from an xlsx file). Second, we calculate the min value, and then, in the third step, get the lower-hinge. In the fourth step, we get the median. In the fifth step we get the upper-hinge and, then, in the sixth, and final step, we get the max value.

Here’s how to read a .xslx file in R using the readxl package:

```
library(readxl)
dataf <- read_excel("play_data.xlsx", sheet = "play_data",
col_types = c("skip", "numeric",
"text","text", "numeric",
"numeric", "numeric"))
head(dataf)
```

Code language: JavaScript (javascript)

We can see that in this example dataset there’s only one column containing numerical data (i.e., the column RT). In the next step, we will take the minimum of this column.

Here’s how to get the minimum value in a column in R:

```
library(readxl)
dataf <- read_excel("play_data.xlsx", sheet = "play_data",
col_types = c("skip", "numeric",
"text","text", "numeric",
"numeric", "numeric"))
head(dataf)
```

Code language: JavaScript (javascript)

Notice how we used the `min()`

function with the dataframe and the column (i.e., RT) as the first argument. The second argument we set to TRUE because we have some missing values in the column. Finally, we used the $ operator in R to select a column. If we, on the other hand, were using dplyr we could use the select() function. That said, let’s move on and get the max value.

Here’s how we get the lower-hinge:

```
# Lower Hinge:
RT <- sort(dataf$RT)
lower.rt <- RT[1:round(length(RT)/2)]
lower.h.rt <- median(lower.rt)
```

Code language: PHP (php)

Notice, how we started by selecting only response times (i.e. the RT column) and sorted the values. Second, we get the lower part of the response times and, then, we get the lower-hinge by calculating the median of this vector.

To calculate the median we can use the `median()`

function:

```
# Median
median.rt <- median(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we used the `na.rm`

argument (`TRUE`

) because there are some missing values in the dataset. Of course, if your data doesn’t have any missing values you can leave this argument out.

Here’s how to get the upper-hinge:

```
# Upper Hinge
RT <- sort(dataf$RT)
upper.rt <- RT[round((length(RT)/2)+1):length(RT)]
upper.h.rt <- median(upper.rt)
```

Code language: PHP (php)

SImilar to when we got the lower-hinge, we first sorted the RT column. Then, we get the upper half and calculate the median of it.

We can get the maximum by using the `max()`

function:

```
# Max
max.rt <- max(dataf$RT, na.rm = TRUE)
```

Code language: PHP (php)

Again, we selected the RT-column using the dollar sign operator and we removed the missing values. Here’s the output:

Note, that the lower- and upper-hinge is the same as the first and third quartile when the sample size is odd. If this is the case, an easier way to get the lower- and upper-hinge is to use the `quantile()`

function. In the example data above, however, we had an equal number of observations (leaving out the missing values). If you need to combine two variables, in your dataset, into one make sure to check this post out:

In this section, we are going to put everything together so we get a somewhat nicer output:

```
fivenumber <- cbind(min.rt, lower.h.rt,
median.rt, upper.h.rt,
max.rt)
colnames(fivenumber) <- c("Min", "Lower-hinge",
"Median", "Upper-hinge", "Max")
fivenumber
```

Code language: CSS (css)

As you can see in the above code chunk, we used the `cbind()`

function to combine the different objects into one. Then, we give the combined object better column names. In the next section, we are going to see that there already is a function that can calculate the five-number statistics in R in one line of code, basically.

Here’s how to find the five-number summary statistics in R with the `fivenum()`

function:

```
# Five summary with R's fivenum()
fivenum(dataf$RT)
```

Code language: PHP (php)

Pretty simple. We just selected the column containing our data. Again, we used the $ operator to get the RT column and use the `fivenum()`

function on. Note that `fivenum()`

function is removing any missing values by default.

As you can see in the output above, we don’t get any column names but the five-number summary statistics are ordered as follows: min, lower-hinge, median, upper-hinge, and max. We can see that we get the same values as in the 6 step method:

In the next section, we are going to create a boxplot displaying the five-number summary statistics in R.

Here’s how we can visualize Tukey’s 5 number summary statistics in R using a boxplot:

```
library(ggplot2)
df <- data.frame(
x = 1,
ymin = fivenumber[1],
Lower = fivenumber[2],
Median = fivenumber[3],
Upper = fivenumber[4],
ymax = fivenumber[5]
)
ggplot(df, aes(x)) +
geom_boxplot(aes(ymin=ymin, lower=Lower,
middle=Median, upper=Upper, ymax=ymax),
stat = "identity") +
scale_y_continuous(breaks=seq(0.2,0.8, 0.05)) +
# Style the plot bit
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
# After this is just to annotate the plot and can be removed
# Min
geom_segment(aes(x = 1, y = ymin, xend = 0.95, yend = ymin), data = df) +
annotate("text", x = 0.93, y = df$ymin, label = "Min") +
# Lower-hinge
geom_segment(aes(x = 0.60, y = Lower, xend = 0.60, yend = Lower-0.05), data = df) +
annotate("text", x = 0.60, y = df$Lower-0.06, label = "Lower-hinge") +
# Median
annotate("text", x = 1, y = df$Median + .012, label = "Median") +
# Upper-hinge
geom_segment(aes(x = 1.40, y = Upper, xend = 1.40, yend = Upper+0.05), data = df) +
annotate("text", x = 1.40, y = df$Upper+0.06, label = "Upper-hinge") +
# Max
geom_segment(aes(x = 1, y = ymax, xend = 1.05, yend = ymax), data = df) +
annotate("text", x = 1.07, y = df$ymax, label = "Max")
```

Code language: R (r)

We are not getting into details in the example above. However, we did create a dataframe from the first object we created and then we used `ggplot()`

and `ggplot_boxplot()`

to create the boxplot. Notice how we used the `aes()`

function and set the different values found in the dataframe as arguments. Here ymin and ymax are the minimum and maximum values, respectively. Note we also changed the number of ticks on the y-axis. Here we used the seq() function to generate a sequence of numbers. The plot is somewhat styled and the code for drawing segments (lines) and adding text can be skipped, of course, if you just want to visualize the five summary statistics in R.

More data visualization tutorials:

In this post, you have learned 2 ways to get the five summary statistics in R: 1) min, 2) lower-hinge, 3) median, 4) upper-hinge, and 5) max. In the first method, we calculated each of these summary statistics separately. Furthermore, we have also learned how to use the handy fivenum() function to get the same values. In the final section, we created a boxplot from the five summary statistics. Hope you have learned something valuable. If you did, please link to the blog post in your projects and reports, share on your social media accounts, and/or drop a comment below.

Here are some other tutorials that you may find useful:

- How to Take Absolute Value in R – vector, matrix, & data frame
- Learn How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Extract Year from Date in R with Examples
- Get the Absolute Value in R – from a vector, a matrix, & a data frame
- How to Rename Factor Levels in R using levels() and dplyr
- Learn How to Remove Duplicates in R – Rows and Columns (dplyr)
- How to Add a Column to a Dataframe in R with tibble & dplyr

The post How to Calculate Five-Number Summary Statistics in R appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the […]

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this Python data visualization tutorial, we are going to learn how to create a violin plot using Matplotlib and Seaborn. Now, there are several techniques for visualizing data (see the post 9 Data Visualization Techniques You Should Learn in Python for some examples) that we can carry out. Violin plots are combining both the box plot and the histogram. In the next section, you will get a brief overview of the content of this blog post.

Before we get into the details on how to create a violin plot in Python we will have a look at what is needed to follow this Python data visualization tutorial. When we have what we need, we will answer a couple of questions (e.g., learn what a violin plot is). In the following sections, we will get into the practical parts. That is, we will learn how to use 1) Matplotlib and 2) Seaborn to create a violin plot in Python.

First of all, you need to have Python 3 installed to follow this post. Second, to use both Matplotlib and Seaborn you need to install these two excellent Python packages. Now, you can install Python packages using both Pip and conda. The later if you have Anaconda (or Miniconda) Python distribution. Note, Seaborn requires that Matplotlib is installed so if you, for example, want to try both packages to create violin plots in Python you can type `pip install seaborn`

. This will install Seaborn and Matplotlib along with other dependencies (e.g., NumPy and SciPy). Oh, we are also going to read the example data using Pandas. Pandas can, of course, also be installed using pip.

As previously mentioned, a violin plot is a data visualization technique that combines a box plot and a histogram. This type of plot therefore will show us the distribution, median, interquartile range (iqr) of data. Specifically, the iqr and median are the statistical information shown in the box plot whereas distribution is being displayed by the histogram.

A violin plot is showing numerical data. Specifically, it will reveal the distribution shape and summary statitisics of the numerical data. It can be used to explore data across different groups or variables in our datasets.

In this post, we are going to work with a fake dataset. This dataset can be downloaded here and is data from a Flanker task created with OpenSesame. Of course, the experiment was never actually run to collect the current data. Here’s how we read a CSV file with Pandas:

```
import pandas as pd
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df = pd.read_csv(data, index_col=0)
df.head()
```

Code language: Python (python)

Now, we can calculate descriptive statistics in Python using Pandas `describe()`

:

`df.loc[:, 'TrialType':'ACC'].groupby(by='TrialType').describe()`

Code language: Python (python)

Now, in the code above we used loc to slice the Pandas dataframe. This as we did not want to calculate summary statistics on the SubID. Furthermore, we used Pandas groupby to group the data by condition (i.e., “TrialType”). Now that we have some data we will continue exploring the data by creating a violin plot using 1) Matplotlib and 2) Seaborn.

Here’s how to create a violin plot with the Python package Matplotlib:

```
import matplotlib.pyplot as plt
plt.violinplot(df['RT'])
```

Code language: Python (python)

n the code above, we used the `violinplot()`

method and used the dataframe as the only parameter. Furthermore, we selected only the response time (i.e. the “RT” column) using the brackets. Now, as we know there are two conditions in the dataset and, therefore, we should create one violin plot for each condition. In the next example, we are going to subset the data and create violin plots, using matplotlib, for each condition.

One way to create a violin plot for the different conditions (grouped) is to subset the data:

```
# Subsetting using Pandas query():
congruent = df.query('TrialType == "congruent"')['RT']
incongruent = df.query('TrialType == "incongruent"')['RT']
fig, ax = plt.subplots()
inc = ax.violinplot(incongruent)
con = ax.violinplot(congruent)
fig.tight_layout()
```

Code language: Python (python)

Now we can see that there is some overlap in the distributions but they seem a bit different. Furthermore, we can see that iqr is a bit different. Especially, the tops. However, we don’t really know which color represents which. However, from the descriptive statistics earlier, we can assume that the blue one is incongruent. Note we also know this because that is the first one we created.

We can make this plot easier to read by using some more methods. In the next code chunk, we are going to create a list of the data and then add ticks labels to the plot as well as set (two) ticks to the plot.

```
# Combine data
plot_data = list([incongruent, congruent])
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data)
```

Code language: Python (python)

Notice how we now get the violin plots side by side instead. In the next example, we are going to add the median to the plot using the `showmedians`

parameter.

Here’s how we can show the median in the violin plots we create with the Python library matplotlib:

```
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data, showmedians=True)
```

Code language: Python (python)

In the next section, we will start working with Seaborn to create a violin plot in Python. This package is built as a wrapper to Matplotlib and is a bit easier to work with. First, we will start by creating a simple violin plot (the same as the first example using Matplotlib). Second, we will create grouped violin plots, as well.

Here’s how we can create a violin plot in Python using Seaborn:

```
import seaborn as sns
sns.violinplot(y='RT', data=df)
```

Code language: JavaScript (javascript)

In the code chunk above, we imported seaborn as sns. This enables us to use a range of methods and, in this case, we created a violin plot with Seaborn. Notice how we set the first parameter to be the dependent variable and the second to be our Pandas dataframe.

Again, we know that there two conditions and, therefore, in the next example we will use the `x`

parameter to create violin plots for each group (i.e. conditions).

To create a grouped violin plot in Python with Seaborn we can use the `x`

parameter:

```
sns.violinplot(y='RT', x="TrialType",
data=df)
```

Code language: Python (python)

Now, this violin plot is easier to read compared to the one we created using Matplotlib. We get a violin plot, for each group/condition, side by side with axis labels. All this by using a single Python metod! If we have further categories we can also use the `split`

parameter to get KDEs for each category split. Let’s see how we do that in the next section.

Here’s how we can use the `split`

parameter, and set it to `True`

to get a KDE for each level of a category:

```
sns.violinplot(y='RT', x="TrialType", split=True, hue='ACC',
data=df)
```

Code language: Python (python)

In the next and final example, we are going to create a horizontal violin plot in Python with Seaborn and the `orient`

parameter.

Here’s how we use the `orient`

parameter to get a horizontal violin plot with Seaborn:

```
sns.violinplot(y='TrialType', x="RT", orient='h',
data=df)
```

Code language: Python (python)

Notice how we also flipped the `y`

and `x`

parameters. That is, we now have the dependent variable (“RT”) as the `x`

parameter. If we want to save a plot, whether created with Matplotlib or Seaborn, we might want to e.g. change the Seaborn plot size and add or change the title and labels. Here’s a code example customizing a Seaborn violin plot:

```
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.gcf()
# Change seaborn plot size
fig.set_size_inches(10, 8)
# Increase font size
sns.set(font_scale=1.5)
# Create the violin plot
sns.violinplot(y='RT', x='TrialType',
data=df)
# Change Axis labels:
plt.xlabel('Condition')
plt.ylabel('Response Time (MSec)')
plt.title('Violin Plot Created in Python')
```

Code language: Python (python)

In the above code chunk, we have a fully working example creating a violin plot in Python using Seaborn and Matplotlib. Now, we start by importing the needed packages. After that, we create a new figure with plt.gcf(). In the next code lines, we change the size of 1) the plot, and 2) the font. Now, we are creating the violin plot and, then, we change the x- and y-axis labels. Finally, the title is added to the plot.

For more data visualization tutorials:

- How to Plot a Histogram with Pandas in 3 Simple Steps
- 9 Python Data Visualization Examples (Video)
- How to Make a Scatter Plot in Python using Seaborn
- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)

In this post, you have learned how to make a violin plot in Python using the packages Matplotlib and Seaborn. First, you learned a bit about what a violin plot is and, then, how to create both single and grouped violin plots in Python with 1) Matplotlib and 2) Seaborn.

The post How to Make a Violin plot in Python using Matplotlib and Seaborn appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to work with $ in R. First, we will have a look at a couple of examples for a list object and then for a dataframe object.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this very short tutorial, you will learn by example how to use the operator $ in R. First, we will learn what the $ operator does by getting the answer to some frequently asked questions. Second, we will work with a list that we create, and use the dollar sign operator to both select and add a variable. Here you will also learn about the downsides of using $ in R as well as the alternatives that you can use. In the following section, we will also work with a dataframe. Both sections will involve creating the list and the dataframe.

To follow this post you need a working installation of the R statistical environment, of course. If you want to read the example Excel file you will also need the readxl package.

The $ operator can be used to select a variable/column, to assign new values to a variable/column, or to add a new variable/column in an R object. This R operator can be used on e.g. lists, and dataframes. For example, if we want to print the values in the column “A” in the dataframe called “dollar” we can use the following code: `print(dollar$A)`

,

First of all, using the double brackets enables us to e.g. select multiple columns whereas the $ operator only enables us to select one column.

Before we go on to the next section, we will create a list using the list() function.

```
dollar <- list(A = rep('A', 5), B = rep('B', 5),
'Life Expectancy' = c(10, 9, 8, 10, 2))
```

Code language: R (r)

In the next section, we will, then, work with the $ operator to 1) add a new variable to the list, and 2) print a variable in the list. In the third example, we will learn how to use $ in R to select a variable which variable contains whitespaces.

Here we will start learning, by examples, how to work with the $ operator in R. First, however, we will create a list.

Here’s how to use $ in R to add a new variable to a list:

`dollar$Sequence <- seq(1, 5)`

Code language: R (r)

Notice how we used the name of the list, then the $ operator, and the assignment (“<-”) operator. On the left side of <- we used seq() function to generate a sequence of numbers in R. This sequence of numbers was added to the list. Here’s our example list with the new variable:

In the next example, we will use the $ operator to print the values of the new variable that we added.

Here’s how we can use $ in R to select a variable in a list:

Code language: R (r)`dollar$Sequence`

Again, we used the list name, and the $ operator to print the new column we previously added:

Note, that if you want to select two, or more, columns you have to use the double brackets and put in each column name as a character. Another option to select columns is, of course, using the `select()`

function from the excellent package dplyr.

You might also be interested in: How to use %in% in R: 7 Example Uses of the Operator

Here’s how we can print, or select, a variable with white space in the name:

Code language: R (r)`dollar$`Life Expectancy``

Notice how we used the ` in the code above. This way, we can select, or add values, even though the variable contains white space. I would, however, suggest that you rename the column (or replace the white spaces). See the recent post to learn how to rename columns in R. Again, using brackets, in this case, would be the same as when the variable is not containing white spaces.

In the next section, we will use the same examples above but on a dataframe. First, however, we will read an .xlsx file in R using the readxl package.

```
dataf <- read_excel('example_sheets.xlsx',
skip=2)
```

Code language: R (r)

Note, that we used the skip argument to skip the first two rows. In the example data (download here), the column names are on the third row. We can print the first 5 rows of the dataframe using the `head()`

function:

Here we can see that there are 5 columns. In the next section, we will use the $ operator on this dataframe.

In the first example, we will add a new column to the dataframe. After this, we will select the new column and print it using the $ operator. Finally, we will also add a new example on how to use this operator: to remove a column.

Here’s how we can use $ to add a new column in R:

`dataf$NewData <- rep('A', length(dataf$ID))`

Code language: R (r)

Notice how we used R’s rep() function to generate a vector containing the letter ‘A’. It is important that we generate a vector of the same length as the number of rows in our dataframe. Therefore, we used the length() function as the second argument.

Now, if you want to learn easier ways to add a column in R check the following posts:

- How to Add a Column to a Dataframe in R with tibble & dplyr
- R: Add a Column to Dataframe Based on Other Columns with dplyr
- How to Add an Empty Column to a Dataframe in R (with tibble)

In the next example, we are going to select this column using the $ operator and print it.

Here’s how we select and print the values in the column we created:

Code language: R (r)`dataf$NewData`

Notice, to select, and print the values, of a column in a dataframe we used R’s $ operator the same way as we used it when we worked with a list. Here’s the output of the code above:

Now, it is easier to use the R package dplyr to select certain columns in R compared to using the $ operator. Another option is, of course, to use the double brackets.

In the next example, we are going to drop a column from the dataframe.

Here’s how we can delete a column using the $ operator and the NULL object:

`dataf$NewData <- NULL`

Code language: PHP (php)

Again, we can use the R package dplyr to remove columns. More specifically, we can make use of the select() function to delete multiple columns in a quick and easy way.

Note, that example 3 will also work if we have a column containing white spaces in our dataframe. Finally, before concluding this post, we will have a quick look on how to use brackets to select a column:

`dataf['ID']`

Code language: R (r)

Notice how we used the column name of the variable we wanted to select. This, again, will work on a list as well.

In this post, you have learned, by examples, how to use $ in R. First, we worked with a list to add a new variable and select a variable. Then, we used the same methods on a dataframe. As a bonus, we also had a look at how to remove a column using the $ operator. Hope you learned something. If you did please share the post in your work, on your social media accounts, or link back to it in your own blog posts. If you have any comments or suggestions to the post please leave a comment below.

The post How to use $ in R: 6 Examples – list & dataframe (dollar sign operator) appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the rename() function. Now, renaming a column with dplyr and the rename() function is super simple. But, of course, […]

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to rename a column (or multiple columns) in R using base functions as well as dplyr. Renaming columns in R is a very easy task, especially using the `rename()`

function. Now, renaming a column with dplyr and the `rename()`

function is super simple. But, of course, it is not super hard to change the column names using base R as well.

Now, there are some cases in which you need to get rid of strange column names such as “x1”, “x2”, “x3”. If we encounter data, such as this, cleaning up the names of the variables in our dataframes may be required and will definietly make work more readable. This is very important especially in those situations we are working together with others or share our data with with others. It is also very important that the column names have clear names if we plan to make the data open in a repository.

The outline of the post is a follows. First, you will learn about the requirements of this post. After you know what you need to follow this tutorial, you will get the answer to two questions. In the section, following the FAQs, we will load an example data set to work with. Here we will read an Excel file using the readxl package. When we have successfully imported data into R we can start by changing name on the oclumns. First, we will start by using a couple of techniques that can be done using base R. Second, we will work with dplyr. Specifically, in this section we will use the rename-family functions to change the names of some of the variables in the dataframe.That is, we will use the `rename()`

, and `rename_with().`

Now, before going on to the next section it is worth mentioning that we can use dplyr to select columns as well as remove columns in R.

To follow this post you need to have R installed as well as the packages readxl and dplyr. If you want to install the two packages you can use the `install.packages()`

function. Here’s how to install readxl and dplyr: `install.packages(c('dplyr', 'readxl')`

.

It is worth pointing out, here, that both these packages are part of the Tidyverse. This means that you can install them, among with a bunch of other great packages, by typing `install.packages('tidyverse')`

.

You can rename a column in R in many ways. For example, if you want to rename the colunn called “A” to “B” you can use this code: `names(dataframe)[names(dataframe)=="A"] <- "B"`

. This way you changed the column name to "B".

To rename a column in R you can use the `rename()`

function from dplyr. For example, if you want to rename the column "A" to "B", again, you can run the following code: `rename(dataframe, B = A)`

.

That was it, we are getting ready to practice how to change the column names in R. First, however, we need some data that we can practice on. In the next section, we are going to import data by reading a .xlsx file.

Here's how we can read a .xlsx file in R with the readxl package:

```
library(readxl)
titanic_df <- read_excel('titanic.xlsx')
```

Code language: R (r)

In the code chunk above, we started by loading the library readxl and then we used the `read_excel()`

function to read titanic.xlsx file. Here's the first 6 rows of this dataframe:

In the next section, we will start by using the base functionality to rename a column in R.

Here's how to rename a single column with base R:

`names(titanic_df)[1] <- 'P_Class'`

Code language: JavaScript (javascript)

In the code chunk above, we used the `names()`

n function to assign a new name to the first column in the dataframe. Specifically, using the `names()`

n function we get all the column names in the the dataframe and then we select the first columns using the brackets. Finally, we assigned the new column anme using the `<-`

and the character 'P_Class' (the new name). Note, you can, of course, rename multiple columns in the dataframe using the same method as above. Just change what you put within the brackets. For example, if you want to rename columns 1 to 5 you can put "1:5" within the brackes and then a character vector with 5 column names.

In the next example, we are going to use the old column name, instead. to rename the column.

Here's how to change the column name by using the old name when selecting it:

`names(titanic_df)[names(titanic_df) == 'P_Class'] <- "PCLASS'`

Code language: JavaScript (javascript)

In the code chunk above, we did something quite similar as in the first method. However, here we selected the column we previously renamed by its name. This is what we do within the brackets. Notice how we, again, there used names and the == to select the column named "P_Class". Here's the output (new column name marked with red):

In the next example, you will learn how to rename multiple columns using base R. In fact, we are going to rename all columns in the dataframe.

Renaming all columns can be done in a similar way as the last example. Here's how we change all the columns in the R dataframe:

```
names(titanic_df) <- c('PC', 'SURV', 'NAM', 'Gender', 'Age', 'SiblingsSPouses',
'ParentChildren', 'Tick', 'Cost', 'Cab', 'Embarked',
'Boat', 'Body', 'Home')
```

Code language: R (r)

Notice how we only used `names()`

in the code above. Here it's worth knowing that if the character vector (right of the <-) should contain as many elements as there are column names. Or else, one or more columns will be named "NA". Morever, you need to know the order of the columns. In the next few examples, we are going to work with dplyr and the rename-family of functions.

You might also be interested in: How to use $ in R: 6 Examples – list & dataframe

Renaming a column in dplyr is quite simple. Here's how to change a column name:

Code language: R (r)`titanic_df <- titanic_df %>% rename(pc_class = PC)`

In the code chunk above, there are some new things that we work with. First, we start by importing dplyr. Second, we are changing the name in the dataframe using the `rename()`

function. Notice how we use the %>% operator. This is very handy because the functions we use after this will be applied on the dataframe to the left of the operator. Third, we use the `rename()`

function with one argument: the column we want to rename. For a blog post on another handy operator in R:

Remember, we renamed all of the columns in the previous example. In the code chunk above, we are actually changing the column back again. That is, to the left of = we have the new column name and to the right, the old name. As you will see in the next example, we can rename multiple columns in the dataframe by adding arguments.

It may be worth mentioning that we can us dplyr to rename factor levels in R, and to add a column to a dataframe. In the next section, however, we are going to rename columns in R with dplyr.

If we, on the other hand, want to change the name of multiple columns we can do as follows:

Code language: R (r)`titanic_df <- titanic_df %>% rename(Survival = SURV, Name = NAM, Sibsp = SiblingsSPouses)`

It was quite simple to change the name multiple columns using dplyr's ename() function. As you can see, in the code chunk above, we just added each column that we wanted to change the name of. Again, the name to the right of the equal sign is the old column name. Here's the first 6 columns and rows of the dataframe with new column names marked with **red**:

In the following sections, we will work with the `rename_with()`

function. This is a great function which enables us to, as you will see, change the column names to upper or lower case.

Here's how we can use the `rename_with()`

function (dplyr) to change all the column names to lowercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(tolower)`

In the code chunk above, we used the `rename_with()`

function and then the `tolower()`

function. This function was applied on all the column names and the resulting dataframe look like this:

In the next example, we are going to change the column names to uppercase using the `rename_with()`

function together with the `toupper()`

function.

In this section, we wil just change the function that we use as the only argument in `rename_with()`

. This will enable us to change all the oclumn names to uppercase:

Code language: R (r)`titanic_df <- titanic_df %>% rename_with(toupper)`

Here's the first 6 rows where all the column names now is in uppercase:

In the next section, we are going to continue working with the rename_with() function and see how we can use other functions to clean the column names from unwanted characters. For example, we can use the gsub() function to remove punctuation from column names.

In some cases, our column names may contain characters that we don't really need. Here's how to use `rename_with()`

from dplyr together with `gsub()`

tro remove punctuation from all the column names in the R dataframe:

```
titanic_df <- titanic_df %>%
rename_with(~ gsub('[[:punct:]]', '', .x))
```

Code language: JavaScript (javascript)

Notice how we added the tilde sign (~) before the gsub() function. Moreover, the first argument is the regular expression for punctuation and the second is what we want to remove it with. In our case, here, we just remove it from the column names. We could, however, add like an underscore ("_") if we want to replace the punctuation in the column names. Finally, if we wanted to replace specific characters we could add them as well, instead of the regular expression for punctuation.

Now that you have renamed the columns that needed better and clearer name you can continue with your data pre-processing. For example, you can add a column to the dataframe based on othher columns with dplyr, calculate descriptive statistics (also with dplyr), take the absolute value in your R dataframe, or remove duplicate rows or columns in the dataframe.

In this tutorial, you have learned how to use base R as well as dplyr. First, you learned how to use the base are functions to change the column name of a single columns based on their index and name. Second, you learned how to do the same with dplyr and the rename function. Here we also renamed multiple columns as well as removed punctuation from the column names. Hope you found the post useful. If you did, please share it on your social media accounts and link to it in your projects. Finally, if you have any corrections on the particular post or suggestion, both on this post or in general what should be covered on this blog, please let me know.

The post How to Rename Column (or Columns) in R with dplyr appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using abs() is very easy and to take the absolute value from e.g. a vector you can type abs(YourVector). […]

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this data science tutorial, you will learn how to get the absolute value in R. Specifically, you will learn how to get the absolute value using the built-in function abs(). As you may already suspect, using `abs()`

is very easy and to take the absolute value from e.g. a vector you can type `abs(YourVector)`

. Furthermore, you will learn how to take the absolute value of both a matrix and a data frame. In the next section, you will get a brief overview of what is covered in this R tutorial.

The structure of the post is as follows. First, we will get the answer to a couple of simple questions. Note, most of them might actually be enough for you to understand how to get the absolute value using the R statistical programming environment. After this, you will learn what you need to know and have installed in your R environment to follow this post. Third, we will start by going into a more detailed example on how to take the absolute value of a vector in R. This section is followed by how to use the abs() function, again, on a matrix containing negative values. Finally, we will also have a look at how to take the absolute values in a data frame in R. This section will also use some of the functions of the dplyr (Tidyverse) package.

The absolute value in R is is the non-negative *value* of x. To be clear, the absolute value in R is no different from the absolute value in any other programming language as this has something to do with mathematics rather than a programming language. In the next FAQ, you will learn how to use the `abs()`

function to get absolute values of a e.g. vector.

To change the ne gative numbers to positive in R we can use the `abs()`

function. For example, if we have the vector `x`

containing negative numbers, we can change them to positive numbers by typing `abs(x)`

in R.

Now that we have some basic understanding on how to chang negative numbers to positive, by taking their absolute values we can go ahead and have a look at what we need to follow this tutorial. That is, in the next section you will learn about the requirements of this post.

First of all, if you already have R installed you will also have the function abs() installed. However, if you want to use some functionality of the dplyr package (as in the later examples) you will also need to install dplyr (or Tidyverse). Moreover, if you want to read an .xlsx file in R with the readxl package you need to install it, as well. Here it might be worth pointing out that dplyr contains a lot of great functions. For example, you can use dplyr to remove columns in R as well as to select columns by e.g. name or index.

To install dplyr you can use the `install.packages()`

function. For example, to install the packages dplyr and readxl you type `install.packages(c("dplyr", "readxl"))`

. Note, you can change “dplyr” and “readxl” to “tidyverse” if you want to install all these packages as they are both part of the Tidyverse packages. In the next section, you will get the first example of how to take absolute value in R using the `abs()`

function.

Here’s how to take the absolute value from a vector in R:

```
# Creating a vector with negative values
negVec <- seq(-0.1, -1.1, by=-.1)
# R absolute value from vector
abs(negVec)
```

Code language: R (r)

In the code chunk above, we first created a sequence of numbers in R with the seq() method. As you may understand, all the numbers we generated were negative. In the second line, therefore, we used the `abs()`

function to take the absolute value of the vector. Here’s the output in which all the negative numbers are now positive:

In the next example, we are going to create a matrix filled with negative numbers and get the absolute values from the matrix.

If we, on the other hand, have a matrix here’s how to take the absolute value in R:

```
negMat <- matrix(
c(-2, -4, 3, 1, -5, 7,
-3, -1.1, -5, -3, -1,
-12, -1, -2.2, 1, -3.0),
nrow=4,
ncol=4)
# Take absolute value in R
abs(negMat)
```

Code language: R (r)

In the example above, we created a small matrix using the `matrix()`

function and, then, used the `abs()`

function to convert all negative numbers in this matrix to positive (i.e., take the absolute values of the matrix). This example will be followed by a couple of examples in which we will take the absolute values in data frames.

Now that you have changed the negative numbers to positive, you may want to quickly get Tukey’s five number summary statistics using the R function `fivenum()`

In this section, we will learn how to get the absolute value in dataframes in R. First, we will select one column and change it to absolute values. Second, we will select multiple columns, and again, use the `abs()`

function on these. Note, that here we will use the `mutate()`

function from dplyr. In the last example, we will also use the `select_if()`

function. This is dplyr function is great if we want to be able to use `abs()`

function on e.g. all numerical columns in a dataframe.

First, however, we are going to import the example dataset “r_absolute_value.xlsx” using the readxl package and `read_excel()`

function:

```
library(readxl)
dataf <- read_excel('./SimData/r_absolute_value.xlsx')
head(dataf)
```

Code language: JavaScript (javascript)

We are not getting into detail when it comes to reading .xlsx files in R. However, you can download the example dataset in the link above. If you store this .xlsx file in a subfolder to your r-script (see code above) you can just copy-paste the code chunk above. However, if you store it somewhere else on your computer you should change the path to the location of the file. In the next example, we are going to get the absolute value from a single column in the dataframe.

Here’s how to take the absolute value from one column in R and create a new column:

Code language: R (r)`dataf$D.abs <- abs(dataf$D) head(dataf)`

Note, that in the example above, we selected a column using the $-operator, and then we used the `abs()`

function to take the absolute value of this column. The absolute values of this column, in turn, were also added to a new column which we created, again, using the $-operator. It is, of course, also possible to use dplyr and the `mutate()`

function instead. Here’s another method, that we used to add a new column to a R dataframe as well as to add a column based on values in other columns in R. Here’s how to:

Code language: R (r)`dataf <- dataf %>% mutate(D.abs <- abs(D))`

Now, learning the above method is quite neat because it is a bit simpler to work with `mutate()`

compared to using only the $-operator. For example, we can make use of the %>%-operator as well (as in the example above). Furthermore, it will make the code look cleaner when creating more than one new column (as in the next example). In the next example, we re going to create two new columns by taking the absolute values of two other.

Here’s how we would take two columns and get the absolute value from them:

```
library(dplyr)
dataf <- dataf %>%
mutate(F.abs = abs(F),
C.abs = abs(C))
```

Code language: HTML, XML (xml)

Again, we worked with the `mutate()`

function and created two new variables. Here it might be worth mentioning that if we only want to get the absolute values from the numerical columns in our dataframe without creating new variables we can, instead, use the `select()`

function to select the specific columns. Here’s an example in which we select two columns and take their absolute value:

```
dataf <- dataf %>%
select(c(F, C)) %>%
abs()
```

Code language: R (r)

In the next section, we will use this newly learned method to take the absolute value in all the columns, that are numerical, in the dataframe. However, in this example, we are going to use the `select_if()`

function and only select the numerical columns. This is good to know because if we tried to run `abs()`

on the complete dataframe we would get an error. Specifically, this would return the error “Error in Math.data.frame(dataf) : non-numeric variable(s) in data frame: M”.

In the next section, we will work with the `select_if()`

function as well as the %>% operator, again. Another awesome operator in R is the %in% operator. Make sure you check this post out to learn more:

Here’s to apply the `abs()`

function on all the numerical columns in the dataframe:

Code language: R (r)`dataf.abs <- dataf %>% select_if(is.numeric) %>% abs()`

Note, how we, again, used the %>%-operator (magittr but imported with dplyr) to apply the `select_if()`

on the dataframe. Again, we used the %>%-operator and applied the `abs()`

function on all the numerical columns. Notice how the new dataframe *only* contains numerical columns (and absolute values).

Now, before concluding this post it may be worth that, again, point out that the tidyverse package is a very handy package. That is, it comes with a range of different packages that can be used for manipulating and cleaning your data. For example, you can use dplyr to rename factor levels in R , the lubridate package to extract year from date in R, and ggplot2 to create a scatter plot.

In this tutorial, you have learned about the absolute value, how to take the absolute value in R from 1) vectors, 2) matrices, and 3) columns in a dataframe. Specifically, you have learned how to use the abs() function to convert negative values to positive in a vector, a matrix, and a dataframe. When it comes to the dataframe you have learned how to select columns and convert them using r-base as well as dplyr. I really hope you learned something. If you did, please leave a comment below. You should also drop a comment if you got a suggestion or correction to the blog post. Stay safe!

The post How to Take Absolute Value in R – vector, matrix, & data frame appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe. Outline In the first section, we are going to have a look at what you […]

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe.

In the first section, we are going to have a look at what you need to follow this tutorial. Second, we will answer some questions that might have brought you to this post. Third, we are going to use base R to select certain columns from the dataframe. In this section, we are also going to use the great operator %in% in R to select specific columns. Fourth, we are going to use dplyr and the select() family of functions. For example, we will use the `select_if()`

to get all the numeric columns and some helper functions. The helper functions enable us to select columns starting with, or ending with, a certain word or a specific character, for instance.

Note, the `select_if()`

function is also great if you, for example, want to take the absolute value in R dataframe and only select the numerical columns.

To select a column in R you can use brackets e.g., `YourDataFrame['Column']`

will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, `select(YourDataFrame, c('A', 'B')`

will take the columns named “A” and “B” from the dataframe.

If you want to use dplyr to select a column in R you can use the `select()`

function. For instance, `select(Data, 'Column_to_Get')`

will get the column “Column_to_Get” from the dataframe “Data”.

In the next section, we are going to learn about the prerequisites of this post and how to install R packages such as dplyr (or Tidyverse).

To follow this post you, obviously, need a working installation of R. Furthermore, we are going to use the read the example data from an Excel file using the readxl package. Moreover, if you want to use dplyr’s `select()`

and the different helper functions (e.g., startsWith(), endsWith()) you also need to install dplyr. It may be worth pointing out, that just by using the “-“-character you can use select() (from dplyr) to drop columns in R.

It may be worth to point out that both readxl and dplyr are part of the tidyverse. Tidyverse comes with a number of great packages that are packed with great functions. Besides selecting, or removing, columns with dplyr (part of Tidyverse) you can extract year from date in R using the lubridate package, create scatter plots with ggplot2, and calculate descriptive statistics. That said, you can install one of these r-packages, depending on what you need, using the `install.packages()`

function. For example, installing dplyr is done by running this in R: `install.packages(c('dplyr', 'readxl'))`

.

Before we continue and practice selecting columns in R, we will read data from a .xlsx file.

```
library(readxl)
dataf <- read_excel("add_column.xlsx")
head(dataf)
```

Code language: R (r)

This example dataset is one that we used in the tutorial, in which we added a column based on other columns. We can see that it contains 9 different columns. If we want to, we can check the structure of the dataframe so that we can see what kind of data we have.

Code language: R (r)`str(dataf)`

Now, we see that there are 20 rows, as well, and that all but one column is numeric. In a more recent post, you can learn how to rename columns in R with dplyr. In the next section, we are going to learn how to select certain columns from this dataframe using base R.

In this section, we are going to practice selecting columns using base R. First, we will use the column indexes and, second, we will use the column names.

Here’s one example on how to select columns by their indexes in R:

`dataf[, c(1, 2, 3)]`

Code language: R (r)

As you can see, we selected the first three columns by using their indexes (1, 2, 3). Notice, how we also used the “,” within the brackets. This is done to get the columns rather than subsetting rows (i.e., by placing the “,” after the vector with indexes). Before moving on to the next example it may be worth knowing that the vector can contain a sequence. For instance, we can generate a sequence of numbers using `:`

. For example, replacing `c(1, 2, 3)`

with `c(1:3)`

would give us the same output, as above. Naturally, we can also select e.g. the third, fifth, and the sixth column if we want to. In the next example, we are going to subset certain columns by their name. Note, sequences of numbers can also be generated in R with the seq() function.

Here’s how we can select columns in R by name:

`dataf[, c('A', 'B', 'Cost')]`

Code language: R (r)

In the code chunk above, we basically did the same as in the first example. Notice, however, how we removed the numbers and added the column names. In the vector, that is, we now used the names of the column we wanted to select. Ín the next example, we are going to learn a neat little trick by using the %in% operator when selecting columns by name.

Here’s how we can make use of the %in% operator to get columns by name from the R dataframe:

```
head(dataf[, (colnames(dataf) %in% c('Depr1', 'Depr2',
'Depr4', 'Depr7'))])
```

Code language: R (r)

In the code chunk above, we used the great %in% operator. Notice something diffrent in the character vector? There’s a column that doesn’t exist in the example data. The cool thing, here, is that even though if we do this when using the %in% operator, we will get the columns that actually exists in the dataframe selected. In the next section, we are going to have a look at a couple of examples using dplyr’s `select()`

and some of the great helper functions.

In this section, we will start with the basic examples of selecting columns (e.g., by name and index). However, the focus will be on using the helper functions together with `select()`

, and the `select_if()`

function.

Here’s how we can get columns by index using the `select()`

function:

`library(dplyr) dataf %>% select(c(2, 5, 6))`

Notice how we used another great operator: %>%. This is the pipe operator and following this, we used the select() function. Again, when selecting columns with base R, we added a vector with the indexes of the columns we want. In the next example, we will basically do the same but select by column names.

Here’s how we use `select()`

to get the columns we want by name:

```
library(dplyr)
dataf %>%
select(c('A', 'Cost', 'Depr1'))
```

Code language: R (r)

n the code chunk above, we just added the names of the columns in the vector. Simple! In the next example, we are going to have a look at how to use `select_if()`

to select columns with containing data of a specific data type.

Here’s how to select all the numeric columns in an R dataframe:

```
dataf %>%
select_if(is.numeric)
```

Code language: CSS (css)

Remember, all columns except for one are of numeric type. This means that we will get 8 out of 9 columns running the above code. If we, on the other hand, added the `is.character`

function we would only select the first column. In the next section, we will learn how to get columns starting with a certain letter.

Here’s how we use the `starts_with()`

helper function and `select()`

to get all columns starting with the letter “D”:

```
dataf %>%
select(starts_with('D'))
```

Code language: R (r)

Selecting columns with names starting with a certain letter was pretty easy. In the `starts_with()`

helper function we just added the letter.

Here’s how we use the `ends_with()`

helper function and `select()`

to get all columns ending with the letter “D”:

```
dataf %>%
select(ends_with('D'))
```

Code language: R (r)

Note, that in the example dataset there is only one column ending with the letter “D”. In fact, all column names are ending with unique characters. That is, here it would not make sense to select columns using this method. It is worth noting here, that we can use a word when working with both the `starts_with()`

and `ends_with()`

helper functions. Let’s have a look!

Here’s how we can select certain columns starting with a specific word:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(starts_with('Depr'))
```

Code language: R (r)

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. Before going to the next section, it may be worth mentioning another great feature of the dplyr package. You can use dplyr to rename factor levels in R. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(contains('pr'))
```

Code language: R (r)

Again, this particular example doesn’t make sense on the example dataset. There’s a final helper function that is worth mentioning: `matches()`

. This function can be used to check whether column names contain a pattern (regular expression) such as digits. Now that you have selected the columns you need, you can continue manipulating your data and get it ready for data analysis. For example, you can now go ahead and create dummy variables in R or add a new column.

In this post, you have learned how to select certain columns using base R and dplyr. Specifically, you have learned how to get columns, from the dataframe, based on their indexes or names. Furthermore, you have learned to select columns of a specific type. After this, you learned how to subset columns based on whether the column names started or ended with a letter. Finally, you have also learned how to select based on whether the columns contained a string or not. Hope you found this blog post useful. If you did, please share it on your social media accounts, add a link to the tutorial in your project reports and such, and leave a comment below.

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do […]

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do if your data violates some of the assumptions.

Third, you will learn how to perform a paired sample t-test using the following Python packages:

- Scipy (scipy.stats.ttest_ind)
- Pingouin (pingouin.ttest)

In the final sections, of this tutorial, you will also learn how to:

- Interpret and report the paired t-test
- P-value, effect size

- report the results and visualizing the data

In the first section, you will learn about what is required to follow this post.

In this tutorial, we are going to use both SciPy and Pingouin, two great Python packages, to carry out the dependent sample t-test. Furthermore, to read the dataset we are going to use Pandas. Finally, we are also going to use Seaborn to visualize the data. In the next three subsections, you will find a brief description of each of these packages.

SciPy is one of the essential data science packages. This package is, furthermore, a dependency of all the other packages that we are going to use in this tutorial. In this tutorial, we are going to use it to test the assumption of normality as well as carry out the paired sample t-test. This means, of course, that if you are going to carry out the data analysis using Pingouin you will get SciPy installed anyway.

Pandas is also a very great Python package for someone carrying out data analysis with Python, whether a data scientist or a psychologist. In this post, we will use Pandas import data into a dataframe and to calculate summary statistics.

In this tutorial, we are going to use data visualization to guide our interpretation of the paired sample t-test. Seaborn is a great package for carrying out data visualization (see for example these 9 examples of how to use Seaborn for data visualization in Python).

In this tutorial, Pingouin is the second package that we are going to use to do a paired sample t-test in Python. One great thing with the ttest function is that it returns a lot of information we need when reporting the results from the test. For instance, when using Pingouin we also get the degrees of freedom, Bayes Factor, power, effect size (Cohen’s d), and confidence interval.

In Python, we can install packages with pip. To install all the required packages run the following code:

Code language: Bash (bash)`pip install scipy pandas seaborn pingouin`

In the next section, we are going to learn about the paired t-test and it’s assumptions.

The paired sample t-test is also known as the *dependent sample t-test*, and *paired t-test*. Furthermore, this type of t-test compares two averages (means) and will give you information if the difference between these two averages are zero. In a paired sample t-test, each participant is measured twice, which results in pairs of observations (the next section will give you an example).

For example, if clinical psychologists want to test whether a treatment for depression will change the quality of life, they might set up an experiment. In this experiment, they will collect information about the participants’ quality of life before the intervention (i.e., the treatment and after. They are conducting a pre- and post-test study. In the pre-test the average quality of life might be 3, while in the post-test the average quality of life might be 5. Numerically, we could think that the treatment is working. However, it could be due to a fluke and, in order to test this, the clinical researchers can use the paired sample t-test.

Now, when performing dependent sample t-tests you typically have the following two hypotheses:

- Null hypotheses: the true mean difference is equal to zero (between the observations)
- Alternative hypotheses: the true mean difference is not equal to zero (two-tailed)

Note, in some cases we also may have a specific idea, based on theory, about the direction of the measured effect. For example, we may strongly believe (due to previous research and/or theory) that a specific intervention should have a positive effect. In such a case, the alternative hypothesis will be something like: the true mean difference is greater than zero (one-tailed). Note, it can also be smaller than zero, of course.

Before we continue and import data we will briefly have a look at the assumptions of this paired t-test. Now, besides that the dependent variable is on interval/ratio scale, and is continuous, there are three assumptions that need to be met.

- Are the two samples independent?
- Does the data, i.e., the differences for the matched-pairs, follow a normal distribution?
- Are the participants randomly selected from the population?

If your data is not following a normal distribution you can transform your dependent variable using square root, log, or Box-Cox in Python. In the next section, we will import data.

Before we check the normality assumption of the paired t-test in Python, we need some data to even do so. In this tutorial post, we are going to work with a dataset that can be found here. Here we will use Pandas and the read_csv method to import the dataset (stored in a .csv file):

```
df = pd.read_csv('./SimData/paired_samples_data.csv',
index_col=0)
```

Code language: Python (python)

In the image above, we can see the structure of the dataframe. Our dataset contains 100 observations and three variables (columns). Furthermore, there are three different datatypes in the dataframe. First, we have an integer column (i.e., “ids”). This column contains the identifier for each individual in the study. Second, we have the column “test” which is of object data type and contains the information about the test time point. Finally, we have the “score” column where the dependent variable is. We can check the pairs by grouping the Pandas dataframe and calculate descriptive statistics:

In the code chunk above, we grouped the data by “test” and selected the dependent variable, and got some descriptive statistics using the `describe()`

method. If we want, we can use Pandas to count unique values in a column:

`df['test'].value_counts()`

Code language: Python (python)

This way we got the information that we have as many observations in the post test as in the pre test. A quick note: before we continue to the next subsection, in which we subset the data, it has to be mentioned that you should check whether the dependent variable is normally distributed or not. This can be done by creating a histogram (e.g., with Pandas) and/or carrying out the Shapiro-Wilks test.

Both the methods, whether using SciPy or Pingouin, require that we have our dependent variable in two Python variables. Therefore, we are going to subset the data and select only the dependent variable. To our help we have the `query()`

method and we will select a column using the brackets ([]):

```
b = df.query('test == "Pre"')['score']
a = df.query('test == "Post"')['score']
```

Code language: Python (python)

Now, we have the variables a and b containing the dependent variable pairs we can use SciPy to do a paired sample t-test.

Here’s how to carry out a paired sample t-test in Python using SciPy:

```
from scipy.stats import ttest_rel
# Python paired sample t-test
ttest_rel(a, b)
```

Code language: Python (python)

In the code chunk above, we first started by importing `ttest_rel()`

, the method we then used to carry out the dependent sample t-test. Furthermore, the two parameters we used were the data, containing the dependent variable, in the pairs (a, and b). Now, we can see by the results (image below) that the difference between the pre- and post-test is statistically significant.

In the next section, we will use Pingouin to carry out the paired t-test.

Here’s how to carry out the dependent samples t-test using the Python package Pingouin:

```
import pingouin as pt
# Python paired sample t-test:
pt.ttest(a, b, paired=True)
```

Code language: Python (python)

There’s not that much to explain, about the code chunk above, but we started by importing pingouin. Next, we used the `ttest()`

method and used our data. Notice how we used the paired parameter and set it to True. We did this because it is a paired sample t-test we wanted to carry out. Here’s the output:

As you can see, we get more information when using Pingouin to do the paired t-test. In fact, here we basically get all we need to continue and interpret the results. In the next section, before learning how to interpret the results, you can also watch a YouTube video explaining all the above (with some exceptions, of course):

Here’s the majority of the current blog post explained in a YouTube video:

In this section, you will be given a short explanation on how to interpret the results from a paired t-test carried out with Python. Note, we will focus on the results that we got from Pingouin as they give us more information (e.g., degrees of freedom, effect size).

Now, the p-value of the test is smaller than 0.001, which is less than the significance level alpha (e.g., 0.05). This means that we can draw the conclusion that the quality of life has increased when the participants conducted the post-test. Note, this can, of course, be due to other things than the intervention but that’s another story.

Note that, the p-value is a probability of getting an effect at least as extreme as the one in our data, assuming that the null hypothesis is true. Pp-values address only one question: how likely your collected data is, assuming a true null hypothesis? Notice, the p-value can never be used as support for the alternative hypothesis.

Normally, we interpret Cohen’s D in terms of the relative strength of e.g. the treatment. Cohen (1988) suggested that *d*=0.2 is a ‘small’ effect size, 0.5 is a ‘medium’ effect size, and that 0.8 is a ‘large’ effect size. You can interpret this such as that iif two groups’ means don’t differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.

When using Pingouin to carry out the paired t-test we also get the Bayes Factor. See this post for more information on how to interpret BF10.

In this section, you will learn how to report the results according to the APA guidelines. In our case, we can report the results from the t-test like this:

The results from the pre-test (

M= 39.77,SD= 6.758) and post-test (M= 45.737,SD= 6.77) quality of life test suggest that the treatment resulted in an improvement in quality of life,t(49) = 115.4384,p< .01. Note, that the “quality of life test” is something made up, for this post (or there might be such a test, of course, that I don’t know of!).

In the final section, before the conclusion, you will learn how to visualize the data in two different ways: creating boxplots and violin plots.

Here’s how we can guide the interpretation of the paired t-test using boxplots:

```
import seaborn as sns
sns.boxplot(x='test', y='score', data=df)
```

Code language: Python (python)

In the code chunk above, we imported seaborn (as sns), and used the boxplot method. First, we put the column that we want to display separate plots for on the x-axis. Here’s the resulting plot:

Here’s another way to report the results from the t-test by creating a violin plot:

```
import seaborn as sns
sns.violinplot(x='test', y='score', data=df)
```

Code language: Python (python)

Much like creating the box plot, we import seaborn and add the columns/variables we want as x- and y-axis’. Here’s the resulting plot:

As you may already be aware of, there are other ways to analyze data. For example, you can use Analysis of Variance (ANOVA) if there are more than two levels in the factorial (e.g. tests during the treatment, as well as pre- and post -tests) in the data. See the following posts about how to carry out ANOVA:

- Repeated Measures ANOVA in R and Python using afex & pingouin
- Two-way ANOVA for repeated measures using Python
- Repeated Measures ANOVA in Python using Statsmodels

Recently, machine learning methods have grown popular. See the following posts for more information:

In this post, you have learned two methods to perform a paired sample t-test.Specifically, in this post you have installed, and used, three Python packages for data analysis (Pandas, SciPy, and Pingouin). Furthermore, you have learned how to interpret and report the results from this statistical test, including data visualization using Seaborn. In the Resources and References section, you will find useful resources and references to learn more. As a final word: the Python package Pingouin will give you the most comprehensive result and that’s the package I’d choose to carry out many statistical methods in Python.

If you liked the post, please share it on your social media accounts and/or leave a comment below. Commenting is also a great way to give me suggestions. However, if you are looking for any help please use other means of contact (see e.g., the About or Contact pages).

Finally, support me and my content (much appreciated, especially if you use an AdBlocker): become a patron. Becoming a patron will give you access to a Discord channel in which you can ask questions and may get interactive feedback.

Here are some useful peer-reviewed articles, blog posts, and books. Refer to these if you want to learn more about the t-test, p-value, effect size, and Bayes Factors.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

It’s the Effect Size, Stupid – What effect size is and why it is important

Using Effect Size—or Why the P Value Is Not Enough.

Beyond Cohen’s d: Alternative Effect Size Measures for Between-Subject Designs (Paywalled).

A tutorial on testing hypotheses using the Bayes factor.

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in […]

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in Python.

That the data we have is of normal shape (also known as following a Bell curve) is important for the majority of the parametric tests we may want to perform. This includes regression analysis, the two-sample t-test, and Analysis of Variance that can be carried out in Python, to name a few.

This post will start by briefly going through what you need to follow this tutorial. After this is done, you will 1) get information about skewness and kurtosis, and 2) a brief overview of the different methods of transformation. In the section, following the transformation methods, you will learn how to import data using Pandas read_csv. We will explore the example dataset a bit by creating histograms, getting the measures of skewness and kurtosis. Finally, the last sections will be covering how to transform data that is non-normal.

In this tutorial, we are going to use Pandas, SciPy, and NumPy. It is worth mentioning, here, that you only need to install Pandas as the other two Python packages are dependencies of Pandas. That is, if you install Python packages using e.g. pip it will also install SciPy and NumPy on your computer, whether you use e.g. Ubuntu Linux or Windows 10. Note, that you can use pip to install a specific version of e.g. Pandas and if you need, you can upgrade pip using either conda or pip.

Now, if you want to install the individual packages (e.g. you only want to use NumPy and SciPy) you can run the following code:

Code language: Bash (bash)`pip install pandas`

Now, if you only want to install NumPy, change “pandas” to “numpy”, in the code chuk above. That said, let us move on to the section about skewness and kurtosis.

Briefly, skewness is a measure of symmetry. To be exact, it is a measure of lack of symmetry. This means that the larger the number is the more your data lack symmetry (not normal, that is). Kurtosis, on the other hand, is a measure of whether your data is heavy- or light-tailed relative to a normal distribution. See here for a more mathematical definition of both measures. A good way to visually examine data for skewness or kurtosis is to use a histogram. Note, however, that there are, of course, also different statistical tests that can be used to test if your data is normally distributed.

One way of handling right, or left, skewed data is to carry out the logarithmic transformation on our data. For example, `np.log(x)`

will log transform the variable `x`

in Python. There are other options as well as the Box-Cox and Square root transformations.

One way to handle left (negative) skewed data is to reverse the distribution of the variable. In Python, this can be done using the following code:

Both of the above questions will be more detailed answered throughout the post (e.g., you will learn how to carry out log transformation in Python). In the next section, you will learn about the three commonly used transformation techniques that you, later, will also learn to apply.

As indicated in the introduction, we are going to learn three methods that we can use to transform data deviating from the normal distribution. In this section, you will get a brief overview of these three transformation techniques and when to use them.

The square root method is typically used when your data is moderately skewed. Now using the square root (e.g., sqrt(x)) is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.

The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. Worth noting, however, is that it can not be applied to zero or negative values.

The Box-Cox transformation is, as you probably understand, also a technique to transform non-normal data into normal shape. This is a procedure to identify a suitable exponent (Lambda = l) to use to transform skewed data.

Now, the above mentioned transformation techniques are the most commonly used. However, there are plenty of other methods, as well, that can be used to transform your skewed dependent variables. For example, if your data is of ordinal data type you can also use the arcsine transformation method. Another method that you can use is called reciprocal. This method, is basically carried out like this: 1/x, where x is your dependent variable.

In the next section, we will import data containing four dependent variables that are positively and negatively skewed.

In this tutorial, we will transform data that is both negatively (left) and positively (right) skewed and we will read an example dataset from a CSV file (Data_to_Transform.csv). To our help we will use Pandas to read the .csv file:

```
import pandas as pd
import numpy as np
# Reading dataset with skewed distributions
df = pd.read_csv('./SimData/Data_to_Transform.csv')
```

Code language: Python (python)

This is an example dataset that have the following four variables:

- Moderate Positive Skew (Right Skewed)
- Highly Positive Skew’ (Right Skewed)
- Moderate Negative Skew (Left Skewed)
- Highly Negative Skew (Left Skewed)

We can obtain this information by using the `info()`

method. This will give us the structure of the dataframe:

As you can see, the dataframe has 10000 rows and 4 columns (as previously described). Furthermore, we get the information that the 4 columns are of float data type and that there are no missing values in the dataset. In the next section, we will have a quick look at the distribution of our 4 variables.

In the next section, we will do a quick visual inspection of the variables in the dataset using Pandas hist() function.

In this section, we are going to visually inspect whether the data are normally distributed. Of course, there are several ways to plot the distribution of our data. In this post, however, we are going to only use Pandas and create histograms. Here’s how to create a histogram in Pandas using the `hist()`

method:

```
df.hist(grid=False,
figsize=(10, 6),
bins=30)
```

Code language: Python (python)

Now, the `hist()`

method takes all our numeric variables in the dataset (i.e.,in our case float data type) and creates a histogram for each. Just to quickly explain the parameters used in the code chunk above. First, using the `grid`

parameter and set it to `False`

to remove the grid from the histogram. Second, we changed the figure size using the `figsize`

parameter. Finally, we also changed the number of bins (default is 20) to get a better view of the data. Here is the distribution visualized:

It is pretty clear that all the variables are skewed and not following a normal distribution (as the variable names imply). Note, there are, of course, other visualization techniques that you can carry out to examine the distribution of your dependent variables. For example, you can use boxplots, stripplots, swarmplots, kernel density estimation, or violin plots. These plots give you a lot of (more) information about your dependent variables. See the post with 9 Python data visualization examples, for more information. In the next section, we are also going to have a look at how we can get the measures of skewness and kurtosis.

More data visualization tutorials:

- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)
- How to use Pandas Scatter Matrix (Pair Plot) to Visualize Trends in Data
- How to Save a Seaborn Plot as a File (e.g., PNG, PDF, EPS, TIFF)

In this section, before we start learning how to transform skewed data in Python, we will just have a quick look at how to get skewness and kurtosis in Python.

`df.agg(['skew', 'kurtosis']).transpose()`

Code language: Python (python)

In the code chunk above, we used the `agg()`

method and used a list as the only parameter. This list contained the two methods that we wanted to use (i.e., we wanted to calculate skewness and kurtosis). Finally, we used the `transpose()`

method to change the rows to columns (i.e., transpose the Pandas dataframe) so that we get an output that is a bit easier to check. Here’s the resulting table:

As rule of thumb, skewness can be interpreted like this:

Skewness | |

Fairly Symmetrical | -0.5 to 0.5 |

Moderate Skewed | -0.5 to -1.0 and 0.5 to 1.0 |

Highly Skewed | < -1.0 and > 1.0 |

There are, of course, more things that can be done to test whether our data is normally distributed. For example, we can carry out statistical tests of normality such as the Shapiro-Wilks test. It is worth noting, however, that most of these tests are susceptible for the sample size. That is, even small deviations from normality will be found using e.g. the Shapiro-Wilks test.

In the next section, we will start transforming the non-normal (skewed) data. First, we will transform the moderate skewed distributions and, then, we will continue with the highly skewed data.

Here’s how to do the square root transformation of non-normal data in Python:

```
# Python Square root transformation
df.insert(len(df.columns), 'A_Sqrt',
np.sqrt(df.iloc[:,0]))
```

Code language: Python (python)

In the code chunk above, we created a new column/variable in the Pandas dataframe by using the `insert()`

method. It is, furthermore, worth mentioning that we used the iloc[] method to select the column we wanted. In the following examples, we are going to continue using this method for selecting columns. Notice how the first parameter (i.e., “:”) is used to select all rows, and the second parameter (“0”) is used to select the first columns. If we, on the other hand, used the loc method we could have selected by the column name. Here’s a histogram of our new column/variable:

Again, we can see that the new, Box-Cox transformed, distribution is more symmetrical than the previous, right-skewed, distribution.

In the next subsection, you will learn how to deal with negatively (left) skewed data. If we try to apply sqrt() on the column, right now, we will get a ValueError (see towards the end of the post).

Now, if we want to transform the negatively (left) skewed data using the square root method we can do as follows.

```
# Square root transormation on left skewed data in Python:
df.insert(len(df.columns), 'B_Sqrt',
np.sqrt(max(df.iloc[:, 2]+1) - df.iloc[:, 2]))
```

Code language: PHP (php)

What we did, above, was to reverse the distribution (i.e., `max(df.iloc[:, 2] + 1) - df.iloc[:, 2]`

) and then applied the square root transformation. You can see, in the image below, that skewness becomes positive when reverting the negatively skewed distribution.

In the next section, you will learn how to log transform in Python on highly skewed data, both to the right and left.

Here’s how we can use the log transformation in Python to get our skewed data more symmetrical:

```
# Python log transform
df.insert(len(df.columns), 'C_log',
np.log(df['Highly Positive Skew']))
```

Code language: PHP (php)

Now, we did pretty much the same as when using Python to do the square root transformation. Here, we created a new column, using the insert() method. However, we used the log() method from NumPy, this time, because we wanted to do a logarithmic transformation. Here’s how the distribution looks like now:

Here’s how to log transform negatively skewed data in Python:

```
# Log transformation of negatively (left) skewed data in Python
df.insert(len(df.columns), 'D_log',
np.log(max(df.iloc[:, 2] + 1) - df.iloc[:, 2]))
```

Code language: PHP (php)

Again, we carried out the log transformation using the NumPy log() method. Furthermore, we did exactly as in the square root example. That is, we reversed the distribution and we can, again, see that all that happened is that the skewness went from negative to positive.

In the next section, we will have a look on how to use SciPy to carry out the Box Cox transformation on our data.

Here’s how to implement the Box-Cox transformation using the Python package SciPy:

```
from scipy.stats import boxcox
# Box-Cox Transformation in Python
df.insert(len(df.columns), 'A_Boxcox',
boxcox(df.iloc[:, 0])[0])
```

Code language: Python (python)

In the code chunk above, the only difference, basically, between the previous examples is that we imported `boxcox()`

from `scipy.stats`

. Furthermore, we used the `boxcox()`

method to apply the Box-Cox transformation. Notice how we selected the first element using the brackets (i.e. `[0]`

). This is because this method (i.e. `boxcox()`

) will give us a tuple. Here’s a visualization of the resulting distribution.

Once again, we managed to transform our positively skewed data to a relatively symmetrical distribution. Now, the Box-Cox transformation also requires our data to only contain positive numbers so if we want to apply it on negatively skewed data we need to reverse it (see the previous examples on how to reverse your distribution). If we try to use `boxcox()`

on the column “Moderate Negative Skewed”, for example, we get a ValueError.

More exactly, if you get the “ValueError: Data must be positive” while using either `np.sqrt()`

, `np.log()`

or SciPy’s `boxcox()`

it is because your dependent variable contains negative numbers. To solve this, you can reverse the distribution.

It is worth noting, here, that we can now check the skewness using the `skew()`

method:

`df.agg(['skew']).transpose()`

Code language: Python (python)

We can see in the output that the skewness values of the transformed values are now acceptable (they are all under 0.5). Of course, we could also run the previously mentioned tests of normality (e.g., the Shapiro-Wilks test). Note, that if your data is still not normally distributed you can carry out the Mann-Whitney U test in Python, as well.

In this post, you have learned how to apply square root, logarithmic, and Box-Cox transformation in Python using Pandas, SciPy, and NumPy. Specifically, you have learned how to transform both positive (left) and negative (right) skewed data so that it will hold the assumption of normal assumption. First, you learned briefly above the Python packages needed to transform non-normal, and skewed, data into normally distributed data. Second, you learned about the three methods that you, later, also learned how to carry out in Python.

Here are some useful resources for further reading.

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. *Psychological Methods*, *2*(3), 292–307. https://doi.org/10.1037//1082-989x.2.3.292

Blanca, M. J., Arnau, J., López-Montiel, D., Bono, R., & Bendayan, R. (2013). Skewness and kurtosis in real data samples. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences*, *9*(2), 78–84. https://doi.org/10.1027/1614-2241/a000057

Mishra, P., Pandey, C. M., Singh, U., Gupta, A., Sahu, C., & Keshri, A. (2019). Descriptive statistics and normality tests for statistical data. *Annals of cardiac anaesthesia*, *22*(1), 67–72. https://doi.org/10.4103/aca.ACA_157_18

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this post, you will learn what you need to add new columns to your dataframe in R. We will work both with base R and some of the great Tidyverse packages.

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this brief tutorial, you will learn how to add a column to a dataframe in R. More specifically, you will learn 1) to add a column using base R (i.e., by using the $-operator and brackets, 2) add a column using the add_column() function (i.e., from tibble), 3) add multiple columns, and 4) to add columns from one dataframe to another.

Note, when adding a column with tibble we are, as well, going to use the `%>%`

operator which is part of dplyr. Note, dplyr, as well as tibble, has plenty of useful functions that, apart from enabling us to add columns, make it easy to remove a column by name from the R dataframe (e.g., using the `select()`

function).

First, before reading an example data set from an Excel file, you are going to get the answer to a couple of questions. Second, we will have a look at the prerequisites to follow this tutorial. Third, we will have a look at how to add a new column to a dataframe using first base R and, then, using tibble and the `add_column()`

function. In this section, using dplyr and `add_column()`

, we will also have a quick look at how we can add an empty column. Note, we will also append a column based on other columns. Furthermore, we are going to learn, in the two last sections, how to insert multiple columns to a dataframe using tibble.

To follow this tutorial, in which we will carry out a simple data manipulation task in R, you only need to install dplyr and tibble if you want to use the `add_column()`

and `mutate()`

functions as well as the %>% operator. However, if you want to read the example data, you will also need to install the readr package.

It may be worth noting that all the mentioned packages are all part of the Tidyverse. This package comes packed with a lot of tools that can be used for cleaning data, visualizing data (e.g. to create a scatter plot in R with ggplot2).

To add a new column to a dataframe in R you can use the $-operator. For example, to add the column “NewColumn”, you can do like this: `dataf$NewColumn <- Values`

. Now, this will effectively add your new variable to your dataset.

To add a column from one dataframe to another you can use the $ operator. For example, if you want to add the column named "A" from the dataframe called "dfa" to the dataframe called "dfb" you can run the following code. `dfb$A <- dfa$A`

. Adding multiple columns from one dataframe to another can also be accomplished, of course.

In the next section, we are going to use the `read_excel()`

function from the readr package. After this, we are going to use R to add a column to the created dataframe.

Here’s how to read a .xlsx file in R:

```
# Import readr
library(readr)
# Read data from .xlsx file
dataf <- read_excel('./SimData/add_column.xlsx')
```

Code language: R (r)

In the code chunk above, we imported the file add_column.xlsx. This file was downloaded to the same directory as the script. We can obtain some information about the structure of the data using the `str()`

function:

Before going to the next section it may be worth pointing out that it is possible to import data from other formats. For example, you can see a couple of tutorials covering how to read data from SPSS, Stata, and SAS:

- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R
- How to Read & Write SPSS Files in R Statistical Environment

Now that we have some example data, to practice with, move on to the next section in which we will learn how to add a new column to a dataframe in base R.

First, we will use the $-operator and assign a new variable to our dataset. Second, we will use brackets ("[ ]") to do the same.

Here’s how to add a new column to a dataframe using the $-operator in R:

```
# add column to dataframe
dataf$Added_Column <- "Value"
```

Code language: R (r)

Note how we used the operator $ to create the new column in the dataframe. What we added, to the dataframe, was a character (i.e., the same word). This will produce a character vector as long as the number of rows. Here's the first 6 rows of the dataframe with the added column:

If we, on the other hand, tried to assign a vector that is not of the same length as the dataframe, it would fail. We would get an error similar to "*Error: Assigned data `c(2, 1)` must be compatible with existing data.*" For more about the dollar sign operator, check the post "How to use $ in R: 6 Examples – list & dataframe (dollar sign operator)".

If we would like to add a sequence of numbers we can use `seq()`

function and the `length.out`

argument:

```
# add column to dataframe
dataf$Seq_Col <- seq(1, 10, length.out = dim(dataf)[1])
```

Code language: R (r)

Notice how we also used the `dim()`

function and selected the first element (the number of rows) to create a sequence with the same length as the number of rows. Of course, in a real-life example, we would probably want to specify the sequence a bit more before adding it as a new column. In the next section, we will learn how to add a new column using brackets.

Here’s how to append a column to a dataframe in R using brackets (“[]”):

```
# Adding a new column
dataf["Added_Column"] <- "Value"
```

Code language: R (r)

Using the brackets will give us the same result as using the $-operator. However, it may be easier to use the brackets instead of $, sometimes. For example, when we have column names containing whitespaces, brackets may be the way to go. Also, when selecting multiple columns you have to use brackets and not $. In the next section, we are going to create a new column by using tibble and the `add_column()`

function.

Here’s how to add a column to a dataframe in R:

```
# Append column using Tibble:
dataf <- dataf %>%
add_column(Add_Column = "Value")
```

Code language: R (r)

In the example above, we added a new column at “the end” of the dataframe. Note, that we can use dplyr to remove columns by name. This was done to produce the following output:

Finally, if we want to, we can add a column and create a copy of our old dataframe. Change the code so that the left “dataf” is something else e.g. “dataf2”. Now, that we have added a column to the dataframe it might be time for other data manipulation tasks. For example, we may now want to remove duplicate rows from the R dataframe or transpose your dataframe.

If we want to append a column at a specific position we can use the `.after`

argument:

```
# R add column after another column
dataf <- dataf %>%
add_column(Column_After = "After",
.after = "A")
```

Code language: R (r)

As you probably understand, doing this will add the new column after the column "A". In the next example, we are going to append a column before a specified column.

Here’s how to add a column to the dataframe before another column:

```
# R add column before another column
dataf <- dataf %>%
add_column(Column_Before = "Before",
.after = "Cost")
```

Code language: R (r)

In the next example, we are going to use `add_column()`

to add an empty column to the dataframe.

Here’s how we would do if we wanted to add an empty column in R:

Note that we just added NA (missing value indicator) as the empty column. Here’s the output, with the empty column, added, to the dataframe:

```
# Empty
dataf <- dataf %>%
add_column(Empty_Column = NA) %>%
```

Code language: R (r)

If we want to do this we just replace the `NA`

with "‘’", for example. However, this would create a character column and may not be considered as empty. In the next example, we are going to add a column to a dataframe based on other columns.

Here’s how to use R to add a column to a dataframe based on other columns:

```
# Append column conditionally
dataf <- dataf %>%
add_column(C = if_else(.$A == .$B, TRUE, FALSE))
```

Code language: R (r)

In the code chunk above, we added something to the `add_column()`

function: the `if_else()`

function. We did this because we wanted to add a value in the column based on the value in another column. Furthermore, we used the `.$`

so that we get the two columns compared (using `==`

). If the values in these two columns are the same we add `TRUE`

on the specific row. Here’s the new column added:

Note, you can also work with the `mutate()`

function (also from dplyr) to add columns based on conditions. See this tutorial for more information about adding columns on the basis of other columns.

In the next section, we will have a look at how to work with the `mutate()`

function to compute, and add, a new variable to the dataset.

Here’s how to compute and add a new variable (i.e., column) to a dataframe in R:

```
# insert new column with mutate
dataf <- dataf %>%
mutate(DepressionIndex = mean(c_across(Depr1:Depr5))) %>%
head()
```

Code language: R (r)

Notice how we, in the example code above, calculated a new variable called “depression index” which was the mean of the 5 columns named Depr1 to Depr5. Obviously, we used the `mean()`

function to calculate the mean of the columns. Notice how we also used the `c_across()`

function. This was done so that we can calculate the mean across these columns.

Note now that you have added new columns, to the dataframe, you may also want to rename factor levels in R with e.g. dplyr. In the next section, however, we will add multiple columns to a dataframe.

Here’s how you would insert multiple columns, to the dataframe, using the `add_column()`

function:

```
# Add multiple columns
dataf <- %>%
add_column(New_Column1 = "1st Column Added",
New_Column2 = "2nd Column Added")
```

Code language: R (r)

In the example code above, we had two vectors (“a” and “b”). Now, we then used the `add_column()`

method to append these two columns to the dataframe. Here’s the first 6 rows of the dataframe with added columns:

Note, if you want to add multiple columns, you just add an argument as we did above for each column you want to insert. It is, again, important that the length of the vector is the same as the number of rows in the dataframe. Or else, we will end up with an error. Note, a more realistic example can be that we want to take the absolute value in R (from e.g. one column) and add it to a new column. In the next example, however, we will add columns from one dataframe to another.

In this section, you will learn how to add columns from one dataframe to another. Here’s how you append e.g. two columns from one dataframe to another:

```
# Read data from the .xlsx files:
dataf <- read_excel('./SimData/add_column.xlsx')
dataf2 <- read_excel('./SimData/add_column2.xlsx')
# Add the columns from the second dataframe to the first
dataf3 <- cbind(dataf, dataf2[c("Anx1", "Anx2", "Anx3")])
```

Code language: R (r)

In the example above, we used the `cbind()`

function together with selecting which columns we wanted to add. Note, that dplyr has the `bind_cols()`

function that can be used in a similar fashion. Now that you have put together your data sets you can create dummy variables in R with e.g. the fastDummies package or calculate descriptive statistics.

In this post, you have learned how to add a column to a dataframe in R. Specifically, you have learned how to use the base functions available, as well as the add_column() function from Tibble. Furthermore, you have learned how to use the mutate() function from dplyr to append a column. Finally, you have also learned how to add multiple columns and how to add columns from one dataframe to another.

I hope you learned something valuable. If you did, please share the tutorial on your social media accounts, add a link to it in your projects, or just leave a comment below! Finally, suggestions and corrections are welcomed, also as comments below.

Here you will find some additiontal resources that you may find useful- The first three, here, is especially interesting if you work with datetime objects (e.g., time series data):

- How to Extract Year from Date in R with Examples with e.g. lubridate (Tidyverse)
- Learn How to Extract Day from Datetime in R with Examples with e.g. lubridate (Tidyverse)
- How to Extract Time from Datetime in R – with Examples

If you are interested in other useful functions and/or operators these two posts might be useful:

- How to use %in% in R: 7 Example Uses of the Operator
- How to use the Repeat and Replicate functions in R

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to rename factor levels in R using 1) levels and 2)

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to rename factor levels in R. First, we will use the base functions that are available in R, and then we will use dplyr.

To rename factor levels using `levels()`

we can assign a character vector with the new names. If we want to recode factor levels with dplyr we can use the `recode_factor()`

function.

This R tutorial has the following outline. First, we start by answering some simple questions. Second, we will have a look at what is required to follow this tutorial. Third, we will read an example data set so that we have something to practice on. Fourth, we will go into how to rename factor levels using 1) the levels() function, and 2) the recode_factor() function from the dplyr package.

One simple method to rename a factor level in R is `levels(your_df$Category1)[levels(our_df$Category1)=="A"] <- "B"`

where `your_df`

is your data frame and `Category1`

is the column containing your categorical data. Now, this would recode your factor level “A” to the new “B”.

The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: `levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3")`

. This would efficiently rename the factors “Factor 1” and so on.

In the next section, we will have a look at what is needed to follow this post.

To learn to recode factor levels by the examples in this post you need to download this data set. Furthermore, if you plan on using dplyr and the recode_factor() function, you will need to install this package. Here’s how to install an R-package:

`install.packages("dplyr")`

Code language: R (r)

Note that this package is very useful. You can, for instance, use dplyr to remove columns in R, and calculate descriptive statistics. A quick tip, before going on to the tutorial part of the post, is that you can install dplyr among plenty of other very good r packages if you install the Tidyverse package. For example, you will get ggplot2 that can be used for data visualization (e.g., can be used to create a scatter plot in R), lubridate to handle datetime data (e.g. to extract year from datetime). In the next section, we are going to read the example data from the .csv file.

Here is how to read a CSV file in R using the read.csv function:

```
# Import data
data <- read.csv("flanks.csv")
```

Code language: R (r)

Note that you need to download the CSV file and store it in the same directory as your R script. Data can, of course, also be imported from other data sources. See the following tutorials for more information:

- How to Read & Write SPSS Files in R Statistical Environment
- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R with Haven & sas7dbat

Now, we have the data frame called `data`

. If we want to get information about the variables in the data frame we can use the `str()`

function:

In the image above, we it is clear that we have a data frame containing 5 columns (i.e., variables). Notice that the first column probably is the index column but we will leave it like this. Of particular interest, for this post we can see that we have one column with a categorical variable called “TrialType”. Furthermore, we can see that this variable has two factor levels.

In the, we are going to use `levels()`

to change the name of the levels of a categorical variable. First, we are just assigning a character vector with the new names. Second, we are going to use a list renaming the factor levels by name.

Here’s how to change the name of factor levels using `levels()`

:

```
# Renaming factor levels
levels(data$TrialType) <- c("Con", "InCon")
```

Code language: R (r)

In the example above, we used the levels() function and selected the categorical variable that we wanted. Furthermore, we created a character vector. Notice how we here put the new names. If we use the levels() function again without assigning anything we can now see that we actually renamed the factor levels:

Note that if we try to assign a character vector containing too few, or too many, elements (i.e., names) it will not work. This will lead to an error (i.e., ‘*Error in `levels<-.factor`(`*tmp*`, value = "Con") : number of levels differs*’). Now that you have renamed the levels of a factor, you might want to clean the data frame from duplicate rows or columns. Furthermore, you can use the t() function to transpose in R (i.e a matrix OR dataframe).

In the next example we will rename factor levels by name also using the levels() function.

Here’s how to rename the factor levels by name:

```
# Recode factor levels by name
levels(data$TrialType) <- list(Congruent = "Con", InCongruent = "InCon")
```

Code language: R (r)

Here's the output from `str()`

in which we can see that we renamed the levels of the TrialType factor, again:

Note, however, that when we rename factor levels by name like in the example above, ALL levels need to be present in the list; if any are not in the list, they will be replaced with NA. In the next example, we are going to work with dplyr to change the name of the factor levels. That is, you will end up with only a single factor level and NA scores. Not that good.

Note, if you are planning on carrying out regression analysis and still want to use your categorical variables, you can at this point create dummy variables in R.

One of the simplest ways to rename factor levels is by using the `recode_factor()`

function:

```
# Renaming factor levels dplyr
data$TrialType <- recode_factor(data$TrialType, congruent = "Con",
incongruent = "InCon")
```

Code language: R (r)

In the code example above, we first loaded dplyr so that we get the `recode_factor()`

function into our name space. On the second line, we assign the renamed factors to the column containing our categorical variable. The `recode_factor()`

function works in a way that the first argument is the character vector. This argument is then followed by the level of a factor (e.g., the first) and then the new name. Each following argument is then the other factors we want to be renamed.

As previously mentioned, dplyr is a very useful package. It can also be used to add a column to an R data frame based other columns, or to simply add a column to a data frame in R. This can be, of course, also be done with other packages that are part of the TIdyverse. Note that there are other ways to recode levels of a factor in R. For instance, another package that is part of the Tidyverse package has a function that can be used: forcats.

In this tutorial, you have learned how to rename factor levels in R. First, we had a look at how to use the `levels()`

function to recode the levels of factors. Second, we had a look at the `recode_factor()`

function from the dplyr package to do the same. Hope you learned something valuable. Please share the tutorial on your social media accounts if you did.

Here are some other resources that you may find useful when working in R statistical environment:

- How to use %in% in R: 7 Example Uses of the Operator
- Learn How to Generate a Sequence of Numbers in R with :, seq() and rep()
- How to use the Repeat and Replicate functions in R
- More on working with datetime objects in R: How to Extract Day from Datetime in R with Examples and How to Extract Time from Datetime in R – with Examples
- R Resources for Psychologists - for a collection of useful resources
- How to Take Absolute Value in R – vector, matrix, & data frame

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicate rows and columns from a data frame. We will use the duplicated() and unique() functions from base R. Furthermore, we will use the distinct() function from the dplyr package.

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicates from the data frame. First, you will learn how to delete duplicated rows and, second, you will remove columns. Specifically, we will have a look at how to remove duplicate records from the data frame using 1) base R, and 2) dplyr.

The post starts out with answering a few questions (e.g., “How do I remove duplicate rows in R?”). In the second section, you will learn about what is required to follow this R tutorial. That is, you will learn about the dplyr (and Tidyverse) package and how to install them. When you have what you need to follow this R tutorial, we will create a data frame containing both duplicated rows and columns that we can use to practice on. In the next 5 sections, we will have a look at the example of how to delete duplicates in R. First, we will use Base R and the duplicated() and unique() functions. Second, we will use the distinct() function from dplyr.

To delete duplicate rows in R you can the `duplicated()`

function. Here’s how to remove all the duplicates in the data frame called “study_df”, `study_df.un <- study_df[!duplicated(df)]`

.

Now, that we know how to extract unique elements from the data frame (i.e., drop duplicate items) we are going to learn, briefly, about what is needed to follow this post.

Apart from having R installed you also need to have the dplyr package installed (this package can be used to rename factor levels in R, and to rename columns in R, as well). That is, you need dplyr if you want to use the distinct() function to remove duplicate data from your data frame. R packages are, of course, easy to install. You can install dplyr using the `install.packages()`

function. Here’s how to install packages in R:

```
# Installing packages in R:
install.packages("dplyr")
```

Code language: R (r)

It is worth noting here that dplyr is part of the Tidyverse package. This package is super useful because it comes with other awesome packages such as ggplot2 (see how to create a scatter plot in R with ggplot2, for example), readr, and tibble. To name a few! That said. Let’s create some example data to practice dropping duplicate records from!

Now, to practice removing duplicate rows and columns we need some data. Here’s some data with two duplicated rows and two duplicated columns:

```
# Creating a data frame:
example_df <- data.frame(FName =c ('Steve', 'Steve', 'Erica',
'John', 'Brody', 'Lisa', 'Lisa', 'Jens'),
LName = c('Johnson', 'Johnson', 'Ericson',
'Peterson', 'Stephenson', 'Bond', 'Bond',
'Gustafsson'),
Age = c(34, 34, 40,
44, 44, 51, 51, 50),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M'),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M')
```

Code language: R (r)

The data frame has 8 rows and 5 columns (we can use the `dim()`

function to see this). Here’s the data frame with the duplicate rows and columns:

Most of the time, of course, we import our data from an external source. See the following posts for more information:

- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read & Write SPSS Files in R Statistical Environment
- Reading SAS Files in R with Haven & sas7dbat
- How to Read and Write Stata (.dta) Files in R with Haven

In the next section, we are going to start by removing the duplicate rows using base R.

Here’s how to remove duplicate rows in R using the `duplicated()`

function:

```
# Remove duplicates from data frame:
example_df[!duplicated(example_df), ]
```

Code language: R (r)

As you can see, in the output above, we have now removed one of the two duplicated rows from the data frame. What we did, was to create a boolean vector with the rows that are duplicated in our data frame. Furthermore, we selected the columns using this boolean vector. Notice also how we used the `!`

operator to select the rows that *were not* duplicated. Finally, we also used the “,” so that we select any columns.

In the image above, we can see that two columns has been removed. Of course, if you want the changes to be permanent you need to use <-:

```
# Delete duplicate rows
example_df.un <- example_df[!duplicated(example_df), ]
```

Code language: R (r)

Note there are other good operations such as the %in% operator in R, that can be used for e.g. value matching.

In the next example, we are going to use the `duplicated()`

function to remove one of the two identical columns (i.e., “Gender” and “Gender.1”).

To remove duplicate columns we can, again, use the `duplicated()`

function:

```
# Drop Duplicated Columns:
ex_df.un <- example_df[!duplicated(as.list(example_df))]
# Dimenesions
dim(ex_df.un)
# 8 Rows and 4 Columns
# First five rows:
head(ex_df.un)
```

Code language: R (r)

Now, to remove duplicate columns we added the `as.list()`

function and removed the “,”. That is, we changed the syntax from Example 1 something. Again, we can use the `dim()`

function to see that we have dropped one column from the data frame. Here’s also the result from the `head()`

function:

Note, dplyr can be used to remove columns from the data frame as well. In the next example, we are going to use another base R function to delete duplicate data from the data frame: the `unique()`

function.

Here’s how you can remove duplicate rows using the `unique()`

function:

```
# Deleting duplicates:
examp_df <- unique(example_df)
# Dimension of the data frame:
dim(examp_df)
# Output: 6 5
```

Code language: R (r)

As you can see, using the `unique()`

function to remove the identical rows in the data frame is quite straight-forward. It is worth noting, here, that if you want to keep the last occurrences of the duplicate rows, you can use the `fromLast`

argument and set it to `TRUE`

. If you're now done carrying out data manipulation, you can now create a dummy variable in R, for example.

In the final two examples, we are going to use the `distinct()`

function from the dplyr package to remove duplicae rows.

Here’s how to drop duplicates in R with the `distinct()`

function:

```
# Deleting duplicates with dplyr
ex_df.un <- example_df %>%
distinct()
```

Code language: R (r)

In the code example above, we used the function distinct() to keep only unique/distinct rows from the data frame. When working with the `distinct()`

function, if there are duplicate rows, only the first row, of the identical ones, is preserved. Note, if you want to you can now go on and add an empty column to your data frame. This is something you can do with tibble, a package that is part of the Tidyverse. In the final example, we are going to look at an example in which we drop rows based on one column.

It is also possible to delete duplicate rows based on values in a certain column. Here's how to remove duplicate rows based on one column:

```
# remove duplicate rows with dplyr
example_df %>%
# Base the removal on the "Age" column
distinct(Age, .keep_all = TRUE)
```

Code language: PHP (php)

In the example above, we used the column as the first argument. Second, we used the .keep_all argument to keep all the columns in the data frame. If we now use the `dim()`

function, again, we can see that we have 5 rows and 5 columns. Let’s print the data frame to see which rows we dropped.

Although, we do not want to remove rows where there are duplicate values in a column containing values such as the age of the participants of a study there might be times when we want to remove duplicates in R based on a single column. Furthermore, we can add columns, as well, and drop whether there are identical values across more than one column. Now that you have removed duplicate rows and columns from your data frame you might want to use R to add a column to the data frame based on other columns.

In this short R tutorial, you have learned how to remove duplicates in R. Specifically, you have learned how to carry out this task by using two base functions (i.e., duplicated() and unique()) as well as the distinct() function from dplyr. Furthermore, you have learned how to drop rows and columns that are occurring as identical copies in, at least, two cases in your data frame.

Here are some other tutorials you may find useful:

- How to Transpose a Dataframe or Matrix in R with the t() Function
- How to use the Repeat and Replicate functions in R
- How to Generate a Sequence of Numbers in R with :, seq() and rep()

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>In this Python tutorial, you will learn how to 1) perform Bartlett’s Test, and 2) Levene’s Test. Both are tests that are testing the assumption of equal variances. Equality of variances (also known as homogeneity of variance, and homoscedasticity) in population samples is assumed in commonly used comparison of means tests, such as Student’s t-test […]

The post Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance in Python appeared first on Erik Marsja.

]]>In this Python tutorial, you will learn how to 1) perform Bartlett’s Test, and 2) Levene’s Test. Both are tests that are testing the assumption of equal variances. Equality of variances (also known as homogeneity of variance, and homoscedasticity) in population samples is assumed in commonly used comparison of means tests, such as Student’s t-test and analysis of variance (ANOVA). Therefore, we can employ tests such as Levene’s or Bartlett’s that can be conducted to examine the assumption of equal variances across group samples.

A brief outline of the post is as follows. First, you will get a couple of questions answered. Second, you will briefly learn about the hypothesis of both Bartlett’s and Levene’s tests of homogeneity of variances. After this, we continue by having a look at the required Python packages to follow this post. In the next section, you will read data from a CSV file so that we can continue by learning how to carry out both tests of equality of variances in Python. That is, the last two sections, before the conclusion, will how to you to carry out Bartlett’s and Levene’s tests.

Bartlett’s test of **homogeneity of variances** is a test, much like Levene’s test, that measures whether the variances are equal for all samples. If your data is **normally distributed **you can use Bartlett’s test instead of Levene’s.

Levene’s test can be carried out to check that variances are equal for all samples. The test can be used to check the assumption of equal variances before running a parametric test like One-Way ANOVA in Python. If your data is not following a normal distribution Levene’s test is preferred before Barlett’s.

Simply put equal variances, also known as homoscedasticity, is when the variances are approximately the same across the samples (i.e., groups). If our samples have unequal variances (heteroscedasticity), on the other hand, it can affect the Type I error rate and lead to false positives. This is, basically, what equality of variances means.

Whether conducting Levene’s Test or Bartlett’s Test of homogeneity of variance we are dealing with two hypotheses. These two are simply put:

**Null Hypothesis**: the variances are equal across all samples/groups**Alternative Hypothesis**: the variances are*not*equal across all samples/groups

This means, for example, that if we get a p-value larger than 0.05 we can assume that our data is heteroscedastic and we can continue carrying out a parametric test such as the two-sample t-test in Python. If we, on the other hand, get a statistically significant result we may want to carry out the Mann-Whitney U test in Python.

In this post, we will use the following Python packages:

- Pandas will be used to import the example data
- SciPy and Pingouin will be used to carry out Levene’s and Bartlett’s tests in Python

Of course, if you have your data in any other format (e.g., NumPy arrays) you can skip using Pandas and work with e.g. SciPy anyway. However, to follow this post it is required that you have the Python packages installed. In Python, you can install packages using Pip or Conda, for example. Here’s how to install all the needed packages:

Code language: Bash (bash)`pip install scipy pandas pingouin`

Note, to use pip to install a specific version of a package you can do type:

Code language: Bash (bash)`pip install scipy==1.5.2 pandas==1.1.1 pingouin==0.3.7`

Make sure to check out how to upgrade pip if you have an old version installed on your computer. That said, let’s move on to the next section in which we start by importing example data using Pandas.

To illustrate the performance of the two tests of equality of variance in Python we will need a dataset with at least two columns: one with numerical data, the other with categorical data. In this example, we are going to use the PlantGrowth.csv data which contains exactly two columns. Here’s how to read a CSV with Pandas:

```
import pandas as pd
# Read data from CSV
df = pd.read_csv('PlantGrowth.csv',
index_col=0)
df.shape
```

Code language: PHP (php)

If we use the `shape`

method we can see that we have 30 rows and 2 columns in the dataframe. Now, we can also print the column names of the Pandas dataframe like this. This will give us information about the names of the variables. Finally, we may also want to see which data types we have in the data. This can, among other things, be obtained using the `info()`

method:

`df.info()`

Code language: CSS (css)

As we can see, in the image above, the two columns are of the data types float and object. More specifically, the column *weight *is of float data type and the column called *group *is an object. This means that we have a dataset with categorical variables. Exactly what we need to practice carrying out the two tests of homogeneity of variances.

In the next section, we are going to learn how to carry out Bartlett’s test in Python with first SciPy and, then, Pingouin. Note, when we are using Pingouin we are actually using SciPy but we get a nice table with the results and can, using the same Python method, carry out Levene’s test. That said, let’s get started with testing the assumption of homogeneity of variances!

In this section, you will learn two methods (i.e., using two different Python packages) for carrying out Bartlett’s test in Python. First, we will use SciPy:

Here’s how to do Bartlett’s test using SciPy:

```
from scipy.stats import bartlett
# subsetting the data:
ctrl = df.query('group == "ctrl"')['weight']
trt1 = df.query('group == "trt1"')['weight']
trt2 = df.query('group == "trt2"')['weight']
# Bartlett's test in Python with SciPy:
stat, p = bartlett(ctrl, trt1, trt2)
# Get the results:
print(stat, p)
```

Code language: Python (python)

As you can see, in the code chunk above, we started by importing the `bartlett`

method from the stats class. Now, `bartlett()`

takes the different sample data as arguments. This means that we need to subset the Pandas dataframe we previously created. Here we used Pandas `query()`

method to subset the data for each group. In the final line, we used the `bartlett()`

method to carry out the test. Here are the results:

Remember the null and alternative hypothesis of the two tests we are learning in this blog post? Good, because judging from the output above, we cannot reject the null hypothesis and can, therefore, assume that the groups have equal variances.

Note, you can get each group by using the `unique()`

method. For example, to get the three groups we can type `df[‘group’].unique()`

and we will get this output.

Here’s another method to carry out Bartlett’s test of equality of variances in Python:

```
import pingouin as pg
# Bartlett's test in Python with pingouin:
pg.homoscedasticity(df, dv='weight',
group='group',
method='bartlett')
```

Code language: Python (python)

In the code chunk above, we used the `homoscedasticity`

method and used the Pandas dataframe as the first argument. As you can see, using this method to carry out Bartlett’s test is a bit easier. That is, using the next two parameters we specify the dependent variable and the grouping variable. This means that we don’t have to subset the data as when using SciPy directly. Finally, we used the method parameter to carry out Bartlett’s test. As you will see, in the next section, if we don’t do this we will carry out Levene’s test.

Now as you may already know, and as stated earlier in the post, Bartlett’s test should only be used if data is normally distributed. In the next section, we will learn how to carry out an alternative test that can be used for non-normal data.

In this section, you will earn two methods to carry out Levene’s test of homogeneity of variances in Python. As in the previous section, we will start by using SciPy and continue using Pingouin.

To carry out Levene’s test with SciPy we can do as follows:

```
from scipy.stats import levene
# Create three arrays for each sample:
ctrl = df.query('group == "ctrl"')['weight']
trt1 = df.query('group == "trt1"')['weight']
trt2 = df.query('group == "trt2"')['weight']
# Levene's Test in Python with Scipy:
stat, p = levene(ctrl, trt1, trt2)
print(stat, p)
```

Code language: PHP (php)

In the code chunk above, we started by importing the `levene`

method from the stats class. Much like when using the `bartlett`

method, levene takes the group’s data as arguments (i.e., one array for each group). Again, we will have to subset the Pandas dataframe containing our data. Subsetting the data is, again, done using Pandas `query()`

method. In the final line, we used the `levene()`

method to carry out the test.

Here’s the second method to perform out Levene’s test of homoscedasticity in Python:

```
import pingouin as pg
# Levene's Test in Python using Pingouin
pg.homoscedasticity(df, dv='weight',
group='group')
```

Code language: Python (python)

In the code chunk above, we used the `homoscedasticity`

method. This method takes the data, in this case, our dataframe, as the first parameter. As you when carrying out Bartlett’s test with this package, it is easier to use when performing Levene’s test as well. The next two parameters to the method is where we specify the dependent variable and the grouping variable. This is quite awesome as we don’t have to subset the dataset ourselves. Note, that we don’t have to use the method parameter (as when performing Bartlett’s test) because the `homoscedasticity`

method will, by default, do Levene’s test.

Now, as testing the assumption of equality of variances using Pingouin is, in fact, using SciPy the results are, of course, the same regardless of Python method used. In this case, with the example data we used, the samples have roughly equal variances. Good news, if we want to compare the groups on their mean values!

In this Python tutorial, you have learned to carry out two tests of equality of variances. First, we used Bartlett’s test of homogeneity of variance using SciPy and Pingouin. This test, however, should only be used on normally distributed data. Therefore, we also learned how to carry out Levene’s test using the same two Python packages! Finally, we also learned that Pingouin uses SciPy to carry out both tests but works as a simple wrapper for the two SciPy methods and is very easy to use. Especially, if our data is stored in a Pandas dataframe.

The post Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance in Python appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to add a column to a dataframe based on other columns.

The post R: Add a Column to Dataframe Based on Other Columns with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to **add a column to a dataframe based on values in other columns**. Specifically, you will learn to create a new column using the mutate() function from the package dplyr, along with some other useful functions.

Finally, we are also going to have a look on how to add the column, based on values in other columns, at a specific place in the dataframe. This will be done using the add_column() function from the Tibble package.

It is worth noting, that both tibble and dplyr are part of the Tidyverse package. Apart from adding columns to a dataframe, you can use dplyr to remove columns, with the select() function, for example.

In this post, we will first learn how to install the r-packages that we are going to use. Second, we are going to import example data that we can play around with and add columns based on conditions. After we have a dataframe, we will then go on and have a look at how to add a column to the dataframe with values depending on other columns. In these sections, we will use the mutate() and add_column() functions to accomplish the same task. That is, we will use these R functions to add a column based on conditions.

As this is an R tutorial, you will, of course, need to have R and, at least, the dplyr package installed. If you want to e.g. easily add a column, based on values in another column, at a specific position I would suggest that you install tibble. Furthermore, if you are going to read the example .xlsx file you will also need to install the readr package. Note, however, that if you install the tidyverse package you will get tibble, dplyr and readr, among a lot of other useful packages.

Installing Tidyverse enables you to easily calculate descriptive statistics, visualize data (e.g., scatter plots with ggplot2). Furthermore, there’s another useful package, that is part of the Tidyverse package, called lubridate. Lubridate is very handy if you are working with time-series data. For example, you can use the functions of this package to extract year from date in R as well as extracting day and extracting time. As usual, when installing r-packages we use the `install.packages()`

function:

`install.packages(c('tibble', 'dplyr', 'readr'))`

Code language: R (r)

Note. if you want to install all packages available in the tidyverse package just exchange the character vector for ‘tidyverse’ (`install.packages('tidyverse')`

). Now that you should be set with these useful packages we can start reading the example Excel file.

Here’s how to read an xlsx file in R using `read_xlsx`

function from the readxl package:

```
library(readxl)
# reading the xlsx file:
depr_df <- read_excel('./SimData/add_column.xlsx')
```

Code language: R (r)

In the code chunk above, we imported the Excel file that can be downloaded here. This file needs, furthermore, to be placed in the same directory as the R script (or change the path to the .xlsx file). Notice that we used the skip argument to skip the first two rows. Finally, we can have a glimpse of the data by using the head() function:

In the output, we can see that our dataset contains the following columns:

- ID – Subject ID
- A
- B
- Cost
- Depr1 – First item on a depression scale
- Depr2 – Second item
- Depr3 – And so on…
- Depr4 – …
- Depr5

Note that all variables in this data set are made up and, thus, the data makes no sense. We are, of course, only going to use it so that we can practice adding new columns based on conditions on values in other columns. Now that we have our data we are jumping into the first example directly!

If we want to add a column based on the values in another column we can work with dplyr. Here’s how to append a column based on what the factor ends with in a column:

```
library(dplyr)
# Adding column based on other column:
depr_df %>%
mutate(Status = case_when(
endsWith(ID, "R") ~ "Recovered",
endsWith(ID, "S") ~ "Sick"
))
```

Code language: R (r)

As you can see, in the code chunk above, we used the `%>%`

operator and the `mutate()`

function together with the `case_when()`

and `endsWith()`

functions. Furthermore, we created the “Status” column (in mutate) and if the factor ended with R the value in the new column will be “Recovered”. On the other hand, if the factor is ending with S, the value in the new column will be “Sick”. Here’s the resulting dataframe to which we appended the new column:

Now, the `%>%`

operator is very handy and, of course, there are more nice operators, as well as functions, in R statistical programming environment. See the following posts for more inspiration (or information):

- How to use %in% in R: 7 Example Uses of the Operator
- Learn How to Generate a Sequence of Numbers in R with :, seq() and rep()
- How to use the Repeat and Replicate functions in R

In the next section, we will continue learning how to add a column to a dataframe in R based on values in other columns.

In the first example, we are going to add a new column based on whether the values in the columns “A” and “B” match. Here’s how to add a new column to the dataframe based on the condition that two values are equal:

```
# R adding a column to dataframe based on values in other columns:
depr_df <- depr_df %>%
mutate(C = if_else(A == B, A + B, A - B))
```

Code language: R (r)

In the code example above, we added the column “C”. Here we used dplyr and the `mutate()`

function. As you can see, we also used the `if_else()`

function to check whether the values in column “A” and “B” were equal. If they were equal, we added the values together. If not, we subtracted the values. Here’s the resulting dataframe with the column added:

Notice how there was only one row in which the values matched and, in that column, our code added the values together. Of course, if we wanted to create e.g. groups based on whether the values in two columns are the same or not we can use change some things in the `if_else()`

function. For example, we can use this code:

```
# creating a column to dataframe based on values in other columns:
depr_df <- depr_df %>%
mutate(C = if_else(A == B, "Equal", "Not Equal"))
```

Code language: PHP (php)

In the next code example, we are going to create a new column summarizing the values from five other columns. This can be useful, for instance, if we have collected data from e.g. a questionnaire measuring psychological constructs.

Here we are going to use the values in the columns named “Depr1” to “Depr5” and summarize them to create a new column called “DeprIndex”:

```
# Adding new column based on the sum of other columns:
depr_df <- depr_df %>% rowwise() %>%
mutate(DeprIndex = sum(c_across(Depr1:Depr5)))
```

Code language: R (r)

To explain the code above, here we also used the `rowwise()`

function before the `mutate()`

function. As you may understand, we use the first function to perform row-wise operations. Furthermore, we used the `sum()`

function to summarize the columns we selected using the `c_across() function. `

Note, if you need to you can rename the levels of a factor in R using dplyr, as well. In the final example, we are going to continue working with these columns. However, we are going to add a new column based on different cutoff values. That is, we are going to create multiple groups out of the score summarized score we have created.

In this example, we are going to create a new column in the dataframe based on 4 conditions. That is, we are going to use the values in the “DeprIndex” column and create 3 different groups depending on the value in each row.

```
# Multiple conditions when adding new column to dataframe:
depr_df %>% mutate(Group =
case_when(DeprIndex <= 15 ~ "A",
DeprIndex <= 20 ~ "B",
DeprIndex >= 21 ~ "C")
)
```

Code language: R (r)

Again, we used mutate() together with case_when(). Here, in this example, we created a new column in the dataframe and added values based on whether “DeprIndex” was smaller or equal to 15, smaller or equal to 20, or larger or equal to 25.

This is cool! We’ve created another new column that categorizes each subject based on our arbitrary depression scale. We could now go on and calculate descriptive statistics in R, by this new group, if we want to. In the final example, we are going to use Tibble and the `add_column()`

function that we used to add an empty column to a dataframe in R.

In the final example, we are going to use add_column() to append a column, based on values in another column. Here’s how to append a column based on whether a value, in on columns, is larger than given value:

```
library(tibble)
depr_df <- depr_df %>%
add_column(Is_Depressed =
if_else(.$DeprIndex < 18, TRUE, FALSE),
.after="ID")
```

Code language: R (r)

Notice how we now use tibble and the add_column() function. Again, we use the %>% operator and then in the function we are using if_else(). Here’s the trick we used “.$” to access the column “DeprIndex” and if the value is larger than 18 we add TRUE to the cell in the new column. Obviously, if it is smaller FALSE will be added. The new column that we have created is called “Is_Depressed” and is a boolean:

Importantly, to add the new column at a specific position we used the .after argument. As you can see, in the image above, we created the new column after the “ID” column. If we want to append our column before a specific column we can use the .before argument. Now, you might want to continue preparing your data for statistical analysis. For more information, you can have a look at how to create dumy variables in R.

In this R tutorial, you have learned how to add a column to a dataframe based on conditions and/or values in other columns. First, we had a look at a simple example in which we created a new column based on the values in another column. Second, we appended a new column based on a condition. That is, we checked whether the values in the two columns were the same and created a new column based on this. In the third example, we had a look at more complex conditions (i.e., 3 conditions) and added a new variable with 3 different factor levels. Finally, we also had a look at how we could use <code>add_column()</code> to append the column where we wanted it in the dataframe.

Hope you found this post useful! If you did, make sure to share the post to show some love! Also, you can become a Patreon to support my work. Finally, make sure you leave a comment if you want something clarified or you found an error in the post!

The post R: Add a Column to Dataframe Based on Other Columns with dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn by examples how to use the %in% in R. Specifically, you will learn 7 different uses of this great operator. Outline Here’s the outline of this post, described a bit more detailed than the table of contents. First, we start out with a couple of simple examples of how […]

The post How to use %in% in R: 7 Example Uses of the Operator appeared first on Erik Marsja.

]]>In this tutorial, you will learn by examples how to use the %in% in R. Specifically, you will learn 7 different uses of this great operator.

Here’s the outline of this post, described a bit more detailed than the table of contents. First, we start out with a couple of simple examples of how to use the `%in%`

operator. Specifically, we will have a look at how to use the operator when testing whether two vectors are containing sequences of numbers and letters. As you may already have expected, the operator can be used in other, maybe more advanced cases. In the following sections, therefore, we are going to have a look at how we can work with this operator and dataframes. For example, you will see that you can use the operator to create new variables, remove columns, and select columns.

The `%in%`

operator in R can be used to identify if an element (e.g., a number) belongs to a vector or dataframe. For example, it can be used the see if the number 1 is in the sequence of numbers 1 to 10.

The `%in%`

operator is used for matching values. “returns a vector of the positions of (first) matches of its first argument in its second”. On the other hand, the `==`

operator, is a logical operator and is used to compare if two elements are exactly equal. Using the `%in%`

operator you can compare vectors of different lengths to see if elements of one vector match at least one element in another. The length of output will be equal to the length of the vector being compared (the first one). This is not possible when utilizing the `==`

operator.

The use of the %in% operator is to match values in e.g. two different vectors, as already answered in the to previous questions. You can use the operator, also, to select certain columns in the dataframe or to subset the dataframe.

Now that you know that `%in%`

is in R and what the difference is between this operator and `==`

is we can go on and have a look at the example usages.

In this section, we are going through 8 examples of how to use %in% in R. As you already know, we will start by working with vectors. After that, we will have a look at how to use the operator when working with dataframes.

In this example, we will use `%in%`

to check if two vectors contain overlapping numbers. Specifically, we will have a look at how we can get a logical value for more specific elements, whether they are also present in a longer vector. Here’s the first example of an excellent usage of the operator:

```
# sequence of numbers 1:
a <- seq(1, 5)
# sequence of numbers 2:
b <- seq(3, 12)
# using the %in% operator to check matching values in the vectors
a %in% b
```

Code language: R (r)

In the code above we get an output as long as the longer vector (i.e., b). Furthermore, we used the `seq()`

function, to create the first one sequence of numbers in R and then another. In a real-world example, our vectors might not be containing sequences but just random numbers. If we, on the other hand, want to test which elements of a longer vector are in a short vector we do as follows:

```
# shorter vector:
a <- seq(12, 19)
# longer vector:
b <- seq(1, 16)
# test if elements in longer vector is in shorter:
b %in a
```

Code language: R (r)

As you can see, both above methods will result in a boolean. Additionally, if we use the which() function, we can the the indexes of where the overlapping elements:

```
# Using the operator together with the which() function
which(seq(1:10) %in% seq(4:12))
```

Code language: PHP (php)

Might also interest you: How to use $ (dollar sign) in R: 6 Examples – list & dataframe

In the next example, we will see that we can apply the same methods for letters, or factors, in R. That is, we will test if two vectors, containing letters, are overlapping.

In this example, we will use `%in%`

to check if two vectors contain overlapping letters. Note, this can also be done for words (e.g., factors). First, we will compare letters in a shorter vector and in a longer vector. Here’s how to compare two vectors containing letters:

```
# Sequences of Letters:
a <- LETTERS[1:10]
# Second seq of ltters
b <- LETTERS[4:10]
# longer in shorter
a %in% b
```

Code language: PHP (php)

As you can see, and probably already figured out, we used the `%in%`

operator exactly in the same way as for vectors containing sequences of numbers. Again we can test which letters in a long vector are in a short vector:

`b %in% a`

Naturally, as with the examples where we used sequences of numbers in R, the result when working with letters, words, or factors is a boolean vector. Furthermore, as in the first example, we can use the `which()`

function to get indexes:

```
g <- c("C", "D", "E")
h <- c("A", "E", "B", "C", "D", "E", "A", "B", "C", "D", "E")
which(h %in% g)
```

Code language: JavaScript (javascript)

Finally, here’s an example of why using the `%in%`

operator is better than the `==`

. If we use `which()`

, together with `==`

, we will get the only the two 3 elements:

```
# %in% vs == the equal operator wrong!
which(g == h)
```

Code language: R (r)

In the next example, we will work with a dataframe, instead of vectors. First, however, we are going to read the readxl package to read a .xlsx file in R. Here’s how we get our dataframe to play around with:

```
library(readxl)
library(httr)
#URL to Excel File:
xlsx_URL <- 'https://mathcs.org/statistics/datasets/titanic.xlsx'
# Get the .xlsx file as an temporary file
GET(xlsx_URL, write_disk(tf <- tempfile(fileext = ".xlsx")))
# Reading the temporary .xlsx file in R:
dataf <- read_excel(tf)
# Checkiing dataframe:
head(dataf)
```

Code language: R (r)

A quick note, before going on to the third example, is that readxl as well as dplyr, a package that we will use later, are part of the Tidyverse package. If you install Tidyverse you will get some powerful tools to extract year from date in R, carry out descriptive statistics, visualize data (e.g., scatter plots with ggplot2), to name a few.

In this example, we will have a look at a very simple example of how we can use this operator. Namely, we are going to use `%in%`

to check if a value is in one of the columns in a dataframe:

```
# %in% column
2 %in% dataf$boat
```

Code language: R (r)

Now, if you have read through the first 2 examples you already know that we get a boolean vector. In this vector, the value TRUE means that the cell contained the value we sought. Notice also how we used the `$`

operator to select one of the columns.

Here’s how to use the `%in%`

operator to create a new variable:

```
# Creating a dataframe:
dataf2 <- data.frame(Type = c("Fruit","Fruit","Fruit","Fruit","Fruit",
"Vegetable", "Vegetable", "Vegetable", "Vegetable", "Fruit"),
Name = c("Red Apple","Strawberries","Orange","Watermelon","Papaya",
"Carrot","Tomato","Chili","Cucumber", "Green Apple"),
Color = c(NA, "Red", "Orange", "Red", "Green",
"Orange", "Red", "Red", "Green", "Green"))
# Adding a New Column:
dataf2 <- within(dataf2, {
Red_Fruit = "No"
Red_Fruit[Type %in% c("Fruit")] = "No"
Red_Fruit[Type %in% "Vegetable"] = "No"
Red_Fruit[Name %in% c("Red Apple", "Strawberries", "Watermelon", "Chili", "Tomato")] = "Yes"
})
```

Code language: PHP (php)

Notice how we make use of the operator, Here’s the dataframe, with the added column “Red_Fruit”:

In another post, you will learn how to use R to add a column to a dataframe based on conditions and/or values in other columns.

In this example, we are going to use the `%in%`

operator to subset the data:

```
library(dplyr)
home.dests <- c("St Louis, MO", "New York, NY", "Hudson, NY")
# Subsetting using %in% in R:
dataf %>%
filter(home.dest %in% home.dests)
```

Code language: R (r)

Notice how we created a vector of the elements that we want to be included in our new, subsetted, dataframe. Furthermore, we also used the dplyr package and the filter() function together with the %in% operator. Finally, we get the resulting, subsetted, dataframe:

Note, dplyr comes with a lot of other handy functions such as the select-family. For example, you can use dplyr to select columns in R or to take the absolute value in R, using the function only on numerical columns. In the next section, we will have a look at another way we may use the %in% operator: namely, to drop columns from a dataframe.

In this example, we are going to use `%in%`

to drop columns from the datafarme:

```
# Drop columns using %in% operator in R
dataf[, !(colnames(dataf) %in% c("pclass", "embarked", "boat"))]
```

Code language: R (r)

In the code cunk above, we used the I to tell R that we do not want select these columns. Running the code, above, will result in a new dataframe with the columns removed:

Note, it is also possible to use dplyr to remove columns in R. For example, using the select() function together with the pipe operator may result in a slightly more readable code.

In the next example, we are going to have a look at how we can use the `%in%`

operator to do the opposite of dropping columns. That is, we are going to select columns, instead.

Let us use the `%in%`

operator to select a number of variables from the dataframe:

```
# Select columns using %in%:
dataf[, (colnames(dataf) %in% c("pclass", "embarked", "boat"))]
```

Code language: CSS (css)

Note that we removed the ! before the paranthese which will tell R to select these columns (see example 6, above, for the opposite).

Selecting columns, instead of deleting them, might be a more efficient way to go if we have a lot of variables in our dataset and we want to create a new dataframe with only some of them. Notice how we used another nice function: select_if(). This function is also from the dplyr package and when we wanted to select columns if they had certain names.

In the final bonus section, we are going to see how we can negate the %in% operator. Now, we are going to do this because there is not an built in “not in” operator in R.

Here’s how we can create our own *not in *operator in R:

```
# Creating a not in operator:
`%notin%` <- Negate(`%in%`)
```

Code language: R (r)

Pretty simple. It is now possible to use this new R not in operator to check if a e.g. a number is not in a vector:

```
# Generating a sequence of numbers:
numbs <- rep(seq(3), 4)
# Using the not in operator:
4 %notin% numbs
# Output: [1] TRUE
```

Code language: R (r)

As you can see in the example above, we can use the %notin% operator similarly as we would use the %in% operator. Note that it is also possible to use both the operators on lists, as well. Finally, it is worth noting that there are some R packages that contain “not in” functions. For example, the package mefa4 has the %notin% function.

In this R tutorial, you have learned 7 ways you can use the %in% operator in R. Specifically, you have learned how to compare vectors of numbers and letters (factors). You have also learned how to check if a value is in a column (as well as how many times), how to add a new variable, remove a columns, and select columns.

Here are some other useful tutorials:

- How to Rename Column (or Columns) in R with dplyr
- Select Columns in R by Name, Index, Letters, & Certain Words with dplyr
- How to Extract Year from Date in R with Examples
- Learn How to Calculate Descriptive Statistics in R the Easy Way
- How to Extract Day from Datetime in R with Examples
- Learn How to Create Dummy Variables in R (with Examples)

The post How to use %in% in R: 7 Example Uses of the Operator appeared first on Erik Marsja.

]]>