In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe. Outline In the first section, we are going to have a look at what you […]

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to select columns in a dataframe. First, we will use base R, in a number of examples, to choose certain columns. Second, we will use dplyr to get columns from the dataframe.

In the first section, we are going to have a look at what you need to follow this tutorial. Second, we will answer some questions that might have brought you to this post. Third, we are going to use base R to select certain columns from the dataframe. In this section, we are also going to use the great operator %in% in R to select specific columns. Fourth, we are going to use dplyr and the select() family of functions. For example, we will use the `select_if()`

to get all the numeric columns and some helper functions. The helper functions enable us to select columns starting with, or ending with, a certain word or a specific character, for instance.

To select a column in R you can use brackets e.g., `YourDataFrame['Column']`

will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, `select(YourDataFrame, c('A', 'B')`

will take the columns named “A” and “B” from the dataframe.

If you want to use dplyr to select a column in R you can use the `select()`

function. For instance, `select(Data, 'Column_to_Get')`

will get the column “Column_to_Get” from the dataframe “Data”.

In the next section, we are going to learn about the prerequisites of this post and how to install R packages such as dplyr (or Tidyverse).

To follow this post you, obviously, need a working installation of R. Furthermore, we are going to use the read the example data from an Excel file using the readxl package. Moreover, if you want to use dplyr’s `select()`

and the different helper functions (e.g., startsWith(), endsWith()) you also need to install dplyr. It may be worth pointing out, that just by using the “-“-character you can use select() (from dplyr) to drop columns in R.

It may be worth to point out that both readxl and dplyr are part of the tidyverse. Tidyverse comes with a number of great packages that are packed with great functions. Besides selecting, or removing, columns with dplyr (part of Tidyverse) you can extract year from date in R using the lubridate package, create scatter plots with ggplot2, and calculate descriptive statistics. That said, you can install one of these r-packages, depending on what you need, using the `install.packages()`

function. For example, installing dplyr is done by running this in R: `install.packages(c('dplyr', 'readxl'))`

.

Before we continue and practice selecting columns in R, we will read data from a .xlsx file.

```
library(readxl)
dataf <- read_excel("add_column.xlsx")
head(dataf)
```

This example dataset is one that we used in the tutorial, in which we added a column based on other columns. We can see that it contains 9 different columns. If we want to, we can check the structure of the dataframe so that we can see what kind of data we have.

`str(dataf)`

Now, we see that there are 20 rows, as well, and that all but one column is numeric. In the next section, we are going to learn how to select certain columns from this dataframe using base R.

In this section, we are going to practice selecting columns using base R. First, we will use the column indexes and, second, we will use the column names.

Here’s one example on how to select columns by their indexes in R:

`dataf[, c(1, 2, 3)]`

As you can see, we selected the first three columns by using their indexes (1, 2, 3). Notice, how we also used the “,” within the brackets. This is done to get the columns rather than subsetting rows (i.e., by placing the “,” after the vector with indexes). Before moving on to the next example it may be worth knowing that the vector can contain a sequence. For instance, we can generate a sequence of numbers using `:`

. For example, replacing `c(1, 2, 3)`

with `c(1:3)`

would give us the same output, as above. Naturally, we can also select e.g. the third, fifth, and the sixth column if we want to. In the next example, we are going to subset certain columns by their name. Note, sequences of numbers can also be generated in R with the seq() function.

Here’s how we can select columns in R by name:

`dataf[, c('A', 'B', 'Cost')]`

In the code chunk above, we basically did the same as in the first example. Notice, however, how we removed the numbers and added the column names. In the vector, that is, we now used the names of the column we wanted to select. Ín the next example, we are going to learn a neat little trick by using the %in% operator when selecting columns by name.

Here’s how we can make use of the %in% operator to get columns by name from the R dataframe:

```
head(dataf[, (colnames(dataf) %in% c('Depr1', 'Depr2',
'Depr4', 'Depr7'))])
```

In the code chunk above, we used the great %in% operator. Notice something diffrent in the character vector? There’s a column that doesn’t exist in the example data. The cool thing, here, is that even though if we do this when using the %in% operator, we will get the columns that actually exists in the dataframe selected. In the next section, we are going to have a look at a couple of examples using dplyr’s `select()`

and some of the great helper functions.

In this section, we will start with the basic examples of selecting columns (e.g., by name and index). However, the focus will be on using the helper functions together with `select()`

, and the `select_if()`

function.

Here’s how we can get columns by index using the `select()`

function:

`library(dplyr) dataf %>% select(c(2, 5, 6))`

Notice how we used another great operator: %>%. This is the pipe operator and following this, we used the select() function. Again, when selecting columns with base R, we added a vector with the indexes of the columns we want. In the next example, we will basically do the same but select by column names.

Here’s how we use `select()`

to get the columns we want by name:

```
library(dplyr)
dataf %>%
select(c('A', 'Cost', 'Depr1'))
```

n the code chunk above, we just added the names of the columns in the vector. Simple! In the next example, we are going to have a look at how to use `select_if()`

to select columns with containing data of a specific data type.

Here’s how to select all the numeric columns in an R dataframe:

```
dataf %>%
select_if(is.numeric)
```

Remember, all columns except for one are of numeric type. This means that we will get 8 out of 9 columns running the above code. If we, on the other hand, added the `is.character`

function we would only select the first column. In the next section, we will learn how to get columns starting with a certain letter.

Here’s how we use the `starts_with()`

helper function and `select()`

to get all columns starting with the letter “D”:

```
dataf %>%
select(starts_with('D'))
```

Selecting columns with names starting with a certain letter was pretty easy. In the `starts_with()`

helper function we just added the letter.

Here’s how we use the `ends_with()`

helper function and `select()`

to get all columns ending with the letter “D”:

```
dataf %>%
select(ends_with('D'))
```

Note, that in the example dataset there is only one column ending with the letter “D”. In fact, all column names are ending with unique characters. That is, here it would not make sense to select columns using this method. It is worth noting here, that we can use a word when working with both the `starts_with()`

and `ends_with()`

helper functions. Let’s have a look!

Here’s how we can select certain columns starting with a specific word:

```
dataf %>%
select(starts_with('Depr'))
```

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(starts_with('Depr'))
```

Of course, “Depr” is not really a word, and, yes, we get the exact same columns as in example 7. However, you get the idea and should understand how to use this in your own application. One example, when this makes sense to do, is when having multiple columns beginning with the same letter but some of them beginning with the same word. Before going to the next section, it may be worth mentioning another great feature of the dplyr package. You can use dplyr to rename factor levels in R. In the final example, we are going to select certain column names that are containing a string (or a word).

Here’s how we can select certain columns starting with a string:

```
dataf %>%
select(contains('pr'))
```

Again, this particular example doesn’t make sense on the example dataset. There’s a final helper function that is worth mentioning: `matches()`

. This function can be used to check whether column names contain a pattern (regular expression) such as digits. Now that you have selected the columns you need, you can continue manipulating your data and get it ready for data analysis. For example, you can now go ahead and create dummy variables in R or add a new column.

In this post, you have learned how to select certain columns using base R and dplyr. Specifically, you have learned how to get columns, from the dataframe, based on their indexes or names. Furthermore, you have learned to select columns of a specific type. After this, you learned how to subset columns based on whether the column names started or ended with a letter. Finally, you have also learned how to select based on whether the columns contained a string or not. Hope you found this blog post useful. If you did, please share it on your social media accounts, add a link to the tutorial in your project reports and such, and leave a comment below.

The post Select Columns in R by Name, Index, Letters, & Certain Words with dplyr appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do […]

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this Python data analysis tutorial, you will learn how to perform a paired sample t-test in Python. First, you will learn about this type of t-test (e.g. when to use it, the assumptions of the test). Second, you will learn how to check whether your data follow the assumptions and what you can do if your data violates some of the assumptions.

Third, you will learn how to perform a paired sample t-test using the following Python packages:

- Scipy (scipy.stats.ttest_ind)
- Pingouin (pingouin.ttest)

In the final sections, of this tutorial, you will also learn how to:

- Interpret and report the paired t-test
- P-value, effect size

- report the results and visualizing the data

In the first section, you will learn about what is required to follow this post.

In this tutorial, we are going to use both SciPy and Pingouin, two great Python packages, to carry out the dependent sample t-test. Furthermore, to read the dataset we are going to use Pandas. Finally, we are also going to use Seaborn to visualize the data. In the next three subsections, you will find a brief description of each of these packages.

SciPy is one of the essential data science packages. This package is, furthermore, a dependency of all the other packages that we are going to use in this tutorial. In this tutorial, we are going to use it to test the assumption of normality as well as carry out the paired sample t-test. This means, of course, that if you are going to carry out the data analysis using Pingouin you will get SciPy installed anyway.

Pandas is also a very great Python package for someone carrying out data analysis with Python, whether a data scientist or a psychologist. In this post, we will use Pandas import data into a dataframe and to calculate summary statistics.

In this tutorial, we are going to use data visualization to guide our interpretation of the paired sample t-test. Seaborn is a great package for carrying out data visualization (see for example these 9 examples of how to use Seaborn for data visualization in Python).

In this tutorial, Pingouin is the second package that we are going to use to do a paired sample t-test in Python. One great thing with the ttest function is that it returns a lot of information we need when reporting the results from the test. For instance, when using Pingouin we also get the degrees of freedom, Bayes Factor, power, effect size (Cohen’s d), and confidence interval.

In Python, we can install packages with pip. To install all the required packages run the following code:

`pip install scipy pandas seaborn pingouin`

In the next section, we are going to learn about the paired t-test and it’s assumptions.

The paired sample t-test is also known as the *dependent sample t-test*, and *paired t-test*. Furthermore, this type of t-test compares two averages (means) and will give you information if the difference between these two averages are zero. In a paired sample t-test, each participant is measured twice, which results in pairs of observations (the next section will give you an example).

For example, if clinical psychologists want to test whether a treatment for depression will change the quality of life, they might set up an experiment. In this experiment, they will collect information about the participants’ quality of life before the intervention (i.e., the treatment and after. They are conducting a pre- and post-test study. In the pre-test the average quality of life might be 3, while in the post-test the average quality of life might be 5. Numerically, we could think that the treatment is working. However, it could be due to a fluke and, in order to test this, the clinical researchers can use the paired sample t-test.

Now, when performing dependent sample t-tests you typically have the following two hypotheses:

- Null hypotheses: the true mean difference is equal to zero (between the observations)
- Alternative hypotheses: the true mean difference is not equal to zero (two-tailed)

Note, in some cases we also may have a specific idea, based on theory, about the direction of the measured effect. For example, we may strongly believe (due to previous research and/or theory) that a specific intervention should have a positive effect. In such a case, the alternative hypothesis will be something like: the true mean difference is greater than zero (one-tailed). Note, it can also be smaller than zero, of course.

Before we continue and import data we will briefly have a look at the assumptions of this paired t-test. Now, besides that the dependent variable is on interval/ratio scale, and is continuous, there are three assumptions that need to be met.

- Are the two samples independent?
- Does the data, i.e., the differences for the matched-pairs, follow a normal distribution?
- Are the participants randomly selected from the population?

If your data is not following a normal distribution you can transform your dependent variable using square root, log, or Box-Cox in Python. In the next section, we will import data.

Before we check the normality assumption of the paired t-test in Python, we need some data to even do so. In this tutorial post, we are going to work with a dataset that can be found here. Here we will use Pandas and the read_csv method to import the dataset (stored in a .csv file):

```
df = pd.read_csv('./SimData/paired_samples_data.csv',
index_col=0)
```

In the image above, we can see the structure of the dataframe. Our dataset contains 100 observations and three variables (columns). Furthermore, there are three different datatypes in the dataframe. First, we have an integer column (i.e., “ids”). This column contains the identifier for each individual in the study. Second, we have the column “test” which is of object data type and contains the information about the test time point. Finally, we have the “score” column where the dependent variable is. We can check the pairs by grouping the Pandas dataframe and calculate descriptive statistics:

In the code chunk above, we grouped the data by “test” and selected the dependent variable, and got some descriptive statistics using the `describe()`

method. If we want, we can use Pandas to count unique values in a column:

`df['test'].value_counts()`

This way we got the information that we have as many observations in the post test as in the pre test. A quick note: before we continue to the next subsection, in which we subset the data, it has to be mentioned that you should check whether the dependent variable is normally distributed or not. This can be done by creating a histogram (e.g., with Pandas) and/or carrying out the Shapiro-Wilks test.

Both the methods, whether using SciPy or Pingouin, require that we have our dependent variable in two Python variables. Therefore, we are going to subset the data and select only the dependent variable. To our help we have the `query()`

method and we will select a column using the brackets ([]):

```
b = df.query('test == "Pre"')['score']
a = df.query('test == "Post"')['score']
```

Now, we have the variables a and b containing the dependent variable pairs we can use SciPy to do a paired sample t-test.

Here’s how to carry out a paired sample t-test in Python using SciPy:

```
from scipy.stats import ttest_rel
# Python paired sample t-test
ttest_rel(a, b)
```

In the code chunk above, we first started by importing `ttest_rel()`

, the method we then used to carry out the dependent sample t-test. Furthermore, the two parameters we used were the data, containing the dependent variable, in the pairs (a, and b). Now, we can see by the results (image below) that the difference between the pre- and post-test is statistically significant.

In the next section, we will use Pingouin to carry out the paired t-test.

Here’s how to carry out the dependent samples t-test using the Python package Pingouin:

```
import pingouin as pt
# Python paired sample t-test:
pt.ttest(a, b, paired=True)
```

There’s not that much to explain, about the code chunk above, but we started by importing pingouin. Next, we used the `ttest()`

method and used our data. Notice how we used the paired parameter and set it to True. We did this because it is a paired sample t-test we wanted to carry out. Here’s the output:

As you can see, we get more information when using Pingouin to do the paired t-test. In fact, here we basically get all we need to continue and interpret the results. In the next section, before learning how to interpret the results, you can also watch a YouTube video explaining all the above (with some exceptions, of course):

Here’s the majority of the current blog post explained in a YouTube video:

In this section, you will be given a short explanation on how to interpret the results from a paired t-test carried out with Python. Note, we will focus on the results that we got from Pingouin as they give us more information (e.g., degrees of freedom, effect size).

Now, the p-value of the test is smaller than 0.001, which is less than the significance level alpha (e.g., 0.05). This means that we can draw the conclusion that the quality of life has increased when the participants conducted the post-test. Note, this can, of course, be due to other things than the intervention but that’s another story.

Note that, the p-value is a probability of getting an effect at least as extreme as the one in our data, assuming that the null hypothesis is true. Pp-values address only one question: how likely your collected data is, assuming a true null hypothesis? Notice, the p-value can never be used as support for the alternative hypothesis.

Normally, we interpret Cohen’s D in terms of the relative strength of e.g. the treatment. Cohen (1988) suggested that *d*=0.2 is a ‘small’ effect size, 0.5 is a ‘medium’ effect size, and that 0.8 is a ‘large’ effect size. You can interpret this such as that iif two groups’ means don’t differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.

When using Pingouin to carry out the paired t-test we also get the Bayes Factor. See this post for more information on how to interpret BF10.

In this section, you will learn how to report the results according to the APA guidelines. In our case, we can report the results from the t-test like this:

The results from the pre-test (

M= 39.77,SD= 6.758) and post-test (M= 45.737,SD= 6.77) quality of life test suggest that the treatment resulted in an improvement in quality of life,t(49) = 115.4384,p< .01. Note, that the “quality of life test” is something made up, for this post (or there might be such a test, of course, that I don’t know of!).

In the final section, before the conclusion, you will learn how to visualize the data in two different ways: creating boxplots and violin plots.

Here’s how we can guide the interpretation of the paired t-test using boxplots:

```
import seaborn as sns
sns.boxplot(x='test', y='score', data=df)
```

In the code chunk above, we imported seaborn (as sns), and used the boxplot method. First, we put the column that we want to display separate plots for on the x-axis. Here’s the resulting plot:

Here’s another way to report the results from the t-test by creating a violin plot:

```
import seaborn as sns
sns.violinplot(x='test', y='score', data=df)
```

Much like creating the box plot, we import seaborn and add the columns/variables we want as x- and y-axis’. Here’s the resulting plot:

As you may already be aware of, there are other ways to analyze data. For example, you can use Analysis of Variance (ANOVA) if there are more than two levels in the factorial (e.g. tests during the treatment, as well as pre- and post -tests) in the data. See the following posts about how to carry out ANOVA:

- Repeated Measures ANOVA in R and Python using afex & pingouin
- Two-way ANOVA for repeated measures using Python
- Repeated Measures ANOVA in Python using Statsmodels

Recently, machine learning methods have grown popular. See the following posts for more information:

In this post, you have learned two methods to perform a paired sample t-test.Specifically, in this post you have installed, and used, three Python packages for data analysis (Pandas, SciPy, and Pingouin). Furthermore, you have learned how to interpret and report the results from this statistical test, including data visualization using Seaborn. In the Resources and References section, you will find useful resources and references to learn more. As a final word: the Python package Pingouin will give you the most comprehensive result and that’s the package I’d choose to carry out many statistical methods in Python.

If you liked the post, please share it on your social media accounts and/or leave a comment below. Commenting is also a great way to give me suggestions. However, if you are looking for any help please use other means of contact (see e.g., the About or Contact pages).

Finally, support me and my content (much appreciated, especially if you use an AdBlocker): become a patron. Becoming a patron will give you access to a Discord channel in which you can ask questions and may get interactive feedback.

Here are some useful peer-reviewed articles, blog posts, and books. Refer to these if you want to learn more about the t-test, p-value, effect size, and Bayes Factors.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

It’s the Effect Size, Stupid – What effect size is and why it is important

Using Effect Size—or Why the P Value Is Not Enough.

Beyond Cohen’s d: Alternative Effect Size Measures for Between-Subject Designs (Paywalled).

A tutorial on testing hypotheses using the Bayes factor.

The post How to use Python to Perform a Paired Sample T-test appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in […]

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this tutorial, related to data analysis in Python, you will learn how to deal with your data when it is not following the normal distribution. One way to deal with non-normal data is to transform your data. In this post, you will learn how to carry out Box-Cox, square root, and log transformation in Python.

That the data we have is of normal shape (also known as following a Bell curve) is important for the majority of the parametric tests we may want to perform. This includes regression analysis, the two-sample t-test, and Analysis of Variance that can be carried out in Python, to name a few.

This post will start by briefly going through what you need to follow this tutorial. After this is done, you will 1) get information about skewness and kurtosis, and 2) a brief overview of the different methods of transformation. In the section, following the transformation methods, you will learn how to import data using Pandas read_csv. We will explore the example dataset a bit by creating histograms, getting the measures of skewness and kurtosis. Finally, the last sections will be covering how to transform data that is non-normal.

In this tutorial, we are going to use Pandas, SciPy, and NumPy. It is worth mentioning, here, that you only need to install Pandas as the other two Python packages are dependencies of Pandas. That is, if you install Python packages using e.g. pip it will also install SciPy and NumPy on your computer, whether you use e.g. Ubuntu Linux or Windows 10. Note, that you can use pip to install a specific version of e.g. Pandas and if you need, you can upgrade pip using either conda or pip.

Now, if you want to install the individual packages (e.g. you only want to use NumPy and SciPy) you can run the following code:

`pip install pandas`

Now, if you only want to install NumPy, change “pandas” to “numpy”, in the code chuk above. That said, let us move on to the section about skewness and kurtosis.

Briefly, skewness is a measure of symmetry. To be exact, it is a measure of lack of symmetry. This means that the larger the number is the more your data lack symmetry (not normal, that is). Kurtosis, on the other hand, is a measure of whether your data is heavy- or light-tailed relative to a normal distribution. See here for a more mathematical definition of both measures. A good way to visually examine data for skewness or kurtosis is to use a histogram. Note, however, that there are, of course, also different statistical tests that can be used to test if your data is normally distributed.

One way of handling right, or left, skewed data is to carry out the logarithmic transformation on our data. For example, `np.log(x)`

will log transform the variable `x`

in Python. There are other options as well as the Box-Cox and Square root transformations.

One way to handle left (negative) skewed data is to reverse the distribution of the variable. In Python, this can be done using the following code:

Both of the above questions will be more detailed answered throughout the post (e.g., you will learn how to carry out log transformation in Python). In the next section, you will learn about the three commonly used transformation techniques that you, later, will also learn to apply.

As indicated in the introduction, we are going to learn three methods that we can use to transform data deviating from the normal distribution. In this section, you will get a brief overview of these three transformation techniques and when to use them.

The square root method is typically used when your data is moderately skewed. Now using the square root (e.g., sqrt(x)) is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.

The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. Worth noting, however, is that it can not be applied to zero or negative values.

The Box-Cox transformation is, as you probably understand, also a technique to transform non-normal data into normal shape. This is a procedure to identify a suitable exponent (Lambda = l) to use to transform skewed data.

Now, the above mentioned transformation techniques are the most commonly used. However, there are plenty of other methods, as well, that can be used to transform your skewed dependent variables. For example, if your data is of ordinal data type you can also use the arcsine transformation method. Another method that you can use is called reciprocal. This method, is basically carried out like this: 1/x, where x is your dependent variable.

In the next section, we will import data containing four dependent variables that are positively and negatively skewed.

In this tutorial, we will transform data that is both negatively (left) and positively (right) skewed and we will read an example dataset from a CSV file (Data_to_Transform.csv). To our help we will use Pandas to read the .csv file:

```
import pandas as pd
import numpy as np
# Reading dataset with skewed distributions
df = pd.read_csv('./SimData/Data_to_Transform.csv')
```

This is an example dataset that have the following four variables:

- Moderate Positive Skew (Right Skewed)
- Highly Positive Skew’ (Right Skewed)
- Moderate Negative Skew (Left Skewed)
- Highly Negative Skew (Left Skewed)

We can obtain this information by using the `info()`

method. This will give us the structure of the dataframe:

As you can see, the dataframe has 10000 rows and 4 columns (as previously described). Furthermore, we get the information that the 4 columns are of float data type and that there are no missing values in the dataset. In the next section, we will have a quick look at the distribution of our 4 variables.

In the next section, we will do a quick visual inspection of the variables in the dataset using Pandas hist() function.

In this section, we are going to visually inspect whether the data are normally distributed. Of course, there are several ways to plot the distribution of our data. In this post, however, we are going to only use Pandas and create histograms. Here’s how to create a histogram in Pandas using the `hist()`

method:

```
df.hist(grid=False,
figsize=(10, 6),
bins=30)
```

Now, the `hist()`

method takes all our numeric variables in the dataset (i.e.,in our case float data type) and creates a histogram for each. Just to quickly explain the parameters used in the code chunk above. First, using the `grid`

parameter and set it to `False`

to remove the grid from the histogram. Second, we changed the figure size using the `figsize`

parameter. Finally, we also changed the number of bins (default is 20) to get a better view of the data. Here is the distribution visualized:

It is pretty clear that all the variables are skewed and not following a normal distribution (as the variable names imply). Note, there are, of course, other visualization techniques that you can carry out to examine the distribution of your dependent variables. For example, you can use boxplots, stripplots, swarmplots, kernel density estimation, or violin plots. These plots give you a lot of (more) information about your dependent variables. See the post with 9 Python data visualization examples, for more information. In the next section, we are also going to have a look at how we can get the measures of skewness and kurtosis.

More data visualization tutorials:

- Seaborn Line Plots: A Detailed Guide with Examples (Multiple Lines)
- How to use Pandas Scatter Matrix (Pair Plot) to Visualize Trends in Data
- How to Save a Seaborn Plot as a File (e.g., PNG, PDF, EPS, TIFF)

In this section, before we start learning how to transform skewed data in Python, we will just have a quick look at how to get skewness and kurtosis in Python.

`df.agg(['skew', 'kurtosis']).transpose()`

In the code chunk above, we used the `agg()`

method and used a list as the only parameter. This list contained the two methods that we wanted to use (i.e., we wanted to calculate skewness and kurtosis). Finally, we used the `transpose()`

method to change the rows to columns (i.e., transpose the Pandas dataframe) so that we get an output that is a bit easier to check. Here’s the resulting table:

As rule of thumb, skewness can be interpreted like this:

Skewness | |

Fairly Symmetrical | -0.5 to 0.5 |

Moderate Skewed | -0.5 to -1.0 and 0.5 to 1.0 |

Highly Skewed | < -1.0 and > 1.0 |

There are, of course, more things that can be done to test whether our data is normally distributed. For example, we can carry out statistical tests of normality such as the Shapiro-Wilks test. It is worth noting, however, that most of these tests are susceptible for the sample size. That is, even small deviations from normality will be found using e.g. the Shapiro-Wilks test.

In the next section, we will start transforming the non-normal (skewed) data. First, we will transform the moderate skewed distributions and, then, we will continue with the highly skewed data.

Here’s how to do the square root transformation of non-normal data in Python:

```
# Python Square root transformation
df.insert(len(df.columns), 'A_Sqrt',
np.sqrt(df.iloc[:,0]))
```

In the code chunk above, we created a new column/variable in the Pandas dataframe by using the `insert()`

method. It is, furthermore, worth mentioning that we used the iloc[] method to select the column we wanted. In the following examples, we are going to continue using this method for selecting columns. Notice how the first parameter (i.e., “:”) is used to select all rows, and the second parameter (“0”) is used to select the first columns. If we, on the other hand, used the loc method we could have selected by the column name. Here’s a histogram of our new column/variable:

Again, we can see that the new, Box-Cox transformed, distribution is more symmetrical than the previous, right-skewed, distribution.

In the next subsection, you will learn how to deal with negatively (left) skewed data. If we try to apply sqrt() on the column, right now, we will get a ValueError (see towards the end of the post).

Now, if we want to transform the negatively (left) skewed data using the square root method we can do as follows.

```
# Square root transormation on left skewed data in Python:
df.insert(len(df.columns), 'B_Sqrt',
np.sqrt(max(df.iloc[:, 2]+1) - df.iloc[:, 2]))
```

What we did, above, was to reverse the distribution (i.e., `max(df.iloc[:, 2] + 1) - df.iloc[:, 2]`

) and then applied the square root transformation. You can see, in the image below, that skewness becomes positive when reverting the negatively skewed distribution.

In the next section, you will learn how to log transform in Python on highly skewed data, both to the right and left.

Here’s how we can use the log transformation in Python to get our skewed data more symmetrical:

```
# Python log transform
df.insert(len(df.columns), 'C_log',
np.log(df['Highly Positive Skew']))
```

Now, we did pretty much the same as when using Python to do the square root transformation. Here, we created a new column, using the insert() method. However, we used the log() method from NumPy, this time, because we wanted to do a logarithmic transformation. Here’s how the distribution looks like now:

Here’s how to log transform negatively skewed data in Python:

```
# Log transformation of negatively (left) skewed data in Python
df.insert(len(df.columns), 'D_log',
np.log(max(df.iloc[:, 2] + 1) - df.iloc[:, 2]))
```

Again, we carried out the log transformation using the NumPy log() method. Furthermore, we did exactly as in the square root example. That is, we reversed the distribution and we can, again, see that all that happened is that the skewness went from negative to positive.

In the next section, we will have a look on how to use SciPy to carry out the Box Cox transformation on our data.

Here’s how to implement the Box-Cox transformation using the Python package SciPy:

```
from scipy.stats import boxcox
# Box-Cox Transformation in Python
df.insert(len(df.columns), 'A_Boxcox',
boxcox(df.iloc[:, 0])[0])
```

In the code chunk above, the only difference, basically, between the previous examples is that we imported `boxcox()`

from `scipy.stats`

. Furthermore, we used the `boxcox()`

method to apply the Box-Cox transformation. Notice how we selected the first element using the brackets (i.e. `[0]`

). This is because this method (i.e. `boxcox()`

) will give us a tuple. Here’s a visualization of the resulting distribution.

Once again, we managed to transform our positively skewed data to a relatively symmetrical distribution. Now, the Box-Cox transformation also requires our data to only contain positive numbers so if we want to apply it on negatively skewed data we need to reverse it (see the previous examples on how to reverse your distribution). If we try to use `boxcox()`

on the column “Moderate Negative Skewed”, for example, we get a ValueError.

More exactly, if you get the “ValueError: Data must be positive” while using either `np.sqrt()`

, `np.log()`

or SciPy’s `boxcox()`

it is because your dependent variable contains negative numbers. To solve this, you can reverse the distribution.

It is worth noting, here, that we can now check the skewness using the `skew()`

method:

`df.agg(['skew']).transpose()`

We can see in the output that the skewness values of the transformed values are now acceptable (they are all under 0.5). Of course, we could also run the previously mentioned tests of normality (e.g., the Shapiro-Wilks test). Note, that if your data is still not normally distributed you can carry out the Mann-Whitney U test in Python, as well.

In this post, you have learned how to apply square root, logarithmic, and Box-Cox transformation in Python using Pandas, SciPy, and NumPy. Specifically, you have learned how to transform both positive (left) and negative (right) skewed data so that it will hold the assumption of normal assumption. First, you learned briefly above the Python packages needed to transform non-normal, and skewed, data into normally distributed data. Second, you learned about the three methods that you, later, also learned how to carry out in Python.

Here are some useful resources for further reading.

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. *Psychological Methods*, *2*(3), 292–307. https://doi.org/10.1037//1082-989x.2.3.292

Blanca, M. J., Arnau, J., López-Montiel, D., Bono, R., & Bendayan, R. (2013). Skewness and kurtosis in real data samples. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences*, *9*(2), 78–84. https://doi.org/10.1027/1614-2241/a000057

Mishra, P., Pandey, C. M., Singh, U., Gupta, A., Sahu, C., & Keshri, A. (2019). Descriptive statistics and normality tests for statistical data. *Annals of cardiac anaesthesia*, *22*(1), 67–72. https://doi.org/10.4103/aca.ACA_157_18

The post How to use Square Root, log, & Box-Cox Transformation in Python appeared first on Erik Marsja.

]]>In this post, you will learn what you need to add new columns to your dataframe in R. We will work both with base R and some of the great Tidyverse packages.

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this brief tutorial, you will learn how to add a column to a dataframe in R. More specifically, you will learn 1) to add a column using base R (i.e., by using the $-operator and brackets, 2) add a column using the add_column() function (i.e., from tibble), 3) add multiple columns, and 4) to add columns from one dataframe to another.

Note, when adding a column with tibble we are, as well, going to use the `%>%`

operator which is part of dplyr. Note, dplyr, as well as tibble, has plenty of useful functions that, apart from enabling us to add columns, make it easy to remove a column by name from the R dataframe (e.g., using the `select()`

function).

First, before reading an example data set from an Excel file, you are going to get the answer to a couple of questions. Second, we will have a look at the prerequisites to follow this tutorial. Third, we will have a look at how to add a new column to a dataframe using first base R and, then, using tibble and the `add_column()`

function. In this section, using dplyr and `add_column()`

, we will also have a quick look at how we can add an empty column. Note, we will also append a column based on other columns. Furthermore, we are going to learn, in the two last sections, how to insert multiple columns to a dataframe using tibble.

To follow this tutorial, in which we will carry out a simple data manipulation task in R, you only need to install dplyr and tibble if you want to use the `add_column()`

and `mutate()`

functions as well as the %>% operator. However, if you want to read the example data, you will also need to install the readr package.

It may be worth noting that all the mentioned packages are all part of the Tidyverse. This package comes packed with a lot of tools that can be used for cleaning data, visualizing data (e.g. to create a scatter plot in R with ggplot2).

o add a new column to a dataframe in R you can use the $-operator. For example, to add the column “NewColumn”, you can do like this: `dataf$NewColumn <- Values`

. Now, this will effectively add your new variable to your dataset.

In the next section, we are going to use the `read_excel()`

function from the readr package. After this, we are going to use R to add a column to the created dataframe.

Here’s how to read a .xlsx file in R:

```
# Import readr
library(readr)
# Read data from .xlsx file
dataf <- read_excel('./SimData/add_column.xlsx')
```

In the code chunk above, we imported the file add_column.xlsx. This file was downloaded to the same directory as the script. We can obtain some information about the structure of the data using the `str()`

function:

Before going to the next section it may be worth pointing out that it is possible to import data from other formats. For example, you can see a couple of tutorials covering how to read data from SPSS, Stata, and SAS:

- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R
- How to Read & Write SPSS Files in R Statistical Environment

Now that we have some example data, to practice with, move on to the next section in which we will learn how to add a new column to a dataframe in base R.

First, we will use the $-operator and assign a new variable to our dataset. Second, we will use brackets ("[ ]") to do the same.

Here’s how to add a new column to a dataframe using the $-operator in R:

```
# add column to dataframe
dataf$Added_Column <- "Value"
```

Note how we used the operator $ to create the new column in the dataframe. What we added, to the dataframe, was a character (i.e., the same word). This will produce a character vector as long as the number of rows. Here's the first 6 rows of the dataframe with the added column:

If we, on the other hand, tried to assign a vector that is not of the same length as the dataframe, it would fail. We would get an error similar to "*Error: Assigned data `c(2, 1)` must be compatible with existing data.*"

If we would like to add a sequence of numbers we can use `seq()`

function and the `length.out`

argument:

```
# add column to dataframe
dataf$Seq_Col <- seq(1, 10, length.out = dim(dataf)[1])
```

Notice how we also used the `dim()`

function and selected the first element (the number of rows) to create a sequence with the same length as the number of rows. In the next section, we will learn how to add a new column using brackets.

Here’s how to append a column to a dataframe in R using brackets (“[]”):

```
# Adding a new column
dataf["Added_Column <- "Value"
```

Using the brackets will give us the same result as using the $-operator. However, it may be easier to use the brackets instead of $, sometimes. For example, when we have column names containing whitespaces, brackets may be the way to go. Also, when selecting multiple columns you have to use brackets and not $. In the next section, we are going to create a new column by using tibble and the `add_column()`

function.

Here’s how to add a column to a dataframe in R:

```
# Append column using Tibble:
dataf <- dataf %>%
add_column(Add_Column = "Value")
```

In the example above, we added a new column at “the end” of the dataframe. Note, that we can use dplyr to remove columns by name. This was done to produce the following output:

Finally, if we want to, we can add a column and create a copy of our old dataframe. Change the code so that the left “dataf” is something else e.g. “dataf2”. Now, that we have added a column to the dataframe it might be time for other data manipulation tasks. For example, we may now want to remove duplicate rows from the R dataframe or transpose your dataframe.

If we want to append a column at a specific position we can use the `.after`

argument:

```
# R add column after another column
dataf <- dataf %>%
add_column(Column_After = "After",
.after = "A")
```

As you probably understand, doing this will add the new column after the column "A". In the next example, we are going to append a column before a specified column.

Here’s how to add a column to the dataframe before another column:

```
# R add column before another column
dataf <- dataf %>%
add_column(Column_Before = "Before",
.after = "Cost")
```

In the next example, we are going to use `add_column()`

to add an empty column to the dataframe.

Here’s how we would do if we wanted to add an empty column in R:

Note that we just added NA (missing value indicator) as the empty column. Here’s the output, with the empty column, added, to the dataframe:

```
# Empty
dataf <- dataf %>%
add_column(Empty_Column = NA) %>%
```

If we want to do this we just replace the `NA`

with "‘’", for example. However, this would create a character column and may not be considered as empty. In the next example, we are going to add a column to a dataframe based on other columns.

Here’s how to use R to add a column to a dataframe based on other columns:

```
# Append column conditionally
dataf <- dataf %>%
add_column(C = if_else(.$A == .$B, TRUE, FALSE))
```

In the code chunk above, we added something to the `add_column()`

function: the `if_else()`

function. We did this because we wanted to add a value in the column based on the value in another column. Furthermore, we used the `.$`

so that we get the two columns compared (using `==`

). If the values in these two columns are the same we add `TRUE`

on the specific row. Here’s the new column added:

Note, you can also work with the `mutate()`

function (also from dplyr) to add columns based on conditions. See this tutorial for more information about adding columns on the basis of other columns.

In the next section, we will have a look at how to work with the `mutate()`

function to compute, and add, a new variable to the dataset.

Here’s how to compute and add a new variable (i.e., column) to a dataframe in R:

```
# insert new column with mutate
dataf <- dataf %>%
mutate(DepressionIndex = mean(c_across(Depr1:Depr5))) %>%
head()
```

Notice how we, in the example code above, calculated a new variable called “depression index” which was the mean of the 5 columns named Depr1 to Depr5. Obviously, we used the `mean()`

function to calculate the mean of the columns. Notice how we also used the `c_across()`

function. This was done so that we can calculate the mean across these columns.

Note now that you have added new columns, to the dataframe, you may also want to rename factor levels in R with e.g. dplyr. In the next section, however, we will add multiple columns to a dataframe.

Here’s how you would insert multiple columns, to the dataframe, using the `add_column()`

function:

```
# Add multiple columns
dataf <- %>%
add_column(New_Column1 = "1st Column Added",
New_Column2 = "2nd Column Added")
```

In the example code above, we had two vectors (“a” and “b”). Now, we then used the `add_column()`

method to append these two columns to the dataframe. Here’s the first 6 rows of the dataframe with added columns:

Note, if you want to add multiple columns, you just add an argument as we did above for each column you want to insert. It is, again, important that the length of the vector is the same as the number of rows in the dataframe. Or else, we will end up with an error

In this section, you will learn how to add columns from one dataframe to another. Here’s how you append e.g. two columns from one dataframe to another:

```
# Read data from the .xlsx files:
dataf <- read_excel('./SimData/add_column.xlsx')
dataf2 <- read_excel('./SimData/add_column2.xlsx')
# Add the columns from the second dataframe to the first
dataf3 <- cbind(dataf, dataf2[c("Anx1", "Anx2", "Anx3")])
```

In the example above, we used the `cbind()`

function together with selecting which columns we wanted to add. Note, that dplyr has the `bind_cols()`

function that can be used in a similar fashion. Now that you have put together your data sets you can create dummy variables in R with e.g. the fastDummies package or calculate descriptive statistics.

In this post, you have learned how to add a column to a dataframe in R. Specifically, you have learned how to use the base functions available, as well as the add_column() function from Tibble. Furthermore, you have learned how to use the mutate() function from dplyr to append a column. Finally, you have also learned how to add multiple columns and how to add columns from one dataframe to another.

I hope you learned something valuable. If you did, please share the tutorial on your social media accounts, add a link to it in your projects, or just leave a comment below! Finally, suggestions and corrections are welcomed, also as comments below.

Here you will find some additiontal resources that you may find useful- The first three, here, is especially interesting if you work with datetime objects (e.g., time series data):

- How to Extract Year from Date in R with Examples with e.g. lubridate (Tidyverse)
- Learn How to Extract Day from Datetime in R with Examples with e.g. lubridate (Tidyverse)
- How to Extract Time from Datetime in R – with Examples

If you are interested in other useful functions and/or operators these two posts might be useful:

- How to use %in% in R: 7 Example Uses of the Operator
- How to use the Repeat and Replicate functions in R

The post How to Add a Column to a Dataframe in R with tibble & dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to rename factor levels in R using 1) levels and 2)

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn how to rename factor levels in R. First, we will use the base functions that are available in R, and then we will use dplyr.

To rename factor levels using `levels()`

we can assign a character vector with the new names. If we want to recode factor levels with dplyr we can use the `recode_factor()`

function.

This R tutorial has the following outline. First, we start by answering some simple questions. Second, we will have a look at what is required to follow this tutorial. Third, we will read an example data set so that we have something to practice on. Fourth, we will go into how to rename factor levels using 1) the levels() function, and 2) the recode_factor() function from the dplyr package.

One simple method to rename a factor level in R is `levels(your_df$Category1)[levels(our_df$Category1)=="A"] <- "B"`

where `your_df`

is your data frame and `Category1`

is the column containing your categorical data. Now, this would recode your factor level “A” to the new “B”.

The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: `levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3")`

. This would efficiently rename the factors “Factor 1” and so on.

In the next section, we will have a look at what is needed to follow this post.

To learn to recode factor levels by the examples in this post you need to download this data set. Furthermore, if you plan on using dplyr and the recode_factor() function, you will need to install this package. Here’s how to install an R-package:

`install.packages("dplyr")`

Note that this package is very useful. You can, for instance, use dplyr to remove columns in R, and calculate descriptive statistics. A quick tip, before going on to the tutorial part of the post, is that you can install dplyr among plenty of other very good r packages if you install the Tidyverse package. For example, you will get ggplot2 that can be used for data visualization (e.g., can be used to create a scatter plot in R), lubridate to handle datetime data (e.g. to extract year from datetime). In the next section, we are going to read the example data from the .csv file.

Here is how to read a CSV file in R using the read.csv function:

```
# Import data
data <- read.csv("flanks.csv")
```

Note that you need to download the CSV file and store it in the same directory as your R script. Data can, of course, also be imported from other data sources. See the following tutorials for more information:

- How to Read & Write SPSS Files in R Statistical Environment
- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read and Write Stata (.dta) Files in R with Haven
- Reading SAS Files in R with Haven & sas7dbat

Now, we have the data frame called `data`

. If we want to get information about the variables in the data frame we can use the `str()`

function:

In the image above, we it is clear that we have a data frame containing 5 columns (i.e., variables). Notice that the first column probably is the index column but we will leave it like this. Of particular interest, for this post we can see that we have one column with a categorical variable called “TrialType”. Furthermore, we can see that this variable has two factor levels.

In the, we are going to use `levels()`

to change the name of the levels of a categorical variable. First, we are just assigning a character vector with the new names. Second, we are going to use a list renaming the factor levels by name.

Here’s how to change the name of factor levels using `levels()`

:

```
# Renaming factor levels
levels(data$TrialType) <- c("Con", "InCon")
```

In the example above, we used the levels() function and selected the categorical variable that we wanted. Furthermore, we created a character vector. Notice how we here put the new names. If we use the levels() function again without assigning anything we can now see that we actually renamed the factor levels:

Note that if we try to assign a character vector containing too few, or too many, elements (i.e., names) it will not work. This will lead to an error (i.e., ‘*Error in `levels<-.factor`(`*tmp*`, value = "Con") : number of levels differs*’). Now that you have renamed the levels of a factor, you might want to clean the data frame from duplicate rows or columns. Furthermore, you can use the t() function to transpose in R (i.e a matrix OR dataframe).

In the next example we will rename factor levels by name also using the levels() function.

Here’s how to rename the factor levels by name:

```
# Recode factor levels by name
levels(data$TrialType) <- list(Congruent = "Con", InCongruent = "InCon")
```

Here's the output from `str()`

in which we can see that we renamed the levels of the TrialType factor, again:

Note, however, that when we rename factor levels by name like in the example above, ALL levels need to be present in the list; if any are not in the list, they will be replaced with NA. In the next example, we are going to work with dplyr to change the name of the factor levels. That is, you will end up with only a single factor level and NA scores. Not that good.

Note, if you are planning on carrying out regression analysis and still want to use your categorical variables, you can at this point create dummy variables in R.

One of the simplest ways to rename factor levels is by using the `recode_factor()`

function:

```
# Renaming factor levels dplyr
data$TrialType <- recode_factor(data$TrialType, congruent = "Con",
incongruent = "InCon")
```

In the code example above, we first loaded dplyr so that we get the `recode_factor()`

function into our name space. On the second line, we assign the renamed factors to the column containing our categorical variable. The `recode_factor()`

function works in a way that the first argument is the character vector. This argument is then followed by the level of a factor (e.g., the first) and then the new name. Each following argument is then the other factors we want to be renamed.

As previously mentioned, dplyr is a very useful package. It can also be used to add a column to an R data frame based other columns, or to simply add a column to a data frame in R. This can be, of course, also be done with other packages that are part of the TIdyverse. Note that there are other ways to recode levels of a factor in R. For instance, another package that is part of the Tidyverse package has a function that can be used: forcats.

In this tutorial, you have learned how to rename factor levels in R. First, we had a look at how to use the `levels()`

function to recode the levels of factors. Second, we had a look at the `recode_factor()`

function from the dplyr package to do the same. Hope you learned something valuable. Please share the tutorial on your social media accounts if you did.

Here are some other resources that you may find useful when working in R statistical environment:

- How to use %in% in R: 7 Example Uses of the Operator
- Learn How to Generate a Sequence of Numbers in R with :, seq() and rep()
- How to use the Repeat and Replicate functions in R
- More on working with datetime objects in R: How to Extract Day from Datetime in R with Examples and How to Extract Time from Datetime in R – with Examples
- R Resources for Psychologists - for a collection of useful resources

The post How to Rename Factor Levels in R using levels() and dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicate rows and columns from a data frame. We will use the duplicated() and unique() functions from base R. Furthermore, we will use the distinct() function from the dplyr package.

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to remove duplicates from the data frame. First, you will learn how to delete duplicated rows and, second, you will remove columns. Specifically, we will have a look at how to remove duplicate records from the data frame using 1) base R, and 2) dplyr.

The post starts out with answering a few questions (e.g., “How do I remove duplicate rows in R?”). In the second section, you will learn about what is required to follow this R tutorial. That is, you will learn about the dplyr (and Tidyverse) package and how to install them. When you have what you need to follow this R tutorial, we will create a data frame containing both duplicated rows and columns that we can use to practice on. In the next 5 sections, we will have a look at the example of how to delete duplicates in R. First, we will use Base R and the duplicated() and unique() functions. Second, we will use the distinct() function from dplyr.

To delete duplicate rows in R you can the `duplicated()`

function. Here’s how to remove all the duplicates in the data frame called “study_df”, `study_df.un <- study_df[!duplicated(df)]`

.

Now, that we know how to extract unique elements from the data frame (i.e., drop duplicate items) we are going to learn, briefly, about what is needed to follow this post.

Apart from having R installed you also need to have the dplyr package installed (this package can be used to rename factor levels in R, as well). That is, you need dplyr if you want to use the distinct() function to remove duplicate data from your data frame. R packages are, of course, easy to install. You can install dplyr using the `install.packages()`

function. Here’s how to install packages in R:

```
# Installing packages in R:
install.packages("dplyr")
```

It is worth noting here that dplyr is part of the Tidyverse package. This package is super useful because it comes with other awesome packages such as ggplot2 (see how to create a scatter plot in R with ggplot2, for example), readr, and tibble. To name a few! That said. Let’s create some example data to practice dropping duplicate records from!

Now, to practice removing duplicate rows and columns we need some data. Here’s some data with two duplicated rows and two duplicated columns:

```
# Creating a data frame:
example_df <- data.frame(FName =c ('Steve', 'Steve', 'Erica',
'John', 'Brody', 'Lisa', 'Lisa', 'Jens'),
LName = c('Johnson', 'Johnson', 'Ericson',
'Peterson', 'Stephenson', 'Bond', 'Bond',
'Gustafsson'),
Age = c(34, 34, 40,
44, 44, 51, 51, 50),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M'),
Gender = c('M', 'M', 'F', 'M',
'M', 'F', 'F', 'M')
```

The data frame has 8 rows and 5 columns (we can use the `dim()`

function to see this). Here’s the data frame with the duplicate rows and columns:

Most of the time, of course, we import our data from an external source. See the following posts for more information:

- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read & Write SPSS Files in R Statistical Environment
- Reading SAS Files in R with Haven & sas7dbat
- How to Read and Write Stata (.dta) Files in R with Haven

In the next section, we are going to start by removing the duplicate rows using base R.

Here’s how to remove duplicate rows in R using the `duplicated()`

function:

```
# Remove duplicates from data frame:
example_df[!duplicated(example_df), ]
```

As you can see, in the output above, we have now removed one of the two duplicated rows from the data frame. What we did, was to create a boolean vector with the rows that are duplicated in our data frame. Furthermore, we selected the columns using this boolean vector. Notice also how we used the `!`

operator to select the rows that *were not* duplicated. Finally, we also used the “,” so that we select any columns.

In the image above, we can see that two columns has been removed. Of course, if you want the changes to be permanent you need to use <-:

```
# Delete duplicate rows
example_df.un <- example_df[!duplicated(example_df), ]
```

Note there are other good operations such as the %in% operator in R, that can be used for e.g. value matching.

In the next example, we are going to use the `duplicated()`

function to remove one of the two identical columns (i.e., “Gender” and “Gender.1”).

To remove duplicate columns we can, again, use the `duplicated()`

function:

```
# Drop Duplicated Columns:
ex_df.un <- example_df[!duplicated(as.list(example_df))]
# Dimenesions
dim(ex_df.un)
# 8 Rows and 4 Columns
# First five rows:
head(ex_df.un)
```

Now, to remove duplicate columns we added the `as.list()`

function and removed the “,”. That is, we changed the syntax from Example 1 something. Again, we can use the `dim()`

function to see that we have dropped one column from the data frame. Here’s also the result from the `head()`

function:

Note, dplyr can be used to remove columns from the data frame as well. In the next example, we are going to use another base R function to delete duplicate data from the data frame: the `unique()`

function.

Here’s how you can remove duplicate rows using the `unique()`

function:

```
# Deleting duplicates:
examp_df <- unique(example_df)
# Dimension of the data frame:
dim(examp_df)
# Output: 6 5
```

As you can see, using the `unique()`

function to remove the identical rows in the data frame is quite straight-forward. It is worth noting, here, that if you want to keep the last occurrences of the duplicate rows, you can use the `fromLast`

argument and set it to `TRUE`

. If you're now done carrying out data manipulation, you can now create a dummy variable in R, for example.

In the final two examples, we are going to use the `distinct()`

function from the dplyr package to remove duplicae rows.

Here’s how to drop duplicates in R with the `distinct()`

function:

```
# Deleting duplicates with dplyr
ex_df.un <- example_df %>%
distinct()
```

In the code example above, we used the function distinct() to keep only unique/distinct rows from the data frame. When working with the `distinct()`

function, if there are duplicate rows, only the first row, of the identical ones, is preserved. Note, if you want to you can now go on and add an empty column to your data frame. This is something you can do with tibble, a package that is part of the Tidyverse. In the final example, we are going to look at an example in which we drop rows based on one column.

It is also possible to delete duplicate rows based on values in a certain column. Here's how to remove duplicate rows based on one column:

```
# remove duplicate rows with dplyr
example_df %>%
# Base the removal on the "Age" column
distinct(Age, .keep_all = TRUE)
```

In the example above, we used the column as the first argument. Second, we used the .keep_all argument to keep all the columns in the data frame. If we now use the `dim()`

function, again, we can see that we have 5 rows and 5 columns. Let’s print the data frame to see which rows we dropped.

Although, we do not want to remove rows where there are duplicate values in a column containing values such as the age of the participants of a study there might be times when we want to remove duplicates in R based on a single column. Furthermore, we can add columns, as well, and drop whether there are identical values across more than one column. Now that you have removed duplicate rows and columns from your data frame you might want to use R to add a column to the data frame based on other columns.

In this short R tutorial, you have learned how to remove duplicates in R. Specifically, you have learned how to carry out this task by using two base functions (i.e., duplicated() and unique()) as well as the distinct() function from dplyr. Furthermore, you have learned how to drop rows and columns that are occurring as identical copies in, at least, two cases in your data frame.

Here are some other tutorials you may find useful:

- How to Transpose a Dataframe or Matrix in R with the t() Function
- How to use the Repeat and Replicate functions in R
- How to Generate a Sequence of Numbers in R with :, seq() and rep()

The post How to Remove Duplicates in R – Rows and Columns (dplyr) appeared first on Erik Marsja.

]]>In this Python tutorial, you will learn how to 1) perform Bartlett’s Test, and 2) Levene’s Test. Both are tests that are testing the assumption of equal variances. Equality of variances (also known as homogeneity of variance, and homoscedasticity) in population samples is assumed in commonly used comparison of means tests, such as Student’s t-test […]

The post Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance in Python appeared first on Erik Marsja.

]]>In this Python tutorial, you will learn how to 1) perform Bartlett’s Test, and 2) Levene’s Test. Both are tests that are testing the assumption of equal variances. Equality of variances (also known as homogeneity of variance, and homoscedasticity) in population samples is assumed in commonly used comparison of means tests, such as Student’s t-test and analysis of variance (ANOVA). Therefore, we can employ tests such as Levene’s or Bartlett’s that can be conducted to examine the assumption of equal variances across group samples.

A brief outline of the post is as follows. First, you will get a couple of questions answered. Second, you will briefly learn about the hypothesis of both Bartlett’s and Levene’s tests of homogeneity of variances. After this, we continue by having a look at the required Python packages to follow this post. In the next section, you will read data from a CSV file so that we can continue by learning how to carry out both tests of equality of variances in Python. That is, the last two sections, before the conclusion, will how to you to carry out Bartlett’s and Levene’s tests.

Bartlett’s test of **homogeneity of variances** is a test, much like Levene’s test, that measures whether the variances are equal for all samples. If your data is **normally distributed **you can use Bartlett’s test instead of Levene’s.

Levene’s test can be carried out to check that variances are equal for all samples. The test can be used to check the assumption of equal variances before running a parametric test like One-Way ANOVA in Python. If your data is not following a normal distribution Levene’s test is preferred before Barlett’s.

Simply put equal variances, also known as homoscedasticity, is when the variances are approximately the same across the samples (i.e., groups). If our samples have unequal variances (heteroscedasticity), on the other hand, it can affect the Type I error rate and lead to false positives. This is, basically, what equality of variances means.

Whether conducting Levene’s Test or Bartlett’s Test of homogeneity of variance we are dealing with two hypotheses. These two are simply put:

**Null Hypothesis**: the variances are equal across all samples/groups**Alternative Hypothesis**: the variances are*not*equal across all samples/groups

This means, for example, that if we get a p-value larger than 0.05 we can assume that our data is heteroscedastic and we can continue carrying out a parametric test such as the two-sample t-test in Python. If we, on the other hand, get a statistically significant result we may want to carry out the Mann-Whitney U test in Python.

In this post, we will use the following Python packages:

- Pandas will be used to import the example data
- SciPy and Pingouin will be used to carry out Levene’s and Bartlett’s tests in Python

Of course, if you have your data in any other format (e.g., NumPy arrays) you can skip using Pandas and work with e.g. SciPy anyway. However, to follow this post it is required that you have the Python packages installed. In Python, you can install packages using Pip or Conda, for example. Here’s how to install all the needed packages:

`pip install scipy pandas pingouin`

Note, to use pip to install a specific version of a package you can do type:

`pip install scipy==1.5.2 pandas==1.1.1 pingouin==0.3.7`

Make sure to check out how to upgrade pip if you have an old version installed on your computer. That said, let’s move on to the next section in which we start by importing example data using Pandas.

To illustrate the performance of the two tests of equality of variance in Python we will need a dataset with at least two columns: one with numerical data, the other with categorical data. In this example, we are going to use the PlantGrowth.csv data which contains exactly two columns. Here’s how to read a CSV with Pandas:

```
import pandas as pd
# Read data from CSV
df = pd.read_csv('PlantGrowth.csv',
index_col=0)
df.shape
```

If we use the `shape`

method we can see that we have 30 rows and 2 columns in the dataframe. Now, we can also print the column names of the Pandas dataframe like this. This will give us information about the names of the variables. Finally, we may also want to see which data types we have in the data. This can, among other things, be obtained using the `info()`

method:

`df.info()`

As we can see, in the image above, the two columns are of the data types float and object. More specifically, the column *weight *is of float data type and the column called *group *is an object. This means that we have a dataset with categorical variables. Exactly what we need to practice carrying out the two tests of homogeneity of variances.

In the next section, we are going to learn how to carry out Bartlett’s test in Python with first SciPy and, then, Pingouin. Note, when we are using Pingouin we are actually using SciPy but we get a nice table with the results and can, using the same Python method, carry out Levene’s test. That said, let’s get started with testing the assumption of homogeneity of variances!

In this section, you will learn two methods (i.e., using two different Python packages) for carrying out Bartlett’s test in Python. First, we will use SciPy:

Here’s how to do Bartlett’s test using SciPy:

```
from scipy.stats import bartlett
# subsetting the data:
ctrl = df.query('group == "ctrl"')['weight']
trt1 = df.query('group == "trt1"')['weight']
trt2 = df.query('group == "trt2"')['weight']
# Bartlett's test in Python with SciPy:
stat, p = bartlett(ctrl, trt1, trt2)
# Get the results:
print(stat, p)
```

As you can see, in the code chunk above, we started by importing the `bartlett`

method from the stats class. Now, `bartlett()`

takes the different sample data as arguments. This means that we need to subset the Pandas dataframe we previously created. Here we used Pandas `query()`

method to subset the data for each group. In the final line, we used the `bartlett()`

method to carry out the test. Here are the results:

Remember the null and alternative hypothesis of the two tests we are learning in this blog post? Good, because judging from the output above, we cannot reject the null hypothesis and can, therefore, assume that the groups have equal variances.

Note, you can get each group by using the `unique()`

method. For example, to get the three groups we can type `df[‘group’].unique()`

and we will get this output.

Here’s another method to carry out Bartlett’s test of equality of variances in Python:

```
import pingouin as pg
# Bartlett's test in Python with pingouin:
pg.homoscedasticity(df, dv='weight',
group='group',
method='bartlett')
```

In the code chunk above, we used the `homoscedasticity`

method and used the Pandas dataframe as the first argument. As you can see, using this method to carry out Bartlett’s test is a bit easier. That is, using the next two parameters we specify the dependent variable and the grouping variable. This means that we don’t have to subset the data as when using SciPy directly. Finally, we used the method parameter to carry out Bartlett’s test. As you will see, in the next section, if we don’t do this we will carry out Levene’s test.

Now as you may already know, and as stated earlier in the post, Bartlett’s test should only be used if data is normally distributed. In the next section, we will learn how to carry out an alternative test that can be used for non-normal data.

In this section, you will earn two methods to carry out Levene’s test of homogeneity of variances in Python. As in the previous section, we will start by using SciPy and continue using Pingouin.

To carry out Levene’s test with SciPy we can do as follows:

```
from scipy.stats import levene
# Create three arrays for each sample:
ctrl = df.query('group == "ctrl"')['weight']
trt1 = df.query('group == "trt1"')['weight']
trt2 = df.query('group == "trt2"')['weight']
# Levene's Test in Python with Scipy:
stat, p = levene(ctrl, trt1, trt2)
print(stat, p)
```

In the code chunk above, we started by importing the `levene`

method from the stats class. Much like when using the `bartlett`

method, levene takes the group’s data as arguments (i.e., one array for each group). Again, we will have to subset the Pandas dataframe containing our data. Subsetting the data is, again, done using Pandas `query()`

method. In the final line, we used the `levene()`

method to carry out the test.

Here’s the second method to perform out Levene’s test of homoscedasticity in Python:

```
import pingouin as pg
# Levene's Test in Python using Pingouin
pg.homoscedasticity(df, dv='weight',
group='group')
```

In the code chunk above, we used the `homoscedasticity`

method. This method takes the data, in this case, our dataframe, as the first parameter. As you when carrying out Bartlett’s test with this package, it is easier to use when performing Levene’s test as well. The next two parameters to the method is where we specify the dependent variable and the grouping variable. This is quite awesome as we don’t have to subset the dataset ourselves. Note, that we don’t have to use the method parameter (as when performing Bartlett’s test) because the `homoscedasticity`

method will, by default, do Levene’s test.

Now, as testing the assumption of equality of variances using Pingouin is, in fact, using SciPy the results are, of course, the same regardless of Python method used. In this case, with the example data we used, the samples have roughly equal variances. Good news, if we want to compare the groups on their mean values!

In this Python tutorial, you have learned to carry out two tests of equality of variances. First, we used Bartlett’s test of homogeneity of variance using SciPy and Pingouin. This test, however, should only be used on normally distributed data. Therefore, we also learned how to carry out Levene’s test using the same two Python packages! Finally, we also learned that Pingouin uses SciPy to carry out both tests but works as a simple wrapper for the two SciPy methods and is very easy to use. Especially, if our data is stored in a Pandas dataframe.

The post Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance in Python appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to add a column to a dataframe based on other columns.

The post R: Add a Column to Dataframe Based on Other Columns with dplyr appeared first on Erik Marsja.

]]>In this R tutorial, you are going to learn how to **add a column to a dataframe based on values in other columns**. Specifically, you will learn to create a new column using the mutate() function from the package dplyr, along with some other useful functions.

Finally, we are also going to have a look on how to add the column, based on values in other columns, at a specific place in the dataframe. This will be done using the add_column() function from the Tibble package.

It is worth noting, that both tibble and dplyr are part of the Tidyverse package. Apart from adding columns to a dataframe, you can use dplyr to remove columns, with the select() function, for example.

In this post, we will first learn how to install the r-packages that we are going to use. Second, we are going to import example data that we can play around with and add columns based on conditions. After we have a dataframe, we will then go on and have a look at how to add a column to the dataframe with values depending on other columns. In these sections, we will use the mutate() and add_column() functions to accomplish the same task. That is, we will use these R functions to add a column based on conditions.

As this is an R tutorial, you will, of course, need to have R and, at least, the dplyr package installed. If you want to e.g. easily add a column, based on values in another column, at a specific position I would suggest that you install tibble. Furthermore, if you are going to read the example .xlsx file you will also need to install the readr package. Note, however, that if you install the tidyverse package you will get tibble, dplyr and readr, among a lot of other useful packages.

Installing Tidyverse enables you to easily calculate descriptive statistics, visualize data (e.g., scatter plots with ggplot2). Furthermore, there’s another useful package, that is part of the Tidyverse package, called lubridate. Lubridate is very handy if you are working with time-series data. For example, you can use the functions of this package to extract year from date in R as well as extracting day and extracting time. As usual, when installing r-packages we use the `install.packages()`

function:

`install.packages(c('tibble', 'dplyr', 'readr'))`

Note. if you want to install all packages available in the tidyverse package just exchange the character vector for ‘tidyverse’ (`install.packages('tidyverse')`

). Now that you should be set with these useful packages we can start reading the example Excel file.

Here’s how to read an xlsx file in R using `read_xlsx`

function from the readxl package:

```
library(readxl)
# reading the xlsx file:
depr_df <- read_excel('./SimData/add_column.xlsx')
```

In the code chunk above, we imported the Excel file that can be downloaded here. This file needs, furthermore, to be placed in the same directory as the R script (or change the path to the .xlsx file). Notice that we used the skip argument to skip the first two rows. Finally, we can have a glimpse of the data by using the head() function:

In the output, we can see that our dataset contains the following columns:

- ID – Subject ID
- A
- B
- Cost
- Depr1 – First item on a depression scale
- Depr2 – Second item
- Depr3 – And so on…
- Depr4 – …
- Depr5

Note that all variables in this data set are made up and, thus, the data makes no sense. We are, of course, only going to use it so that we can practice adding new columns based on conditions on values in other columns. Now that we have our data we are jumping into the first example directly!

If we want to add a column based on the values in another column we can work with dplyr. Here’s how to append a column based on what the factor ends with in a column:

```
library(dplyr)
# Adding column based on other column:
depr_df %>%
mutate(Status = case_when(
endsWith(ID, "R") ~ "Recovered",
endsWith(ID, "S") ~ "Sick"
))
```

As you can see, in the code chunk above, we used the `%>%`

operator and the `mutate()`

function together with the `case_when()`

and `endsWith()`

functions. Furthermore, we created the “Status” column (in mutate) and if the factor ended with R the value in the new column will be “Recovered”. On the other hand, if the factor is ending with S, the value in the new column will be “Sick”. Here’s the resulting dataframe to which we appended the new column:

Now, the `%>%`

operator is very handy and, of course, there are more nice operators, as well as functions, in R statistical programming environment. See the following posts for more inspiration (or information):

- How to use %in% in R: 7 Example Uses of the Operator
- Learn How to Generate a Sequence of Numbers in R with :, seq() and rep()
- How to use the Repeat and Replicate functions in R

In the next section, we will continue learning how to add a column to a dataframe in R based on values in other columns.

In the first example, we are going to add a new column based on whether the values in the columns “A” and “B” match. Here’s how to add a new column to the dataframe based on the condition that two values are equal:

```
# R adding a column to dataframe based on values in other columns:
depr_df <- depr_df %>%
mutate(C = if_else(A == B, A + B, A - B))
```

In the code example above, we added the column “C”. Here we used dplyr and the `mutate()`

function. As you can see, we also used the `if_else()`

function to check whether the values in column “A” and “B” were equal. If they were equal, we added the values together. If not, we subtracted the values. Here’s the resulting dataframe with the column added:

Notice how there was only one row in which the values matched and, in that column, our code added the values together. Of course, if we wanted to create e.g. groups based on whether the values in two columns are the same or not we can use change some things in the `if_else()`

function. For example, we can use this code:

```
# creating a column to dataframe based on values in other columns:
depr_df <- depr_df %>%
mutate(C = if_else(A == B, "Equal", "Not Equal"))
```

In the next code example, we are going to create a new column summarizing the values from five other columns. This can be useful, for instance, if we have collected data from e.g. a questionnaire measuring psychological constructs.

Here we are going to use the values in the columns named “Depr1” to “Depr5” and summarize them to create a new column called “DeprIndex”:

```
# Adding new column based on the sum of other columns:
depr_df <- depr_df %>% rowwise() %>%
mutate(DeprIndex = sum(c_across(Depr1:Depr5)))
```

To explain the code above, here we also used the `rowwise()`

function before the `mutate()`

function. As you may understand, we use the first function to perform row-wise operations. Furthermore, we used the `sum()`

function to summarize the columns we selected using the `c_across() function. `

Note, if you need to you can rename the levels of a factor in R using dplyr, as well. In the final example, we are going to continue working with these columns. However, we are going to add a new column based on different cutoff values. That is, we are going to create multiple groups out of the score summarized score we have created.

In this example, we are going to create a new column in the dataframe based on 4 conditions. That is, we are going to use the values in the “DeprIndex” column and create 3 different groups depending on the value in each row.

```
# Multiple conditions when adding new column to dataframe:
depr_df %>% mutate(Group =
case_when(DeprIndex <= 15 ~ "A",
DeprIndex <= 20 ~ "B",
DeprIndex >= 21 ~ "C")
)
```

Again, we used mutate() together with case_when(). Here, in this example, we created a new column in the dataframe and added values based on whether “DeprIndex” was smaller or equal to 15, smaller or equal to 20, or larger or equal to 25.

This is cool! We’ve created another new column that categorizes each subject based on our arbitrary depression scale. We could now go on and calculate descriptive statistics in R, by this new group, if we want to. In the final example, we are going to use Tibble and the `add_column()`

function that we used to add an empty column to a dataframe in R.

In the final example, we are going to use add_column() to append a column, based on values in another column. Here’s how to append a column based on whether a value, in on columns, is larger than given value:

```
library(tibble)
depr_df <- depr_df %>%
add_column(Is_Depressed =
if_else(.$DeprIndex < 18, TRUE, FALSE),
.after="ID")
```

Notice how we now use tibble and the add_column() function. Again, we use the %>% operator and then in the function we are using if_else(). Here’s the trick we used “.$” to access the column “DeprIndex” and if the value is larger than 18 we add TRUE to the cell in the new column. Obviously, if it is smaller FALSE will be added. The new column that we have created is called “Is_Depressed” and is a boolean:

Importantly, to add the new column at a specific position we used the .after argument. As you can see, in the image above, we created the new column after the “ID” column. If we want to append our column before a specific column we can use the .before argument. Now, you might want to continue preparing your data for statistical analysis. For more information, you can have a look at how to create dumy variables in R.

In this R tutorial, you have learned how to add a column to a dataframe based on conditions and/or values in other columns. First, we had a look at a simple example in which we created a new column based on the values in another column. Second, we appended a new column based on a condition. That is, we checked whether the values in the two columns were the same and created a new column based on this. In the third example, we had a look at more complex conditions (i.e., 3 conditions) and added a new variable with 3 different factor levels. Finally, we also had a look at how we could use <code>add_column()</code> to append the column where we wanted it in the dataframe.

Hope you found this post useful! If you did, make sure to share the post to show some love! Also, you can become a Patreon to support my work. Finally, make sure you leave a comment if you want something clarified or you found an error in the post!

The post R: Add a Column to Dataframe Based on Other Columns with dplyr appeared first on Erik Marsja.

]]>In this tutorial, you will learn by examples how to use the %in% in R. Specifically, you will learn 7 different uses of this great operator. Outline Here’s the outline of this post, described a bit more detailed than the table of contents. First, we start out with a couple of simple examples of how […]

The post How to use %in% in R: 7 Example Uses of the Operator appeared first on Erik Marsja.

]]>In this tutorial, you will learn by examples how to use the %in% in R. Specifically, you will learn 7 different uses of this great operator.

Here’s the outline of this post, described a bit more detailed than the table of contents. First, we start out with a couple of simple examples of how to use the `%in%`

operator. Specifically, we will have a look at how to use the operator when testing whether two vectors are containing sequences of numbers and letters. As you may already have expected, the operator can be used in other, maybe more advanced cases. In the following sections, therefore, we are going to have a look at how we can work with this operator and dataframes. For example, you will see that you can use the operator to create new variables, remove columns, and select columns.

The `%in%`

operator in R can be used to identify if an element (e.g., a number) belongs to a vector or dataframe. For example, it can be used the see if the number 1 is in the sequence of numbers 1 to 10.

The `%in%`

operator is used for matching values. “returns a vector of the positions of (first) matches of its first argument in its second”. On the other hand, the `==`

operator, is a logical operator and is used to compare if two elements are exactly equal. Using the `%in%`

operator you can compare vectors of different lengths to see if elements of one vector match at least one element in another. The length of output will be equal to the length of the vector being compared (the first one). This is not possible when utilizing the `==`

operator.

The use of the %in% operator is to match values in e.g. two different vectors, as already answered in the to previous questions. You can use the operator, also, to select certain columns in the dataframe or to subset the dataframe.

Now that you know that `%in%`

is in R and what the difference is between this operator and `==`

is we can go on and have a look at the example usages.

In this section, we are going through 8 examples of how to use %in% in R. As you already know, we will start by working with vectors. After that, we will have a look at how to use the operator when working with dataframes.

In this example, we will use `%in%`

to check if two vectors contain overlapping numbers. Specifically, we will have a look at how we can get a logical value for more specific elements, whether they are also present in a longer vector. Here’s the first example of an excellent usage of the operator:

```
# sequence of numbers 1:
a <- seq(1, 5)
# sequence of numbers 2:
b <- seq(3, 12)
# using the %in% operator to check matching values in the vectors
a %in% b
```

In the code above we get an output as long as the longer vector (i.e., b). Furthermore, we used the `seq()`

function, to create the first one sequence of numbers in R and then another. In a real-world example, our vectors might not be containing sequences but just random numbers. If we, on the other hand, want to test which elements of a longer vector are in a short vector we do as follows:

```
# shorter vector:
a <- seq(12, 19)
# longer vector:
b <- seq(1, 16)
# test if elements in longer vector is in shorter:
b %in a
```

As you can see, both above methods will result in a boolean. Additionally, if we use the which() function, we can the the indexes of where the overlapping elements:

```
# Using the operator together with the which() function
which(seq(1:10) %in% seq(4:12))
```

In the next example, we will see that we can apply the same methods for letters, or factors, in R. That is, we will test if two vectors, containing letters, are overlapping.

In this example, we will use `%in%`

to check if two vectors contain overlapping letters. Note, this can also be done for words (e.g., factors). First, we will compare letters in a shorter vector and in a longer vector. Here’s how to compare two vectors containing letters:

```
# Sequences of Letters:
a <- LETTERS[1:10]
# Second seq of ltters
b <- LETTERS[4:10]
# longer in shorter
a %in% b
```

As you can see, and probably already figured out, we used the `%in%`

operator exactly in the same way as for vectors containing sequences of numbers. Again we can test which letters in a long vector are in a short vector:

`b %in% a`

Naturally, as with the examples where we used sequences of numbers in R, the result when working with letters, words, or factors is a boolean vector. Furthermore, as in the first example, we can use the `which()`

function to get indexes:

```
g <- c("C", "D", "E")
h <- c("A", "E", "B", "C", "D", "E", "A", "B", "C", "D", "E")
which(h %in% g)
```

Finally, here’s an example of why using the `%in%`

operator is better than the `==`

. If we use `which()`

, together with `==`

, we will get the only the two 3 elements:

```
# %in% vs == the equal operator wrong!
which(g == h)
```

In the next example, we will work with a dataframe, instead of vectors. First, however, we are going to read the readxl package to read a .xlsx file in R. Here’s how we get our dataframe to play around with:

```
library(readxl)
library(httr)
#URL to Excel File:
xlsx_URL <- 'https://mathcs.org/statistics/datasets/titanic.xlsx'
# Get the .xlsx file as an temporary file
GET(xlsx_URL, write_disk(tf <- tempfile(fileext = ".xlsx")))
# Reading the temporary .xlsx file in R:
dataf <- read_excel(tf)
# Checkiing dataframe:
head(dataf)
```

A quick note, before going on to the third example, is that readxl as well as dplyr, a package that we will use later, are part of the Tidyverse package. If you install Tidyverse you will get some powerful tools to extract year from date in R, carry out descriptive statistics, visualize data (e.g., scatter plots with ggplot2), to name a few.

In this example, we will have a look at a very simple example of how we can use this operator. Namely, we are going to use `%in%`

to check if a value is in one of the columns in a dataframe:

```
# %in% column
2 %in% dataf$boat
```

Now, if you have read through the first 2 examples you already know that we get a boolean vector. In this vector, the value TRUE means that the cell contained the value we sought. Notice also how we used the `$`

operator to select one of the columns.

Here’s how to use the `%in%`

operator to create a new variable:

```
# Creating a dataframe:
dataf2 <- data.frame(Type = c("Fruit","Fruit","Fruit","Fruit","Fruit",
"Vegetable", "Vegetable", "Vegetable", "Vegetable", "Fruit"),
Name = c("Red Apple","Strawberries","Orange","Watermelon","Papaya",
"Carrot","Tomato","Chili","Cucumber", "Green Apple"),
Color = c(NA, "Red", "Orange", "Red", "Green",
"Orange", "Red", "Red", "Green", "Green"))
# Adding a New Column:
dataf2 <- within(dataf2, {
Red_Fruit = "No"
Red_Fruit[Type %in% c("Fruit")] = "No"
Red_Fruit[Type %in% "Vegetable"] = "No"
Red_Fruit[Name %in% c("Red Apple", "Strawberries", "Watermelon", "Chili", "Tomato")] = "Yes"
})
```

Notice how we make use of the operator, Here’s the dataframe, with the added column “Red_Fruit”:

In another post, you will learn how to use R to add a column to a dataframe based on conditions and/or values in other columns.

In this example, we are going to use the `%in%`

operator to subset the data:

```
library(dplyr)
home.dests <- c("St Louis, MO", "New York, NY", "Hudson, NY")
# Subsetting using %in% in R:
dataf %>%
filter(home.dest %in% home.dests)
```

Notice how we created a vector of the elements that we want to be included in our new, subsetted, dataframe. Furthermore, we also used the dplyr package and the filter() function together with the %in% operator. Finally, we get the resulting, subsetted, dataframe:

In the next section, we will have a look at another way we may use the %in% operator: namely, to drop columns from a dataframe.

In this example, we are going to use `%in%`

to drop columns from the datafarme:

```
# Drop columns using %in% operator in R
dataf[, !(colnames(dataf) %in% c("pclass", "embarked", "boat"))]
```

In the code cunk above, we used the I to tell R that we do not want select these columns. Running the code, above, will result in a new dataframe with the columns removed:

Note, it is also possible to use dplyr to remove columns in R. For example, using the select() function together with the pipe operator may result in a slightly more readable code.

In the next example, we are going to have a look at how we can use the `%in%`

operator to do the opposite of dropping columns. That is, we are going to select columns, instead.

Let us use the `%in%`

operator to select a number of variables from the dataframe:

```
# Select columns using %in%:
dataf[, (colnames(dataf) %in% c("pclass", "embarked", "boat"))]
```

Note that we removed the ! before the paranthese which will tell R to select these columns (see example 6, above, for the opposite).

Selecting columns, instead of deleting them, might be a more efficient way to go if we have a lot of variables in our dataset and we want to create a new dataframe with only some of them. Notice how we used another nice function: select_if(). This function is also from the dplyr package and when we wanted to select columns if they had certain names.

In the final bonus section, we are going to see how we can negate the %in% operator. Now, we are going to do this because there is not an built in “not in” operator in R.

Here’s how we can create our own *not in *operator in R:

```
# Creating a not in operator:
`%notin%` <- Negate(`%in%`)
```

Now, we can use this new R not in operator to check if a e.g. a number is not in a vector:

```
# Generating a sequence of numbers:
numbs <- rep(seq(3), 4)
# Using the not in operator:
4 %notin% numbs
# Output: [1] TRUE
```

Finally, it is worth noting that there are some R packages that contains “not in” functions. For example, the package mefa4 have the %notin% function.

In this R tutorial, you have learned 7 ways you can use the %in% operator in R. Specifically, you have learned how to compare vectors of numbers and letters (factors). You have also learned how to check if a value is in a column (as well as how many times), how to add a new variable, remove a columns, and select columns.

The post How to use %in% in R: 7 Example Uses of the Operator appeared first on Erik Marsja.

]]>In this hands on tutorial, you will learn how to transpose a matrix and a dataframe in R statistical programming environment.

The post How to Transpose a Dataframe or Matrix in R with the t() Function appeared first on Erik Marsja.

]]>In this brief tutorial, you will learn how to transpose a dataframe or a matrix in R statistical programming environment. Transposing rows and columns is a quite simple task if your data is 2-dimensional (e.g., a matrix or a dataframe). If you have a, for example, 3-dimensional array the function we are going to use in this post will not work.

In this post, we will start by answering the question regarding which function we can use to transpose a given matrix. After that, we will create a simple matrix and, then, we will rotate it. In the following section, we will create a dataframe from the matrix. Finally, we will go on to the section where we will interchange rows with columns in the dataframe, as well.

To interchange rows with columns, you can use the `t()`

function. For example, if you have the matrix (or dataframe) mat you can transpose it by typing `t(mat)`

. This will, as previously hinted, result in a new **matrix** that is obtained by exchanging the rows and columns.

For learning more about useful functions and operators see for example the post about how to use %in% in R.

In this section, we are going to create the matrix that we later are going to transpose. Here we are going to use the `matrix()`

function:

```
# Creating a matrix:
mat <- matrix(1:15, ncol = 5)
```

This will produce a matrix with 3 rows and 5 columns. Here’s the result:

Note, in the example above we created a sequence of numbers in R using the `:`

operator.

Here’s how to transpose a matrix in R with the t() function:

```
# Transpose matrix:
mat.t <- t(mat)
```

In the image below, you can see that we now have a transposed matrix. That is, we now have a matrix in which we have rotated the rows and columns. This results that the rows become columns and the columns now are rows:

Now using the this function (i.e., transpose) will also create a new matrix object. We can see this by using the `class()`

function:

```
class(mat.t)
# [1] "matrix" "array"
```

In the next section, we will create a dataframe and, in the following section, we will use the `t()`

function to rotate the dataframe, as well.

In this section, we are going to use the functions `data.frame()`

and `matrix()`

to create a dataframe:

```
# Creating a dataframe
dataf <- data.frame(matrix(1:9, ncol = 3))
# Setting the column and row names
colnames(dataf) <- c('A', 'B' ,'C', 'D', 'E')
rownames(dataf) <- c('G', 'H','I')
```

Additionally, we set the column names as “A”, “B”, “C”, “D,” and “E” using the `colnames()`

function. Additionally, we set the row names as “E”, “D”, and “F” using the `rownames()`

function. Now, we have a small dataframe that we can rotate!

Naturally, dataframes can be created in many other ways. More common ways are, in fact, to read data from some kind of file format. See the following blog posts to learn more:

- How to Read & Write SPSS Files in R Statistical Environment
- R Excel Tutorial: How to Read and Write xlsx files in R
- How to Read and Write Stata (.dta) Files in R with Haven

To transpose a dataframe in R we can apply exactly the same method as we did with the matrix earlier. That is, we rotate the dataframe with the `t()`

function. Here’s how to rotate a dataframe:

```
# transpose a dataframe
t(dataf)
```

Here’s the resulting, transposed, dataframe. Notice how the row names now are the column names:

Note, if your dataframe contains categorical data and you need to change the name of these you can use R to rename the levels of a factor. If we, on the other hand, have an array we cannot rotate it. Here’s what happens if we try:

```
t(x)
x <- array(rep(1, 12*4*2), dim=c(12, 4, 2))
# Error in t.default(x) : argument is not a matrix
```

Now, there are many other tasks that you might find yourself in need of doing. For example, if you need to drop variables from your dataframe you can use dplyr to remove columns. Additionally, you can also extract year from datetime in R or create dummy variable in r

In this post, you have learned one of the simplest methods to reshape your data. First, you learned how to transpose a matrix. Second, you learned how to rotate a dataframe. Finally, you also learned that you cannot use t() to transpose an array. Finally, along with t() there are other useful R functions worth mentioning such as the repeat and replicate functions.

The post How to Transpose a Dataframe or Matrix in R with the t() Function appeared first on Erik Marsja.

]]>In this Pandas tutorial, you will learn how to count occurrences in column using the value_counts() method.

The post Pandas Count Occurrences in Column – i.e. Unique Values appeared first on Erik Marsja.

]]>In this Pandas tutorial, you are going to learn how to count occurrences in a column. There are occasions in data science when you need to know how many times a given value occurs. This can happen when you, for example, have a limited set of possible values that you want to compare. Another example can be if you want to count the number of duplicate values in a column. Furthermore, we may want to count the number of observations there is in a factor or we need to know how many men or women there are in the data set, for example.

In this post, you will learn how to use Pandas `value_counts()`

method to count the occurrences in a column in the dataframe. First, we start by importing the needed packages and then we import example data from a CSV file. Second, we will start looking at the value_counts() method and how we can use this to count distinct occurrences in a column. Third, we will count the number of occurrences of a specific value in the dataframe. In the last section, we will have a look at an alternative method that also can be used: the groupby() method together with `size()`

and `count()`

. Now, let’s start by importing pandas and some example data to play around with!

To count the number of occurences in e.g. a column in a dataframe you can use Pandas `value_counts()`

method. For example, if you type `df['condition'].value_counts()`

you will get the frequency of each unique value in the column “condition”.

We use Pandas read_csv to import data from a CSV file found online:

```
import pandas as pd
# URL to .csv file
data_url = 'https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
```

In the code example above, we first imported Pandas and then we created a string variable with the URL to the dataset. In the last line of code, we imported the data and named the dataframe “df”. Note, we used the `index_col`

parameter to set the first column in the .csv file as index column. Briefly explained, each row in this dataset includes details of a person who has been arrested. This means, and is true in many cases, that each row is one observation in the study. If you store data in other formats refer to the following tutorials:

- How to Read SAS Files in Python with Pandas
- Pandas Excel Tutorial: How to Read and Write Excel files
- How to Read & Write SPSS Files in Python using Pandas
- How to Read SAS Files in Python with Pandas

In this tutorial, we are mainly going to work with the “sex” and “age” columns. It may be obvious but the “sex” column classifies an individual’s gender as male or female. The age is, obviously, referring to a person’s age in the dataset. We can take a quick peek of the dataframe before counting the values in the chosen columns:

If you have another data source and you can also add a new column to the dataframe. Although, we get some information about the dataframe using the `head()`

method you can get a list of column names using the `column()`

method. Many times, we only need to know the column names when counting values.

Of course, in most cases, you would count occurrences in your own data set but now we have data to practice counting unique values with. In fact, we will now jump right into counting distinct values in the column “sex”.

Here’s how to count occurrences (unique values) in a column in Pandas dataframe:

```
# pandas count distinct values in column
df['sex'].value_counts()
```

As you can see, we selected the column “sex” using brackets (i.e. `df['sex']`

), and then we just used the `value_counts()`

method. Note, if we want to store the counted values as a variable we can create a new variable. For example, `gender_counted = df['sex'].value_counts()`

would enable us to fetch the number of men in the dataset by its index (0, in this case).

As you can see, the method returns the count of all unique values in the given column in descending order, without any null values. By glancing at the above output we can, furthermore, see that there are more men than women in the dataset. In fact, the results show us that the vast majority are men.

Now, as with many Pandas methods, `value_counts()`

has a couple of parameters that we may find useful at times. For example, if we want the reorder the output such as that the counted values (male and female, in this case) are shown in alphabetical order we can use the `ascending`

parameter and set it to `True`

:

```
# pandas count unique values ascending:
df['sex'].value_counts(ascending=True)
```

Note, both of the examples above will drop missing values. That is, they will not be counted at all. There are cases, however, when we may want to know how many missing values there are in a column as well. In the next section, we will therefore have a look at another parameter that we can use (i.e., `dropna`

). First, however, we need to add a couple of missing values to the dataset:

```
import numpy as np
# Copying the dataframe
df_na = df
# Adding 10 missing values to the dataset
df_na.iloc[[1, 6, 7, 8, 33,
44, 99, 103, 109, 201], 4] = np.NaN
```

In the code above, we used Pandas iloc method to select rows and NumPy’s nan to add the missing values to these rows that we selected. In the next section, we will count the occurrences including the 10 missing values we added, above.

Here’s a code example to get the number of unique values as well as how many missing values there are:

```
# Counting occurences as well as missing values:
df_na['sex'].value_counts(dropna=False)
```

Looking at the output we can see that there are 10 missing values (yes, yes, we already knew that!).

Now that we have counted the unique values in a column we will continue by using another parameter of the `value_counts()`

method: `normalize`

. Here’s how we get the relative frequencies of men and women in the dataset:

`df['sex'].value_counts(normalize=True)`

This may be useful if we not only want to count the occurrences but want to know e.g. what percentage of the sample that are male and female. Before moving on to the next section, let’s get some descriptive statistics of the age column by using the `describe()`

method:

`df['age'].describe()`

Naturally, counting age as we did earlier, with the column containing gender, would not provide any useful information. Here’s the data output from the above code:

We can see that there are 5226 values of age data, a mean of 23.85, and a standard deviation of 8.32. Naturally, counting the unique values of the age column would produce a lot of headaches but, of course, it could be worse. In the next example, we will have a look at counting age and how we can bin the data. This is useful if we want to count e.g. continuous data.

Another cool feature of the `value_counts()`

method is that we can use the method to bin continuous data into discrete intervals. Here’s how we set the parameter bins to an integer representing the number of `bins`

to create bins:

```
# pandas count unique values in bins:
df['age'].value_counts(bins=5)
```

For each bin, the range of age values (in years, naturally) is the same. One contains ages from 11.45 to 22.80 which is a range of 10.855. The next bin, on the other hand, contains ages from 22.80 to 33.60 which is a range of 11.8. in this example, you can see that all ranges here are roughly the same (except the first, of course). However, inside each range of fare values can contain a different count of the number of persons within this age range. We can see most people, that are arrested are under 22.8, followed by under 33.6. Kind of makes sense, in this case, right? In the next section, we will have a look at how we can use count the unique values in all columns in a dataframe.

Naturally, it is also possible to count the occurrences in many columns using the `value_counts()`

method. Now, we are going to start by creating a dataframe from a dictionary:

```
# create a dict of lists
data = {'Language':['Python', 'Python',
'Javascript',
'C#', 'PHP'],
'University':['LiU', 'LiU',
'UmU', 'GU','UmU'],
'Age':[22, 22, 23, 24, 23]}
# Creating a dataframe from the dict
df3 = pd.DataFrame(data)
df3.head()
```

As you can see in the output, above, we have a smaller data set which makes it easier to show how to count the frequency of unique values in all columns. If you need, you can convert a NumPy array to a Pandas dataframe, as well. That said, here’s how to use the apply() method:

`df3.apply(pd.value_counts)`

What we did, in the code example above, was to use the method with the value_counts method as the only parameter. This will apply this method to all columns in the Pandas dataframe. However, this really not a feasible approach if we have larger datasets. In fact, the unique counts we get for this rather small dataset is not that readable:

It is, of course, also possible to get the number of times a certain value appears in a column. Here’s how to use Pandas `value_counts()`

, again, to count the occurences of a specific value in a column:

```
# Count occurences of certain value (i.e. Male) in a column (i.e., sex)
df.sex.value_counts().Male
```

In the example above, we used the dataset we imported in the first code chunk (i.e., Arrest.csv). Furthermore, we selected the column containing gender and used the value_counts() method. Because we wanted to count the occurrences of a certain value we then selected Male. The output shows us that there are 4783 occurences of this certain value in the column.

As often, when working with programming languages, there are more approaches than one to solve a problem. Therefore, in the next example, we are going to have a look at some alternative methods that involve grouping the data by category using Pandas groupby() method.

In this section, we are going to learn how to count the frequency of occurrences across different groups. For example, we can use `size()`

to count the number of occurrences in a column:

```
# count unique values with pandas size:
df.groupby('sex').size()
```

Another method to get the frequency we can use is the `count()`

method:

```
# counting unique values with pandas groupby and count:
df.groupby('sex').count()
```

Now, in both examples above, we used the brackets to select the column we want to apply the method on. Just as in the `value_counts()`

examples we saw earlier. Note that this produces the exact same output as using the previous method and to keep your code clean I suggest that you use `value_counts()`

. Finally, it is also worth mentioning that using the `count()`

method will produce unique counts, grouped, for each column. This is clearly redundant information:

In this Pandas tutorial, you have learned how to count occurrences in a column using 1) `value_counts()`

and 2) `groupby()`

together with `size()`

and `count()`

. Specifically, you have learned how to get the frequency of occurrences in ascending and descending order, including missing values, calculating the relative frequencies, and binning the counted values.

The post Pandas Count Occurrences in Column – i.e. Unique Values appeared first on Erik Marsja.

]]>In this post, you will learn how to generate sequences of numbers in R using the : operator, the seq() and rep() functions.

The post How to Generate a Sequence of Numbers in R with :, seq() and rep() appeared first on Erik Marsja.

]]>In this R tutorial, you will learn how to generate sequences of numbers in R. There are many reasons why we would want to generate sequences of numbers. For example, we may want to generate sequences when plotting the axes of figures or simulating data.

As often there is no one way to perform a specific task in R. In this post, we are going to use the `:`

operator, the `seq()`

, and `rep()`

functions. First, we start having a look at the : operator. Second, we dive into the seq() function including the arguments that we can use. Third, we will have a look at how we can use the rep() function to generate e.g. sequences of the same numbers or a few numbers.

The absolutely simplest way to create a sequence of numbers in R is by using the `:`

operator. Here’s how to create a sequence of numbers, from 1 to 10:

`1:10`

As you can see, in the image above, that gave us every integer from 1 and 10 (an integer is every positive or negative counting number, including 0). Furthermore, the created sequence is in ascending order (i.e., from the smallest number to the largest number). We will soon learn how to generate a sequence in descending order. First, however, if we want our sequence, from 1 to 10, to be saved as a variable we have to use the `<-`

and create a vector:

`numbers <- 1:10`

Now, you might already have guessed that we can just change the order of the smallest and largest numbers to generate a sequence of numbers in descending order:

`25:1`

Note, that if you want to know more about a particular R function, you can access its documentation with a question mark followed by the function name: ?function_name_here.

In the particular case, however, an operator like the colon we used above, you must enclose the symbol in backticks like this: ?’:’. Before we go to the next section it is worth mentioning that you can also use R to transpose a matrix or a dataframe.

Often, we desire more control over a sequence we are creating than what the `:`

operator will give us. The `seq()`

function serves this purpose and is a generalization of the `:`

operator, which creates a sequence of numbers with a specified arithmetic progression.

Now, the most basic use of `seq()`

, however, works the same way as the `:`

operator does. For example, if you `t`

ype `seq(1, 10)`

this will become clear. That is, running this command will generate the same sequence as in the first example:

`seq(1, 10)`

Evidently, we got the same output as using the `:`

operator. If we have a look at the documentation we can see that there are number of arguments that we can work with:

As you can see in the image above (or in the documentation): the first two arguments of `seq()`

are “from =” and “to =”. In R, we do not have to use the name of the arguments. That is, if we write out their values in the same order as written in the function it will produce the same results as using the names. It is worth noting, however, for more complex functions best practice is to use the names of the arguments. This will also make the code much clearer. For example, we can generate a sequence of descending numbers like this:

`seq(from = 20, to = 1)`

In the next subsection, we will have a look at the “by=” argument that enables us to define the increment of the sequence.

In some cases, we may want, instead of 1 to 20, a vector of numbers ranging from 0 to 20, sequences incremented by e.g. 2. Here’s how to create a sequence of numbers with a specified increment step:

`seq(0, 20, by = 2)`

As you can see, in the image below, this produces a vector with fewer numbers but every number is increased by 2. In the next section, we will have a look at how to specify how many numbers we want to generate between two specified numbers.

Here’s how we can use the length argument to generate 30 numbers between 1 and 30:

```
# sequence
nums <- seq(1, 30, length = 10)
```

Now, this generated 30 floating-point numbers between 1 and 30. If we want to check whether there really are 30 numbers in our vector we can use the `length()`

function:

`length(nums)`

Now, as previously mentioned there are often many different approaches for solving the same problems. This is, of course, also for R statistical programming language. In general, choosing the simplest approach which includes as little code as possible is probably the way to go. That said, we will go to the next section where we will be learning to get a sequence of the same number (e.g, “0”). In a more recent post, you will learn 7 examples of when and how to use the %in% operator in R.

To get a repeated sequence of a number we can use the rep() function. Here’s how to create a vector containing 10 repetitions of the number 0:

`rep(0, 10)`

Now, the rep() function can also be used together with the `:`

operator, the `c()`

or the `seq()`

functions.

In this example, we are going to get the numbers 1, 2, 3 generated 10 times. Here’s how to repeat a sequence of numbers.

```
# Repat a sequence of numbers:
rep(c(1, 2, 3), times=10)
```

If we, on the other hand, want to replicate a sequence (e.g., 1 to 5) 10 times we can use the `:`

operator:

```
# Repeating a sequence of numbers ten times
rep(1:5, times=10)
```

Finally, it is also possible to get each number, that we want in our sequence, to be generated a specified amount of times:

`rep(1:5, each=10)`

Note, that if we want to repeat a function or generate e.g., sequences of numbers we can use the repeat and replicate functions in R. as well.

If you want to generate non-random numbers you can use the : operator. For instance, to generate numbers between 1 and 10 you type 1:10. Another option is to use the `seq()`

function.

If you want to create a sequence vector containing numbers you use the : operator. For example, `1:15`

will generate a vector with numbers between 1 and 15. To gain more control you can use the seq() method.

To repeat a sequence of numbers in R you can use the rep() function. For example, if you type `rep(1:5, times=5)`

you will get a vector with the sequence 1 to 5 repeated 5 times.

Check out the following posts if you need to extract elements from datetime in e.g. a vector:

- How to Extract Year from Date in R with Examples
- How to Extract Day from Datetime in R with Examples
- How to Extract Time from Datetime in R – with Examples

In this post, you have learned how to get a sequence of numbers using the : operator, the seq() and rep() functions. Specifically, you learned how to create numbers with a specified increment step. You have also learned how to repeat a number to get a sequence of the same numbers.

The post How to Generate a Sequence of Numbers in R with :, seq() and rep() appeared first on Erik Marsja.

]]>In this short NumPy tutorial, you will learn how to convert a float array to an integer array in Python.

The post How to Convert a Float Array to an Integer Array in Python with NumPy appeared first on Erik Marsja.

]]>In this short NumPy tutorial, we are going to learn how to convert a float array to an integer array in Python. Specifically, here we are going to learn by example how to carry out this rather simple conversion task. First, we are going to change the data type from float to integer in a 1-dimensional array. Second, we are going to convert float to integer in a 2-dimensional array.

Now, sometimes we may want to round the numbers before we change the data type. Thus, we are going through a couple of examples, as well, in which we 1) round the numbers with the `round()`

method, 2) round the numbers to the nearest largest in with the `ceil()`

method, 3) round the float numbers to the nearest smallest numbers with `floor()`

method. Note, all code can be found in a Jupyter Notebook.

First, however, we are going to create an example NumPy 1-dimensional array:

```
import numpy as np
# Creating a 1-d array with float numbers
oned = np.array([0.1, 0.3, 0.4,
0.6, -1.1, 0.3])
```

As you can see, in the code chunk above, we started by importing NumPy as np. Second, we created a 1-dimensional array with the `array()`

method. Here’s the output of the array containing float numbers:

Now, we are also going to be converting a 2-dimensional array so let’s create this one as well:

```
# Creating a 2-d float array
twod = np.array([[ 0.3, 1.2, 2.4, 3.1, 4.3],
[ 5.9, 6.8, 7.6, 8.5, 9.2],
[10.11, 11.1, 12.23, 13.2, 14.2],
[15.2, 16.4, 17.1, 18.1, 19.1]])
```

Note, if you have imported your data with Pandas you can also convert the dataframe to a NumPy array. In the next section, we will be converting the 1-dimensional array to integer data type using the astype() method.

Here’s how to convert a float array to an integer array in Python:

```
# convert array to integer python
oned_int = oned.astype(int)
```

Now, if we want to change the data type (i.e. from float to int) in the 2-dimensional array we will do as follows:

```
# python convert array to int
twod_int = twod.astype(int)
```

Now, in the output, from both conversion examples above, we can see that the float numbers were rounded down. In some cases, we may want the float numbers to be rounded according to common practice. Therefore, in the next section, we are going to use `around()`

method before converting.

Now, if we want to we can now convert the NumPy array to Pandas dataframe, as well as carrying out descriptive statistics.

Here’s how to use the `around()`

method before converting the float array to an integer array:

```
oned = np.array([0.1, 0.3, 0.4,
0.6, -1.1, 0.3])
oned = np.around(oned)
# numpy convert to int
oned_int = oned.astype(int)
```

Now, we can see in the output that the float numbers are rounded up when they should be and, then, we converted them to integers. Here’s the output of the converted array:

Here’s how we can use the ceil() method before converting the array to integer:

```
oned = np.array([0.1, 0.3, 0.4,
0.6, -1.1, 0.3])
oned = np.ceil(oned)
# numpy float to int
oned_int = oned.astype(int)
```

Now, we can see the different in the output containing the converted float numbers:

Here’s how to round the numbers to the smallest integer and changing the data type from float to integer:

```
oned = np.array([0.1, 0.3, 0.4,
0.6, -1.1, 0.3])
oned = np.floor(oned)
# numpy float to int
oned_int = oned.astype(int)
```

In the image below, we see the results of using the floor() method before converting the array. It is, of course, possible to carry out the rounding task before converting a 2-dimensional float array to integer, as well.

Here’s the link to the Jupyter Notebook containing all the code examples found in this post.

In this NumPy tutorial, we have learned a simple conversion task. That is, we have converted a float array to an integer array. To change the data type of the array we used the `astype()`

method. Hope you learned something. Please share the post across your social media accounts if you did! Support the blog by becoming a patron. Finally, if you have any suggestions, comments, or anything you want me to cover in the blog: leave a comment below.

The post How to Convert a Float Array to an Integer Array in Python with NumPy appeared first on Erik Marsja.

]]>In this post, you will learn how to carry out the non-parametric test known as Mann-Whitney U test with Python. Specifically, you will learn how to carry out this test with SciPy and Pingouin.

The post How to Perform Mann-Whitney U Test in Python with Scipy and Pingouin appeared first on Erik Marsja.

]]>In this data analysis tutorial, you will learn how to carry out a Mann-Whitney U test in Python with the packages SciPy and Pingouin. This test is also known as Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test and is a non-parametric hypothesis test.

In this tutorial, you will learn when and how to use this non-parametric test. After that, we will see an example of a situation when the Mann-Whitney U test can be used. The example is followed by how to install the needed package (i.e., SciPy) as well as a package that makes importing data easy and that we can quickly visualize the data to support the interpretation of the results. In the following section, you will learn the 2 steps to carry out the Mann-Whitney-Wilcoxon test in Python. Note, we will also have a look at another package, Pingouin, that enables us to carry out statistical tests with Python. Finally, we will learn how to interpret the results and visualize data to support our interpretation.

This test is a rank-based test that can be used to compare values for two groups. If we get a significant result it suggests that the values for the two groups are different. As previously mentioned, the Mann-Whitney U test is equivalent to a two-sample Wilcoxon rank-sum test.

Furthermore, we don’t have to assume that our data is following the normal distribution and can decide whether the population distributions are identical. Now, the Mann–Whitney test does not address hypotheses about the medians of the groups. Rather, the test addresses if it is likely that an observation in one group is greater than an observation in the other group. In other words, it concerns whether one sample has stochastic dominance compared with the other.

The test assumes that the observations are independent. That is, it is not appropriate for paired observations or repeated measures data.

- One-way data with two groups: two-sample data, that is,
- Your dependent variable is of one of the three following: 1) ordinal, 2) interval, or 3) ratio,
- The independent variable is a factor with two levels (again, only two groups, see the first point),
- Observations between groups are independent. That is, not paired or repeated measures data
- To be a test of medians, the distributions of values for both the groups have to be of similar shape and spread. Under other conditions, the Mann-Whitney U test is by and large a test of stochastic equality.

As with the two samples t-test there are normally two hypothesis:

- Null hypothesis (H
_{0}): The two groups are sampled from populations with identical distributions. Typically, the sampled populations exhibit stochastic equality. - Alternative hypothesis (H
_{a}: The two groups are sampled from populations with different distributions (see the previous section). Most of the time, this means that one of the sampled populations (groups) displays stochastic dominance.

If the results are significant they can be reported as “The values for men were significantly different from those for women.”, if you are examining differences in values between men and women.

You can use the Mann-Whitney U test when your outcome/dependent variable is either ordinal or continous but not normally distributed. Furthermore, this non-parametric test is used when you want to compare differences between two independent groups (e.g., such as an alternative to the two-sample t-test).

To conclude, you should use this test instead of e.g., two-sample t-test using Python if the above information is true for your data.

In this section, before moving on to how to carry out the test, we will have a quick look at an example when you should use the Mann-Whitney U test.

If you, for example, run an intervention study designed to examine the effectiveness of a new psychological treatment to reduce symptoms of depression in adults. Let’s say that you have a total of n=14 participants. Furthermore, these participants are randomized to receive either the treatment or no treatment, at all. In your study, the participants are asked to record the number of depressive episodes over a 1 week period following receipt of the assigned treatment. Here are some example data:

In this example, the question you might want to answer is: is there a difference in the number of depressive episodes over a 1 week period in participants receiving the new treatment as in comparison to those receiving no treatment? By inspecting your data, it appears that participants receiving no treatment have more depressive episodes. The crucial question is, however, is this **statistically significant**?

In this example, the outcome variable is number of episodes (count) and, naturally, in this sample, the data do not follow a normal distribution. Note, Pandas was used to create the above histogram.

To follow this tutorial you will need to have Pandas and SciPy installed. Now, you can get these packages using your favorite Python package manager. For example, installing Python packages with pip can be done as follows:

`pip install scipy pandas pingouin`

Note, both Pandas and Pingouin are optional. However, using these packages have, as you will see later, their advantages. Hint, Pandas make data importing easy. If you ever need, you can also use pip to install a specific version of a package.

In this section, we will go through the steps to carry out the Mann-Whitney U test using Pandas and SciPy. In the first step, we will get our data. After the data is stored in a dataframe, we will carry out the non-parametric test.

Here’s one way to import data to Python with Pandas:

```
import pandas as pd
# Getting our data in to a dictionary
data = {'Notrt':[7, 5, 6, 4, 12, 9, 8],
'Trt':[3, 6, 4, 2, 1, 5, 1]}
# Dictionary to Dataframe
df = pd.DataFrame(data)
```

In the code chunk above, we created a Pandas dataframe from a dictionary. Of course, most of the time we will have our data stored in formats such as CSV or Excel.

See the following posts about how to import data in Python with Pandas:

- Pandas Read CSV Tutorial: How to Read and Write
- How to Read & Write SPSS Files in Python using Pandas
- Pandas Excel Tutorial: How to Read and Write Excel files
- How to use Pandas read_html to Scrape Data from HTML Tables

Here’s also worth noting that *if* your data is stored in long format, you will have to subset your data such that you can get the data from each group into two different variables.

Here’s how to perform the Mann-Whitney U test in Python with SciPy:

```
from scipy.stats import mannwhitneyu
# Carrying out the Wilcoxon–Mann–Whitney test
results = mannwhitneyu(df['Notrt'], df['Trt'])
results
```

Notice that we selected the columns, for each group, as x and y parameters to the `mannwhitneyu`

method. If your data, as previously mentioned, is stored in long format (e.g., see image further down below) you can use Pandas `query()`

method to subset the data.

Here’s how to perform the test, using `df.query()`

, if your data is stored in a similar way as in the image above:

```
import pandas as pd
idrt = [i for i in range(1,8)]
idrt += idrt
data = {'Count':[7, 5, 6, 4, 12, 9, 8,
3, 6, 4, 2, 1, 5, 1],
'Condition':['No Treatment']*7 + ['Treatment']*7, 'IDtrt':idrt}
# Dictionary to Dataframe
df = pd.DataFrame(data)
# Subsetting (i.e., creating new variables):
x = df.query('Condition == "No Treatment"')['Count']
y = df.query('Condition == "Treatment"')['Count']
# Mann-Whitney U test:
mannwhitneyu(x, y)
```

Now, there are some things to be explained here. First, the `mannwhitneyu`

method will by default carry out a one-sided test. If we, on the other hand, would use the parameter alternative and set it to “two-sided” we would get different results. Make sure you check out the documentation before using the method. In the next section, we will have a look at another, previously mentioned, Python package that also can be used to do the Mann-Whitney U test.

As previously mentioned, we can also install the Python package Pingouin to carry out the Mann-Whitney U test. Here’s how to perform this test with the `mwu()`

method:

```
from pingouin import mwu
results2 = mwu(df['Notrt'], df['Trt'],
tail='one-sided')
```

Now, the advantage with using the mwu method is that we will get some additional information (e.g., common language effect size; CLES). Here’s the output:

In this section, we will start off by interpreting the results of the test. Now, this is pretty straight forward.

In our example, we can reject H_{0} because 3 < 7. Furthermore, we have statistically significant evidence at *α* =0.05 to show that the treatment groups differ in the number of depressive episodes. Naturally, in a real application, we would have set both the H_{0} and H_{a} prior to conducting the hypothesis test, as we did here.

To aid the interpretation of our results we can create box plots with Pandas:

```
axarr = df.boxplot(column='Count', by='Condition',
figsize=(8, 6), grid=False)
axarr.set_title('')
axarr.set_ylabel('Number of Depressive Episodes')
```

In the box plot, we can see that the median is greater for the group that did not get any treatment compared to the group that got treatment. Furthermore, if there were any outliers in our data they would show up as dots in the box plot. If you are interested in more data visualization techniques have a look at the post “9 Data Visualization Techniques You Should Learn in Python”.

In this post, you have learned how to perform the Mann-Whitney U test using the Python packages SciPy, Pandas, and Pingouin. Moreover, you have learned when to carry out this non-parametric test both by learning about e.g. when it is appropriate and by an example. After this, you learned how to carry out the test using data from the example. Finally, you have learned how to interpret the results and visualize the data. Note that you preferably should have a larger sample size than in the example of the current post. Of course, you should also make the decision on whether to carry out a one-sided or two-sided test based on theory. In the example of this post, we can assume that going without treatment would mean more depressive episodes. However, in other examples this may not be true.

Hope you have learned something and if you have a comment, a suggestion, or anything you can leave a comment below. Finally, I would very much appreciate it if you shared this post across your social media accounts if you found it useful!

In this final section, you will find some references and resources that may prove useful. Note, there are both links to blog posts and peer-reviewed articles. Sadly, some of the content here is behind paywalls.

Mann, H. B.; Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18 (1947), no. 1, 50–60. doi:10.1214/aoms/1177730491. https://projecteuclid.org/euclid.aoms/1177730491

Vargha, A., & Delaney, H. D. (2000). A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong. *Journal of Educational and Behavioral Statistics*, *25*(2), 101–132. https://doi.org/10.3102/10769986025002101

The post How to Perform Mann-Whitney U Test in Python with Scipy and Pingouin appeared first on Erik Marsja.

]]>In this post, you will learn how to install specific packages with pip.

The post Pip Install Specific Version of a Python Package: 2 Steps appeared first on Erik Marsja.

]]>In this Python tutorial, you will learn how to use pip to install a specific version of a package. The outline of the post (as also can be seen in the ToC) is as follows. First, you will get a brief introduction with examples on *when* you might need to install e.g. an older version of a package. Second, you will get the general syntax for how to carry out this task. After that, you will get two steps to installing specific versions of Python packages with pip. In this section, you will also learn how to work with a virtual environment. In the next section, we will look at how to specify the version of multiple Python packages by creating a .txt file.

Now, there may be several reasons that you may want to install a specific version of a Python package. For example, you may need to install an older version of a package if the package has changed in a way that is not compatible with the version of Python you have installed, with other packages that you have installed, or with your Python code. As previously mentioned, we are going to work with the package manager pip, but it is also possible to install a specific version of a package if you use other package managers. For example, it is also possible if you use the package manager conda (Anaconda Python distribution).

Here are some instructions on how to use pip to install a specific (e.g, older) version of a Python package:

Here’s the general Pip syntax that you can use to install a specific version of a Python package:

`pip install <PACKAGE>==<VERSION>`

As you may understand, now, you exchange “<PACKAGE>” and “<VERSION>” for the name of the package and the version you want to install, respectively. Don’t worry, the next section will show you, by example, more exactly how this is done.

If you get the warning, as in the image above, you can upgrade pip to the latest version: `pip --install upgrade pip`

. Before getting into more details here’s how to install a specific version of a Python package:

To install a specific version of a Python package you can use pip: ` pip install YourPackage==YourVersion`

. For example, if you want to install an older version of Pandas you can do as follows: `pip install pandas==1.1.3`

. Of course, you will have to open up e.g. Windows Command Prompt or your favorite terminal emulator in Linux. Note, it is also possible to use conda to install a certain version of a package.

In the nex section, you will learn two important steps for installing a certain version of a Python package using pip package manager. First, you will learn how to install and create a virtual environmenet. Second, you wll learn how to use pip to install a the version you need of a Python package using the syntax you’ve already learned.

In this section, you will learn how to install an older version of a Python package using pip. First, I would recommend creating a virtual environment. Therefore, you will first learn how to install the virtual environment package, create a virtual environment, and install a specific version of a Python package.

First, you should install the virtualenv package. Here’s how to install a Python package with pip:

`pip install virtualenv`

Second, you should create, and then activate, your virtual environment:

```
virtualenv myproject
source myproject/bin/activate
```

Now that you have your virtual environment setup, you can go on to the next step and install an older version of a Python package. In step two, we use pip again (like when installing virtualenv) but now will also use the general syntax we’ve learned earlier in this post.

Now, that your virtual environment is ready to use. Here’s how to use pip to install a specific version of the package Pandas:

`pip install pandas==1.1.1`

It is, of course, possible to add more packages and their versions if you have many packages that you want to install a certain version of. However, this may be cumbersome and in the next section, we will have a look at how to deal with installing older versions of multiple packages. That is, when storing them in a text file.

That was pretty simple, but using the above steps may not be useful if you, for instance, need to install a lot of Python packages. When we are installing packages using pip we can create a .txt file (e.g., requirements.txt). Here’s an example text file with a few Python packages and their versions:

As you can see, you should keep each package on one line in the text file. Moreover, you should follow the syntax you’ve learned earlier in the post. This is also evident in the image above. Here’s how to install a specified version of multiple packages using the text file:

```
# Pip install specific versions of multiple packages:
pip install -r myproject/requirements.txt
```

Now, installing an older version of *one* package can lead to some problems with the packages dependencies. You will still get the newest versions of the dependencies. That is, that the version you use allows, of course. One backside of this is that it can later break your application or work flow. Luckily, there are some solutions to combat this issue. For example, if you want your data analysis to be reproducible using Binder, Jupyter Notebooks, and Python may be a solution. However, if you are developing applications you may need to have another strategy. In the last section, we will have a look at another Python package that may be useful: Pipenv (see resources, at the bottom for a great tutorial on Pipenv).

In this brief Python tutorial, you learned how to use pip to install a certain version of a package. First, you learned the syntax of pip, for specifying a version. After that, you learned how to 1) create a virtual environment, and 2) install the version of a package you needed. In the final section, we had a look on how to deal with multiple packages of certain versions. That is, how to set the version of multiple packages you wanted to install.

If you have any suggestions or corrections to the current post, please leave a comment below. I always appreciate when I get to learn from others.

Here are some useful packages and tutorials as well as the documentation that may be worth having a look at:

- Pipx: Installing, Uninstalling, & Upgrading Python Packages in Virtual Envs
- Pipenv
- Virtualenv documentation
- Pip documentation
- Pipenv guide

The post Pip Install Specific Version of a Python Package: 2 Steps appeared first on Erik Marsja.

]]>