# Create a Correlation Matrix in Python with NumPy and Pandas

18 Shares

In this post, we will calculate a correlation matrix in Python with NumPy and Pandas. Now, there will be several Python correlation matrix examples in this tutorial. First, we will read data from a CSV file so we can simply have a look at the numpy.corrcoef and Pandas DataFrame.corr methods.

Now, building a correlation table (matrix) comes in handy, especially if we have a lot of variables in our data (see three more reasons by reading further). At the post’s end is a link to a Jupyter Notebook with code examples.

## Prerequisites

Now, before we go on and use NumPy and Pandas to create a correlation matrix in Python, we need to make sure we have these Python packages installed. If there’s a scientific Python distribution, such as Anaconda or ActivePython, installed on the computer we are using, we most likely don’t have to install the Python packages. In other cases, NumPy and Pandas can be installed using conda (Anaconda/Miniconda) or pip.

### Installing Python Packages with pip and conda

For more examples of installing Python packages, check that post out. That said, open up a Terminal Window or Anaconda prompt and type: `pip install pandas numpy (pip)`. To install this package with conda run: `conda install -c anaconda numpy`. Note upgrading pip, if needed, can also be done with pip.

What is a Correlation Matrix?

A correlation matrix is used to examine the relationship between multiple variables simultaneously. When we do this calculation, we get a table containing the correlation coefficients between each variable and the others. Now, the coefficient shows us both the strength of the relationship and its direction (positive or negative correlations). In Python, a correlation matrix can be created using the Python packages Pandas and NumPy, for instance.

How do You do a Correlation Matrix in Python?

Now that we know what a correlation matrix is, we will look at the simplest way to do a correlation matrix with Python: with Pandas.
`import pandas as pddf = pd.read_csv('datafile.csv')df.cor()`

The above code would give you a correlation matrix printed in e.g., a Jupyter Notebook. Read the post for more information.

Before looking at the applications of a correlation matrix, I also want to mention that pip can be used to install a specific version of a Python package if needed.

## Applications of a Correlation Matrix

Now, before we go on to the Python code, here are three general reasons for creating a correlation matrix:

1. If we have a big data set, and we have an intention to explore patterns.
2. For use in other statistical methods. For instance, correlation matrices can be used as data when conducting exploratory factor analysis, confirmatory factor analysis, and structural equation models.
3. Correlation matrices can also be used as a diagnostic when checking assumptions for e.g., regression analysis.

## Correlation Method

Now, the majority of correlation matrices use Pearson’s Product-Moment Correlation (r). Depending on whether the data type of our variables or whether the data follow the assumptions for correlation, there are other methods commonly used, such as Spearman’s Correlation (rho) and Kendall’s Tau.

In the next section, we will get into the general syntax of the two methods to compute the correlation matrix in Python.

## Syntax corrcoef and cor

Here we will find the general syntax for the computation of correlation matrixes with Python using 1) NumPy and 2) Pandas.

### Correlation Matrix with NumPy

To create a correlation table in Python using NumPy, this is the general syntax:

```.wp-block-code {
border: 0;
}

.wp-block-code > span {
display: block;
overflow: auto;
}

.shcb-language {
border: 0;
clip: rect(1px, 1px, 1px, 1px);
-webkit-clip-path: inset(50%);
clip-path: inset(50%);
height: 1px;
margin: -1px;
overflow: hidden;
position: absolute;
width: 1px;
word-wrap: normal;
word-break: normal;
}

.hljs {
box-sizing: border-box;
}

.hljs.shcb-code-table {
display: table;
width: 100%;
}

.hljs.shcb-code-table > .shcb-loc {
color: inherit;
display: table-row;
width: 100%;
}

.hljs.shcb-code-table .shcb-loc > span {
display: table-cell;
}

.wp-block-code code.hljs:not(.shcb-wrap-lines) {
white-space: pre;
}

.wp-block-code code.hljs.shcb-wrap-lines {
white-space: pre-wrap;
}

.hljs.shcb-line-numbers {
border-spacing: 0;
counter-reset: line;
}

.hljs.shcb-line-numbers > .shcb-loc {
counter-increment: line;
}

.hljs.shcb-line-numbers .shcb-loc > span {
}

.hljs.shcb-line-numbers .shcb-loc::before {
border-right: 1px solid #ddd;
content: counter(line);
display: table-cell;
text-align: right;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
white-space: nowrap;
width: 1%;
}
`np.corrcoef(x)`Code language: Python (python)```

In this case, x is a 1-D or 2-D array with the variables and observations we want to get the correlation coefficients of. Furthermore, every row of x represents one of our variables, whereas each column is a single observation of all our variables. Don’t worry, we look into how to use `np.corrcoef` later. A quick note: if you need to, you can convert a NumPy array to integer in Python.

### Correlation Matrix using Pandas

To create a correlation table in Python with Pandas, this is the general syntax:

``df.corr()`Code language: Python (python)`

Here, `df `is the DataFrame we have, and `cor()` is the method to get the correlation coefficients. Of course, we will look into how to use Pandas and the `corr` method later in this post.

## Computing a Correlation Matrix in Python with NumPy

Now, we will get into some details of NumPy’s `corrcoef` method. Note that this will be a simple example, and refer to the documentation linked at the beginning of the post for a more detailed explanation.

First, we will load the data using the `numpy.loadtxt` method. Second, we will use the `corrcoeff` method to create the correlation table.

``````import numpy as np

data = './SimData/correlationMatrixPython.csv'

unpack=True)

np.corrcoef(x)```Code language: Python (python)```

Note we used the `skiprows` argument to skip the first row containing the variable names and the `delimiter` argument as the columns are delimited by a comma. Finally, we used the unpack argument so that our data would follow the requirements of `corrcoef`. As a final note; using NumPy we cannot calculate Spearman’s Rho or Kendall’s Tau. That is, the `corrcoef` method will only return correlation Persons’ R coefficients.

## Three Steps to Creating a Correlation Matrix in Python with Pandas

In this section, we will learn how to do a correlation table in Python with Pandas in 3 simple steps.

### 1. Import Pandas

In the script, or Jupyter Notebooks, we need to start by importing Pandas:

``import pandas as pd`Code language: Python (python)`

### 2. Import Data in Python with Pandas

Import the data into a Pandas dataframe as follows:

``````data = './SimData/correlationMatrixPython.csv'

Remember that the data file needs to be in a subfolder relative to the Jupyter Notebook called ‘SimData’.

In the image below, we can see the values from the four variables in the dataset:

It is, of course, important to give the full path to the data file. Note there are, of course, other ways to create a Pandas dataframe. For instance, we can make a dataframe from a Python dictionary. Furthermore, it’s also possible to read data from an Excel file with Pandas, or scrape the data from a HTML table to a dataframe, to name a few.

### 3. Calculate the Correlation Matrix with Pandas:

Now, we are in the final step to create the correlation table in Python with Pandas:

``df.corr()`Code language: Python (python)`

Using the example data, we get the following output when we print it in a Jupyter Notebook:

Finally, if we want to use other methods (e.g., Spearman’s Rho) we’d just add the `method='Spearman' `argument to the `corr `method. See the image below. Here is a link to the example dataset.

## Upper and Lower Triangular Correlation Tables with Pandas

In this section, we are going to use NumPy and Pandas together with our correlation matrix (we have saved it as `cormat`: `cormat = df.corr()`).

``````import numpy as np

def triang(cormat, triang='lower'):

if triang == 'upper':
rstri = pd.DataFrame(np.triu(cormat.values),
index=cormat.index,
columns=cormat.columns).round(3)
rstri = rstri.iloc[:,1:]
rstri.drop(rstri.tail(1).index, inplace=True)

if triang == 'lower':
rstri = pd.DataFrame(np.tril(cormat.values),
index=cormat.index,
columns=cormat.columns).round(3)
rstri = rstri.iloc[:,:-1]

rstri.replace(to_replace=[0,1], value='', inplace=True)

return(rstri)```Code language: Python (python)```

Now, this function can be run with the argument `triang `(‘upper’ or ‘lower’). For example, if we want the upper triangular, we do as follows.

``triang(cormat, 'upper')`Code language: JavaScript (javascript)`

Now, there are, of course, other ways to communicate a correlation matrix. For example, we can explore the relationship between each variable (if they are not too many) using Pandas scatter_matrix method to create a pair plot. Other options are creating a correlogram or a heatmap, for instance (see the post named 9 Data Visualization Techniques in Python you need to Know, for more information about these two methods).

The above heatmap can be reproduced with the code found in the Jupyter Notebook here.

## Conclusion

In this post, we have created a correlation matrix using Python and the packages NumPy and Pandas. In general, both methods are quite simple to use. If we need to use other correlation methods, we cannot use `corrcoef`, however. As we have seen, using Pandas `corr` method, this is possible (use the method argument). Finally, we created correlation tables with Pandas and NumPy (i.e., upper and lower triangular).

If there is something that needs to be corrected or something that should be added to this correlation matrix in Python tutorial, drop a comment below.

## Resources 18 Shares

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top