The post Python Video Tutorial: Creating a Flanker Task using Expyriment appeared first on Erik Marsja.

]]>In the tutorial you will get familiar with Expyriment and get to create a commonly used task in Psychology – the Flanker task. In this task, you are to respond on the direction of an arrow surrounded by distractors (arrows pointing in either the same or the other direction). It shows how hard it can be to ignore irrelevant information (arrows pointing in the wrong direction).

The post Python Video Tutorial: Creating a Flanker Task using Expyriment appeared first on Erik Marsja.

]]>The post Exploring response time distributions using Python appeared first on Erik Marsja.

]]>I used the following Python packages; Pandas for data storing/manipulation, NumPy for some calculations, Seaborn for most of the plotting, and Matplotlib for some tweaking of the plots. Any script using these functions should import them:

from __future__ import division import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

The first plot is the easiest to create using Python; visualizing the kernel density estimation. It can be done using the Seaborn package only. *kde_plot* takes the arguments *df *Note, in the beginning of the function I set the style to white and to ticks. I do this because I want a white background and ticks on the axes.

def kde_plot(df, conditions, dv, col_name, save_file=False): sns.set_style('white') sns.set_style('ticks') fig, ax = plt.subplots() for condition in conditions: condition_data = df[(df[col_name] == condition)][dv] sns.kdeplot(condition_data, shade=True, label=condition) sns.despine() if save_file: plt.savefig("kernel_density_estimate_seaborn_python_response" "-time.png") plt.show()

Using the function above you can basically plot as many conditions as you like (however, but with to many conditions, the plot will probably be cluttered). I use some response time data from a Flanker task:

# Load the data frame = pd.read_csv('flanks.csv', sep=',') kde_plot(frame, ['incongruent', 'congruent'], 'RT', 'TrialType', save_file=False)

Next out is to plot the cumulative distribution functions (CDF). In the first function CDFs for each condition will be calculated. It takes the arguments *df* (a Pandas dataframe), a list of the conditions (i.e., *conditions*).

def cdf(df, conditions=['congruent', 'incongruent']): data = {i: df[(df.TrialType == conditions[i])] for i in range(len( conditions))} plot_data = [] for i, condition in enumerate(conditions): rt = data[i].RT.sort_values() yvals = np.arange(len(rt)) / float(len(rt)) # Append it to the data cond = [condition]*len(yvals) df = pd.DataFrame(dict(dens=yvals, dv=rt, condition=cond)) plot_data.append(df) plot_data = pd.concat(plot_data, axis=0) return plot_data

Next is the plot function (*cdf_plot)*. The function takes a Pandas a dataframe (created with the function above) as argument as well as *save_file* and *legend*.

def cdf_plot(cdf_data, save_file=False, legend=True): sns.set_style('white') sns.set_style('ticks') g = sns.FacetGrid(cdf_data, hue="condition", size=8) g.map(plt.plot, "dv", "dens", alpha=.7, linewidth=1) if legend: g.add_legend(title="Congruency") g.set_axis_labels("Response Time (ms.)", "Probability") g.fig.suptitle('Cumulative density functions') if save_file: g.savefig("cumulative_density_functions_seaborn_python_response" "-time.png") plt.show()

Here is how to create the plot on the same Flanker task data as above:

cdf_dat = cdf(frame, conditions=['incongruent', 'congruent']) cdf_plot(cdf_dat, legend=True, save_file=False)

In Psychological research Delta plots (DPs) can be used to visualize and compare response time (RT) quantiles obtained under two experimental conditions. DPs enable examination whether the experimental manipulation has a larger effect on the relatively fast responses or on the relatively slow ones (e.g., Speckman, Rouder, Morey, & Pratte, 2008).

In the following script I have created two functions; *calc_delta_data* and *delta_plot*. calc_delta_data takes a Pandas dataframe (df). Rest of the arguments you need to fill in the column names for the subject id, the dependent variable (e.g., RT), and the conditions column name. All in the string data type. The last argument should contain a list of strings of the factors in your condition.

def calc_delta_data(df, subid, dv, condition, conditions=['incongruent', 'congruent']): subjects = pd.Series(df[subid].values.ravel()).unique().tolist() subjects.sort() deciles = np.arange(0.1, 1., 0.1) cond_one = conditions[0] cond_two = conditions[1] # Frame to store the data (per subject) arrays = [np.array([cond_one, cond_two]).repeat(len(deciles)), np.array(deciles).tolist() * 2] data_delta = pd.DataFrame(columns=n_subjects, index=arrays) for subject in subjects: sub_data_inc = df.loc[(df[subid] == subject) & (df[condition] == cond_one)] sub_data_con = df.loc[(df[subid] == subject) & (df[condition] == cond_two)] inc_q = sub_data_inc[rt].quantile(q=deciles).values con_q = sub_data_con[rt].quantile(q=deciles).values for i, dec in enumerate(deciles): data_delta.loc[(cond_one, dec)][subject] = inc_q[i] data_delta.loc[(cond_two, dec)][subject] = con_q[i] # Aggregate deciles data_delta = data_delta.mean(axis=1).unstack(level=0) # Calculate difference data_delta['Diff'] = data_delta[cond_one] - data_delta[cond_two] # Calculate average data_delta['Average'] = (data_delta[cond_one] + data_delta[cond_two]) / 2 return data_delta

Next function, *delta_plot*, takes the data returned from the *calc_delta_data* function to create a line graph.

def delta_plot(delta_data, save_file=False): ymax = delta_data['Diff'].max() + 10 ymin = -10 xmin = delta_data['Average'].min() - 20 xmax = delta_data['Average'].max() + 20 sns.set_style('white') g = sns.FacetGrid(delta_data, ylim=(ymin, ymax), xlim=(xmin, xmax), size=8) g.map(plt.scatter, "Average", "Diff", s=50, alpha=.7, linewidth=1, edgecolor="white") g.map(plt.plot, "Average", "Diff", alpha=.7, linewidth=1) g.set_axis_labels("Avarage RTs (ms.)", "Effect (ms.)") g.fig.suptitle('Delta Plot') if save_file: g.savefig("delta_plot_seaborn_python_response-time.png") plt.show() sns.plt.show()

The above functions are quite easy to use. First load your data (again, I use data from a Flanker task).

# Load the data frame = pd.read_csv('flanks.csv', sep=',') # Calculate delta plot data and plot it d_data = calc_delta_data(frame, "SubID", "RT", "TrialType", ['incongruent', 'congruent']) delta_plot(d_data)

Conditional accuracy functions (CAF) is a technique that also incorporates the accuracy in the task. Creating CAFs involve binning your data (e.g., the response time and accuracy) and creating a linegraph. Briefly, CAFs can capture patterns related to speed/accuracy trade-offs. First function,

def calc_caf(df, subid, rt, acc, trialtype, quantiles=[0.25, 0.50, 0.75, 1]): # Subjects subjects = pd.Series(df[subid].values.ravel()).unique().tolist() subjects.sort() # Multi-index frame for data: arrays = [np.array(['rt'] * len(quantiles) + ['acc'] * len(quantiles)), np.array(quantiles * 2)] data_caf = pd.DataFrame(columns=subjects, index=arrays) # Calculate CAF for each subject for subject in subjects: sub_data = df.loc[(df[subid] == subject)] subject_cdf = sub_data[rt].quantile(q=quantiles).values # calculate mean response time and proportion of error for each bin for i, q in enumerate(subject_cdf): quantile = quantiles[i] # First if i < 1: # Subset temp_df = sub_data[(sub_data[rt] < subject_cdf[i])] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Second & third (if using 4) elif i == 1 or i < len(quantiles): # Subset temp_df = sub_data[(sub_data[rt] > subject_cdf[i - 1]) & ( sub_data[rt] < q)] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Last elif i == len(quantiles): # Subset temp_df = sub_data[(sub_data[rt] > subject_cdf[i])] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Aggregate subjects CAFs data_caf = data_caf.mean(axis=1).unstack(level=0) # Add trialtype data_caf['trialtype'] = [condition for _ in range(len(quantiles))] return data_caf

*caf_plot *(the function below) uses Seaborn, again, to plot the conditional accuracy functions.

def caf_plot(df, legend_title='Congruency', save_file=True): sns.set_style('white') sns.set_style('ticks') g = sns.FacetGrid(df, hue="trialtype", size=8, ylim=(0, 1.1)) g.map(plt.scatter, "rt", "acc", s=50, alpha=.7, linewidth=1, edgecolor="white") g.map(plt.plot, "rt", "acc", alpha=.7, linewidth=1) g.add_legend(title=legend_title) g.set_axis_labels("Response Time (ms.)", "Accuracy") g.fig.suptitle('Conditional Accuracy Functions') if save_file: g.savefig("conditional_accuracy_function_seaborn_python_response" "-time.png") plt.show()

Right now, the function for calculation the Conditional Accuracy Functions can only do one condition at the time. Thus, in the code below I subset the Pandas dataframe (same old, Flanker data as in the previous examples) for incongruent and congruent conditions. The CAFs for these two subsets are then concatenated (i.e., combined to one dataframe) and plotted.

# Conditional accuracy function (data) for incongruent and congruent conditions inc = calc_caf(frame[(frame.TrialType == "incongruent")], "SubID", "RT", "ACC", "incongruent") con = calc_caf(frame[(frame.TrialType == "congruent")], "SubID", "RT", "ACC", "congruent") # Combine the data and plot it df_caf = pd.concat([inc, con]) caf_plot(df_caf, save_file=True)

Update: I created a Jupyter notebook containing all code: Exploring distributions.

Balota, D. a., & Yap, M. J. (2011). Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. *Current Directions in Psychological Science, 20(3)*, 160–166. http://doi.org/10.1177/0963721411408885

Luce, R. D. (1986). *Response times: Their role in inferring elementary mental organization* (No. 8). Oxford University Press on Demand.

Speckman, P. L., Rouder, J. N., Morey, R. D., & Pratte, M. S. (2008). Delta Plots and Coherent Distribution Ordering. *The American Statistician, 62(3)*, 262–266. http://doi.org/10.1198/000313008X333493

The post Exploring response time distributions using Python appeared first on Erik Marsja.

]]>The post Best Python libraries for Psychology researchers appeared first on Erik Marsja.

]]>Python is gaining popularity in many fields of science. This means that there also are many applications and libraries specifically for use in Psychological research. For instance, there are packages for collecting data & analysing brain imaging data. In this post, I have collected some useful Python packages for researchers within the field of Psychology and Neuroscience. I have used and tested some of them but others I have yet to try.

Expyriment is a Python library in which makes the programming of Psychology experiments a lot easier than using Python. It contains classes and methods for creating fixation cross’, visual stimuli, collecting responses, etc.

Modular Psychophysics is a collection of tools that aims to implement a modular approach to Psychophysics. It enables us to write experiments in different languages. As far as I understand, you can use both MATLAB and R to control your experiments. That is, the timeline of the experiment can be carried out in another language (e.g., MATLAB).

However, it seems like the experiments are created using Python. Your experiments can be run both locally and over networks. Have yet to test this out.

OpenSesame is a Python application. Using OpenSesame one can create Psychology experiments. It has a graphical user interface (GUI) that allows the user to drag and drop objects on a timeline. More advanced experimental designs can be implemented using inline Python scripts.

PsychoPy is also a Python application for creating Psychology experiments. It comes packed with a GUI but the API can also be used for writing Python scripts. I have written a bit more thoroughly about PsychoPy: PsychoPy.

I have written more extensively on Expyriment, PsychoPy, Opensesame, and some other libraries for creating experiment in my post Python apps and libraries for creating experiments.

PsyUtils “The psyutils package is a collection of utility functions useful for generating visual stimuli and analysing the results of psychophysical experiments. It is a work in progress, and changes as I work. It includes various helper functions for dealing with images, including creating and applying various filters in the frequency domain.”

Psisignifit is a toolbox that allows you to fit psychometric functions. Further, hypotheses about psychometric data can be tested. Psisignfit allows for full Bayesian analysis of psychometric functions that includes Bayesian model selection and goodness of fit evaluation among other great things.

Pygaze is a Python library for eye-tracking data & experiments. It works as a wrapper around many other Python packages (e.g., PsychoPy, Tobii SDK). Pygaze can also, through plugins, be used from within OpenSesame.

General Recognition Theory (GRT) is a fork of a MATLAB toolbox. GRT is ” a multi-dimensional version of signal detection theory.” (see link for more information).

MNE is a library designed for processing electroencephalography (EEG) and magnetoencephalography (MEG) data. Collected data can be preprocessed and denoised. Time-frequency analysis and statistical testing can be carried out. MNE can also be used to apply some machine learning algorithms. Although, mainly focused on EEG and MEG data some of the statistical tests in this library can probably be used to analyse behavioural data (e.g., ANOVA).

Kabuki is a Python library for effortless creation of hierarchical Bayesian models. It uses the library PyMC. Using Kabuki you will get formatted summary statistics, posterior plots, and many more. There is, further, a function to generate data from a formulated model. It seems that there is an intention to add more commonly used statistical tests (i.e., Bayesian ANOVA) in the future!

NIPY: “Welcome to NIPY. We are a community of practice devoted to the use of the Python programming language in the analysis of neuroimaging data”. Here different packages for brain imaging data can be found.

Although, many of the above libraries probably can be used within other research fields there are also more libraries for pure statistics & visualisation.

PyMVPA is a Python library for MultiVariate Pattern Analysis. It enables statistical learning analyses of big data.

Pandas is a Python library for fast, flexible and expressive data structures. Researchers and analysists with an R background will find Pandas data frame objects very similar to Rs. Data can be manipulated, summarised, and some descriptive analysis can be carried out (e.g., see Descriptive Statistics Using Python for some examples using Pandas).

Statsmodels is a Python library for data exploration, estimation of statistical models, and statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Among many methods regression, generalized linear, and non-parametric tests can be carried out using statsmodels.

Pyvttbl is a library for creating Pivot tables. One can further process data and carry out statistical computations using Pyvttbl. Sadly, it seems like it is not updated anymore and is not compatible with other packages (e.g., Pandas). If you are interested in how to carry out repeated measures ANOVAs in Python this is a package that enables these kind of analysis (e.g., see Repeated Measures ANOVA using Python and Two-way ANOVA for Repeated Measures using Python).

There are many Python libraries for visualisation of data. Below are the ones I have worked with. Note, pandas and statsmodels also provides methods for plotting data. All three libraries are compatible with Pandas which makes data manipulation and visualisation very easy.

Matplotlib is a package for creating two-dimensional plots.

Seaborn is a library based on Matplotlib. Using seaborn you can create ready-to-publish graphis (e.g., see the Figure below for a boxplot on some response time data).

Ggplot is a visualisation library based on the R package Ggplot2. That is, if you are familiar with R and Ggplot2 transitioning to Python and the package Ggplot will be easy.

Many of the libraries for analysis and visualisation can be installed separately and, more or less, individually . I do however recommend that you install a scientific Python distribution. This way you will get all you need (and much more) by one click (e.g., Pandas, Matplotlib, NumPy, Statsmodels, Seaborn). I suggest you have a look at the distributions Anaconda or Python(x, y). Note, that installing Python(x, y) will give you the Python IDE Spyder.

The last Python package for Psychology I am going to list is PsychoPy_ext. Although, PsychoPy_ext may be considered a library for building experiments it seems like the general aim behind the package is to make reproducible research easier. That is, analysis and plotting of the experiments can be carried out. I also think it is interesting that there seems to be a way to autorun experiments (a neat function I know that e-prime have, for instance).

That was it, if you happen to know any other Python libraries or applications with a focus on Psychology (e.g., psychophysics) or for statistical methods.

The post Best Python libraries for Psychology researchers appeared first on Erik Marsja.

]]>The post E-prime how-to: save data in csv-file using InLine scripts appeared first on Erik Marsja.

]]>This guide will assume that you have worked with e-prime before. That is, you should already have a, more or less, ready experiment that you can add the scripts to. In the guide I use a Simon task created in e-prime as an example.

I prefer to let the experimenter have the possibility to choose name on the data file. Thus, we will start with creating a variable called *experimentId* (You may already know how to do this and can skip to the next part)*. *We start by clicking on “Edit” in the menu and choosing “Experiment…”:

After doing this we get the “Properties:Experiment Object Properties” dialog. From this dialog we click on add:

In the next dialog, “Edit Startup Info Parameter” we give our variable a name; *experimentId *(we type this in the “Log Name” field). In the “Prompt” field we type in “Experiment ID”. This is what will show up when the experiment is started. We go on and change the Data Type to “String” and put “simonTask” as Default. This is what will be put in by default but can be changed by the experimenter.

As previously mentioned I assume that you know how to create an experiment in e-prime but I will briefly mention how to create an InLine script. Drag the object InLine from the “E-Objects” in the left of the GUI. I chose to put this in the “PracticeSimon” procedure so it is one of the first things that is created when starting an experiment. I typically name the script “fileManagment” or something that makes it clear what the script do.

' Sets up a data file On Error Resume Next MkDir "data_"+c.GetAttrib ("experimentId") ' Create the variable save of data type string and set save to current directory Dim save As String save = CurDir$ ' Change the directory to data_simonTask (since experimentId is by default simonTask) chdir("data_"+c.GetAttrib ("experimentId")) ' Create a file called data_simonTask if it does not exist If FileExists("data_"+c.GetAttrib ("experimentId")+".csv")=False Then Dim fileid As Integer fileid=freefile open "data_"+c.GetAttrib ("experimentId")+".csv" For output As #fileid print #fileid,"SubID;Date;Age;Sex;RT;ACC;CorrectResponse;Response;Arrows;TrialType" close End If chdir(save)

Note, the delimiter in the case above is “;”. That is, this is what makes whatever software you use later know where a new column begins. In this example I want to store Subject ID (“SubID), the date, Age, Sex, Response Time (RT), Accuracy (ACC), and so on.

Now that a data file and folder has been created we can go on and create an InLine script that saves the data (for each trial). In the Simon task example I put the script (“saveData”) in the end of the “SimonTrial” procedure:

Responses, in the example, are logged in the “simonTarget”-object (an ImageDisplay object). However, many times we want to save more information, such as what kind of trial it currently is. Such data is typically stored in a List object (List3 in the image above, for instance):

Now to the script that saves the data:

Dim save As String Dim fileid As Integer save = CurDir$ chdir("data_"+c.GetAttrib ("experimentId")) fileid=freefile open "data_"+c.GetAttrib("experimentId")+".csv" For append As #fileid ' "SubID;Date;Age;Sex;RT;ACC;CorrectResponse;simonTarget;TrialType" print #fileid,(c.GetAttrib("Subject"));";";Date;";";(c.GetAttrib("Age"));";";(c.GetAttrib("Sex"));";";(c.GetAttrib("simonTarget.RT"));";";(c.GetAttrib("simonTarget.ACC"));";";(c.GetAttrib("correct_resp"));";";(c.GetAttrib("simonTarget.Resp"));";";(c.GetAttrib("display2"));";";(c.GetAttrib("Type_C")) close chdir(save) doevents

Note, what really is important in this script is that the data is stored in the order that we have created our column names. In the script above we use the function c.GetAttrib(attribute) to get the current data stored in that variable. That is, when we want to get the RT we use (c.GetAttrib(“simonTarget.RT”) since this is were the responses are recored. Startup info and information in the file can be accessed using only c.GetAttrib(). That was quite easy, right?!

There is one caveat, in your ImageDisplay object (i.e., in our case simonTarget) we will have to set the prerelease to 0 or else we will not have anything recorded. This may at times be a problem (e.g., for timing and such). If anyone know a solution to this, please let me know.

The post E-prime how-to: save data in csv-file using InLine scripts appeared first on Erik Marsja.

]]>The post PsychoPy video tutorials appeared first on Erik Marsja.

]]>I created a playlist with 4 youtube tutorials. In the first video you will learn how to create a classical psychology experiment; the stroop task. The second is more into psycholinguistics and you will learn how to create a language experiment using PsychoPy. In the third you get to know how to create text input fields in PsychoPy. In this tutorial inline Python coding is used so you will also get to know how you may use programming. In the forth video you will get acquainted with using video stimuli in PsychoPy.

Recently I found the following playlist on youtube and it is amazing. In this series of tutorial videos you will learn how to use PsychoPy as a Python package. For instance, it is starting at a very basic level; importing the visual module to create windows. In this video he uses my favourite Python IDE Spyder. The videos are actually screencasts from the course *Programming for Psychology & Vision Science* and contains 10 videos. The first 5 videos covers drawing stimuli on the screen (i.e., drawing to a video, gratings, shapes, images, dots). Watching these videos you will also learn how to collect responses, providing input, and saving your data.

That was all for now. If you know more good video tutorial for PsychoPy please leave a comment. Preferably the tutorials should cover coding but just building experiments with the builder mode is also fine. I may update my playlist with more PsychoPy tutorials (the first playlist).

The post PsychoPy video tutorials appeared first on Erik Marsja.

]]>The post Two-way ANOVA for repeated measures using Python appeared first on Erik Marsja.

]]>import numpy as np import pyvttbl as pt from collections import namedtuple

Numpy is be used in simulating the data. I create a data set in which we have one factor of two levels (P) and a second factor of 3 levels (Q). As in many of my examples the dependent variable is going to be response time (rt) and we create a list of lists for the different population means we are going to assume (i.e., the variable ‘values’). I was a bit lazy when coming up with the data so I named the independent variables ‘iv1’ and ‘iv2’. However, you could think of *iv1* as two different memory tasks; verbal and spatial memory. *Iv2* could be different levels of distractions (no distraction, synthetic sounds, and speech, for instance).

N = 20 P = [1,2] Q = [1,2,3] values = [[998,511], [1119,620], [1300,790]] sub_id = [i+1 for i in xrange(N)]*(len(P)*len(Q)) mus = np.concatenate([np.repeat(value, N) for value in values]).tolist() rt = np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q)).tolist() iv1 = np.concatenate([np.array([p]*N) for p in P]*len(Q)).tolist() iv2 = np.concatenate([np.array([q]*(N*len(P))) for q in Q]).tolist() Sub = namedtuple('Sub', ['Sub_id', 'rt','iv1', 'iv2']) df = pt.DataFrame() for idx in xrange(len(sub_id)): df.insert(Sub(sub_id[idx],rt[idx], iv1[idx],iv2[idx])._asdict())

I start with a boxplot using the method boxplot from Pyvttbl. As far as I can see there is not much room for changing the plot around. We get this plot and it is really not that beautiful.

df.box_plot('rt', factors=['iv1', 'iv2'])

To run the Two-Way ANOVA is simple; the first argument is the dependent variable, the second the subject identifier, and than the within-subject factors. In two previous posts I showed how to carry out one-way and two-way ANOVA for independent measures. One could, of course combine these techniques, to do a split-plot/mixed ANOVA by adding an argument ‘bfactors’ for the between-subject factor(s).

aov = df.anova('rt', sub='Sub_id', wfactors=['iv1', 'iv2']) print(aov)

The output one get from this is an ANOVA table. In this table all metrics needed plus some more can be found; F-statistic, p-value, mean square errors, confidence intervals, effect size (i.e., eta-squared) for all factors and the interaction. Also, some corrected degree of freedom and mean square error can be found (e.g., Grenhouse-Geisser corrected). The output is in the end of the post. It is a bit hard to read. If you know any other way to do a repeated measures ANOVA using Python please let me know. Also, if you happen to know that you can create nicer plots with Pyvttbl I would also like to know how! Please leave a comment.

rt ~ iv1 * iv2 TESTS OF WITHIN SUBJECTS EFFECTS Measure: rt Source Type III eps df MS F Sig. et2_G Obs. SE 95% CI lambda Obs. SS Power ======================================================================================================================================================= iv1 Sphericity Assumed 4419957.211 - 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Greenhouse-Geisser 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Huynh-Feldt 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Box 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv1) Sphericity Assumed 258996.722 - 19 13631.406 Greenhouse-Geisser 258996.722 1 19 13631.406 Huynh-Feldt 258996.722 1 19 13631.406 Box 258996.722 1 19 13631.406 ------------------------------------------------------------------------------------------------------------------------------------------------------- iv2 Sphericity Assumed 5257766.564 - 2 2628883.282 206.008 4.023e-21 3.920 40 18.448 36.158 433.701 1 Greenhouse-Geisser 5257766.564 0.550 1.101 4777252.692 206.008 1.320e-12 3.920 40 18.448 36.158 433.701 1 Huynh-Feldt 5257766.564 0.550 1.101 4777252.692 206.008 1.320e-12 3.920 40 18.448 36.158 433.701 1 Box 5257766.564 0.500 1 5257766.564 206.008 1.192e-11 3.920 40 18.448 36.158 433.701 1 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv2) Sphericity Assumed 484921.251 - 38 12761.086 Greenhouse-Geisser 484921.251 0.550 20.911 23189.668 Huynh-Feldt 484921.251 0.550 20.911 23189.668 Box 484921.251 0.500 19 25522.171 ------------------------------------------------------------------------------------------------------------------------------------------------------- iv1 * Sphericity Assumed 1622027.598 - 2 811013.799 83.220 1.304e-14 1.209 20 22.799 44.687 87.600 1.000 iv2 Greenhouse-Geisser 1622027.598 0.545 1.091 1486817.582 83.220 6.085e-09 1.209 20 22.799 44.687 87.600 1.000 Huynh-Feldt 1622027.598 0.545 1.091 1486817.582 83.220 6.085e-09 1.209 20 22.799 44.687 87.600 1.000 Box 1622027.598 0.500 1 1622027.598 83.220 2.262e-08 1.209 20 22.799 44.687 87.600 1.000 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv1 * Sphericity Assumed 370327.311 - 38 9745.456 iv2) Greenhouse-Geisser 370327.311 0.545 20.728 17866.175 Huynh-Feldt 370327.311 0.545 20.728 17866.175 Box 370327.311 0.500 19 19490.911 TABLES OF ESTIMATED MARGINAL MEANS Estimated Marginal Means for iv1 iv1 Mean Std. Error 95% Lower Bound 95% Upper Bound ============================================================== 1 983.755 43.162 899.157 1068.354 2 599.917 21.432 557.909 641.925 Estimated Marginal Means for iv2 iv2 Mean Std. Error 95% Lower Bound 95% Upper Bound =============================================================== 1 525.025 19.324 487.150 562.899 2 814.197 49.416 717.342 911.053 3 1036.286 43.789 950.459 1122.114 Estimated Marginal Means for iv1 * iv2 iv1 iv2 Mean Std. Error 95% Lower Bound 95% Upper Bound ===================================================================== 1 1 553.522 24.212 506.066 600.978 1 2 1103.488 28.411 1047.804 1159.173 1 3 1294.256 19.773 1255.501 1333.011 2 1 496.528 29.346 439.009 554.047 2 2 524.906 20.207 485.301 564.512 2 3 778.317 21.815 735.560 821.073

The post Two-way ANOVA for repeated measures using Python appeared first on Erik Marsja.

]]>The post Three ways to do a two-way ANOVA with Python appeared first on Erik Marsja.

]]>An important advantage of the two-way ANOVA is that it is more efficient compared to the one-way. There are two assignable sources of variation – supp and dose in our example – and this helps to reduce error variation thereby making this design more efficient. Two-way ANOVA (factorial) can be used to, for instance, compare the means of populations that are different in two ways. It can also be used to analyse the mean responses in an experiment with two factors. Unlike One-Way ANOVA, it enables us to test the effect of two factors at the same time. One can also test for independence of the factors provided there are more than one observation in each cell. The only restriction is that the number of observations in each cell has to be equal (there is no such restriction in case of one-way ANOVA).

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

import pandas as pd from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm from statsmodels.graphics.factorplots import interaction_plot import matplotlib.pyplot as plt from scipy import stats

In the code above we import all the needed Python libraries and methods for doing the two first methods using Python (calculation with Python and using Statsmodels). In the last, and third, method for doing python ANOVA we are going to use Pyvttbl. As in the previous post on one-way ANOVA using Python we will use a set of data that is available in R but can be downloaded here: TootGrowth Data. Pandas is used to create a dataframe that is easy to manipulate.

datafile="ToothGrowth.csv" data = pd.read_csv(datafile)

It can be good to explore data before continuing with the inferential statistics. statsmodels has methods for visualising factorial data. We are going to use the method *interaction_plot*.

fig = interaction_plot(data.dose, data.supp, data.len, colors=['red','blue'], markers=['D','^'], ms=10)

The calculations of the sum of squares (the variance in the data) is quite simple using Python. First we start with getting the sample size (N) and the degree of freedoms needed. We will use them later to calculate the mean square. After we have the degree of freedom we continue with calculation of the sum of squares.

N = len(data.len) df_a = len(data.supp.unique()) - 1 df_b = len(data.dose.unique()) - 1 df_axb = df_a*df_b df_w = N - (len(data.supp.unique())*len(data.dose.unique()))

For the calculation of the sum of squares A, B and Total we will need to have the grand mean. Using Pandas DataFrame method mean on the dependent variable only will give us the grand mean:

grand_mean = data['len'].mean()

The grand mean is simply the mean of all scores of len.

We start with calculation of Sum of Squares for the factor A (supp).

ssq_a = sum([(data[data.supp ==l].len.mean()-grand_mean)**2 for l in data.supp])

Calculation of the second Sum of Square, B (dose), is pretty much the same but over the levels of that factor.

ssq_b = sum([(data[data.dose ==l].len.mean()-grand_mean)**2 for l in data.dose])

ssq_t = sum((data.len - grand_mean)**2)

Finally, we need to calculate the Sum of Squares Within which is sometimes referred to as error or residual.

vc = data[data.supp == 'VC'] oj = data[data.supp == 'OJ'] vc_dose_means = [vc[vc.dose == d].len.mean() for d in vc.dose] oj_dose_means = [oj[oj.dose == d].len.mean() for d in oj.dose] ssq_w = sum((oj.len - oj_dose_means)**2) +sum((vc.len - vc_dose_means)**2)

Since we have a two-way design we need to calculate the Sum of Squares for the interaction of A and B.

ssq_axb = ssq_t-ssq_a-ssq_b-ssq_w

We continue with the calculation of the mean square for each factor, the interaction of the factors, and within.

ms_a = ssq_a/df_a

ms_b = ssq_b/df_b

ms_axb = ssq_axb/df_axb

ms_w = ssq_w/df_w

The *F*-statistic is simply the mean square for each effect and the interaction divided by the mean square for within (error/residual).

f_a = ms_a/ms_w f_b = ms_b/ms_w

We can use the scipy.stats method *f.sf* to check if our obtained *F*-ratios is above the critical value. Doing that we need to use our *F*-value for each effect and interaction as well as the degrees of freedom for them, and the degree of freedom within.

p_a = stats.f.sf(f_a, df_a, df_w) p_b = stats.f.sf(f_b, df_b, df_w) p_axb = stats.f.sf(f_axb, df_axb, df_w)

The results are, right now, stored in a lot of variables. To obtain a morereadable result we can create a DataFrame that will contain our ANOVA table.

results = {'sum_sq':[ssq_a, ssq_b, ssq_axb, ssq_w], 'df':[df_a, df_b, df_axb, df_w], 'F':[f_a, f_b, f_axb, 'NaN'], 'PR(>F)':[p_a, p_b, p_axb, 'NaN']} columns=['sum_sq', 'df', 'F', 'PR(>F)'] aov_table1 = pd.DataFrame(results, columns=columns, index=['supp', 'dose', 'supp:dose', 'Residual'])

As a Psychologist most of the journals we publish in requires to report effect sizes. Common software, such as, SPSS have eta squared as output. However, eta squared is an overestimation of the effect. To get a less biased effect size measure we can use omega squared. The following two functions adds eta squared and omega squared to the above DataFrame that contains the ANOVA table.

def eta_squared(aov): aov['eta_sq'] = 'NaN' aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq']) return aov def omega_squared(aov): mse = aov['sum_sq'][-1]/aov['df'][-1] aov['omega_sq'] = 'NaN' aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse) return aov eta_squared(aov_table1) omega_squared(aov_table1) print(aov_table1)

sum_sq | df | F | PR(>F) | eta_sq | omega_sq | |
---|---|---|---|---|---|---|

supp | 205.350000 | 1 | 15.572 | 0.000231183 | 0.059484 | 0.055452 |

dose | 2426.434333 | 2 | 92 | 0.000231183 | 0.702864 | 0.692579 |

supp:dose | 108.319000 | 2 | 4.10699 | 0.0218603 | 0.031377 | 0.023647 |

Residual | 712.106000 | 54 |

There is, of course, a much easier way to do Two-way ANOVA with Python. We can use Statsmodels which have a similar model notation as many R-packages (e.g., lm). We start with formulation of the model:

formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)' model = ols(formula, data).fit() aov_table = anova_lm(model, typ=2)

Statsmodels does not calculate effect sizes for us. My functions above can, again, be used and will add omega and eta squared effect sizes to the ANOVA table. Actually, I created these two functions to enable calculation of omega and eta squared effect sizes on the output of Statsmodels anova_lm method.

eta_squared(aov_table) omega_squared(aov_table) print(aov_table)

sum_sq | df | F | PR(>F) | eta_sq | omega_sq | |
---|---|---|---|---|---|---|

C(supp) | 205.350000 | 1 | 15.571979 | 2.311828e-04 | 0.059484 | 0.055452 |

C(dose) | 2426.434333 | 2 | 91.999965 | 4.046291e-18 | 0.702864 | 0.692579 |

C(supp):C(dose) | 108.319000 | 2 | 4.106991 | 2.186027e-02 | 0.031377 | 0.023647 |

Residual | 712.106000 | 54 |

What is neat with using statsmodels is that we can also do some diagnostics. It is, for instance, very easy to take our model fit (the linear model fitted with the OLS method) and get a Quantile-Quantile (QQplot):

res = model.resid fig = sm.qqplot(res, line='s') plt.show()

The third way to do Python ANOVA is using the library pyvttbl. Pyvttbl has its own method (also called DataFrame) to create data frames.

from pyvttbl import DataFrame df=DataFrame() df.read_tbl(datafile) df['id'] = xrange(len(df['len'])) print(df.anova('len', sub='id', bfactors=['supp', 'dose']))

The ANOVA tables of Pyvttbl contains a lot of more information compared to that of statsmodels. Actually, Pyvttbl output contains an effect size measure; the generalized omega squared.

Source | Type III Sum of Squares | df | MS | F | Sig. | η^{2}_{G} |
Obs. | SE of x̄ | ±95% CI | λ | Obs. Power |

supp | 205.350 | 1.000 | 205.350 | 15.572 | 0.000 | 0.224 | 30.000 | 0.678 | 1.329 | 8.651 | 0.823 |

dose | 2426.434 | 2.000 | 1213.217 | 92.000 | 0.000 | 0.773 | 20.000 | 0.831 | 1.628 | 68.148 | 1.000 |

supp * dose | 108.319 | 2.000 | 54.159 | 4.107 | 0.022 | 0.132 | 10.000 | 1.175 | 2.302 | 1.521 | 0.173 |

Error | 712.106 | 54.000 | 13.187 | ||||||||

Total | 3452.209 | 59.000 |

The post Three ways to do a two-way ANOVA with Python appeared first on Erik Marsja.

]]>The post Four ways to conduct one-way ANOVAs with Python appeared first on Erik Marsja.

]]>

We start with some brief introduction on theory of ANOVA. If you are more interested in the four methods to carry out one-way ANOVA with Python click here. ANOVA is a means of comparing the ratio of systematic variance to unsystematic variance in an experimental study. Variance in the ANOVA is partitioned in to total variance, variance due to groups, and variance due to individual differences.

The ratio obtained when doing this comparison is known as the *F*-ratio. A one-way ANOVA can be seen as a regression model with a single categorical predictor. This predictor usually has two plus categories. A one-way ANOVA has a single factor with *J *levels. Each level corresponds to the groups in the independent measures design. The general form of the model, which is a regression model for a categorical factor with *J *levels, is:

There is a more elegant way to parametrize the model. In this way the group means are represented as deviations from the grand mean by grouping their coefficients under a single term. I will not go into detail on this equation:

As for all parametric tests the data need to be normally distributed (each groups data should be roughly normally distributed) for the *F*-statistic to be reliable. Each experimental condition should have roughly the same variance (i.e., homogeneity of variance), the observations (e.g., each group) should be independent, and the dependent variable should be measured on, at least, an interval scale.

In the four examples in this tutorial we are going to use the dataset “PlanthGrowth” that originally was available in R but can be downloaded using this link: PlanthGrowth. In the first three examples we are going to use Pandas DataFrame.

import pandas as pd datafile="PlantGrowth.csv" data = pd.read_csv(datafile) #Create a boxplot data.boxplot('weight', by='group', figsize=(12, 8)) ctrl = data['weight'][data.group == 'ctrl'] grps = pd.unique(data.group.values) d_data = {grp:data['weight'][data.group == grp] \ for grp in pd.unique(data.group.values)} k = len(pd.unique(data.group)) # number of conditions N = len(data.values) # conditions times participants n = data.groupby('group').size()[0] #Participants in each condition

Judging by the Boxplot there are differences in the dried weight for the two treatments. However, easy to visually determine whether the treatments are different to the control group.

We start with using SciPy and its method f_oneway from stats.

from scipy import stats F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])

One problem with using SciPy is that following APA guidelines we should also effect size (e.g., eta squared) as well as Degree of freedom (DF). DFs needed for the example data is easily obtained:

DFbetween = k - 1 DFwithin = N - k DFtotal = N - 1

However, if we want to calculate eta-squared we need to do some more computations. Thus, the next section will deal with how to calculate a one-way ANOVA using the Pandas DataFrame and Python code.

A one-way ANOVA is quite easy to calculate so below I am going to show how to do it. First, we need to calculate the sum of squares between (SSbetween), sum of squares within (SSwithin), and sum of squares total (SSTotal).

We start with calculating the Sum of Squares between. Sum of Squares Between is the variability due to interaction between the groups. Sometimes known as the Sum of Squares of the Model.

SSbetween = (sum(data.groupby('group').sum()['weight']**2)/n) \ - (data['weight'].sum()**2)/N

The variability in the data due to differences within people. The calculation of Sum of Squares Within can be carried out according to this formula:

sum_y_squared = sum([value**2 for value in data['weight'].values]) SSwithin = sum_y_squared - sum(data.groupby('group').sum()['weight']**2)/n

Sum of Squares Total will be needed to calculate eta-squared later. This is the total variability in the data.

SStotal = sum_y_squared - (data['weight'].sum()**2)/N

Mean square between is the sum of squares within divided by degree of freedom between.

MSbetween = SSbetween/DFbetween

Mean Square within is also an easy calculation;

MSwithin = SSwithin/DFwithin

` `

F = MSbetween/MSwithin

To reject the null hypothesis we check if the obtained F-value is above the critical value for rejecting the null hypothesis. We could look it up in a F-value table based on the DFwithin and DFbetween. However, there is a method in SciPy for obtaining a p-value.

p = stats.f.sf(F, DFbetween, DFwithin)

Finally, we are also going to calculate effect size. We start with the commonly used eta-squared (*η²* ):

eta_sqrd = SSbetween/SStotal

However, eta-squared is somewhat biased because it is based purely on sums of squares from the sample. No adjustment is made for the fact that what we aiming to do is to estimate the effect size in the population. Thus, we can use the less biased effect size measure Omega squared:

om_sqrd = (SSbetween - (DFbetween * MSwithin))/(SStotal + MSwithin)

The results we get from both the SciPy and the above method can be reported according to APA style; *F*(2, 27) = 4.846, *p *= .016, *η²* = .264. If you want to report Omega Squared: *ω ^{2}* = .204

The third method, using Statsmodels, is also easy. We start by using ordinary least squares method and then the anova_lm method. Also, if you are familiar with R-syntax. Statsmodels have a formula api where your model is very intuitively formulated. First, we import the api and the formula api. Second we, use ordinary least squares regression with our data. The object obtained is a fitted model that we later use with the anova_lm method to obtaine a ANOVA table.

import statsmodels.api as sm from statsmodels.formula.api import ols mod = ols('weight ~ group', data=data).fit() aov_table = sm.stats.anova_lm(mod, typ=2) print aov_table

sum_sq | df | F | PR(>F) | |
---|---|---|---|---|

group | 3.76634 | 2 | 4.846088 | 0.01591 |

Residual | 10.49209 | 27 |

As can be seen in the ANVOA table Statsmodels don’t provide an effect size . To calculate eta squared we can use the sum of squares from the table:

esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])

We can also use the method anova1way from the python package pyvttbl. This package also has a DataFrame method. We have to use this method instead of Pandas DataFrame to be able to carry out the one-way ANOVA.

from pyvttbl import DataFrame df=DataFrame() df.read_tbl(datafile) aov_pyvttbl = df.anova1way('weight', 'group') print aov_pyvttbl

Anova: Single Factor on weight SUMMARY Groups Count Sum Average Variance ============================================ ctrl 10 50.320 5.032 0.340 trt1 10 46.610 4.661 0.630 trt2 10 55.260 5.526 0.196 O'BRIEN TEST FOR HOMOGENEITY OF VARIANCE Source of Variation SS df MS F P-value eta^2 Obs. power =============================================================================== Treatments 0.977 2 0.489 1.593 0.222 0.106 0.306 Error 8.281 27 0.307 =============================================================================== Total 9.259 29 ANOVA Source of Variation SS df MS F P-value eta^2 Obs. power ================================================================================ Treatments 3.766 2 1.883 4.846 0.016 0.264 0.661 Error 10.492 27 0.389 ================================================================================ Total 14.258 29 POSTHOC MULTIPLE COMPARISONS Tukey HSD: Table of q-statistics ctrl trt1 trt2 ================================= ctrl 0 1.882 ns 2.506 ns trt1 0 4.388 * trt2 0 ================================= + p < .10 (q-critical[3, 27] = 3.0301664694) * p < .05 (q-critical[3, 27] = 3.50576984879) ** p < .01 (q-critical[3, 27] = 4.49413305084)

As can be seen in the output from method anova1way we get a lot more information. Maybe of particular interest here is that we get results from a post-hoc test (i.e., Tukey HSD). Whereas the ANOVA only lets us know that there was a significant effect of treatment the post-hoc analysis reveal where this effect may be (between which groups).

That is it! In this tutorial you learned 4 methods that let you carry out one-way ANOVAs using Python. There are, of course, other ways to deal with the tests between the groups (e.g., the post-hoc analysis). One could carry out Multiple Comparisons (e.g., t-tests between each group. Just remember to correct for familywise error!) or Planned Contrasts. In conclusion, doing ANOVAs in Python is pretty simple.

The post Four ways to conduct one-way ANOVAs with Python appeared first on Erik Marsja.

]]>The post Repeated measures ANOVA using Python appeared first on Erik Marsja.

]]>There are, at least, two of the advantages using within-subjects design. First, more information is obtained from each subject in a within-subjects design compared to a between-subjects design. Each subject is measured in all conditions, whereas in the between-subjects design, each subject is typically measured in one or more but not all conditions. A within-subject design thus requires fewer subjects to obtain a certain level of statistical power. In situations where it is costly to find subjects this kind of design is clearly better than a between-subjects design. Second, the variability in individual differences between subjects is removed from the error term. That is, each subject is his or her own control and extraneous error variance is reduced.

pyvttbl can be installed using pip:

pip install pyvttbl

If you are using Linux you may need to add ‘sudo’ before the pip command. This method installs pyvttbl and, hopefully, any missing dependencies.

I continue with simulating a response time data set. If you have your own data set you want to do your analysis on you can use the method “read_tbl” to load your data from a CSV-file.

from numpy.random import normal import pyvttbl as pt from collections import namedtuple N = 40 P = ["noise","quiet"] rts = [998,511] mus = rts*N Sub = namedtuple('Sub', ['Sub_id', 'rt','condition']) df = pt.DataFrame() for subid in xrange(0,N): for i,condition in enumerate(P): df.insert(Sub(subid+1, normal(mus[i], scale=112., size=1)[0], condition)._asdict())

Conducting the repeated measures ANOVA with pyvttbl is pretty straight forward. You just take the pyvttbl DataFrame object and use the method anova. The first argument is your dependent variable (e.g. response time), and you specify the column in which the subject IDs are (e.g., sub=’Sub_id’). Finally, you add your within subject factor(s) (e.g., wfactors). wfactors take a list of column names containing your within subject factors. In my simulated data there is only one (e.g. ‘condition’).

aov = df.anova('rt', sub='Sub_id', wfactors=['condition']) print(aov)

Source | Type III Sum of Squares | ε | df | MS | F | Sig. | η^{2}_{G} |
Obs. | SE of x̄ | ±95% CI | λ | Obs. Power | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

condition | Sphericity Assumed | 4209536.428 | – | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 |

Greenhouse-Geisser | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Huynh-Feldt | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Box | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Error(condition) | Sphericity Assumed | 531140.646 | – | 39.000 | 13618.991 | ||||||||

Greenhouse-Geisser | 531140.646 | 1.000 | 39.000 | 13618.991 | |||||||||

Huynh-Feldt | 531140.646 | 1.000 | 39.000 | 13618.991 | |||||||||

Box | 531140.646 | 1.000 | 39.000 | 13618.991 |

As can be seen in the output table the Sum of Squares used is **Type III** which is what common statistical software use when calculating ANOVA (the *F*-statistic) (e.g., SPSS or R-packages such as ‘afex’ or ‘ez’). The table further contains correction in case our data violates the assumption of Sphericity (which in the case of only 2 factors, as in the simulated data, is nothing to worry about). As you can see we also get generalized eta squared as effect size measure and 95 % Confidence Intervals. It is stated in the docstring for the class Anova that standard Errors and 95% confidence intervals are calculated according to Loftus and Masson (1994). Furthermore, generalized eta squared allows comparability across between-subjects and within-subjects designs (see, Olejnik & Algina, 2003).

Conveniently, if you ever want to transform your data you can add the argument transform. There are several options here; *log* or *log10*, *reciprocal* or *inverse*,* square-root* or *sqrt*, *arcsine* or *arcsin*, and *windsor10*. For instance, if you want to use log-transformation you just add the argument “*transform*=’log'” (either of the previously mentioned methods can be used as arguments in string form):

aovlog = df.anova('rt', sub='Sub_id', wfactors=['condition'], transform='log')

Using pyvttbl we can also analyse mixed-design/split-plot (within-between) data. Doing a split-plot is easy; just add the argument “*bfactors*=” and a list of your between-subject factors. If you are interested in one-way ANOVA for independent measures see my newer post: Four ways to conduct one-way ANOVAS with Python.

Finally, I created a function that extracts the F-statistics, Mean Square Error, generalized eta squared, and the p-value the results obtained with the anova method. It takes a factor as a string, a ANOVA object, and the values you want to extract. Keys for your different factors can be found using the key-method (e.g., *aov.keys()*).

def extract_for_apa(factor, aov, values = ['F', 'mse', 'eta', 'p']): results = {} for key,result in aov[(factor,)].iteritems(): if key in values: results[key] = result return results

Note, the table with the results in this post was created with the private *method _within_html*. To create an HTML table you will have to import SimpleHTML:

import SimpleHTML output = SimpleHTML.SimpleHTML('Title of your HTML-table') aov._within_html(output) output.write('results_aov.html')

That was all. There are at least one downside with using pyvttbl for doing within-subjects analysis in Python (ANOVA). Pyvttbl is not compatible with Pandas DataFrame which is commonly used. However, this may not be a problem since pyvttbl, as we have seen, has its own DataFrame method. There are also a some ways to aggregate and visualizing data using Pyvttbl. Another downside is that it seems like Pyvttbl no longer is maintained.

Loftus, G.R., & Masson, M.E. (1994). Using confidence intervals in within-subjects designs. The Psychonomic Bulletin & Review, 1(4), 476-490.

Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychological Methods, 8(4), 434–47. http://doi.org/10.1037/1082-989X.8.4.434

The post Repeated measures ANOVA using Python appeared first on Erik Marsja.

]]>The post Descriptive Statistics using Python appeared first on Erik Marsja.

]]>After data collection, most **Psychology researchers** use different ways to summarise the data. In this tutorial we will learn how to do **descriptive statistics **in **Python**. Python, being a programming language, enables us many ways to carry out descriptive statistics.

One useful library for data manipulation and summary statistics is Pandas. Actually, Pandas offers an API similar to Rs. I think that the dataframe in R is very intuitive to use and Pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.

Thus, in this tutorial you will learn how to do descriptive statistics using Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy. For these measures of central tendency we will use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using Pandas).

import numpy as np from pandas import DataFrame as df from scipy.stats import trim_mean, kurtosis from scipy.stats.mstats import mode, gmean, hmean

Many times in **experimental psychology** response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method *describe*.

N = 20 P = ["noise","quiet"] Q = [1,2,3] values = [[998,511], [1119,620], [1300,790]] mus = np.concatenate([np.repeat(value, N) for value in values]) data = df(data = {'id': [subid for subid in xrange(N)]*(len(P)*len(Q)) ,'iv1': np.concatenate([np.array([p]*N) for p in P]*len(Q)) ,'iv2': np.concatenate([np.array([q]*(N*len(P))) for q in Q]) ,'rt': np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q))})

data.describe()

Pandas will output summary statistics by using this method. Output is a table, as you can see below.

Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV. As can be seen from the output it is somewhat hard to read. Note, the method *unstack* is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.

grouped_data = data.groupby(['iv1', 'iv2']) grouped_data['rt'].describe().unstack()

Often we want to know something about the “*average*” or “*middle*” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean can also be obtained using Pandas but I will use methods from SciPy.

There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;

grouped_data['rt'].mean().reset_index()

But the method *aggregate* in combination with NumPys mean can also be used;

grouped_data['rt'].aggregate(np.mean).reset_index()

Both methods will give the same output but the aggregate method have some advantages that I will explain later.

Sometimes the *geometric* or *harmonic* mean can be of interested. These two descriptive statistics can be obtained using the method apply with the methods *gmean* and *hmean* (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.

grouped_data['rt'].apply(gmean, axis=None).reset_index()

grouped_data['rt'].apply(hmean, axis=None).reset_index()

Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the *trimmed mean*. However, we can use the method *trim_mean* from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.

trimmed_mean = grouped_data['rt'].apply(trim_mean, .1) trimmed_mean.reset_index()

Output from the mean values above (trimmed, harmonic, and geometric means):

The *median *can also be obtained using two methods;

grouped_data['rt'].median().reset_index() grouped_data['rt'].aggregate(np.median).reset_index()

There is a method (i.e., pandas.DataFrame.mode()) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:

grouped_data['rt'].apply(mode, axis=None).reset_index()

Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (*std*), mean, harmonic mean, geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.

descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index() descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index) descr

Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. We will probably also want to have a look at a measure of the variability of the data.

grouped_data['rt'].std().reset_index()

Note that here the use unstack() also get the quantiles as columns and the output is easier to read.

ggrouped_data['rt'].quantile([.25, .5, .75]).unstack()

ggrouped_data['rt'].var().reset_index()

That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation **almost **as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.

Update: Recently, I learned some methods to explore response times visualizing the distribution of different conditions: Exploring response time distributions using Python.

I am sorry that the images (i.e., the tables) are so ugly. If you happen to know a good way to output tables and figures from Python (something like Knitr & Rmarkdown) please let me know.

The post Descriptive Statistics using Python appeared first on Erik Marsja.

]]>The post Six ways to reverse pandas dataframe appeared first on Erik Marsja.

]]>

First, we create some example data using normal and binomial from numpy.random. Using normal we create two different response time variables (response times are often used in experimental psychology) and binomial is used to create response accuracy (also often used in psychology experiments). We will use pandas to create a dataframe from the data. Note, the simulation of data is not intended to be used for anything else than for us to have data to create a pandas dataframe.

import pandas as pd import numpy as np distracted = [1 for i in xrange(15)] normal_perf = [0 for i in xrange(15)] sex = ['Male' if i%2 == 0 else 'Female' for i in xrange(30)] rt_distracted = np.random.normal(1660, 100, 15) rt_normal_perf = np.random.normal(1054, 97, 15) accuracy_distracted = np.random.binomial(1, .79, 15) accuracy_normal_perf = np.random.binomial(1, .87, 15) rt = np.concatenate((rt_distracted, rt_normal_perf)) acc = np.concatenate((accuracy_distracted, accuracy_normal_perf)) distracted.extend(normal_perf) subject_id = [i for i in xrange(1,31)] data = {'Sub_id':subject_id, 'Condition':distracted, 'RT':rt, 'Accuracy':acc, 'Gender':sex} data_frame = pd.DataFrame(data)

First are we going to change places of the first (“Accuracy) and last column (“Sub_id”).

We start creating a list of the column names and swapping the first item to the last:

columns = data_frame.columns.tolist() columns = columns[-1:] + columns[:-1]

Then we continue with using the created list and the columns in the dataframe will change places.

data_frame = data_frame[columns]

We can sort our dataframe ascending again (i.e., alphabetical):

data_frame = data_frame.sort_index(axis=1 ,ascending=True)

You can also reverse pandas dataframe completely. The second line (columns[::-1]) reverse the list of column names. Third, the dataframe is reversed using that list.

columns = data_frame.columns.tolist() columns = columns[::-1] data_frame = data_frame[columns]

Pandas dataframe can also be reversed by row. That is, we can get the last row to become the first. We start by re-order the dataframe ascending:

data_frame = data_frame.sort_index(axis=1 ,ascending=True)

First, we will use iloc which is integer based.

data_frame = data_frame.iloc[::-1]

However, we can also use sort_index by using the axis 0 (row). This will, of course, reverse the dataframe back to the one we started with.

data_frame = data_frame.sort_index(ascending=True, axis=0)

Lastly, we can use the method reindex to reverse by row. This will sort Pandas DataFrame reversed. Last element will be first.

data_frame = data_frame.reindex(index=data_frame.index[::-1]) data_frame.head()

That was it; six ways to reverse pandas dataframe. You can go to my GitHub-page to get a Jupyter notebook with all the above code and some output: Jupyter notebook.

The post Six ways to reverse pandas dataframe appeared first on Erik Marsja.

]]>The post Why Spyder is the Best Python IDE for Science appeared first on Erik Marsja.

]]>When I started programming in Python I used IDLE which is the IDE that you will get with your installation of Python (e.g., on Windows computers). I actually used IDLE IDE for some time. It was not until I started to learn R and found RStudio IDE. I thought that RStudio was great (and it still is!). However, after learning R and RStudio I started to look for a better Python IDE.

There are, of course, other IDEs. Personally, I used Ninja-IDE for a while but I quite quickly found Spyder IDE **better**. Spyder is an acronym for “Scientific PYthon Development EnviRonment”. The IDE create a MATLAB-like development environment but, as previously mentioned, it is also quite similar to RStudio. Great, for people with experience of both RStudio and MATLAB!

In fact, Python with the right IDE may be a free open source alternative to MATLAB (e.g., Spyder). Last time I used Matlab was 2011 and I don’t have a working installation on any computer. Moreover, a Spyder (Python) vs. MATLAB comparison is out of the scope for this post.

Well, I really like working with RStudio. One of my favourite features of RStudio is that it easy to find help for a package or a function (i.e., type “help(function)”in the console). Spyder has the “Object inspector” (see image below) in which you can search for documentation for a class, function or module.

Another feature in the GUI I really like is the “Variable explorer” (again similar to RStudio; you can see your variables (Data and values) in the *Environment*). When using Spyder your code can also be analysed using Pylint or Pyflakes.

Spyder also got many other great features and here are some of them:

- syntax coloring for Python, C/C++, and Fortran
- breakpoints and conditional breakpoints
- run or debug Python scripts
- iPython and Python consoles

These were some of the qualities that makes me think that the Spyder is the best Python IDE. Obviously, the IDE many other good qualities. Below is a short video showing these properties.

On Ubuntu 14.04 Spyder can be installed by using pip or apt-get:

sudo pip install spyder

sudo apt-get install spyder

I will finish the post with a quick comparison between Spyder and Rodeo. As can bee seen on the image below, Rodeo look almost **identical** to RStudio.

Rodeo is way more minimalistic than Spyder. Thus, there are fewer features. However, being a Python IDE you can type help(“imported module”) to see information of a module (i.e., Object inspector in Spyder). If you are looking for a slimmed Python IDE Rodeo may be the one for you (all features of Spyder may not be useful to you). However, if you want more features and Spyder is the best python IDE.

The post Why Spyder is the Best Python IDE for Science appeared first on Erik Marsja.

]]>The post Every Psychologist Should Learn Programming appeared first on Erik Marsja.

]]>

Everyone should learn computer programming. It should be taught to our kids, in our school. There are so many benefits of programming and I will, therefore, only write about the ones that I find most important. Writing code (i.e., programming) can be seen as applied mathematics and sciences. Some programming languages make you think algorithmically. It teaches you an iterative approach to solving problems and testing out your ideas. Most of the time it is challenging, fun, and, on top of this, it can make your life easier! For instance, using simple scripts you can automate things such as extracting information from your PDFs. This information may be used to rename the PDFs and, if you’d like, organize them in folders. Apart from making your everyday life easier (e.g., automate everyday tasks) of course more reasons for learning programming. This post is going to focus on some of these reasons with an emphasis on when psychologists have use of coding skills.

I am a Ph.D. student with a focus on Cognitive Psychology (BSc. and MSc in Cognitive Science). Naturally, my views on programming for researchers and students are coloured by my discipline. However, I think that everyone should know programming. Many researchers and Ph.D. students that I know are conducting experiments (e.g., in Psychology and Neuroscience) do some programming. Some do easier stuff such as using SPSS syntax and Mplus. Others use more advanced coding in Matlab, E-prime, Python, or C.

If you are planning on graduate studies (i.e., aiming for a Masters or a Ph.D.) in Psychology, or another of the cognitive sciences, programming is almost essential. I learned this when doing my Masters thesis. There was no time for my supervisor to create an experiment in E-prime for me. After my Masters, I started to look for Ph.D. positions and many of the ads required knowledge in Matlab, Python, or R.

Many Psychology researchers, and students doing projects and their thesis’, use some software for collecting data. Although, there are many graphical interfaces (i.e., E-prime & Presentation) you will most likely need to use some scripting language to solve some issues. For instance, the graphical interfaces will probably not be able to do pseudo-randomization. I would say studying cognition or perception on graduate level will require you to have some scripting or coding skills.

There is also an emerging interest in doing psychological research on social media (e.g., how different personality types expresses in Facebook behaviour). The flexibility of programming languages may also open doors to research projects that cannot be reached with common statistical software (e.g., Stata & SAS). Data can be spread across thousands of text documents or available around the web (e.g., social media).

There is lots of software available for data analysis: spreadsheets like Excel, batch-oriented procedure-based systems like SAS; point-and-click GUI-based systems like SPSS, Stata, and Statistica. Choosing an open source programming language is typically free of charge (e.g., R or Python) and offers greater flexibility. Using a programming language, you do data analysis by writing functions and scripts. Although, the learning curve may be steeper it is a very natural and expressive method for conducting data analysis. There are several positive aspects of writing scripts; it documents all your work, and you can automate sequences of tasks. That is, you can easily follow what you did last year (i.e., it is in your script) and save time if you are running similar analysis on many experiments.

Another very important aspect of doing analysis in an open source and free programming language is that it makes reproducibility easier. Your script documents every step of your process and anyone can download and install the software needed to run your analysis. I would say that if you are a proponent of Open Science you should learn a programming language for doing your analysis.

When people discuss beginner programming languages and which languages that are easier and quicker to learn, Python inevitably comes up. It was created by Guido van Rossum, but is administrated by the non-profit organisation Python Software Foundation. The language is open source and free, even for commercial applications. Python is usually used and referred to as a scripting language. It is a high-level and general-purpose programming language. Thanks to its flexibility, Python is one of the most popular programming languages (e.g., 8th on the TIOBE Index for January 2016). It got full support for both object-oriented programming and structured programming.

Why Python? First, it is open source and free. My personal experience with Matlab is that The Python is just far better that Matlab’s weird language. Furthermore, it integrates better with other languages (e.g. C/C++). Most importantly, however, is that there is a variety of both general-purpose and specialized python libraries. This means you can do data collection (e.g., scrape the Web or software psychological experiments) using Python. You can also analyse the your data using both common and more advanced statistical methods. Note that it may be more complicated to install Python. However, there are scientifically focused distributions that has a lot of the libraries that you will want to use. Personally, I have only used Anaconda and Python(x, y) but there is also Canopy.

There are of course so many useful Python libraries so that it would need, at least, a separate post to list them all. However, here are a few I have found useful.

For **data collection**:

- OpenSesame is easy to install package and works on Linux, OS-X, Windows, and Android (both tablets, phones, and computers). The application offers a graphical user interface (GUI) for creating experiments. Requires minimal coding but let you write in-line scripts.
- PsychoPy is also simple to install and cross-platform (Linux, OS-X, and Windows). It promises precision timing, and it has a lot of different types of stimuli ready to use. PsychoPy offers both an Application Programming Interface (API) and a graphical interface for drag-and-drop creation of experiment. You can, of course, combine both.

For** data analysis:**

- Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Statsmodels is a module that allows users to explore data, estimate statistical models, and perform statistical tests. Some of the features of Statsmodels are; linear regression, generalized linear models, functions for plotting, tools for outputting results in tables in formats such as Text, LaTex, and HTML.

Note that if you install Anaconda, Canopy, or Python(x, y) you will get Pandas and Statsmodels and many more useful libraries. Python(x, y) also includes Spyder IDE.

If you are interested in **Bayesian statistics:**

- PyStan enables you write Python code and send it to Stan. Stan is a package for Bayesian statistics using the No-U-Turn sampler.
- PyMC is a module that implements Bayesian statistical models and fitting algorithms. It includes Markov chain Monte Carlo. PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics.

R is a free and open source programming language. It is a complete, interactive, object-oriented language with a focus on data analysis. In R statistical environment you can carry out a variety of statistical and graphical techniques. For instance, linear and non-linear modeling, classical statistical tests, time-series analysis, classification, and many more can be carried using both frequentist and Bayesian paradigms. Many new and exiting methods are developed in R meaning that if you learn R you have the possibility to use pioneering techniques. There are to many good resources and packages. You can find some resources on learning R in my post: R resources for Psychologists.

Python and R are, both, some the most popular languages for data analysis. As previously mentioned, while Python is a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in mind (“language for statisticians by statisticians”). If your primarily interest is doing data analysis R is probably the language to begin with. Python, on the other side, enables you to create experiments and may be good to start with if you need to program your experiments. The amount of data science libraries for Python is increasing so in the future maybe it will be the one for statistics also. Whichever you choose, I suggest that you learn one at a time.

There are a huge number of resources out there to help. I came in contact with programming in school, when I was 17, in an elective course on web programming. I later took some courses in C++ but it was not until I found Python I started to like programming. As a student at the Bachelor’s programme in Cognitive Science I had to take 15 ECTs in Python programming. Focusing on the free programming language (e.g., Python and R) there are so many resources available freely online to learn programming. For instance, you can find both Python and R massive open online courses (MOOCs) at Coursera, and beginner courses at Code Academy for Python, **Javascript** **Ruby**, among other languages. Another promising free and interactive course is Codeschool Try R. Google has Python and C++ courses: Google For Education. For a more statistics focused Python course you can look at CogMaster Statistics course: Python materials and Scipy Lecture Notes (at least one chapter). Another really interesting way to learn R is Swirl. Swirl is an R-package that enables you to install interactive Swirl courses.

You can find some more good resources and books for learning R in my posts R resources for Psychologists and R books for Psychologists.

Learning programming is, of course, not easy. There are really good resources available, and with the right motivation you can learn how to program. If you are early in your career (Master, Ph.D., or Post-doc) the time you spend now on learning programming now will be rewarding in many ways. It is both challenging and fun, and may change your way of thinking. It will also make your research easier to reproduce. Join the Open Science movement and upload your scripts on, for instance, GitHub or Bitbucket. This will enable other researchers to run your exact analysis. What do you think? **Programming is essential or nothing for a Psychologist?**

The post Every Psychologist Should Learn Programming appeared first on Erik Marsja.

]]>The post R resources for Psychologists appeared first on Erik Marsja.

]]>R is a **free **and** open source **programming language and environment. **Data analysis** in R is carried out by writing **scripts** and **functions**. R is a complete, interactive, object-oriented language.

In R statistical environment you are able to carry out a variety of statistical and graphical techniques. For instance, linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and many more can be carried using both frequentist and bayesian paradigms.

You may want to start with R commander. It provides a menu which may make learning R a bit easier at the beginning. R can be downloaded here: The Comprehensive R Archive Network.

R has a broad and helpful community and, therefore, there are many good resources for learning the language.

- Quick-R: A comprehensive source for information on how to carry out many common statistical analysis as well as
**descriptives**and**graphs**. - SimpleR notes for introducing the use of R for an introductory statistics course.
- The R Guide
- Cookbook for R provides solutions to many tasks and problems in data analysis.
- Revolution Analytics – A blog for news and information about R. Publishes guides and articles about R.
- R-bloggers A “blogg aggregator” for R blogs. Here you can find, and follow, a lot of comprehensive, basic and advanced, guides and posts.
- RStudio – my Integrated Development Environment (IDE) of choice when it comes to R.

- Using R for Psychological Research – here you will find many tutorials for using R in psychological research. These are very pedagogical and helpful. Also homepage of the r-package
*psych*. - Notes on the use of R for psychology experiments and questionnaires
- Learning Statistics with R
**free**book for learning R. Mainly covers**frequentists**methods but have one chapter of**Bayesian**statistics. - Psychometric Models and Methods a brief overview of packages that are closely related to
**psychometrics**. Used to teach statistics to Psychology students. - Mixed Psychophysics – aims to give statistical tools (i.e., R code, models, tutorials, and links to articles) for
**psychophysics**. Contains a short tutorial covering Psychometric functions, generalized mixed models (**GLMM**).

Although, the learning curve for software such as R is steeper than using software with graphical interfaces (i.e., SPSS, Stata, and Statistica) it is not super hard to learn to carry out the most basic and classical statistical tests. If you aim for reproducibility learning R and/or LaTeX or RMarkdown is really the way to go.

The post R resources for Psychologists appeared first on Erik Marsja.

]]>The post Installing Rodeo IDE on Linux appeared first on Erik Marsja.

]]>Rodeo is, as previously mentioned, a Python IDE very similar to RStudio. It is intended to use for **Data Science**. If you are coming from **R** and plan to add Python to your stack, Rodeo is probably going to be very familiar to you. Given that you have used RStudio, that is. I would still say that Spyder may be a better IDE for doing Data Science in Python. Why? Because, up to date, there are plenty of more features in **Spyder** compared Rodeo. Update: now you will get a .deb file when using wget (see below). Thus, we can use dpkg to install the Rodeo.

#!/bin/sh wget -O rodeo.deb https://www.yhat.com/products/rodeo/downloads/linux_64 sudo dpkg -i install rodeo.deb

The above code will download the Linux 64 binaries for Rodeo, unzip it into the ‘/usr/local/bin’ directory, and remove the downloaded file. Finally, a symbolic link to the executable is created. Obviously, you can change 64 to 32 to download and install Rodeo 32 bit. Note, you can just cut & paste the code and paste it into a command window. If you, however, save it as a bash script (i.e., install_rodeo.sh) you need to make it executable; *chmod +x install_rodeo.sh*. To download and install Rodeo:

sh install_rodeo.sh

I have tested installing Rodeo IDE on **Ubuntu 14.04 and 16.04** with the above script. If you don’t have Jupyter and Matplotlib installed it may need to install them also;

pip install matplotlib jupyter

Thats all, now you should know how to easily download and install Rodeo 1.0 on your Linux machine. Please let me know if you need to do more than I described in this post,

The post Installing Rodeo IDE on Linux appeared first on Erik Marsja.

]]>