The post Step-by-step guide for solving the Pyvttbl Float and NoneType error appeared first on Erik Marsja.

]]>First, we will need to install the package virtualenv:

pip install virtualenv

Next we continue with opening up a terminal window. Here, we change directory to the where we typically have our Python projects and create a directory for our new Pyvttbl environment:

The next step is to activate the virtual environment:

source pyvttbl_env/bin/activate

Now we should have a virtual environment set up and we can start installing numpy 1.1.0 and pyvttbl:

pip install numpy==1.11.0 pyvttbl

In the code snippet above, it is important that you set the version of Numpy (e.g., 1.11.0) that we want to install. Thus, we get the version that have been previously working with Pyvttbl.

**Remember**, if we want to use any other Python packages (e.g., jupyter notebooks, Seaborn, matplotlib) we need to install them in our virtual environment. We can, of course, just add any package to the *pip install* snippet above:

pip install numpy==1.11.0 pyvttbl jupyter matplotlib seaborn

Now, lets check the version of Numpy using ipython:

Finally, if we want to do our ANOVA in a jupyter notebook we need to install a new kernel that uses the one in the virtual environment. If we don’t do this we will run the notebook from the global environment and, thus, we will import whatever version of Numpy we have installed on the system:

ipython kernel install --name "pyvttbl_env" --user

If we follow the image above, we also start up a Jupyter notebook. When creating a new notebook to run our python ANOVAs using pyvttbl in we need to select our new core (i.e., pyvttbl_env”). In the notebook below (after Windows set-up) you can see that we don’t get the error “unsupported operand type(s) for +: ‘float’ and ‘NoneType’”. Note, we do get a warning that in the future what pvttbl is trying to do is not going to work.

For Windows, the set-up is basically the same. First, we will need to install virtualenv:

pip install virtualenv

Next step for Windows user is, of course, similar but we need to start the Command Prompt (here is how). Lets change the directory to where our Python project are intended to be stored and create the new Pyvttbl environment. First, we create a directory and then we create the virtual environment in this directory:

We continue by activating our virtual environment for running Numpy 1.11.0 and Pyvttbl:

Next step is to activate our virtual environment. Here, it is a bit different for Windows users compared to Linux users:

source pyvttbl_env/Scripts/activate

Now that we have a virtual environment set up we can install an older version of Numpy. Here we will also depart on how this is done in Linux. I had a bit of a problem with installing Scipy (Pyvttbl depends on Scipy, for instance) using Pip.

Fortunately, there are Windows binaries for both Numpy and Scipy. Download the version you need and put it in the directory for the virtual environment and use pip to install:

pip install numpy-1.11.3+mkl-cp27-cp27m-win32.whl scipy-0-19.1-cp27-cp27m-win32.whl

Any additional packages that we may need in our data analysis will, of course, also have to be installed in the virtual environment. Luckily us, we can just use pip again (no external downloading, that is!):

pip install jupyter matplotlib seaborn

Since we plan to run our ANOVA in a notebook we have one final step that need to be done. To get Jupyter to run WITHIN the virtual environment (or else it will use whatever Numpy version we had problems with earlier). Thus, we need to install a new kernel for the virtual environment. This is easy done using the command prompt, again:

ipython kernel install --name "pyvttbl_env" --user

In this notebook, we can run our ANOVA using pyvttbl without the “unsupported operand type(s) for +: ‘float’ and ‘NoneType’ error! Note, Windows user will have a different Numpy versions (we installed 1.11.3) and if you run Linux you can of course install this version to. Of course, we do get the warning (the future is now!)

That was it, now we can use Pyvttbl on both Linux and Windows computers without the problem with float/NoneType. This is of course not an optimal solution but it does the trick. If anyone have another ideas on how to solve this problem, please let me know.

The post Step-by-step guide for solving the Pyvttbl Float and NoneType error appeared first on Erik Marsja.

]]>The post PyCharm vs Spyder: a quick comparison of two Python IDEs appeared first on Erik Marsja.

]]>Spyder is one of my long-time favorite IDEs, and I am mainly using Spyder when I have to write code in Windows environments. However, in one of my blog posts PyCharm was suggested in one comment (see the comments on this post: Why Spyder is the Best Python IDE for Science) that I should test PyCharm. .

After testing out PyCharm I started to like this IDE. In this post you will find my views on the two IDEs. E.g., I intend to answer the question; which is the best Python IDE; PyCharm or Spyder?

The post will divided into the following sections:

In the first section (1) I will outline some shared features of PyCharm and Spyder. I will then continue with describing features that is unique to PyCharm (2) and Spyder (3). Finally, I will go on and compare the two Python IDEs (4).

I will start discussion some of the shared features of PyCharm and Spyder. First, the both IDEs are free (well, Spyder is “more” free compared to PyCharm but if you are a student or a researcher you can get the full version of PyCharm free, also) and cross-platform. This means that you can download and install both Spyder and PyCharm on your Windows, Linux, or OS-X machine. This is of course awesome! PyCharm and Spyder also have the possibility to create projects, an editor with syntax highlighting and introspection for code completion, and have support for plugins.

I must admit, the main thing I liked with PyCharm was that I could change the theme to a dark. I really prefer having my applications dark. That said, PyCharm of course comes with a bunch of features. I will not list all of them here but if you are interested you can read here. As I have mentioned earlier, both PyCharm and Spyder have support for plugins. However, I find it easier to find and install plugins in Pycharm. To install a plugin you just open up *settings *(File -> Settings) and click on “Plugins”:

This makes it very easy to search for plugins. For instance, one can install Markdown plugins to also write Markdown files (.md) that can be uploaded to your Github page. That leads me into another GREAT future of PyCharm; support for different types of Version Control Systems (VCS: e.g., GitHub, Subversion, and Mercurial). E.g., uploading your work to GitHub is only a few click aways (if you prefer not to use command line, that is).

Another great feature is that you can set the with of your code and PyCharm will end our line and move it to next line (great if you are a lazy programmer.)

Another feature of PyCharm is that you can safely rename and delete, extract your methods, among other things. It may be very helpful if you need to rename a variable that is used on various places in your code.

One of my favorite features is that you can, much like in RStudio for R, install Python packages from within the interface. PyCharm offers an easy system to browse, download, and update 3rd party packages. If you are not only working with Python projects, PyCharm allso provides supprot for Javascript, CoffeScript, Typescript and CSS, for instance.

First of all, Spyder is made in for and in Python! Of course this is not a feature of the IDE itself but I like that it’s quite pure Python!

However, one of the most obvious pros with Spyder is that is much easier to install (e.g., in Ubuntu) compared to PyCharm. Whereas PyCharm must be downloaded and installed, Spyder can be installed using Pip. It is also part of many Linux distributions package manager (e.g., apt in Debian Ubuntu). There is one thing, however, that I really like with the Spyder interface; the variable explorer.

In Spyder it is also quite easy to get help. That is, if you are getting stuck, and is not sure how to use a certain function or method. The help function of Spyder IDE lets youtype in the object and get the document string printed out. It can come in very handy, I think.

It is easier to install Spyder (at least in Linux) but PyCharm is not that hard to install. In fact, if you are running Ubuntu you can just add a PPA (See here on how to install PyCharm this way) and install PyCharm using your favourite package manager. If you are a Windows user, you just download an installation file (Download PyCharm).

Spyder is also part of two great Python distributions, Anaconda and WinPython. Anaconda is cross-platform and WinPython for Windows only. Both distributions comes with most of the Python packages that you may need (and probably more than you need!) Thus, you will get a lot of what you need to write code and Spyder in one installation.

PyCharms have support for VCS systems (e.g., Git and Mercurial) is also a great feature that is in favor for PyCharm. I know that some people find this attractive; they don’t have to use the command line.

Okey, which IDE do I think is the best? I think that Spyder, still, is a great IDE. PyCharm do, of course, offer a lot more features. If you are running a relatively new computer and is using Linux (e.g., Ubuntu), PyCharm may be the best (almost) free Python IDE.

On the other hand, if you are using Windows and don’t want to install a lot of Python packages by your self, Spyder you can choose to install either Anaconda or WinPython.

In fact, in the lab where we run Windows 10 computers, I have installed Anaconda (as can be read in the comments, Python(x, y) is no longer maintained). Here I use Spyder but at home I tend to write in PyCharm.

In conclusion, for scientific use maybe Spyder is the best free Python IDE (for Windows, Linux and OS-X). If you are a more general programmer or want to have a lot of features within the interface, PyCharm may be your choice!

The post PyCharm vs Spyder: a quick comparison of two Python IDEs appeared first on Erik Marsja.

]]>The post OpenSesame Tutorial – How to Create a Flanker Task appeared first on Erik Marsja.

]]>

The task we are going to use in this OpenSesame tutorial is a version of the Flanker Task. In the version we are going to use here the task is to respond, as quickly and accurate as possible, to the direction of an arrow. The arrow will be surrounded by either arrows pointing in the same direction (congruent; e.g., “<<<<<“) or in the other direction (incongruent: “>><>>”). In the example, there will be four practice trials and 128 test trials. Each trial, whether practice or test, will start with the presentation of a fixation cross for 2000ms. Following the fixation cross, the target and flankers will be presented (also for 2000ms). The general layout can be seen in the figure below.

In this tutorial we start with OpenSesame’s default template:

Generally, this is how OpenSesame looks like when we start it. Now we go ahead and delete the getting_started by right-clicking on it and select “Permanently delete all linked copies”.

When this is done, I renamed the welcome sketchpad to “WelcomeScreen” (click on the blue text to rename it). In the “WelcomeScreen” we are going to add text containing the task instructions. To change the text double-click on “OpenSesame 3.1 *Jazzy James*” (or whatever text your version of OpenSesame will have):

In this dialogue we will type our welcome and task instructions and the text in the sketchpad will change accordingly:

The next step is to create a loop for our practice trials by dragging and dropping the loop object under our welcome screen. We put it under our welcome screen and give it the name “practice_trials” by clicking on the new_loop (blue text). In our case, we add 4 trials. First, we name 4 columns, *targets*, *congruent*, and *correct_response*. In the column *targets*, we put the four different types of stimuli we are going to use in our example (e.g., “<<<<<“, “>><>>”, “>>>>>”, and “<<><<“) and in the column *congruent *we put 0 and 1 for congruent and incongruent trials, respectively. Finally, we add the correct response in the last column (“x” for pointing left and “m” for pointing right).

The next thing we need to do is inserting a sequence into our *practice_trials *loop. Here we will add our fixation cross, flanker stimuli, and so on. Basically, things that are put into the sequence item are run in the order they appear. To add the sequence item, drag and drop the item to the practice trials loop:

After we have created our sequence item we can rename it to “practice_seq” and insert a *sketchpad*. Each trial is, namely, gong to start with the presentation of the fixation dot followed by the flanker stimuli. A sketchpad is inserted in a similar way as the sequence item. That is, we drag the sketchpad object and drop it into the practice sequence. When we have our first sketchpad, we rename it fixation, and put the duration to 2000ms:

Next, we are going to draw the fixation dot in the middle of the sketchpad item. It is quite easy, just select the icon looking like a crosshair and click in the middle of the black screen with a green grid:

Now we have created our fixation dot and we are ready to add a new sketchpad for presenting the Flanker stimuli. We drag and drop a sketchpad item but on this item we select a textline element (the icon with an “A”) and click on the middle of the screen. In the dialogue that pops up we write “[targets]”:

It is important, that we write “targets” within the brackets because this is what tells OpenSesame to get the text we wrote in the column earlier (the practice loop). As you may notice in the image below, we also set the duration to 2000ms and increased the font size (36 px). What seems to be important here is to untick HTML. Using the HTML setting it seems like the arrows (“<“) looks a bit weird, namely. Since we don’t typically use the data from practice trials we skip, in this OpenSesame tutorial, how to collect responses. We will look at this when we create our test trials, however.

The last thing we do before we we create our experiment trials is to add a sketchpad *after* the practice loop and name it “end_practice”. In this sketchpad we add a textline object with the text: “That was the practice trials. Press ANY key to start the actual test.”

A neat thing with OpenSesame is that we can copy our practice loop by right clicking on it. This way we can skip creating a new loop, new sequence, and so on. In this case, we copy the practice loop unlinked so that we can add trials to our experiment block:

If we right click again (or press ctrl-V) we can choose to paste our copied loop after the practice trials (and we do that). We go on and rename the new loop to “experiment_loop” and the new sequence to “experiment_seq”. Finally, we copy-and-paste our trials until we get 64 trials:

In the Flanker task we are going to collect responses, of course. We now need to add a response device. As with all objects in OpenSesame, we can find the keyboard object to the left, in the menu. We just drag and drop it under the *target_1* sketchpad. Here we just add 1200 to *Timeout* because we want a response window with the duration of 1200ms. That we named one column “correct_response” means that we don’t have to add that to our keyboard:

Because we want to have some data recorded (i.e., correct responses and response time) the last thing we will add to the experiment is a logger. In this object we just leave it so that it logs everything but we could, if we would like, just tell OpenSesame which different items we want to record.

We also want to tell the subjects that the task has ended, so we add a final item to the test: the end screen. This will, again, be a sketchpad with the name “end_screen” and the text: “That was the test. Thank you for participating!”

I also created a Video Tutorial that is very similar to the how-to in the text above:

The post OpenSesame Tutorial – How to Create a Flanker Task appeared first on Erik Marsja.

]]>The post How to do Descriptive Statistics in Python using Numpy appeared first on Erik Marsja.

]]>In this example I am going to use the Toothgrowth dataset (download here). It is pretty easy to load a CSV file using the *genfromtxt* method:

import numpy as np data_file = 'ToothGrowth.csv' data = np.genfromtxt(data_file, names=True, delimiter=",", dtype=None)

Notice the arguments we pass. The first row has the names and that is why we set the argument ‘names’ to True. One of the columns, further, has strings. Setting ‘*dtype*‘ to *None* enables us to load both floats and integers into our data.

In the next code chunk, below, we are going to loop through each level of the two factors (i.e., ‘supp’, and ‘dose’) and create a subset of the data for each crossed level. If you are familiar with Pandas, you may notice that subsetting a Numpy *ndarray* is pretty simple (data[data[yourvar] == level). The summary statistics are then appended into a list.

summary_stats = [] for supp_lvl in np.unique(data['supp']): for dose_lvl in np.unique(data['dose']): # Subsetting data_to_sum = data[(data['supp'] == supp_lvl) & (data['dose'] == dose_lvl)] # Calculating the descriptives mean = data_to_sum['len'].mean() sd = data_to_sum['len'].std() max_supp = data_to_sum['len'].max() min_supp = data_to_sum['len'].min() ps = np.percentile(data_to_sum['len'], [25, 75] ) summary_stats.append((mean, sd, max_supp, min_supp, ps[0], ps[1], supp_lvl, dose_lvl))

From the list of data we are going to create a Numpy array. The reason for doing this is that it will get us a bit prettier output. Especially, when we are setting the print options (line 19, below).

results = np.array(summary_stats, dtype=None) np.set_printoptions(suppress=True) print(results)

That was it. I still prefer doing my descriptives statistics using Pandas. Primarily, because of that the output is much more nicer but it’s also easier to work with Pandas dataframes compared to Numpy arrays.

The post How to do Descriptive Statistics in Python using Numpy appeared first on Erik Marsja.

]]>The post How to use Python to create an iCalendar file from a Word table appeared first on Erik Marsja.

]]>After some searching around on the Internet I found the Python packages python-docx and iCalendar. In this post I will show you how to use these to packages to create an iCalender file that can be loaded in to a lot of available calendars.

Both Python packages can be installed using pip:

pip install python-docx icalendar

In the example code I used a table from a Word document containing 4 columns. It is a pretty simple example but in the first column store the date, the second the time, third the room (location), and the last the activity of the event (e.g., lecture).

In the first code chunk, below, we start by importing the needed modules. Apart from using Document from python-docx, Calendar and Event from iCalendar, we are going to use datetime from datetime. Datetime is used to store the date in a format that icalendar “likes”.

from datetime import datetime from docx import Document from icalendar import Calendar, Event document = Document('course_schedule.docx') table = document.tables[0] data = [] keys = {} for i, row in enumerate(table.rows): text = (cell.text for cell in row.cells) if i == 0: keys = tuple(text) continue row_data = dict(zip(keys, text))

In the next chunk of code (in the same loop as above) we split the date and time. We do this since due to the formatting of the date and time in the example (“5/4” and “9-12). As previously mentioned the date need to be formatted in a certain way (e.g., using Pythons datetime). In the table from the Word document, some of the events are deadlines and, thus, have no set time. Therefore, we need to see if we have a time to use. If not, we set the time to 17:00-17:01. There is probably a better way to do this but this will do for now. The last line adds each event (as a Python dictionary) to our list containing all data.

e_date = row_data['Date'].split('/') e_time = row_data['Time'].split('-') if len(e_time) > 1: row_data[u'dtstart'] = datetime(2017, int(e_date[1]), int(e_date[0]), int(e_time[0]), 0, 0) row_data[u'dtend'] = datetime(2017, int(e_date[1]), int(e_date[0]), int(e_time[1]), 0, 0) else: row_data[u'dtstart'] = datetime(2017, int(e_date[1]), int(e_date[0]), 17,1, 0) row_data[u'dtend'] = datetime(2017, int(e_date[1]), int(e_date[0]), 17 ,0,0) data.append(row_data)

Now that we have a list of dictionaries containing our lectures/seminars (one for each dictionary) we can use iCalendar to create the calendar. First we create the calendar object and the continue with looping through our list of dictionaries. In the loop we create an event and add the information. In the example here we use the activity as both summary and description but we could have had a summary of the activity and a more detailed description if we’d liked.

The crucial parts, may be, are the ‘dtstart’ and ‘dtend’. This is the starting time and ending time of the event (e.g., a lecture). We continue to add the location (e.g., the room of the event) and add the event to our calender. Finally, we create a file (‘schedule.ics’), write the calender to the file, and close the file.

cal = Calendar() for row in data: event = Event() event.add('summary', row['Activity']) event.add('dtstart', row['dtstart']) event.add('dtend', row['dtend']) event.add('description', row['Activity']) event.add('location', row['Room']) cal.add_component(event) f = open('course_schedule.ics', 'wb') f.write(cal.to_ical()) f.close()

Now we have our iCalendar file (course_schedule.ics) and can load it into our calender software. I typically use Lightning (a calendar addon for Thunderbird). To open the iCalendar file we created using Python go to File, Open, and Calendar File. Finally select the your iCalendar file:

After you done that your new schedule should be loaded into Lightning. Your schedule will be loaded as a separate calendar. As you can see in the image below your lecture and computer labs will show up.

In this post, we learned how to use Python (python-docx) to extract a schedule from a table in a Word Document (.docx). We used the data to create an iCalendar file that we can load into many Calendar applications (e.g., Google, Lightning).

The post How to use Python to create an iCalendar file from a Word table appeared first on Erik Marsja.

]]>The post Python Video Tutorial: Creating a Flanker Task using Expyriment appeared first on Erik Marsja.

]]>In the tutorial you will get familiar with Expyriment and get to create a commonly used task in Psychology – the Flanker task. In this task, you are to respond on the direction of an arrow surrounded by distractors (arrows pointing in either the same or the other direction). It shows how hard it can be to ignore irrelevant information (arrows pointing in the wrong direction).

The post Python Video Tutorial: Creating a Flanker Task using Expyriment appeared first on Erik Marsja.

]]>The post Exploring response time distributions using Python appeared first on Erik Marsja.

]]>I used the following Python packages; Pandas for data storing/manipulation, NumPy for some calculations, Seaborn for most of the plotting, and Matplotlib for some tweaking of the plots. Any script using these functions should import them:

from __future__ import division import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

The first plot is the easiest to create using Python; visualizing the kernel density estimation. It can be done using the Seaborn package only. *kde_plot* takes the arguments *df *Note, in the beginning of the function I set the style to white and to ticks. I do this because I want a white background and ticks on the axes.

def kde_plot(df, conditions, dv, col_name, save_file=False): sns.set_style('white') sns.set_style('ticks') fig, ax = plt.subplots() for condition in conditions: condition_data = df[(df[col_name] == condition)][dv] sns.kdeplot(condition_data, shade=True, label=condition) sns.despine() if save_file: plt.savefig("kernel_density_estimate_seaborn_python_response" "-time.png") plt.show()

Using the function above you can basically plot as many conditions as you like (however, but with to many conditions, the plot will probably be cluttered). I use some response time data from a Flanker task:

# Load the data frame = pd.read_csv('flanks.csv', sep=',') kde_plot(frame, ['incongruent', 'congruent'], 'RT', 'TrialType', save_file=False)

Next out is to plot the cumulative distribution functions (CDF). In the first function CDFs for each condition will be calculated. It takes the arguments *df* (a Pandas dataframe), a list of the conditions (i.e., *conditions*).

def cdf(df, conditions=['congruent', 'incongruent']): data = {i: df[(df.TrialType == conditions[i])] for i in range(len( conditions))} plot_data = [] for i, condition in enumerate(conditions): rt = data[i].RT.sort_values() yvals = np.arange(len(rt)) / float(len(rt)) # Append it to the data cond = [condition]*len(yvals) df = pd.DataFrame(dict(dens=yvals, dv=rt, condition=cond)) plot_data.append(df) plot_data = pd.concat(plot_data, axis=0) return plot_data

Next is the plot function (*cdf_plot)*. The function takes a Pandas a dataframe (created with the function above) as argument as well as *save_file* and *legend*.

def cdf_plot(cdf_data, save_file=False, legend=True): sns.set_style('white') sns.set_style('ticks') g = sns.FacetGrid(cdf_data, hue="condition", size=8) g.map(plt.plot, "dv", "dens", alpha=.7, linewidth=1) if legend: g.add_legend(title="Congruency") g.set_axis_labels("Response Time (ms.)", "Probability") g.fig.suptitle('Cumulative density functions') if save_file: g.savefig("cumulative_density_functions_seaborn_python_response" "-time.png") plt.show()

Here is how to create the plot on the same Flanker task data as above:

cdf_dat = cdf(frame, conditions=['incongruent', 'congruent']) cdf_plot(cdf_dat, legend=True, save_file=False)

In Psychological research Delta plots (DPs) can be used to visualize and compare response time (RT) quantiles obtained under two experimental conditions. DPs enable examination whether the experimental manipulation has a larger effect on the relatively fast responses or on the relatively slow ones (e.g., Speckman, Rouder, Morey, & Pratte, 2008).

In the following script I have created two functions; *calc_delta_data* and *delta_plot*. calc_delta_data takes a Pandas dataframe (df). Rest of the arguments you need to fill in the column names for the subject id, the dependent variable (e.g., RT), and the conditions column name. All in the string data type. The last argument should contain a list of strings of the factors in your condition.

def calc_delta_data(df, subid, dv, condition, conditions=['incongruent', 'congruent']): subjects = pd.Series(df[subid].values.ravel()).unique().tolist() subjects.sort() deciles = np.arange(0.1, 1., 0.1) cond_one = conditions[0] cond_two = conditions[1] # Frame to store the data (per subject) arrays = [np.array([cond_one, cond_two]).repeat(len(deciles)), np.array(deciles).tolist() * 2] data_delta = pd.DataFrame(columns=subjects, index=arrays) for subject in subjects: sub_data_inc = df.loc[(df[subid] == subject) & (df[condition] == cond_one)] sub_data_con = df.loc[(df[subid] == subject) & (df[condition] == cond_two)] inc_q = sub_data_inc[dv].quantile(q=deciles).values con_q = sub_data_con[dv].quantile(q=deciles).values for i, dec in enumerate(deciles): data_delta.loc[(cond_one, dec)][subject] = inc_q[i] data_delta.loc[(cond_two, dec)][subject] = con_q[i] # Aggregate deciles data_delta = data_delta.mean(axis=1).unstack(level=0) # Calculate difference data_delta['Diff'] = data_delta[cond_one] - data_delta[cond_two] # Calculate average data_delta['Average'] = (data_delta[cond_one] + data_delta[cond_two]) / 2 return data_delta

Next function, *delta_plot*, takes the data returned from the *calc_delta_data* function to create a line graph.

def delta_plot(delta_data, save_file=False): ymax = delta_data['Diff'].max() + 10 ymin = -10 xmin = delta_data['Average'].min() - 20 xmax = delta_data['Average'].max() + 20 sns.set_style('white') g = sns.FacetGrid(delta_data, ylim=(ymin, ymax), xlim=(xmin, xmax), size=8) g.map(plt.scatter, "Average", "Diff", s=50, alpha=.7, linewidth=1, edgecolor="white") g.map(plt.plot, "Average", "Diff", alpha=.7, linewidth=1) g.set_axis_labels("Avarage RTs (ms.)", "Effect (ms.)") g.fig.suptitle('Delta Plot') if save_file: g.savefig("delta_plot_seaborn_python_response-time.png") plt.show() sns.plt.show()

The above functions are quite easy to use. First load your data (again, I use data from a Flanker task).

# Load the data frame = pd.read_csv('flanks.csv', sep=',') # Calculate delta plot data and plot it d_data = calc_delta_data(frame, "SubID", "RT", "TrialType", ['incongruent', 'congruent']) delta_plot(d_data)

Conditional accuracy functions (CAF) is a technique that also incorporates the accuracy in the task. Creating CAFs involve binning your data (e.g., the response time and accuracy) and creating a linegraph. Briefly, CAFs can capture patterns related to speed/accuracy trade-offs (see Richard, 2014). First function,

def calc_caf(df, subid, rt, acc, trialtype, quantiles=[0.25, 0.50, 0.75, 1]): # Subjects subjects = pd.Series(df[subid].values.ravel()).unique().tolist() subjects.sort() # Multi-index frame for data: arrays = [np.array(['rt'] * len(quantiles) + ['acc'] * len(quantiles)), np.array(quantiles * 2)] data_caf = pd.DataFrame(columns=subjects, index=arrays) # Calculate CAF for each subject for subject in subjects: sub_data = df.loc[(df[subid] == subject)] subject_cdf = sub_data[rt].quantile(q=quantiles).values # calculate mean response time and proportion of error for each bin for i, q in enumerate(subject_cdf): quantile = quantiles[i] # First if i < 1: # Subset temp_df = sub_data[(sub_data[rt] < subject_cdf[i])] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Second & third (if using 4) elif i == 1 or i < len(quantiles): # Subset temp_df = sub_data[(sub_data[rt] > subject_cdf[i - 1]) & ( sub_data[rt] < q)] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Last elif i == len(quantiles): # Subset temp_df = sub_data[(sub_data[rt] > subject_cdf[i])] # RT data_caf.loc[('rt', quantile)][subject] = temp_df[rt].mean() # Accuracy data_caf.loc[('acc', quantile)][subject] = temp_df[acc].mean() # Aggregate subjects CAFs data_caf = data_caf.mean(axis=1).unstack(level=0) # Add trialtype data_caf['trialtype'] = [condition for _ in range(len(quantiles))] return data_caf

*caf_plot *(the function below) uses Seaborn, again, to plot the conditional accuracy functions.

def caf_plot(df, legend_title='Congruency', save_file=True): sns.set_style('white') sns.set_style('ticks') g = sns.FacetGrid(df, hue="trialtype", size=8, ylim=(0, 1.1)) g.map(plt.scatter, "rt", "acc", s=50, alpha=.7, linewidth=1, edgecolor="white") g.map(plt.plot, "rt", "acc", alpha=.7, linewidth=1) g.add_legend(title=legend_title) g.set_axis_labels("Response Time (ms.)", "Accuracy") g.fig.suptitle('Conditional Accuracy Functions') if save_file: g.savefig("conditional_accuracy_function_seaborn_python_response" "-time.png") plt.show()

Right now, the function for calculation the Conditional Accuracy Functions can only do one condition at the time. Thus, in the code below I subset the Pandas dataframe (same old, Flanker data as in the previous examples) for incongruent and congruent conditions. The CAFs for these two subsets are then concatenated (i.e., combined to one dataframe) and plotted.

# Conditional accuracy function (data) for incongruent and congruent conditions inc = calc_caf(frame[(frame.TrialType == "incongruent")], "SubID", "RT", "ACC", "incongruent") con = calc_caf(frame[(frame.TrialType == "congruent")], "SubID", "RT", "ACC", "congruent") # Combine the data and plot it df_caf = pd.concat([inc, con]) caf_plot(df_caf, save_file=True)

Update: I created a Jupyter notebook containing all code: Exploring distributions.

Balota, D. a., & Yap, M. J. (2011). Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. *Current Directions in Psychological Science, 20(3)*, 160–166. http://doi.org/10.1177/0963721411408885

Luce, R. D. (1986). *Response times: Their role in inferring elementary mental organization* (No. 8). Oxford University Press on Demand.

Richard, P. (2014). The speed-accuracy tradeoff : history , physiology , methodology , and behavior. *Frontiers in Neuroscience*, *8*(June), 1–19. http://doi.org/10.3389/fnins.2014.00150

Speckman, P. L., Rouder, J. N., Morey, R. D., & Pratte, M. S. (2008). Delta Plots and Coherent Distribution Ordering. *The American Statistician, 62(3)*, 262–266. http://doi.org/10.1198/000313008X333493

The post Exploring response time distributions using Python appeared first on Erik Marsja.

]]>The post Best Python libraries for Psychology researchers appeared first on Erik Marsja.

]]>Python is gaining popularity in many fields of science. This means that there also are many applications and libraries specifically for use in Psychological research. For instance, there are packages for collecting data & analysing brain imaging data. In this post, I have collected some useful Python packages for researchers within the field of Psychology and Neuroscience. I have used and tested some of them but others I have yet to try.

Expyriment is a Python library in which makes the programming of Psychology experiments a lot easier than using Python. It contains classes and methods for creating fixation cross’, visual stimuli, collecting responses, etc (see my video how-to: Expyriment Tutorial: Creating a Flanker Task using Python on Youtube if you want to learn more).

Modular Psychophysics is a collection of tools that aims to implement a modular approach to Psychophysics. It enables us to write experiments in different languages. As far as I understand, you can use both MATLAB and R to control your experiments. That is, the timeline of the experiment can be carried out in another language (e.g., MATLAB).

However, it seems like the experiments are created using Python. Your experiments can be run both locally and over networks. Have yet to test this out.

OpenSesame is a Python application. Using OpenSesame one can create Psychology experiments. It has a graphical user interface (GUI) that allows the user to drag and drop objects on a timeline. More advanced experimental designs can be implemented using inline Python scripts (for a tutorial on how to use OpenSesame see How to Create a Flanker Task).

PsychoPy is also a Python application for creating Psychology experiments. It comes packed with a GUI but the API can also be used for writing Python scripts. I have written a bit more thoroughly about PsychoPy: PsychoPy.

I have written more extensively on Expyriment, PsychoPy, Opensesame, and some other libraries for creating experiment in my post Python apps and libraries for creating experiments.

PsyUtils “The psyutils package is a collection of utility functions useful for generating visual stimuli and analysing the results of psychophysical experiments. It is a work in progress, and changes as I work. It includes various helper functions for dealing with images, including creating and applying various filters in the frequency domain.”

Psisignifit is a toolbox that allows you to fit psychometric functions. Further, hypotheses about psychometric data can be tested. Psisignfit allows for full Bayesian analysis of psychometric functions that includes Bayesian model selection and goodness of fit evaluation among other great things.

Pygaze is a Python library for eye-tracking data & experiments. It works as a wrapper around many other Python packages (e.g., PsychoPy, Tobii SDK). Pygaze can also, through plugins, be used from within OpenSesame.

General Recognition Theory (GRT) is a fork of a MATLAB toolbox. GRT is ” a multi-dimensional version of signal detection theory.” (see link for more information).

MNE is a library designed for processing electroencephalography (EEG) and magnetoencephalography (MEG) data. Collected data can be preprocessed and denoised. Time-frequency analysis and statistical testing can be carried out. MNE can also be used to apply some machine learning algorithms. Although, mainly focused on EEG and MEG data some of the statistical tests in this library can probably be used to analyse behavioural data (e.g., ANOVA).

Kabuki is a Python library for effortless creation of hierarchical Bayesian models. It uses the library PyMC. Using Kabuki you will get formatted summary statistics, posterior plots, and many more. There is, further, a function to generate data from a formulated model. It seems that there is an intention to add more commonly used statistical tests (i.e., Bayesian ANOVA) in the future!

NIPY: “Welcome to NIPY. We are a community of practice devoted to the use of the Python programming language in the analysis of neuroimaging data”. Here different packages for brain imaging data can be found.

Although, many of the above libraries probably can be used within other research fields there are also more libraries for pure statistics & visualisation.

PyMVPA is a Python library for MultiVariate Pattern Analysis. It enables statistical learning analyses of big data.

Pandas is a Python library for fast, flexible and expressive data structures. Researchers and analysists with an R background will find Pandas data frame objects very similar to Rs. Data can be manipulated, summarised, and some descriptive analysis can be carried out (e.g., see Descriptive Statistics Using Python for some examples using Pandas).

Statsmodels is a Python library for data exploration, estimation of statistical models, and statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Among many methods regression, generalized linear, and non-parametric tests can be carried out using statsmodels.

Pyvttbl is a library for creating Pivot tables. One can further process data and carry out statistical computations using Pyvttbl. Sadly, it seems like it is not updated anymore and is not compatible with other packages (e.g., Pandas). If you are interested in how to carry out repeated measures ANOVAs in Python this is a package that enables these kind of analysis (e.g., see Repeated Measures ANOVA using Python and Two-way ANOVA for Repeated Measures using Python).

There are many Python libraries for visualisation of data. Below are the ones I have worked with. Note, pandas and statsmodels also provides methods for plotting data. All three libraries are compatible with Pandas which makes data manipulation and visualisation very easy.

Matplotlib is a package for creating two-dimensional plots.

Seaborn is a library based on Matplotlib. Using seaborn you can create ready-to-publish graphics (e.g., see the Figure above for a boxplot on some response time data). I have also used Seaborn to visualize response time distributions.

Ggplot is a visualisation library based on the R package Ggplot2. That is, if you are familiar with R and Ggplot2 transitioning to Python and the package Ggplot will be easy.

Many of the libraries for analysis and visualisation can be installed separately and, more or less, individually . I do however recommend that you install a scientific Python distribution. This way you will get all you need (and much more) by one click (e.g., Pandas, Matplotlib, NumPy, Statsmodels, Seaborn). I suggest you have a look at the distributions Anaconda or Python(x, y). Note, that installing Python(x, y) will give you the Python IDE Spyder.

The last Python package for Psychology I am going to list is PsychoPy_ext. Although, PsychoPy_ext may be considered a library for building experiments it seems like the general aim behind the package is to make reproducible research easier. That is, analysis and plotting of the experiments can be carried out. I also think it is interesting that there seems to be a way to autorun experiments (a neat function I know that e-prime have, for instance).

That was it, if you happen to know any other Python libraries or applications with a focus on Psychology (e.g., psychophysics) or for statistical methods.

The post Best Python libraries for Psychology researchers appeared first on Erik Marsja.

]]>The post E-prime how-to: save data to csv-file using InLine scripts appeared first on Erik Marsja.

]]>This guide will assume that you have worked with e-prime before. That is, you should already have a, more or less, ready experiment that you can add the scripts to. In the guide I use a Simon task created in e-prime as an example.

I prefer to let the experimenter have the possibility to choose name on the data file. Thus, we will start with creating a variable called *experimentId* (You may already know how to do this and can skip to the next part)*. *We start by clicking on “Edit” in the menu and choosing “Experiment…”:

After doing this we get the “Properties:Experiment Object Properties” dialog. From this dialog we click on add:

In the next dialog, “Edit Startup Info Parameter” we give our variable a name; *experimentId *(we type this in the “Log Name” field). In the “Prompt” field we type in “Experiment ID”. This is what will show up when the experiment is started. We go on and change the Data Type to “String” and put “simonTask” as Default. This is what will be put in by default but can be changed by the experimenter.

As previously mentioned I assume that you know how to create an experiment in e-prime but I will briefly mention how to create an InLine script. Drag the object InLine from the “E-Objects” in the left of the GUI. I chose to put this in the “PracticeSimon” procedure so it is one of the first things that is created when starting an experiment. I typically name the script “fileManagment” or something that makes it clear what the script do.

' Sets up a data file On Error Resume Next MkDir "data_"+c.GetAttrib ("experimentId") ' Create the variable save of data type string and set save to current directory Dim save As String save = CurDir$ ' Change the directory to data_simonTask (since experimentId is by default simonTask) chdir("data_"+c.GetAttrib ("experimentId")) ' Create a file called data_simonTask if it does not exist If FileExists("data_"+c.GetAttrib ("experimentId")+".csv")=False Then Dim fileid As Integer fileid=freefile open "data_"+c.GetAttrib ("experimentId")+".csv" For output As #fileid print #fileid,"SubID;Date;Age;Sex;RT;ACC;CorrectResponse;Response;Arrows;TrialType" close End If chdir(save)

Note, the delimiter in the case above is “;”. That is, this is what makes whatever software you use later know where a new column begins. In this example I want to store Subject ID (“SubID), the date, Age, Sex, Response Time (RT), Accuracy (ACC), and so on.

Now that a data file and folder has been created we can go on and create an InLine script that saves the data (for each trial). In the Simon task example I put the script (“saveData”) in the end of the “SimonTrial” procedure:

Responses, in the example, are logged in the “simonTarget”-object (an ImageDisplay object). However, many times we want to save more information, such as what kind of trial it currently is. Such data is typically stored in a List object (List3 in the image above, for instance):

Now to the script that saves the data:

Dim save As String Dim fileid As Integer save = CurDir$ chdir("data_"+c.GetAttrib ("experimentId")) fileid=freefile open "data_"+c.GetAttrib("experimentId")+".csv" For append As #fileid ' "SubID;Date;Age;Sex;RT;ACC;CorrectResponse;simonTarget;TrialType" print #fileid,(c.GetAttrib("Subject"));";";Date;";";(c.GetAttrib("Age"));";";(c.GetAttrib("Sex"));";";(c.GetAttrib("simonTarget.RT"));";";(c.GetAttrib("simonTarget.ACC"));";";(c.GetAttrib("correct_resp"));";";(c.GetAttrib("simonTarget.Resp"));";";(c.GetAttrib("display2"));";";(c.GetAttrib("Type_C")) close chdir(save) doevents

Note, what really is important in this script is that the data is stored in the order that we have created our column names. In the script above we use the function c.GetAttrib(attribute) to get the current data stored in that variable. That is, when we want to get the RT we use (c.GetAttrib(“simonTarget.RT”) since this is were the responses are recored. Startup info and information in the file can be accessed using only c.GetAttrib(). That was quite easy, right?!

There is one caveat, in your ImageDisplay object (i.e., in our case simonTarget) we will have to set the prerelease to 0 or else we will not have anything recorded. This may at times be a problem (e.g., for timing and such). If anyone know a solution to this, please let me know.

The post E-prime how-to: save data to csv-file using InLine scripts appeared first on Erik Marsja.

]]>The post PsychoPy video tutorials appeared first on Erik Marsja.

]]>I created a playlist with 4 youtube tutorials. In the first video you will learn how to create a classical psychology experiment; the stroop task. The second is more into psycholinguistics and you will learn how to create a language experiment using PsychoPy. In the third you get to know how to create text input fields in PsychoPy. In this tutorial inline Python coding is used so you will also get to know how you may use programming. In the forth video you will get acquainted with using video stimuli in PsychoPy.

Recently I found the following playlist on youtube and it is amazing. In this series of tutorial videos you will learn how to use PsychoPy as a Python package. For instance, it is starting at a very basic level; importing the visual module to create windows. In this video he uses my favourite Python IDE Spyder. The videos are actually screencasts from the course *Programming for Psychology & Vision Science* and contains 10 videos. The first 5 videos covers drawing stimuli on the screen (i.e., drawing to a video, gratings, shapes, images, dots). Watching these videos you will also learn how to collect responses, providing input, and saving your data.

That was all for now. If you know more good video tutorial for PsychoPy please leave a comment. Preferably the tutorials should cover coding but just building experiments with the builder mode is also fine. I may update my playlist with more PsychoPy tutorials (the first playlist).

The post PsychoPy video tutorials appeared first on Erik Marsja.

]]>The post Two-way ANOVA for repeated measures using Python appeared first on Erik Marsja.

]]>import numpy as np import pyvttbl as pt from collections import namedtuple

Numpy is going to be used to simulate data. I create a data set in which we have one factor of two levels (P) and a second factor of 3 levels (Q). As in many of my examples the dependent variable is going to be response time (rt) and we create a list of lists for the different population means we are going to assume (i.e., the variable ‘values’). I was a bit lazy when coming up with the data so I named the independent variables ‘iv1’ and ‘iv2’. However, you could think of *iv1* as two different memory tasks; verbal and spatial memory. *Iv2* could be different levels of distractions (no distraction, synthetic sounds, and speech, for instance).

N = 20 P = [1,2] Q = [1,2,3] values = [[998,511], [1119,620], [1300,790]] sub_id = [i+1 for i in xrange(N)]*(len(P)*len(Q)) mus = np.concatenate([np.repeat(value, N) for value in values]).tolist() rt = np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q)).tolist() iv1 = np.concatenate([np.array([p]*N) for p in P]*len(Q)).tolist() iv2 = np.concatenate([np.array([q]*(N*len(P))) for q in Q]).tolist() Sub = namedtuple('Sub', ['Sub_id', 'rt','iv1', 'iv2']) df = pt.DataFrame() for idx in xrange(len(sub_id)): df.insert(Sub(sub_id[idx],rt[idx], iv1[idx],iv2[idx])._asdict())

I start with a boxplot using the method boxplot from Pyvttbl. As far as I can see there is not much room for changing the plot around. We get this plot and it is really not that beautiful.

df.box_plot('rt', factors=['iv1', 'iv2'])

To run the Two-Way ANOVA is simple; the first argument is the dependent variable, the second the subject identifier, and than the within-subject factors. In two previous posts I showed how to carry out one-way and two-way ANOVA for independent measures. One could, of course combine these techniques, to do a split-plot/mixed ANOVA by adding an argument ‘bfactors’ for the between-subject factor(s).

aov = df.anova('rt', sub='Sub_id', wfactors=['iv1', 'iv2']) print(aov)

The output one get from this is an ANOVA table. In this table all metrics needed plus some more can be found; F-statistic, p-value, mean square errors, confidence intervals, effect size (i.e., eta-squared) for all factors and the interaction. Also, some corrected degree of freedom and mean square error can be found (e.g., Grenhouse-Geisser corrected). The output is in the end of the post. It is a bit hard to read. If you know any other way to do a repeated measures ANOVA using Python please let me know. Also, if you happen to know that you can create nicer plots with Pyvttbl I would also like to know how! Please leave a comment.

**Update** (2017-07-03): If your installed version of Numpy is greater than 1.11.x, you will run into a Float and NoneType error. One quick solution for this is to downgrade Numpy to 1.11.x. I created a post, Step-by-step guide for solving the Pyvttbl Float and NoneType error, in which I show how to install Numpy 1.11.x in an virtual environment. This way, you can run your ANOVAs, without having to uninstall Numpy.

rt ~ iv1 * iv2 TESTS OF WITHIN SUBJECTS EFFECTS Measure: rt Source Type III eps df MS F Sig. et2_G Obs. SE 95% CI lambda Obs. SS Power ======================================================================================================================================================= iv1 Sphericity Assumed 4419957.211 - 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Greenhouse-Geisser 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Huynh-Feldt 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 Box 4419957.211 1 1 4419957.211 324.248 2.128e-13 3.295 60 16.096 31.548 1023.941 1 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv1) Sphericity Assumed 258996.722 - 19 13631.406 Greenhouse-Geisser 258996.722 1 19 13631.406 Huynh-Feldt 258996.722 1 19 13631.406 Box 258996.722 1 19 13631.406 ------------------------------------------------------------------------------------------------------------------------------------------------------- iv2 Sphericity Assumed 5257766.564 - 2 2628883.282 206.008 4.023e-21 3.920 40 18.448 36.158 433.701 1 Greenhouse-Geisser 5257766.564 0.550 1.101 4777252.692 206.008 1.320e-12 3.920 40 18.448 36.158 433.701 1 Huynh-Feldt 5257766.564 0.550 1.101 4777252.692 206.008 1.320e-12 3.920 40 18.448 36.158 433.701 1 Box 5257766.564 0.500 1 5257766.564 206.008 1.192e-11 3.920 40 18.448 36.158 433.701 1 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv2) Sphericity Assumed 484921.251 - 38 12761.086 Greenhouse-Geisser 484921.251 0.550 20.911 23189.668 Huynh-Feldt 484921.251 0.550 20.911 23189.668 Box 484921.251 0.500 19 25522.171 ------------------------------------------------------------------------------------------------------------------------------------------------------- iv1 * Sphericity Assumed 1622027.598 - 2 811013.799 83.220 1.304e-14 1.209 20 22.799 44.687 87.600 1.000 iv2 Greenhouse-Geisser 1622027.598 0.545 1.091 1486817.582 83.220 6.085e-09 1.209 20 22.799 44.687 87.600 1.000 Huynh-Feldt 1622027.598 0.545 1.091 1486817.582 83.220 6.085e-09 1.209 20 22.799 44.687 87.600 1.000 Box 1622027.598 0.500 1 1622027.598 83.220 2.262e-08 1.209 20 22.799 44.687 87.600 1.000 ------------------------------------------------------------------------------------------------------------------------------------------------------- Error(iv1 * Sphericity Assumed 370327.311 - 38 9745.456 iv2) Greenhouse-Geisser 370327.311 0.545 20.728 17866.175 Huynh-Feldt 370327.311 0.545 20.728 17866.175 Box 370327.311 0.500 19 19490.911 TABLES OF ESTIMATED MARGINAL MEANS Estimated Marginal Means for iv1 iv1 Mean Std. Error 95% Lower Bound 95% Upper Bound ============================================================== 1 983.755 43.162 899.157 1068.354 2 599.917 21.432 557.909 641.925 Estimated Marginal Means for iv2 iv2 Mean Std. Error 95% Lower Bound 95% Upper Bound =============================================================== 1 525.025 19.324 487.150 562.899 2 814.197 49.416 717.342 911.053 3 1036.286 43.789 950.459 1122.114 Estimated Marginal Means for iv1 * iv2 iv1 iv2 Mean Std. Error 95% Lower Bound 95% Upper Bound ===================================================================== 1 1 553.522 24.212 506.066 600.978 1 2 1103.488 28.411 1047.804 1159.173 1 3 1294.256 19.773 1255.501 1333.011 2 1 496.528 29.346 439.009 554.047 2 2 524.906 20.207 485.301 564.512 2 3 778.317 21.815 735.560 821.073

The post Two-way ANOVA for repeated measures using Python appeared first on Erik Marsja.

]]>The post Three ways to do a two-way ANOVA with Python appeared first on Erik Marsja.

]]>An important advantage of the two-way ANOVA is that it is more efficient compared to the one-way. There are two assignable sources of variation – supp and dose in our example – and this helps to reduce error variation thereby making this design more efficient. Two-way ANOVA (factorial) can be used to, for instance, compare the means of populations that are different in two ways. It can also be used to analyse the mean responses in an experiment with two factors. Unlike One-Way ANOVA, it enables us to test the effect of two factors at the same time. One can also test for independence of the factors provided there are more than one observation in each cell. The only restriction is that the number of observations in each cell has to be equal (there is no such restriction in case of one-way ANOVA).

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

import pandas as pd from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm from statsmodels.graphics.factorplots import interaction_plot import matplotlib.pyplot as plt from scipy import stats

In the code above we import all the needed Python libraries and methods for doing the two first methods using Python (calculation with Python and using Statsmodels). In the last, and third, method for doing python ANOVA we are going to use Pyvttbl. As in the previous post on one-way ANOVA using Python we will use a set of data that is available in R but can be downloaded here: TootGrowth Data. Pandas is used to create a dataframe that is easy to manipulate.

datafile="ToothGrowth.csv" data = pd.read_csv(datafile)

It can be good to explore data before continuing with the inferential statistics. statsmodels has methods for visualising factorial data. We are going to use the method *interaction_plot*.

fig = interaction_plot(data.dose, data.supp, data.len, colors=['red','blue'], markers=['D','^'], ms=10)

The calculations of the sum of squares (the variance in the data) is quite simple using Python. First we start with getting the sample size (N) and the degree of freedoms needed. We will use them later to calculate the mean square. After we have the degree of freedom we continue with calculation of the sum of squares.

N = len(data.len) df_a = len(data.supp.unique()) - 1 df_b = len(data.dose.unique()) - 1 df_axb = df_a*df_b df_w = N - (len(data.supp.unique())*len(data.dose.unique()))

For the calculation of the sum of squares A, B and Total we will need to have the grand mean. Using Pandas DataFrame method mean on the dependent variable only will give us the grand mean:

grand_mean = data['len'].mean()

The grand mean is simply the mean of all scores of len.

We start with calculation of Sum of Squares for the factor A (supp).

ssq_a = sum([(data[data.supp ==l].len.mean()-grand_mean)**2 for l in data.supp])

Calculation of the second Sum of Square, B (dose), is pretty much the same but over the levels of that factor.

ssq_b = sum([(data[data.dose ==l].len.mean()-grand_mean)**2 for l in data.dose])

ssq_t = sum((data.len - grand_mean)**2)

Finally, we need to calculate the Sum of Squares Within which is sometimes referred to as error or residual.

vc = data[data.supp == 'VC'] oj = data[data.supp == 'OJ'] vc_dose_means = [vc[vc.dose == d].len.mean() for d in vc.dose] oj_dose_means = [oj[oj.dose == d].len.mean() for d in oj.dose] ssq_w = sum((oj.len - oj_dose_means)**2) +sum((vc.len - vc_dose_means)**2)

Since we have a two-way design we need to calculate the Sum of Squares for the interaction of A and B.

ssq_axb = ssq_t-ssq_a-ssq_b-ssq_w

We continue with the calculation of the mean square for each factor, the interaction of the factors, and within.

ms_a = ssq_a/df_a

ms_b = ssq_b/df_b

ms_axb = ssq_axb/df_axb

ms_w = ssq_w/df_w

The *F*-statistic is simply the mean square for each effect and the interaction divided by the mean square for within (error/residual).

f_a = ms_a/ms_w f_b = ms_b/ms_w

We can use the scipy.stats method *f.sf* to check if our obtained *F*-ratios is above the critical value. Doing that we need to use our *F*-value for each effect and interaction as well as the degrees of freedom for them, and the degree of freedom within.

p_a = stats.f.sf(f_a, df_a, df_w) p_b = stats.f.sf(f_b, df_b, df_w) p_axb = stats.f.sf(f_axb, df_axb, df_w)

The results are, right now, stored in a lot of variables. To obtain a morereadable result we can create a DataFrame that will contain our ANOVA table.

results = {'sum_sq':[ssq_a, ssq_b, ssq_axb, ssq_w], 'df':[df_a, df_b, df_axb, df_w], 'F':[f_a, f_b, f_axb, 'NaN'], 'PR(>F)':[p_a, p_b, p_axb, 'NaN']} columns=['sum_sq', 'df', 'F', 'PR(>F)'] aov_table1 = pd.DataFrame(results, columns=columns, index=['supp', 'dose', 'supp:dose', 'Residual'])

As a Psychologist most of the journals we publish in requires to report effect sizes. Common software, such as, SPSS have eta squared as output. However, eta squared is an overestimation of the effect. To get a less biased effect size measure we can use omega squared. The following two functions adds eta squared and omega squared to the above DataFrame that contains the ANOVA table.

def eta_squared(aov): aov['eta_sq'] = 'NaN' aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq']) return aov def omega_squared(aov): mse = aov['sum_sq'][-1]/aov['df'][-1] aov['omega_sq'] = 'NaN' aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse) return aov eta_squared(aov_table1) omega_squared(aov_table1) print(aov_table1)

sum_sq | df | F | PR(>F) | eta_sq | omega_sq | |
---|---|---|---|---|---|---|

supp | 205.350000 | 1 | 15.572 | 0.000231183 | 0.059484 | 0.055452 |

dose | 2426.434333 | 2 | 92 | 0.000231183 | 0.702864 | 0.692579 |

supp:dose | 108.319000 | 2 | 4.10699 | 0.0218603 | 0.031377 | 0.023647 |

Residual | 712.106000 | 54 |

There is, of course, a much easier way to do Two-way ANOVA with Python. We can use Statsmodels which have a similar model notation as many R-packages (e.g., lm). We start with formulation of the model:

formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)' model = ols(formula, data).fit() aov_table = anova_lm(model, typ=2)

Statsmodels does not calculate effect sizes for us. My functions above can, again, be used and will add omega and eta squared effect sizes to the ANOVA table. Actually, I created these two functions to enable calculation of omega and eta squared effect sizes on the output of Statsmodels anova_lm method.

eta_squared(aov_table) omega_squared(aov_table) print(aov_table)

sum_sq | df | F | PR(>F) | eta_sq | omega_sq | |
---|---|---|---|---|---|---|

C(supp) | 205.350000 | 1 | 15.571979 | 2.311828e-04 | 0.059484 | 0.055452 |

C(dose) | 2426.434333 | 2 | 91.999965 | 4.046291e-18 | 0.702864 | 0.692579 |

C(supp):C(dose) | 108.319000 | 2 | 4.106991 | 2.186027e-02 | 0.031377 | 0.023647 |

Residual | 712.106000 | 54 |

What is neat with using statsmodels is that we can also do some diagnostics. It is, for instance, very easy to take our model fit (the linear model fitted with the OLS method) and get a Quantile-Quantile (QQplot):

res = model.resid fig = sm.qqplot(res, line='s') plt.show()

The third way to do Python ANOVA is using the library pyvttbl. Pyvttbl has its own method (also called DataFrame) to create data frames.

from pyvttbl import DataFrame df=DataFrame() df.read_tbl(datafile) df['id'] = xrange(len(df['len'])) print(df.anova('len', sub='id', bfactors=['supp', 'dose']))

The ANOVA tables of Pyvttbl contains a lot of more information compared to that of statsmodels. Actually, Pyvttbl output contains an effect size measure; the generalized omega squared.

Source | Type III Sum of Squares | df | MS | F | Sig. | η^{2}_{G} |
Obs. | SE of x̄ | ±95% CI | λ | Obs. Power |

supp | 205.350 | 1.000 | 205.350 | 15.572 | 0.000 | 0.224 | 30.000 | 0.678 | 1.329 | 8.651 | 0.823 |

dose | 2426.434 | 2.000 | 1213.217 | 92.000 | 0.000 | 0.773 | 20.000 | 0.831 | 1.628 | 68.148 | 1.000 |

supp * dose | 108.319 | 2.000 | 54.159 | 4.107 | 0.022 | 0.132 | 10.000 | 1.175 | 2.302 | 1.521 | 0.173 |

Error | 712.106 | 54.000 | 13.187 | ||||||||

Total | 3452.209 | 59.000 |

The post Three ways to do a two-way ANOVA with Python appeared first on Erik Marsja.

]]>The post Four ways to conduct one-way ANOVAs with Python appeared first on Erik Marsja.

]]>

We start with some brief introduction on theory of ANOVA. If you are more interested in the four methods to carry out one-way ANOVA with Python click here. ANOVA is a means of comparing the ratio of systematic variance to unsystematic variance in an experimental study. Variance in the ANOVA is partitioned in to total variance, variance due to groups, and variance due to individual differences.

The ratio obtained when doing this comparison is known as the *F*-ratio. A one-way ANOVA can be seen as a regression model with a single categorical predictor. This predictor usually has two plus categories. A one-way ANOVA has a single factor with *J *levels. Each level corresponds to the groups in the independent measures design. The general form of the model, which is a regression model for a categorical factor with *J *levels, is:

There is a more elegant way to parametrize the model. In this way the group means are represented as deviations from the grand mean by grouping their coefficients under a single term. I will not go into detail on this equation:

As for all parametric tests the data need to be normally distributed (each groups data should be roughly normally distributed) for the *F*-statistic to be reliable. Each experimental condition should have roughly the same variance (i.e., homogeneity of variance), the observations (e.g., each group) should be independent, and the dependent variable should be measured on, at least, an interval scale.

In the four examples in this tutorial we are going to use the dataset “PlantGrowth” that originally was available in R but can be downloaded using this link: PlantGrowth. In the first three examples we are going to use Pandas DataFrame.

import pandas as pd datafile="PlantGrowth.csv" data = pd.read_csv(datafile) #Create a boxplot data.boxplot('weight', by='group', figsize=(12, 8)) ctrl = data['weight'][data.group == 'ctrl'] grps = pd.unique(data.group.values) d_data = {grp:data['weight'][data.group == grp] for grp in grps} k = len(pd.unique(data.group)) # number of conditions N = len(data.values) # conditions times participants n = data.groupby('group').size()[0] #Participants in each condition

Judging by the Boxplot there are differences in the dried weight for the two treatments. However, easy to visually determine whether the treatments are different to the control group.

We start with using SciPy and its method f_oneway from stats.

from scipy import stats F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])

One problem with using SciPy is that following APA guidelines we should also effect size (e.g., eta squared) as well as Degree of freedom (DF). DFs needed for the example data is easily obtained

DFbetween = k - 1 DFwithin = N - k DFtotal = N - 1

However, if we want to calculate eta-squared we need to do some more computations. Thus, the next section will deal with how to calculate a one-way ANOVA using the Pandas DataFrame and Python code.

A one-way ANOVA is quite easy to calculate so below I am going to show how to do it. First, we need to calculate the sum of squares between (SSbetween), sum of squares within (SSwithin), and sum of squares total (SSTotal).

We start with calculating the Sum of Squares between. Sum of Squares Between is the variability due to interaction between the groups. Sometimes known as the Sum of Squares of the Model.

SSbetween = (sum(data.groupby('group').sum()['weight']**2)/n) \ - (data['weight'].sum()**2)/N

The variability in the data due to differences within people. The calculation of Sum of Squares Within can be carried out according to this formula:

sum_y_squared = sum([value**2 for value in data['weight'].values]) SSwithin = sum_y_squared - sum(data.groupby('group').sum()['weight']**2)/n

Sum of Squares Total will be needed to calculate eta-squared later. This is the total variability in the data.

SStotal = sum_y_squared - (data['weight'].sum()**2)/N

Mean square between is the sum of squares within divided by degree of freedom between.

MSbetween = SSbetween/DFbetween

Mean Square within is also an easy calculation;

MSwithin = SSwithin/DFwithin

` `

F = MSbetween/MSwithin

To reject the null hypothesis we check if the obtained F-value is above the critical value for rejecting the null hypothesis. We could look it up in a F-value table based on the DFwithin and DFbetween. However, there is a method in SciPy for obtaining a p-value.

p = stats.f.sf(F, DFbetween, DFwithin)

Finally, we are also going to calculate effect size. We start with the commonly used eta-squared (*η²* ):

eta_sqrd = SSbetween/SStotal

However, eta-squared is somewhat biased because it is based purely on sums of squares from the sample. No adjustment is made for the fact that what we aiming to do is to estimate the effect size in the population. Thus, we can use the less biased effect size measure Omega squared:

om_sqrd = (SSbetween - (DFbetween * MSwithin))/(SStotal + MSwithin)

The results we get from both the SciPy and the above method can be reported according to APA style; *F*(2, 27) = 4.846, *p *= .016, *η²* = .264. If you want to report Omega Squared: *ω ^{2}* = .204

The third method, using Statsmodels, is also easy. We start by using ordinary least squares method and then the anova_lm method. Also, if you are familiar with R-syntax. Statsmodels have a formula api where your model is very intuitively formulated. First, we import the api and the formula api. Second we, use ordinary least squares regression with our data. The object obtained is a fitted model that we later use with the anova_lm method to obtaine a ANOVA table.

import statsmodels.api as sm from statsmodels.formula.api import ols mod = ols('weight ~ group', data=data).fit() aov_table = sm.stats.anova_lm(mod, typ=2) print aov_table

sum_sq | df | F | PR(>F) | |
---|---|---|---|---|

group | 3.76634 | 2 | 4.846088 | 0.01591 |

Residual | 10.49209 | 27 |

As can be seen in the ANVOA table Statsmodels don’t provide an effect size . To calculate eta squared we can use the sum of squares from the table:

esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])

We can also use the method anova1way from the python package pyvttbl. This package also has a DataFrame method. We have to use this method instead of Pandas DataFrame to be able to carry out the one-way ANOVA. Note, Pyvttbl is old and outdated. It requires Numpy to be at most version 1.1.x or else you will run in to an error ( “unsupported operand type(s) for +: ‘float’ and ‘NoneType’”). This can, of course, be solved by downgrading Numpy (see my solution using a virtual environment Step-by-step guide for solving the Pyvttbl Float and NoneType error).

from pyvttbl import DataFrame df=DataFrame() df.read_tbl(datafile) aov_pyvttbl = df.anova1way('weight', 'group') print aov_pyvttbl

Anova: Single Factor on weight SUMMARY Groups Count Sum Average Variance ============================================ ctrl 10 50.320 5.032 0.340 trt1 10 46.610 4.661 0.630 trt2 10 55.260 5.526 0.196 O'BRIEN TEST FOR HOMOGENEITY OF VARIANCE Source of Variation SS df MS F P-value eta^2 Obs. power =============================================================================== Treatments 0.977 2 0.489 1.593 0.222 0.106 0.306 Error 8.281 27 0.307 =============================================================================== Total 9.259 29 ANOVA Source of Variation SS df MS F P-value eta^2 Obs. power ================================================================================ Treatments 3.766 2 1.883 4.846 0.016 0.264 0.661 Error 10.492 27 0.389 ================================================================================ Total 14.258 29 POSTHOC MULTIPLE COMPARISONS Tukey HSD: Table of q-statistics ctrl trt1 trt2 ================================= ctrl 0 1.882 ns 2.506 ns trt1 0 4.388 * trt2 0 ================================= + p < .10 (q-critical[3, 27] = 3.0301664694) * p < .05 (q-critical[3, 27] = 3.50576984879) ** p < .01 (q-critical[3, 27] = 4.49413305084)

As can be seen in the output from method anova1way we get a lot more information. Maybe of particular interest here is that we get results from a post-hoc test (i.e., Tukey HSD). Whereas the ANOVA only lets us know that there was a significant effect of treatment the post-hoc analysis reveal where this effect may be (between which groups).

That is it! In this tutorial you learned 4 methods that let you carry out one-way ANOVAs using Python. There are, of course, other ways to deal with the tests between the groups (e.g., the post-hoc analysis). One could carry out Multiple Comparisons (e.g., t-tests between each group. Just remember to correct for familywise error!) or Planned Contrasts. In conclusion, doing ANOVAs in Python is pretty simple.

The post Four ways to conduct one-way ANOVAs with Python appeared first on Erik Marsja.

]]>The post Repeated measures ANOVA using Python appeared first on Erik Marsja.

]]>There are, at least, two of the advantages using within-subjects design. First, more information is obtained from each subject in a within-subjects design compared to a between-subjects design. Each subject is measured in all conditions, whereas in the between-subjects design, each subject is typically measured in one or more but not all conditions. A within-subject design thus requires fewer subjects to obtain a certain level of statistical power. In situations where it is costly to find subjects this kind of design is clearly better than a between-subjects design. Second, the variability in individual differences between subjects is removed from the error term. That is, each subject is his or her own control and extraneous error variance is reduced.

pyvttbl can be installed using pip:

pip install pyvttbl

If you are using Linux you may need to add ‘sudo’ before the pip command. This method installs pyvttbl and, hopefully, any missing dependencies.

I continue with simulating a response time data set. If you have your own data set you want to do your analysis on you can use the method “read_tbl” to load your data from a CSV-file.

from numpy.random import normal import pyvttbl as pt from collections import namedtuple N = 40 P = ["noise","quiet"] rts = [998,511] mus = rts*N Sub = namedtuple('Sub', ['Sub_id', 'rt','condition']) df = pt.DataFrame() for subid in xrange(0,N): for i,condition in enumerate(P): df.insert(Sub(subid+1, normal(mus[i], scale=112., size=1)[0], condition)._asdict())

Conducting the repeated measures ANOVA with pyvttbl is pretty straight forward. You just take the pyvttbl DataFrame object and use the method anova. The first argument is your dependent variable (e.g. response time), and you specify the column in which the subject IDs are (e.g., sub=’Sub_id’). Finally, you add your within subject factor(s) (e.g., wfactors). wfactors take a list of column names containing your within subject factors. In my simulated data there is only one (e.g. ‘condition’). Note, if your Numpy version is greater than 1.1.x you will have to install an older version. A good way to do this is to run Pyvttbl within a virtual environment (see Step-by-step guide for solving the Pyvttbl Float and NoneType error for a detailed solution both for Linux and Windows users).

aov = df.anova('rt', sub='Sub_id', wfactors=['condition']) print(aov)

Source | Type III Sum of Squares | ε | df | MS | F | Sig. | η^{2}_{G} |
Obs. | SE of x̄ | ±95% CI | λ | Obs. Power | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

condition | Sphericity Assumed | 4209536.428 | – | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 |

Greenhouse-Geisser | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Huynh-Feldt | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Box | 4209536.428 | 1.000 | 1.000 | 4209536.428 | 309.093 | 0.000 | 4.165 | 40.000 | 19.042 | 37.323 | 317.019 | 1.000 | |

Error(condition) | Sphericity Assumed | 531140.646 | – | 39.000 | 13618.991 | ||||||||

Greenhouse-Geisser | 531140.646 | 1.000 | 39.000 | 13618.991 | |||||||||

Huynh-Feldt | 531140.646 | 1.000 | 39.000 | 13618.991 | |||||||||

Box | 531140.646 | 1.000 | 39.000 | 13618.991 |

As can be seen in the output table the Sum of Squares used is **Type III** which is what common statistical software use when calculating ANOVA (the *F*-statistic) (e.g., SPSS or R-packages such as ‘afex’ or ‘ez’). The table further contains correction in case our data violates the assumption of Sphericity (which in the case of only 2 factors, as in the simulated data, is nothing to worry about). As you can see we also get generalized eta squared as effect size measure and 95 % Confidence Intervals. It is stated in the docstring for the class Anova that standard Errors and 95% confidence intervals are calculated according to Loftus and Masson (1994). Furthermore, generalized eta squared allows comparability across between-subjects and within-subjects designs (see, Olejnik & Algina, 2003).

Conveniently, if you ever want to transform your data you can add the argument transform. There are several options here; *log* or *log10*, *reciprocal* or *inverse*,* square-root* or *sqrt*, *arcsine* or *arcsin*, and *windsor10*. For instance, if you want to use log-transformation you just add the argument “*transform*=’log'” (either of the previously mentioned methods can be used as arguments in string form):

aovlog = df.anova('rt', sub='Sub_id', wfactors=['condition'], transform='log')

Using pyvttbl we can also analyse mixed-design/split-plot (within-between) data. Doing a split-plot is easy; just add the argument “*bfactors*=” and a list of your between-subject factors. If you are interested in one-way ANOVA for independent measures see my newer post: Four ways to conduct one-way ANOVAS with Python.

Finally, I created a function that extracts the F-statistics, Mean Square Error, generalized eta squared, and the p-value the results obtained with the anova method. It takes a factor as a string, a ANOVA object, and the values you want to extract. Keys for your different factors can be found using the key-method (e.g., *aov.keys()*).

def extract_for_apa(factor, aov, values = ['F', 'mse', 'eta', 'p']): results = {} for key,result in aov[(factor,)].iteritems(): if key in values: results[key] = result return results

Note, the table with the results in this post was created with the private *method _within_html*. To create an HTML table you will have to import SimpleHTML:

import SimpleHTML output = SimpleHTML.SimpleHTML('Title of your HTML-table') aov._within_html(output) output.write('results_aov.html')

That was all. There are at least one downside with using pyvttbl for doing within-subjects analysis in Python (ANOVA). Pyvttbl is not compatible with Pandas DataFrame which is commonly used. However, this may not be a problem since pyvttbl, as we have seen, has its own DataFrame method. There are also a some ways to aggregate and visualizing data using Pyvttbl. Another downside is that it seems like Pyvttbl no longer is maintained.

Loftus, G.R., & Masson, M.E. (1994). Using confidence intervals in within-subjects designs. The Psychonomic Bulletin & Review, 1(4), 476-490.

Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychological Methods, 8(4), 434–47. http://doi.org/10.1037/1082-989X.8.4.434

The post Repeated measures ANOVA using Python appeared first on Erik Marsja.

]]>The post Descriptive Statistics using Python appeared first on Erik Marsja.

]]>After data collection, most **Psychology researchers** use different ways to summarise the data. In this tutorial we will learn how to do **descriptive statistics **in **Python**. Python, being a programming language, enables us many ways to carry out descriptive statistics.

One useful library for data manipulation and summary statistics is Pandas. Actually, Pandas offers an API similar to Rs. I think that the dataframe in R is very intuitive to use and Pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.

Thus, in this tutorial you will learn how to do descriptive statistics using Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy. For these measures of central tendency we will use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using Pandas).

import numpy as np from pandas import DataFrame as df from scipy.stats import trim_mean, kurtosis from scipy.stats.mstats import mode, gmean, hmean

Many times in **experimental psychology** response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method *describe*.

N = 20 P = ["noise","quiet"] Q = [1,2,3] values = [[998,511], [1119,620], [1300,790]] mus = np.concatenate([np.repeat(value, N) for value in values]) data = df(data = {'id': [subid for subid in xrange(N)]*(len(P)*len(Q)) ,'iv1': np.concatenate([np.array([p]*N) for p in P]*len(Q)) ,'iv2': np.concatenate([np.array([q]*(N*len(P))) for q in Q]) ,'rt': np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q))})

data.describe()

Pandas will output summary statistics by using this method. Output is a table, as you can see below.

Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV. As can be seen from the output it is somewhat hard to read. Note, the method *unstack* is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.

grouped_data = data.groupby(['iv1', 'iv2']) grouped_data['rt'].describe().unstack()

Often we want to know something about the “*average*” or “*middle*” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean can also be obtained using Pandas but I will use methods from SciPy.

There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;

grouped_data['rt'].mean().reset_index()

But the method *aggregate* in combination with NumPys mean can also be used;

grouped_data['rt'].aggregate(np.mean).reset_index()

Both methods will give the same output but the aggregate method have some advantages that I will explain later.

Sometimes the *geometric* or *harmonic* mean can be of interested. These two descriptive statistics can be obtained using the method apply with the methods *gmean* and *hmean* (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.

grouped_data['rt'].apply(gmean, axis=None).reset_index()

grouped_data['rt'].apply(hmean, axis=None).reset_index()

Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the *trimmed mean*. However, we can use the method *trim_mean* from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.

trimmed_mean = grouped_data['rt'].apply(trim_mean, .1) trimmed_mean.reset_index()

Output from the mean values above (trimmed, harmonic, and geometric means):

The *median *can also be obtained using two methods;

grouped_data['rt'].median().reset_index() grouped_data['rt'].aggregate(np.median).reset_index()

There is a method (i.e., pandas.DataFrame.mode()) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:

grouped_data['rt'].apply(mode, axis=None).reset_index()

Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (*std*), mean, harmonic mean, geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.

descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index() descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index) descr

Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. We will probably also want to have a look at a measure of the variability of the data.

grouped_data['rt'].std().reset_index()

Note that here the use unstack() also get the quantiles as columns and the output is easier to read.

ggrouped_data['rt'].quantile([.25, .5, .75]).unstack()

ggrouped_data['rt'].var().reset_index()

That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation **almost **as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.

Update: Recently, I learned some methods to explore response times visualizing the distribution of different conditions: Exploring response time distributions using Python.

I am sorry that the images (i.e., the tables) are so ugly. If you happen to know a good way to output tables and figures from Python (something like Knitr & Rmarkdown) please let me know.

The post Descriptive Statistics using Python appeared first on Erik Marsja.

]]>