Collecting data with Scrapy

This rather long (counting the Python code) tutorial assumes that you are interested in scraping data off the web using Scrapy. It might also be limited to Linux.

Scrapy can be installed using pip:

To create a new Scrapy project open up a terminal and create a directory where you will store your Scrapy projects and change do that directory and run this (change ‘psysci’ to whatever your project will be named):

 

Item

We start with the Item class and create an object to store our scraped data in. In my example I scrape the article title, type of article, the articles abstract, when the article was received and when it got accepted for publication, which year it was publicized (and month) and the articles keywords.

Open up items.py in your Scrapy projects directory (i.e., psysci\psysci) using your favourite editor or IDE. This is how my Item class looks like:

Spider

The spider defines the first URL (http://pss.sagepub.com/content/by/year), how to follow links/pagination, and how to extract and parse the fields defined above. We define the following:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: method that parses the first page in which links to years are located, which will be called with the downloaded Response object of each start URL. If you only collect links or you only need one parser. Here I used a regular expression getting the links wich had texts from 2011 and up (earlier years and volumes did not have the recieved and accepted data I wanted now). The links to content/by/year/year is in a list which in a loop

For my purpose collecting data on articles from Journal Volumes throughout the years I defined more parsers. I did this since I needed the spider to follow the links for each year and find the volumes published each year;

After getting the volumes we need to find links to the abstract. We also get the article type here (e.g., Research reports). Short reports seem not to have links to the abstract open. These are not considered but one might be able to get them if one have full access to the journal. In “items” we store the article type of the articles. Note that, numberOfTypes is a method I created to count the number of article types each volume has. I had to do this since all abstracts did not contain this information.

Next we need to parse the content of the abstract to get which year it was published, keywords, when it was received and accepted. We store all information in items so that we can save it later on.

Lastly, I created a method for counting how many papers it was of each article type in each volume that the spider collects data from.

That was the code for the Spider. Next is the pipelines in which the data is handled and saved in a .csv-file.

Pipeline

Settings

Finally, you need to add this to the settings file (“settings.py”) to make the above Pipelines work;

It might be worth considering other settings such setting a download delay (DOWNLOAD_DELAY).  I will upload this project on Github as soon as possible.

Bonus: Wordcloud

I end this post with a wordcloud I created by all the keywords I collected using the Spider. If you have some ideas on what to do with the collected data let me know.

View post on imgur.com

Leave a Reply

Your email address will not be published. Required fields are marked *