Collecting data with Scrapy

This rather long (counting the Python code) tutorial assumes that you are interested in scraping data off the web using Scrapy. It might also be limited to Linux.

Installing scrapy
- Item
- Spider
- Pipeline
- Settings
- Bonus: Wordcloud

Installing scrapy

Here is how to install a Python package using pip:

pip install scrapyCode language: Bash (bash)

Note, if you need to you can also use pip to install a specific version of Scrapy as well as upgrade pip, if you have an old version. Now, to create a new Scrapy project, open up a terminal and create a directory where you will store your Scrapy projects change do that directory and run this (change ‘psysci’ to whatever your project will be named):

scrapy startproject psysciCode language: Bash (bash)

Item

We start with the Item class and create an object to store our scraped data in. In my example I scrape the article title, type of article, the article’s abstract, when the article was received and when it got accepted for publication, which year it was publicized (and month) and the articles keywords.

Open up items.py in your Scrapy projects directory (i.e., psysci\psysci) using your favorite editor or IDE. This is what my Item class looks like:

from scrapy.item import Item, Field

class PsysciItem(Item):
    # define the fields for your item here like:
    title = Field()
    keywords = Field()
    articletype = Field()
    abstracturl = Field()
    received = Field()
    accepted = Field()
    pubyear = Field()
    abstract = Field()Code language: Python (python)

Spider

The spider defines the first URL (http://pss.sagepub.com/content/by/year), how to follow links/pagination, and how to extract and parse the fields defined above. We define the following:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: a method that parses the first page in which links to years are located, which will be called with the downloaded Response object of each start URL. If you only collect links or you only need one parser. Here I used a regular expression to get the links with texts from 2011 and up (earlier years and volumes did not have the received and accepted data I wanted now). The links to content/by/year/year are in a list in a loop

from psysci.items import PsysciItem
import urlparse
import scrapy

class PsySciSpider(scrapy.Spider):
    name = "psysci"
    allowed_domains = ["http://pss.sagepub.com", "pss.sagepub.com"]
    start_urls = [
        "http://pss.sagepub.com/content/by/year"]

def parse(self, response):
        #Regexp here is for taking abstracts from 2011 and forward
        years = response.xpath(
              '//a[re:match(text(), "^20[1-9][1-9]$")]/@href').extract()
        
        for year in years:
            year_url = urlparse.urljoin(
                response.url, year)
          
            yield scrapy.Request(year_url,
                                self.parse_volume)Code language: Python (python)

For my purpose of collecting data on articles from Journal Volumes throughout the years, I defined more parsers. I did this since I needed the spider to follow the links for each year and find the volumes published each year;

 def parse_volume(self, response):
        #Regexp matches month and will get volumes 
        volume = response.xpath(
                        '//a[re:match(text(),"[a-zA-Z0-9_]{3,9}\s[0-9]{2}")]/@href').extract()
        
        for vol in volume:
            volume_url = urlparse.urljoin(
                response.url, vol)
          
            yield scrapy.Request(volume_url,
                                self.parse_abstract)Code language: Python (python)

After getting the volumes, we need to find links to the abstract. We also get the article type here (e.g., Research reports). Short reports seem not to have links to the abstract open. These are not considered, but one might be able to get them if one has full access to the journal. In “items” we store the article type of the articles. Note that numberOfTypes is a method I created to count the number of article types each volume has. I had to do this since all abstracts did not contain this information.

def parse_abstract(self,response):
        items = PsysciItem()
        articles = {}
        self.state['items_count'] = self.state.get('items_count', 0) + 1        
                    #Find links to abstracts [@id=r"ResearchReports"]
        abstract_links = response.xpath(
                        '//a[@rel="abstract"]/@href').extract()
        article_type = response.xpath(
                '//h2/span/text()').re(r"[a-zA-Z]{5,11}\s{0,1}[a-zA-Z]{0,8}$")
        #In the current version of script I drop the Short Reports since they don't have abstracts...
        article_type= [ x for x in article_type if x != 'Short Reports']
        
        for article in article_type:
            art = article.replace(' ', '')
            xpathString = '//*[@id="content-block"]/form/div[re:match(@class, ".{9}\s.{6}\s.{12}(' + art + ')\s.{0,10}\s{0,1}.{11}\s.{9}$")]/ul/li'
            articles[article] = response.xpath(xpathString + '/div/ul/li/a[@rel="abstract"]').extract()
        

        article_types = self.numberOfTypes(articles, article_type)
        
        for i, abstract in enumerate(abstract_links):
            abstract_url = urlparse.urljoin(
                response.url, abstract)
            items['articletype'] = article_types[i]
            yield scrapy.Request(abstract_url, 
                                 self.parse_content, meta={'item':items})Code language: Python (python)

Next, we need to parse the content of the abstract to get which year it was published, keywords, and when it was received and accepted. We store all information in items so that we can save it later on.

def parse_content(self,response):
        items = response.meta['item']
        items['abstracturl'] = response.url
        items['pubyear'] = response.xpath(
        './/span[@class="slug-pub-date"]/text()').extract()
        items['title'] = response.xpath(
                    '//*[@id="article-title-1"]/text()').extract()  
        items['keywords'] = response.xpath(
                    './/a[@class="kwd-search"]/text()').extract()    
        items['received'] = response.xpath(
                    './/li[@class="received"]/text()').extract()
                
        items['accepted'] = response.xpath(
                    './/li[@class="accepted"]/text()').extract()
        #Note; som abstracts have id="p-1"...  maybe having a //div[@class = "section abstract"]/p/text() will be better
        items['abstract'] = response.xpath('//div[@class="section abstract"]/p/text()').extract()
                
        if len(items['title']) == 2:
            items['title'] = items['title'][0][0].strip() + items['title'][1][0].strip()
        elif len(items['title']) == 1: 
            items['title'][0].strip()
        return itemsCode language: Python (python)

Lastly, I created a method for counting the number of papers of each article type in each volume that the spider collects data from.

def numberOfTypes(self,art_types, arts):
        '''Returning a list of article types for each article
        that has an abstract-link.
        art_types = dict with article types 
        arts = list with the article types
        '''
        articleTypes = []
        for i in range(len(arts)):
            
            length = len(art_types[arts[i]])
            for nArticles in range(length):
                articleTypes.append(arts[i])
        return articleTypesCode language: Python (python)

That was the code for the Spider. Next is the pipelines in which the data is handled and saved in a .csv-file.

Pipeline

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
import os.path

class PsysciPipeline(object):

    def __init__(self):
        if not os.path.isfile('CONTENT_psysci.csv'):
            self.csvwriter = csv.writer(open('CONTENT.csv', 'a'))
            self.csvwriter.writerow(['Title', 'Recieved', 'Accepted',
                                 'Pubyear','Keywords', 
                                    'Articletype', 'Abstract', 'Url'])
        else: self.csvwriter = csv.writer(open('CONTENT.csv', 'a'))

        

    def process_item(self, item, spider):        
        self.csvwriter.writerow([item['title'], item['received'], 
                                    item['accepted'], 
                                    item['pubyear'],item['keywords'],
                                    item['articletype'], 
                                        item['abstract'], item['abstracturl']])
        return item
    
    def string_handling(self, strings):
        if len(strings) == 2:
            strings = items['title'][0] + items['title'][1].strip()
        else: items['title'][0].strip()
    
        #Strip whitespaces after the date...
        if not items['received']:
            items['received'] = "Not available"
        elif items['received'] > 0:
            items['received'] = items['received'][0].strip()
        if not items['accepted']:
            items['accepted'] = "Not available"
        elif items['accepted'] > 0:
            items['accepted'] = items['accepted'][0].strip()
        if not items['pubyear']:
            items['pubyear'] = "Not available"
        elif items['pubyear'] > 0:
            items['pubyear'] = items['pubyear'][0].strip()Code language: Python (python)

Settings

Finally, you need to add this to the settings file (“settings.py”) to make the above Pipelines work;

ITEM_PIPELINES = {
    'psysci.pipelines.PsysciPipeline': 300,
}Code language: Python (python)

It might be worth considering other settings such setting a download delay (DOWNLOAD_DELAY). I will upload this project on Github as soon as possible.

Bonus: Wordcloud

I end this post with a word cloud I created using the Spider by all the keywords I collected. If you have some ideas on what to do with the collected data let me know.

View post on imgur.com