Press "Enter" to skip to content

Collecting data with Scrapy

This rather long (counting the Python code) tutorial assumes that you are interested in scraping data off the web using Scrapy. It might also be limited to Linux.

Scrapy can be installed using pip:

pip install scrapy

To create a new Scrapy project open up a terminal and create a directory where you will store your Scrapy projects and change do that directory and run this (change ‘psysci’ to whatever your project will be named):


scrapy startproject psysci

 

Item

We start with the Item class and create an object to store our scraped data in. In my example I scrape the article title, type of article, the articles abstract, when the article was received and when it got accepted for publication, which year it was publicized (and month) and the articles keywords.

Open up items.py in your Scrapy projects directory (i.e., psysci\psysci) using your favourite editor or IDE. This is how my Item class looks like:

from scrapy.item import Item, Field

class PsysciItem(Item):
    # define the fields for your item here like:
    title = Field()
    keywords = Field()
    articletype = Field()
    abstracturl = Field()
    received = Field()
    accepted = Field()
    pubyear = Field()
    abstract = Field()

Spider

The spider defines the first URL (http://pss.sagepub.com/content/by/year), how to follow links/pagination, and how to extract and parse the fields defined above. We define the following:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: method that parses the first page in which links to years are located, which will be called with the downloaded Response object of each start URL. If you only collect links or you only need one parser. Here I used a regular expression getting the links wich had texts from 2011 and up (earlier years and volumes did not have the recieved and accepted data I wanted now). The links to content/by/year/year is in a list which in a loop

from psysci.items import PsysciItem
import urlparse
import scrapy

class PsySciSpider(scrapy.Spider):
    name = "psysci"
    allowed_domains = ["http://pss.sagepub.com", "pss.sagepub.com"]
    start_urls = [
        "http://pss.sagepub.com/content/by/year"]

def parse(self, response):
        #Regexp here is for taking abstracts from 2011 and forward
        years = response.xpath(
              '//a[re:match(text(), "^20[1-9][1-9]$")]/@href').extract()
        
        for year in years:
            year_url = urlparse.urljoin(
                response.url, year)
          
            yield scrapy.Request(year_url,
                                self.parse_volume)

For my purpose collecting data on articles from Journal Volumes throughout the years I defined more parsers. I did this since I needed the spider to follow the links for each year and find the volumes published each year;

 def parse_volume(self, response):
        #Regexp matches month and will get volumes 
        volume = response.xpath(
                        '//a[re:match(text(),"[a-zA-Z0-9_]{3,9}\s[0-9]{2}")]/@href').extract()
        
        for vol in volume:
            volume_url = urlparse.urljoin(
                response.url, vol)
          
            yield scrapy.Request(volume_url,
                                self.parse_abstract)

After getting the volumes we need to find links to the abstract. We also get the article type here (e.g., Research reports). Short reports seem not to have links to the abstract open. These are not considered but one might be able to get them if one have full access to the journal. In “items” we store the article type of the articles. Note that, numberOfTypes is a method I created to count the number of article types each volume has. I had to do this since all abstracts did not contain this information.

def parse_abstract(self,response):
        items = PsysciItem()
        articles = {}
        self.state['items_count'] = self.state.get('items_count', 0) + 1        
                    #Find links to abstracts [@id=r"ResearchReports"]
        abstract_links = response.xpath(
                        '//a[@rel="abstract"]/@href').extract()
        article_type = response.xpath(
                '//h2/span/text()').re(r"[a-zA-Z]{5,11}\s{0,1}[a-zA-Z]{0,8}$")
        #In the current version of script I drop the Short Reports since they don't have abstracts...
        article_type= [ x for x in article_type if x != 'Short Reports']
        
        for article in article_type:
            art = article.replace(' ', '')
            xpathString = '//*[@id="content-block"]/form/div[re:match(@class, ".{9}\s.{6}\s.{12}(' + art + ')\s.{0,10}\s{0,1}.{11}\s.{9}$")]/ul/li'
            articles[article] = response.xpath(xpathString + '/div/ul/li/a[@rel="abstract"]').extract()
        

        article_types = self.numberOfTypes(articles, article_type)
        
        for i, abstract in enumerate(abstract_links):
            abstract_url = urlparse.urljoin(
                response.url, abstract)
            items['articletype'] = article_types[i]
            yield scrapy.Request(abstract_url, 
                                 self.parse_content, meta={'item':items})

Next we need to parse the content of the abstract to get which year it was published, keywords, when it was received and accepted. We store all information in items so that we can save it later on.

def parse_content(self,response):
        items = response.meta['item']
        items['abstracturl'] = response.url
        items['pubyear'] = response.xpath(
        './/span[@class="slug-pub-date"]/text()').extract()
        items['title'] = response.xpath(
                    '//*[@id="article-title-1"]/text()').extract()  
        items['keywords'] = response.xpath(
                    './/a[@class="kwd-search"]/text()').extract()    
        items['received'] = response.xpath(
                    './/li[@class="received"]/text()').extract()
                
        items['accepted'] = response.xpath(
                    './/li[@class="accepted"]/text()').extract()
        #Note; som abstracts have id="p-1"...  maybe having a //div[@class = "section abstract"]/p/text() will be better
        items['abstract'] = response.xpath('//div[@class="section abstract"]/p/text()').extract()
                
        if len(items['title']) == 2:
            items['title'] = items['title'][0][0].strip() + items['title'][1][0].strip()
        elif len(items['title']) == 1: 
            items['title'][0].strip()
        return items

Lastly, I created a method for counting how many papers it was of each article type in each volume that the spider collects data from.

def numberOfTypes(self,art_types, arts):
        '''Returning a list of article types for each article
        that has an abstract-link.
        art_types = dict with article types 
        arts = list with the article types
        '''
        articleTypes = []
        for i in range(len(arts)):
            
            length = len(art_types[arts[i]])
            for nArticles in range(length):
                articleTypes.append(arts[i])
        return articleTypes

That was the code for the Spider. Next is the pipelines in which the data is handled and saved in a .csv-file.

Pipeline

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
import os.path

class PsysciPipeline(object):

    def __init__(self):
        if not os.path.isfile('CONTENT_psysci.csv'):
            self.csvwriter = csv.writer(open('CONTENT.csv', 'a'))
            self.csvwriter.writerow(['Title', 'Recieved', 'Accepted',
                                 'Pubyear','Keywords', 
                                    'Articletype', 'Abstract', 'Url'])
        else: self.csvwriter = csv.writer(open('CONTENT.csv', 'a'))

        

    def process_item(self, item, spider):        
        self.csvwriter.writerow([item['title'], item['received'], 
                                    item['accepted'], 
                                    item['pubyear'],item['keywords'],
                                    item['articletype'], 
                                        item['abstract'], item['abstracturl']])
        return item
    
    def string_handling(self, strings):
        if len(strings) == 2:
            strings = items['title'][0] + items['title'][1].strip()
        else: items['title'][0].strip()
    
        #Strip whitespaces after the date...
        if not items['received']:
            items['received'] = "Not available"
        elif items['received'] > 0:
            items['received'] = items['received'][0].strip()
        if not items['accepted']:
            items['accepted'] = "Not available"
        elif items['accepted'] > 0:
            items['accepted'] = items['accepted'][0].strip()
        if not items['pubyear']:
            items['pubyear'] = "Not available"
        elif items['pubyear'] > 0:
            items['pubyear'] = items['pubyear'][0].strip()

Settings

Finally, you need to add this to the settings file (“settings.py”) to make the above Pipelines work;

ITEM_PIPELINES = {
    'psysci.pipelines.PsysciPipeline': 300,
}

It might be worth considering other settings such setting a download delay (DOWNLOAD_DELAY).  I will upload this project on Github as soon as possible.

Bonus: Wordcloud

I end this post with a wordcloud I created by all the keywords I collected using the Spider. If you have some ideas on what to do with the collected data let me know.

View post on imgur.com

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: