This rather long (counting the Python code) tutorial assumes that you are interested in scraping data off the web using Scrapy. It might also be limited to Linux.

Here’s how to install a Python package using pip:

pip install scrapy
Code language: Bash (bash)

Note, if you need to you can also use pip to install a specific version of scrapy as well as upgrade pip, if you have an old version. Now, to create a new Scrapy project open up a terminal and create a directory where you will store your Scrapy projects and change do that directory and run this (change ‘psysci’ to whatever your project will be named):

scrapy startproject psysci
Code language: Bash (bash)


We start with the Item class and create an object to store our scraped data in. In my example I scrape the article title, type of article, the articles abstract, when the article was received and when it got accepted for publication, which year it was publicized (and month) and the articles keywords.

Open up in your Scrapy projects directory (i.e., psysci\psysci) using your favourite editor or IDE. This is how my Item class looks like:

from scrapy.item import Item, Field class PsysciItem(Item): # define the fields for your item here like: title = Field() keywords = Field() articletype = Field() abstracturl = Field() received = Field() accepted = Field() pubyear = Field() abstract = Field()
Code language: Python (python)


The spider defines the first URL (, how to follow links/pagination, and how to extract and parse the fields defined above. We define the following:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: method that parses the first page in which links to years are located, which will be called with the downloaded Response object of each start URL. If you only collect links or you only need one parser. Here I used a regular expression getting the links wich had texts from 2011 and up (earlier years and volumes did not have the recieved and accepted data I wanted now). The links to content/by/year/year is in a list which in a loop

from psysci.items import PsysciItem import urlparse import scrapy class PsySciSpider(scrapy.Spider): name = "psysci" allowed_domains = ["", ""] start_urls = [ ""] def parse(self, response): #Regexp here is for taking abstracts from 2011 and forward years = response.xpath( '//a[re:match(text(), "^20[1-9][1-9]$")]/@href').extract() for year in years: year_url = urlparse.urljoin( response.url, year) yield scrapy.Request(year_url, self.parse_volume)
Code language: Python (python)

For my purpose collecting data on articles from Journal Volumes throughout the years I defined more parsers. I did this since I needed the spider to follow the links for each year and find the volumes published each year;

def parse_volume(self, response): #Regexp matches month and will get volumes volume = response.xpath( '//a[re:match(text(),"[a-zA-Z0-9_]{3,9}\s[0-9]{2}")]/@href').extract() for vol in volume: volume_url = urlparse.urljoin( response.url, vol) yield scrapy.Request(volume_url, self.parse_abstract)
Code language: Python (python)

After getting the volumes we need to find links to the abstract. We also get the article type here (e.g., Research reports). Short reports seem not to have links to the abstract open. These are not considered but one might be able to get them if one have full access to the journal. In “items” we store the article type of the articles. Note that, numberOfTypes is a method I created to count the number of article types each volume has. I had to do this since all abstracts did not contain this information.

def parse_abstract(self,response): items = PsysciItem() articles = {} self.state['items_count'] = self.state.get('items_count', 0) + 1 #Find links to abstracts [@id=r"ResearchReports"] abstract_links = response.xpath( '//a[@rel="abstract"]/@href').extract() article_type = response.xpath( '//h2/span/text()').re(r"[a-zA-Z]{5,11}\s{0,1}[a-zA-Z]{0,8}$") #In the current version of script I drop the Short Reports since they don't have abstracts... article_type= [ x for x in article_type if x != 'Short Reports'] for article in article_type: art = article.replace(' ', '') xpathString = '//*[@id="content-block"]/form/div[re:match(@class, ".{9}\s.{6}\s.{12}(' + art + ')\s.{0,10}\s{0,1}.{11}\s.{9}$")]/ul/li' articles[article] = response.xpath(xpathString + '/div/ul/li/a[@rel="abstract"]').extract() article_types = self.numberOfTypes(articles, article_type) for i, abstract in enumerate(abstract_links): abstract_url = urlparse.urljoin( response.url, abstract) items['articletype'] = article_types[i] yield scrapy.Request(abstract_url, self.parse_content, meta={'item':items})
Code language: Python (python)

Next we need to parse the content of the abstract to get which year it was published, keywords, when it was received and accepted. We store all information in items so that we can save it later on.

def parse_content(self,response): items = response.meta['item'] items['abstracturl'] = response.url items['pubyear'] = response.xpath( './/span[@class="slug-pub-date"]/text()').extract() items['title'] = response.xpath( '//*[@id="article-title-1"]/text()').extract() items['keywords'] = response.xpath( './/a[@class="kwd-search"]/text()').extract() items['received'] = response.xpath( './/li[@class="received"]/text()').extract() items['accepted'] = response.xpath( './/li[@class="accepted"]/text()').extract() #Note; som abstracts have id="p-1"... maybe having a //div[@class = "section abstract"]/p/text() will be better items['abstract'] = response.xpath('//div[@class="section abstract"]/p/text()').extract() if len(items['title']) == 2: items['title'] = items['title'][0][0].strip() + items['title'][1][0].strip() elif len(items['title']) == 1: items['title'][0].strip() return items
Code language: Python (python)

Lastly, I created a method for counting how many papers it was of each article type in each volume that the spider collects data from.

def numberOfTypes(self,art_types, arts): '''Returning a list of article types for each article that has an abstract-link. art_types = dict with article types arts = list with the article types ''' articleTypes = [] for i in range(len(arts)): length = len(art_types[arts[i]]) for nArticles in range(length): articleTypes.append(arts[i]) return articleTypes
Code language: Python (python)

That was the code for the Spider. Next is the pipelines in which the data is handled and saved in a .csv-file.


# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: import csv import os.path class PsysciPipeline(object): def __init__(self): if not os.path.isfile('CONTENT_psysci.csv'): self.csvwriter = csv.writer(open('CONTENT.csv', 'a')) self.csvwriter.writerow(['Title', 'Recieved', 'Accepted', 'Pubyear','Keywords', 'Articletype', 'Abstract', 'Url']) else: self.csvwriter = csv.writer(open('CONTENT.csv', 'a')) def process_item(self, item, spider): self.csvwriter.writerow([item['title'], item['received'], item['accepted'], item['pubyear'],item['keywords'], item['articletype'], item['abstract'], item['abstracturl']]) return item def string_handling(self, strings): if len(strings) == 2: strings = items['title'][0] + items['title'][1].strip() else: items['title'][0].strip() #Strip whitespaces after the date... if not items['received']: items['received'] = "Not available" elif items['received'] > 0: items['received'] = items['received'][0].strip() if not items['accepted']: items['accepted'] = "Not available" elif items['accepted'] > 0: items['accepted'] = items['accepted'][0].strip() if not items['pubyear']: items['pubyear'] = "Not available" elif items['pubyear'] > 0: items['pubyear'] = items['pubyear'][0].strip()
Code language: Python (python)


Finally, you need to add this to the settings file (“”) to make the above Pipelines work;

ITEM_PIPELINES = { 'psysci.pipelines.PsysciPipeline': 300, }
Code language: Python (python)

It might be worth considering other settings such setting a download delay (DOWNLOAD_DELAY).  I will upload this project on Github as soon as possible.

Bonus: Wordcloud

I end this post with a wordcloud I created by all the keywords I collected using the Spider. If you have some ideas on what to do with the collected data let me know.

  • Save
Copy link
Powered by Social Snap