Aleph

Crawling data with Memorious

Memorious is the web crawling framework that is part of Aleph toolkit. It can be used to periodically retrieve structured and unstructured data from the web and load it into Aleph.

Memorious (named after Funes the Memorious) is a light-weight distributed web scraping toolkit. It can:

  • Maintain an overview of a fleet of crawlers
  • Scrape and store both structured and unstructured data from the web
  • Load the scraped data to Aleph in a variety of ways
  • Schedule crawler execution in regular intervals
  • Store execution information and error messages
  • Distribute scraping tasks across multiple machines
  • Make crawlers modular and simple tasks re-usable
  • Get out of your way as much as possible

Memorious management interface

Memorious has a neat user interface to monitor the status of your crawler fleet at a glance. The interface also lets you start, stop and inspect crawlers with ease.

How it works

Memorious crawlers consist of a YAML configuration file and some (optional) Python functions to define crawler operations. Some built-in utility operations already come packaged with Memorious. For example, frequent operations like making HTTP requests, writing data into a database - can be done using built-in operations provided by Memorious. Memorious can also provide handy utilities to load the scraped data to Aleph for further processing.

A really simple crawler configuration in Memorious might look like this:

# Scraper for the OCCRP web site.
name: occrp_web_site
description: 'Organized Crime and Corruption Reporting Project'
# Uncomment to run this scraper automatically:
# schedule: weekly
pipeline:
  init:
    # This first stage will get the ball rolling with a seed URL.
    method: seed
    params:
      urls:
        - https://occrp.org
    handle:
      pass: fetch

  fetch:
    # Download the seed page
    method: fetch
    params:
      # These rules specify which pages should be scraped or included:
      rules:
        and:
          - domain: occrp.org
    handle:
      pass: parse
  parse:
    # Parse the scraped pages to find if they contain additional links.
    method: parse
    params:
      # Additional rules to determine if a scraped page should be stored or not.
      # In this example, we're only keeping PDFs, word files, etc.
      store:
        or:
          - mime_group: archives
          - mime_group: documents
    handle:
      store: store
      # this makes it a recursive web crawler:
      fetch: fetch
  store:
    # Store the crawled documents to a directory
    method: aleph_emit
    params:
      collection: occrp_web_site

Getting started

To learn more about Memorious, you can: