Aleph

alephclient CLI

Python-based command line client and library for interacting with the Aleph API.

alephclient is a command-line client for Aleph. It can be used to bulk import document sets via the API, without direct access to the server.

Installation

To install alephclient, you need to have Python 3 installed and working on your computer. You may also want to create a virtual environment using virtualenv or pyenv. With that done, type:

pip install alephclient
alephclient --help

Configuration

alephclient needs to know the url of an Aleph instance that it will connect to. For privileged operations, it will also need to know the user’s API key. The API key can be found in the users profile page on the Aleph web application.

Both settings can be provided by setting the environment variables ALEPHCLIENT_HOST and ALEPHCLIENT_API_KEY respectively; or by passing them in with --host and --api-key options directly for the command. The examples below will assume you have set up the environment variables:

export ALEPHCLIENT_HOST=https://data.occrp.org/
export ALEPHCLIENT_API_KEY=7b08118ad9e19be0a82906064ac9c19c8eee4869

Really enthusiastic users of Aleph might want to add these settings to their shell login file (.bashrc, .zshrc)

Importing all files from a directory

The crawldir command crawls through a given directory recursively and uploads all the files and directories inside it to a collection. The external identifier (foreign ID) of the target collection needs to be passed to the command with --foreign-id option. If a collection with the given foreign ID does not exist, it will be created. The pathargument needs to be a valid path to a file or directory:

alephclient crawldir --foreign-id wikileaks-cable /Users/sunu/data/cable

When Aleph imports data, it performs optical character recognition (OCR) on images contained in the material. This works better when Aleph already has an idea of the language the documents might use. This can be specified with the --language option, which expects a 3-letter ISO 639 language code. It can be specified multiple times, for when the directory contains files in more than one language.

alephclient crawldir --language rus --foreign-id russian_leak /Users/sunu/data/russian_leak

Writing a stream of entities to a collection

Using the write-entities command, users can load JSON-formatted entities formatted in the followthemoney structure into an aleph collection. This can be used in conjunction with the command-line tools for generating such data provided by followthemoney. Data that is loaded this way should be aggregated before being written to the API, for example using the ftm aggregate command-line utility, or the followthemoney-store database layer.

A typical use might look this:

curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/master/mappings/md_companies.yml
ftm map md_companies.yml | ftm aggregate | alephclient write-entities -f md_companies

Streaming entities from a collection

The command stream-entities is the inverse of write-entities. It will stream entities from the given Aleph instance so that they can be written to a file or piped into another processing step:

alephclient stream-entities -f us_ofac >us_ofac.json

With the Follow the Money tools installed, you can build more complex command like such:

alephclient stream-entities -f us_ofac | ftm validate | ftm export-excel -o OFAC.xlsx

Streaming very large collections from an Aleph instance is a resource-consuming activity on the server side. Please only stream collections with more than 100,000 entities after making sure that the server administrators are OK with it.

Executing an entity mapping on the server

The bulkload command executes an entity mapping within the Aleph system. Its only argument is a YAML mapping file:

curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/master/mappings/md_companies.yml
alephclient bulkload md_companies.yml

Using mappings within Aleph is falling out of use, we recommend generating and aggregating the entity stream outside the platform for a more reproducible and resumable data processing workflow.