Python-based command line client and library for interacting with the Aleph API.
alephclient is a command-line client for Aleph. It can be used to bulk import document sets via the API, without direct access to the server.
alephclient, you need to have Python 3 installed and working on your computer. You may also want to create a virtual environment using
pyenv. With that done, type:
pip install alephclient alephclient --help
alephclient needs to know the url of an Aleph instance that it will connect to. For privileged operations, it will also need to know the user’s API key. The API key can be found in the users profile page on the Aleph web application.
Both settings can be provided by setting the environment variables
ALEPHCLIENT_API_KEY respectively; or by passing them in with
--api-key options directly for the command. The examples below will assume you have set up the environment variables:
export ALEPHCLIENT_HOST=https://data.occrp.org/ export ALEPHCLIENT_API_KEY=7b08118ad9e19be0a82906064ac9c19c8eee4869
Really enthusiastic users of Aleph might want to add these settings to their shell login file (
Importing all files from a directory
crawldir command crawls through a given directory recursively and uploads all the files and directories inside it to a collection. The external identifier (foreign ID) of the target collection needs to be passed to the command with
--foreign-id option. If a collection with the given foreign ID does not exist, it will be created. The
pathargument needs to be a valid path to a file or directory:
alephclient crawldir --foreign-id wikileaks-cable /Users/sunu/data/cable
When Aleph imports data, it performs optical character recognition (OCR) on images contained in the material. This works better when Aleph already has an idea of the language the documents might use. This can be specified with the
--language option, which expects a 3-letter ISO 639 language code. It can be specified multiple times, for when the directory contains files in more than one language.
alephclient crawldir --language rus --foreign-id russian_leak /Users/sunu/data/russian_leak
Writing a stream of entities to a collection
write-entities command, users can load JSON-formatted entities formatted in the
followthemoney structure into an aleph collection. This can be used in conjunction with the command-line tools for generating such data provided by
followthemoney. Data that is loaded this way should be aggregated before being written to the API, for example using the
ftm aggregate command-line utility, or the
followthemoney-store database layer.
A typical use might look this:
curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/master/mappings/md_companies.yml ftm map md_companies.yml | ftm aggregate | alephclient write-entities -f md_companies
Streaming entities from a collection
stream-entities is the inverse of
write-entities. It will stream entities from the given Aleph instance so that they can be written to a file or piped into another processing step:
alephclient stream-entities -f us_ofac >us_ofac.json
With the Follow the Money tools installed, you can build more complex command like such:
alephclient stream-entities -f us_ofac | ftm validate | ftm export-excel -o OFAC.xlsx
Streaming very large collections from an Aleph instance is a resource-consuming activity on the server side. Please only stream collections with more than 100,000 entities after making sure that the server administrators are OK with it.
Executing an entity mapping on the server
bulkload command executes an entity mapping within the Aleph system. Its only argument is a YAML mapping file:
curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/master/mappings/md_companies.yml alephclient bulkload md_companies.yml
Using mappings within Aleph is falling out of use, we recommend generating and aggregating the entity stream outside the platform for a more reproducible and resumable data processing workflow.