Technical introduction

Aleph is a toolkit of powerful components for processing knowledge graphs, focussed around the Aleph API server and document processing framework.

Aleph is an open source toolkit for investigative data analysis. It allows generating, searching and analysing large graphs of heterogeneous data, including public records, structured databases and leaked evidence. The system can integrate data from both unstructured data formats (like PDF, Email, and other file types) and structured data such as CSV files, or SQL databases. Data that’s been loaded can be securely searched, cross-referenced with other datasets and exported to other systems.

At the core of Aleph’s capabilities is Follow the Money (FtM), a shared data model the encapsulates core concepts such as People, Companies, Documents or Contracts. Such data can be generated from tabular inputs, or via the ingest-file system that extracts data from dozens of input formats (including Word, Powerpoint, PDF, Access, E-Mail, ZIP Archives and so on).

The basics

Getting data in and out

The Aleph system also includes Memorious, a crawler framework that lets you write, manage and control a fleet of scrapers to maintain up-to-date copies of public records from the web.

Architecture overview

Overview of the architecture used within OCCRP that shows the role of different components, including the memorious crawling toolkit, aleph, and alephclient. Datavault is a simple PostgreSQL database used to collect output from scrapers that is eventually projected into Aleph.


We’re keen to consider pull requests for extensions or bug fixes in all components of the platform. An ideal submission would already follow common coding standards, such as PEP8, and, when significantly changing functionality, include a test case.

Please also consider dropping by in the Slack instance before to discuss your idea.