Aleph

Ingest Pipeline

When you upload files to Aleph, Aleph ingests these files in a multi-step process. For example, Aleph tries to convert certain file formats, performs optical character recognition (OCR), and tries to extract names, locations, IBAN account numbers and more from unstructured text. This article explains what happens when you upload a file to Aleph. It’s aimed at developers who want to make changes to the ingest pipeline.

This article describes the process of uploading and ingesting a Word document as an example. It is meant as a high-level overview of the system components and stages involved in the ingestion process. Depending on the file type and configuration options, the process might differ from what’s described here.

Overview

The following graph shows an overview of all stages of the ingestion pipeline (for a Word document) as well as the system components handling each step. Continue reading below for a description of the individual stages.

graph LR subgraph client["UI"] upload("Upload file") end subgraph api["API"] archive("Archive file") record("Create Document record") index-api("Index entity") dispatch-ingest("Dispatch ingest") end subgraph ingest-file-ingest["ingest-file (ingest)"] ingestor("Pick ingestor") metadata("Extract metadata") pdf("Convert to PDF") cache-pdf("Cache converted PDF") text("Extract text") fragments-ingest("Write fragments") dispatch-analyze("Dispatch analyze") end subgraph ingest-file-analyze["ingest-file (analyze)"] languages("Detect languages") ner("Run NER") patterns("Extract patterns") fragments-analyze("Write fragments") dispatch-index("Dispatch index") end subgraph worker["Aleph worker (index)"] index-worker("Index entities") stats("Update stats") end client-->api api-->ingest-file-ingest ingest-file-ingest-->ingest-file-analyze ingest-file-analyze-->worker archive-->record record-->index-api index-api-->dispatch-ingest ingestor-->metadata metadata-->pdf pdf-->cache-pdf cache-pdf-->text text-->fragments-ingest fragments-ingest-->dispatch-analyze languages-->ner ner-->patterns patterns-->fragments-analyze fragments-analyze-->dispatch-index index-worker-->stats

User interface

There are many ways of uploading a file to Aleph. End users looking to upload a small number of files may do so via the Aleph web UI. Aleph also provides a command-line client, alephclient, that simplifies uploading files in bulk, and integrates with a scraping toolkit called Memorious. However, no matter which method of uploading files to Aleph you use, file uploads will always be handled by Aleph’s ingest API and the rest of the pipeline will be the same.

Upload

A user navigates to an investigation in the web UI and uploads a Microsoft Word (.docx) document. The upload is handled by the ingest API.

API

Archive file

The API endpoint stores the uploaded file using the configured storage backend. Storage backends are implemented in servicelayer. Supported backends include a simple backend using the host’s file system, AWS S3 (and other services with an S3-compatible API), and Google Cloud Storage.

Files are stored at a path that is inferred from the contents of the file by computing a SHA1 hash of the file’s contents. For example a file with the content hash 34d4e388b7994b3846504e89f54e10a6fd869eb8 would be stored at the path 34/d4/e3/34d4e388b7994b3846504e89f54e10a6fd869eb8.

Storing files this way allows easy retrieval as long as the content hash is known, and ensures automatic deduplication if the same file is uploaded multiple times.

Create Document record

After the file has been stored, the ingest API creates a Document record and saves it to the app database. This record stores metadata about the record, the user that uploaded the file, the parent directory the file was uploaded to, and the content hash.

Index entity

Dispatch ingest task

Finally, the ingest API dispatches an ingest task for the entity. This pushes a task object to the relevant Redis queue. The task payload consists of the document serialized as a (stub) FollowTheMoney entity (including the content hash) and additional context (including information about the investigation the file has been uploaded to).

Ingest

Ingest tasks are handled by ingest-file workers that run in separate containers. In order to ingest a file, the worker extracts the content hash and retrieves the file from the storage backend.

Pick an ingestor

ingest-file supports many different file types (for example office documents, PDF, spreadsheets, …). To handle the specifics of each file type, ingest-file implements many different ingestors that handle the specific processing steps for each file type.

When an ingest-file worker picks up a new ingest file, it tries to find the most specific ingestor based on the file’s mime type or file extension. In case of the Word document, ingest-file picks the OfficeOpenXMLIngestor which is suitable for parsing documents in Office Open XML formats used by recent versions of Microsoft Word and PowerPoint.

Depending on the file type, other investors may be used to ingest the file and the processing steps might vary more or less. For example, when uploading a text document created with LibreOffice Writer, a different investor will be used, but many of the subsequent processing steps will be the same. However, when uploading an email mailbox, the processing steps will differ significantly.

Extract document metadata

First, the ingestor extracts metadata from the document, for example the document title, author, and creation/modification timestamps.

Convert to PDF

For further processing and previewing in the Aleph web UI, ingest-file converts many common Office-like file types to a PDF file. It uses a headless LibreOffice subprocess to convert the Word file (previously retrieved from the storage backend) to a PDF file

The resulting PDF file is stored using the configured storage backend (in the same way the source Word file was stored, i.e. a SHA1 hash is computed and used to derive the path the PDF file is stored at).

ingest-file sets the pdfHash property on the FtM entity representing the uploaded file, so the converted PDF file is available to subsequent processing steps.

Cache PDF conversion results

ingest-file also caches the result of the PDF conversion. When a source file with the same content hash is uploaded a second time, it will reuse the converted PDF document instead of converting it again.

Extract text

Using the generated PDF file, ingest-file then uses PyMuPDF to extract text contents from the PDF file. Ingest-file will also run optical character recognition (OCR) to extract text contents from images embedded in the PDF file. That means that ingest-file is able to extract text from scanned documents.

Reminder: Under the hood, ingest-file uses followthemoney-store to store entity data. followthemoney-store stores entity data as “fragments”. Every fragment stores a subset of the properties. The advantage of this approach is that data that is related to the same entity but originates from several different processing steps is stored in a way that retains provenance and can be flushed or updated granularly by origin. In the indexing step, all fragments with the same entity ID are merged into a single entity.

For every page in the Word document, Aleph emits an entity fragment for a main Pages entity that contains the extracted text. Thus, once these fragments are merged into a single Pages entity, there exists one Pages entity that contains the extracted text for the entire document.

In addition, for every page, Aleph emits a separate Page entity fragment that contains the extracted text of a single page, metadata (e.g. the page number), and a reference to the main Pages entity.

This way, users can search for documents that contain a search query on any of the pages, or for individual pages within in a specific document that contain the search query.

Write fragments

Any entities that have been emitted in the process so far (i.e. one Pages entity and multiple Page entities) are now written to FtM store, the Postgres database that acts as the “source of truth” for all entity data.

At this point, FtM store will contain multiple entity fragments related to the uploaded file:

idoriginfragmentdata
97e1f…ingestdefault
// File metadata and hashes of source and PDF files
{
  "schema": "Pages",
  "properties": {
    "fileName": ["my-file.docx"],
    "contentHash": ["128e0..."],
    "pdfHash": ["5355c..."],
    // ...
  }
}
97e1f…ingesteae30…
{
  "schema": "Pages",
  "properties": {
    "indexText": ["Text content page 1..."]
  }
}
97e1f…ingest544d4…
{
  "schema": "Pages",
  "properties": {
    "indexText": ["Text content page 2..."]
  }
}
c00b3…ingestdefault
// `Page` entity for the first page
{
  "schema": "Page",
  "properties": {
    "index": ["1"],
    "bodyText": ["Text content page 1..."],
    "document": ["97e1f..."], // ID of the `Pages` entity
    // ...
  }
}
81424…ingestdefault
// `Page` entity for the second page
{
  "schema": "Page",
  "properties": {
    "index": ["2"],
    "bodyText": ["Text content page 2..."],
    "document": ["97e1f..."], // ID of the `Pages` entity
    // ...
  }
}

Dispatch analyze task

Finally, the ingestor dispatches an analyze task. This pushes a task object to the relevant ingest queue. The task object includes the IDs of all entities written in the previous step.

Analyze

The analyze task is handled by ingest-file workers as well. First, ingest-file retrieves the entity fragments written to the FollowTheMoney Store.

Detected languages

ingest-file uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.

Named-entity recognition (NER)

ingest-file uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.

Extract patterns

In addition to NLP techniques, Aleph also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.

Write fragments

Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named “John Doe”, the entity fragment written to the FollowTheMoney Store might look like this:

idoriginfragmentdata
97e1f…analyzedefault
{
  "schema": "Pages",
  "properties": {
    "peopleMentioned": ["John Doe"],
    "detectedLanguage": ["eng"]
  }
}

Additionally, ingest-file will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows Aleph to take them into account during cross-referencing. For example, another entity fragment will be written because “John Doe” was recognized as a name of a person:

idoriginfragmentdata
310a4…analyzedefault
{
  "schema": "Mention",
  "properties": {
    "name": ["John Doe"],
    "document": ["97e1f..."], // ID of the `Pages` entity
    "resolved": ["356aa..."],
    "detectedSchema": ["Person"]
  }
}

Dispatch index task

At the end of the analyze task, ingest-file dispatches an index task. This pushes a task object to the Redis queue that includes a payload with the IDs of any entities written in the previous step.

Worker

The index task is handled by a standard Aleph worker.

Index entities

In order to make the entity data written to the FollowTheMoney Store searchable, the worker merges all entity fragments by entity ID, then indexes the merged entities in Elasticsearch. The entity fragments from the previous sections of this article would result in the following merged entities.

A merged Pages entitiy representing the document as a whole:

{
  "id": "97e1f...",
  "schema": "Pages",
  "properties": {
    // File metadata and hashes of source and PDF files
    "fileName": ["my-file.docx"],
    "contentHash": ["128e0..."],
    "pdfHash": ["5355c..."],

    // Extracted text for individual pages
    "indexText": ["Text content page 1...", "Text content page 2..."],

    // Results from NER and language detection
    "peopleMentioned": ["John Doe"],
    "detectedLanguage": ["eng"],
  }
}

Two Page entities representing one of the two pages each:

{
  "id": "c00b3...",
  "schema": "Page",
  "properties": {
    "index": ["1"],
    "bodyText": ["Text content page 1..."],
    "document": ["97e1f..."], // ID of the `Pages` entity
  }
}
{
  "id": "81424...",
  "schema": "Page",
  "properties": {
    "index": ["2"],
    "bodyText": ["Text content page 2..."],
    "document": ["97e1f..."], // ID of the `Pages` entity
  }
}

A mention entity for the person recognized during NER:

{
  "id": "310a4...",
  "schema": "Mention",
  "properties": {
    "name": ["John Doe"],
    "document": ["97e1f..."], // ID of the `Pages` entity
    "resolved": ["356aa..."],
    "detectedSchema": ["Person"]
  }
}

Update collection statistics

Aleph caches information related to collections (e.g. when a collection was last updated) as well as statistics about the data in a collection (e.g. the number of entities by entity schema or the number of extracted names). After indexing, the worker clears these caches so that they are recomputed the next time the collection is accessed.

Summary

  • After uploading a file to Aleph, it is passed through a number of processing stages.

  • During the ingest stage, files a processed using an ingestor for the respective file type. The ingestor extracts file metadata, converts the file to a different format if necessary, and extracts the text contents of the file. Any data extracted is stored in entity fragments. The exact steps in this stage vary based on the file type.

  • The analyze stage extracts names of people, organizations, and countries as well as other identifiers such as phone numbers or bank accounts from the file. It also tries to detect the language of the file contents. The data extracted during this step is also stored in entity fragments.

  • Finally, in the index stage, all the data extracted in previous stages (and stored in separate entity fragments) is merged back into entities that are indexed in Elasticsearch to make them searchable.