Ingest Pipeline
When you upload files to Aleph, Aleph ingests these files in a multi-step process. For example, Aleph tries to convert certain file formats, performs optical character recognition (OCR), and tries to extract names, locations, IBAN account numbers and more from unstructured text. This article explains what happens when you upload a file to Aleph. It’s aimed at developers who want to make changes to the ingest pipeline.
This article describes the process of uploading and ingesting a Word document as an example. It is meant as a high-level overview of the system components and stages involved in the ingestion process. Depending on the file type and configuration options, the process might differ from what’s described here.
Overview
The following graph shows an overview of all stages of the ingestion pipeline (for a Word document) as well as the system components handling each step. Continue reading below for a description of the individual stages.
User interface
There are many ways of uploading a file to Aleph. End users looking to upload a small number of files may do so via the Aleph web UI. Aleph also provides a command-line client, alephclient, that simplifies uploading files in bulk, and integrates with a scraping toolkit called Memorious. However, no matter which method of uploading files to Aleph you use, file uploads will always be handled by Aleph’s ingest API and the rest of the pipeline will be the same.
Upload
A user navigates to an investigation in the web UI and uploads a Microsoft Word (.docx) document. The upload is handled by the ingest API.
API
Archive file
The API endpoint stores the uploaded file using the configured storage backend. Storage backends are implemented in servicelayer. Supported backends include a simple backend using the host’s file system, AWS S3 (and other services with an S3-compatible API), and Google Cloud Storage.
Files are stored at a path that is inferred from the contents of the file by computing a SHA1 hash of the file’s contents. For example a file with the content hash 34d4e388b7994b3846504e89f54e10a6fd869eb8
would be stored at the path 34/d4/e3/34d4e388b7994b3846504e89f54e10a6fd869eb8
.
Storing files this way allows easy retrieval as long as the content hash is known, and ensures automatic deduplication if the same file is uploaded multiple times.
Create Document
record
After the file has been stored, the ingest API creates a Document
record and saves it to the app database. This record stores metadata about the record, the user that uploaded the file, the parent directory the file was uploaded to, and the content hash.
Index entity
Dispatch ingest task
Finally, the ingest API dispatches an ingest task for the entity. This pushes a task object to the relevant Redis queue. The task payload consists of the document serialized as a (stub) FollowTheMoney entity (including the content hash) and additional context (including information about the investigation the file has been uploaded to).
Ingest
Ingest tasks are handled by ingest-file workers that run in separate containers. In order to ingest a file, the worker extracts the content hash and retrieves the file from the storage backend.
Pick an ingestor
ingest-file supports many different file types (for example office documents, PDF, spreadsheets, …). To handle the specifics of each file type, ingest-file implements many different ingestors that handle the specific processing steps for each file type.
When an ingest-file worker picks up a new ingest file, it tries to find the most specific ingestor based on the file’s mime type or file extension. In case of the Word document, ingest-file picks the OfficeOpenXMLIngestor which is suitable for parsing documents in Office Open XML formats used by recent versions of Microsoft Word and PowerPoint.
Depending on the file type, other investors may be used to ingest the file and the processing steps might vary more or less. For example, when uploading a text document created with LibreOffice Writer, a different investor will be used, but many of the subsequent processing steps will be the same. However, when uploading an email mailbox, the processing steps will differ significantly.
Extract document metadata
First, the ingestor extracts metadata from the document, for example the document title, author, and creation/modification timestamps.
Convert to PDF
For further processing and previewing in the Aleph web UI, ingest-file converts many common Office-like file types to a PDF file. It uses a headless LibreOffice subprocess to convert the Word file (previously retrieved from the storage backend) to a PDF file
The resulting PDF file is stored using the configured storage backend (in the same way the source Word file was stored, i.e. a SHA1 hash is computed and used to derive the path the PDF file is stored at).
ingest-file sets the pdfHash
property on the FtM entity representing the uploaded file, so the converted PDF file is available to subsequent processing steps.
Cache PDF conversion results
ingest-file also caches the result of the PDF conversion. When a source file with the same content hash is uploaded a second time, it will reuse the converted PDF document instead of converting it again.
Extract text
Using the generated PDF file, ingest-file then uses PyMuPDF to extract text contents from the PDF file. Ingest-file will also run optical character recognition (OCR) to extract text contents from images embedded in the PDF file. That means that ingest-file is able to extract text from scanned documents.
Reminder: Under the hood, ingest-file uses followthemoney-store to store entity data. followthemoney-store stores entity data as “fragments”. Every fragment stores a subset of the properties. The advantage of this approach is that data that is related to the same entity but originates from several different processing steps is stored in a way that retains provenance and can be flushed or updated granularly by origin. In the indexing step, all fragments with the same entity ID are merged into a single entity.
For every page in the Word document, Aleph emits an entity fragment for a main Pages
entity that contains the extracted text. Thus, once these fragments are merged into a single Pages
entity, there exists one Pages
entity that contains the extracted text for the entire document.
In addition, for every page, Aleph emits a separate Page
entity fragment that contains the extracted text of a single page, metadata (e.g. the page number), and a reference to the main Pages
entity.
This way, users can search for documents that contain a search query on any of the pages, or for individual pages within in a specific document that contain the search query.
Write fragments
Any entities that have been emitted in the process so far (i.e. one Pages
entity and multiple Page
entities) are now written to FtM store, the Postgres database that acts as the “source of truth” for all entity data.
At this point, FtM store will contain multiple entity fragments related to the uploaded file:
id | origin | fragment | data |
---|---|---|---|
97e1f… | ingest | default | |
97e1f… | ingest | eae30… | |
97e1f… | ingest | 544d4… | |
c00b3… | ingest | default | |
81424… | ingest | default | |
Dispatch analyze task
Finally, the ingestor dispatches an analyze
task. This pushes a task object to the relevant ingest queue. The task object includes the IDs of all entities written in the previous step.
Analyze
The analyze
task is handled by ingest-file workers as well. First, ingest-file retrieves the entity fragments written to the FollowTheMoney Store.
Detected languages
ingest-file uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.
Named-entity recognition (NER)
ingest-file uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.
Extract patterns
In addition to NLP techniques, Aleph also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.
Write fragments
Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named “John Doe”, the entity fragment written to the FollowTheMoney Store might look like this:
id | origin | fragment | data |
---|---|---|---|
97e1f… | analyze | default | |
Additionally, ingest-file will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows Aleph to take them into account during cross-referencing. For example, another entity fragment will be written because “John Doe” was recognized as a name of a person:
id | origin | fragment | data |
---|---|---|---|
310a4… | analyze | default | |
Dispatch index task
At the end of the analyze
task, ingest-file dispatches an index
task. This pushes a task object to the Redis queue that includes a payload with the IDs of any entities written in the previous step.
Worker
The index
task is handled by a standard Aleph worker.
Index entities
In order to make the entity data written to the FollowTheMoney Store searchable, the worker merges all entity fragments by entity ID, then indexes the merged entities in Elasticsearch. The entity fragments from the previous sections of this article would result in the following merged entities.
A merged Pages
entitiy representing the document as a whole:
{
"id": "97e1f...",
"schema": "Pages",
"properties": {
// File metadata and hashes of source and PDF files
"fileName": ["my-file.docx"],
"contentHash": ["128e0..."],
"pdfHash": ["5355c..."],
// Extracted text for individual pages
"indexText": ["Text content page 1...", "Text content page 2..."],
// Results from NER and language detection
"peopleMentioned": ["John Doe"],
"detectedLanguage": ["eng"],
}
}
Two Page
entities representing one of the two pages each:
{
"id": "c00b3...",
"schema": "Page",
"properties": {
"index": ["1"],
"bodyText": ["Text content page 1..."],
"document": ["97e1f..."], // ID of the `Pages` entity
}
}
{
"id": "81424...",
"schema": "Page",
"properties": {
"index": ["2"],
"bodyText": ["Text content page 2..."],
"document": ["97e1f..."], // ID of the `Pages` entity
}
}
A mention entity for the person recognized during NER:
{
"id": "310a4...",
"schema": "Mention",
"properties": {
"name": ["John Doe"],
"document": ["97e1f..."], // ID of the `Pages` entity
"resolved": ["356aa..."],
"detectedSchema": ["Person"]
}
}
Update collection statistics
Aleph caches information related to collections (e.g. when a collection was last updated) as well as statistics about the data in a collection (e.g. the number of entities by entity schema or the number of extracted names). After indexing, the worker clears these caches so that they are recomputed the next time the collection is accessed.
Summary
-
After uploading a file to Aleph, it is passed through a number of processing stages.
-
During the
ingest
stage, files a processed using an ingestor for the respective file type. The ingestor extracts file metadata, converts the file to a different format if necessary, and extracts the text contents of the file. Any data extracted is stored in entity fragments. The exact steps in this stage vary based on the file type. -
The
analyze
stage extracts names of people, organizations, and countries as well as other identifiers such as phone numbers or bank accounts from the file. It also tries to detect the language of the file contents. The data extracted during this step is also stored in entity fragments. -
Finally, in the
index
stage, all the data extracted in previous stages (and stored in separate entity fragments) is merged back into entities that are indexed in Elasticsearch to make them searchable.