Aleph

Search

This article describes how Aleph makes use of Elasticsearch to make data searchable, as well as some of the considerations behind and limitations of the current index structure.

Aleph uses Elasticsearch for search and content retrieval. We build a custom Docker image based on the official image that includes the ICU plugin (used primarily for transliteration) and a custom list of synonyms.

Entity indexes

Aleph maintains a separate Elasticsearch index per FollowTheMoney schema, for example:

SchemaIndex
Companyaleph-entity-company-v1
Personaleph-entity-person-v1
Documentaleph-entity-document-v1

Every entity index has a version suffix (e.g. v1) that can be configured globally. This allows writing and reading from and to different (or even multiple) indexes in order to migrate/reindex indexes in a live system.

Field mappings

Aleph automatically configures appropriate mappings for every entity index. Mappings define how the entities in the respective indexes are stored and indexed which in turn has an effect on the way the different properties can be searched and used in filter conditions.

In Elasticsearch, there are two different data types for textual contents, keyword and text. The keyword type is intended to be used for structured contents such as IDs, email addresses, tags, categories. These can later be used for filtering results on exact value matches, for example using the faceted search in Aleph’s UI.

The text type is intended to be used for unstructured text contents such as the full contents of a document or an email. text fields support full-text searches. However, text fields cannot be used in filters.

Apart from the keyword and text field types, ElasticSearch also has support for other common data types such as dates, numbers, and binary values.

Aleph configures entity indexes as follows, based on the properties of the schema:

FollowTheMoney typeElasticSearch typeNotes
texttextNot indexed separately, but copied to the main text field used in full-text searches.
datedateAdditionally also copied to the main text field, so that these properties are also used in full-text searches.
All other property types, including stringkeywordAdditionally also copied to the main text field, so that these properties are also used in full-text searches.

This “hybrid” mapping configuration allows full-text searches across all properties, but also searches and filtering on structured data like email addresses and dates.

ElasticSearch does not support changing the field type in existing indexes. When the property type of a FollowTheMoney schema changes (for example from text to string), this will not be reflected in the index configuration unless the index is deleted and recreated.

Special fields

In addition to FtM property values and a full-text search field, Aleph also automatically indexes a few special fields that are derived from the entity data.

Fields per property type

Aleph creates fields per property type that contain all the values of a particular property type. For example, in case of a Company entity, the countries field will contain the values from the jurisdiction property. In case of a Person entity, the countries field will contain the values from the country and nationality properties. In the same way, Aleph aggregates values per property-type for all other matchable property types (including language, name, phone, …). This allows filtering by a country, language, name, phone number, etc. no matter the exact property that contains the country or name.

Fingerprints

Aleph uses the fingerprints package to heavily normalize names. For example, it will normalize both “Siemens Aktiengesellschaft” and “Siemens AG” to “ag siemens”. The normalized names are stored in the fingerprints field. The fingerprints field is mostly used by Aleph for cross-referencing, but the field can also be queried directly just like any other field.

Number of shards

Each index is stored in one or more shards. Shards can be distributed across multiple nodes. Each shard has some overhead, i.e. lots of very small shards have an impact on search performance. On the other hand, when using very large shards, moving shards (for example in case of a node failure) takes longer.

For this reason, Aleph configures the number of shards depending on the entity schema. By default, entity indexes will be split into 5 shards, with higher (or lower) numbers of shards for very common (or uncommon) entity schemata.

For example, the Pages schema is used to represent PDF, Word, PowerPoint and other similar types of documents. These entities tend to be quite large, because they contain the entire document contents. Additionally, they are very common in a typical Aleph instance. As a consequence, the index for Pages entities is usually quite large, so it makes sense to split it up into a larger number of shards.

On the other hand, Passport entities tend to be much smaller in size and less common in a typical Aleph instance, so it makes sense to reduce the number of shards for the passports index.

Indexing of multi-page documents

Multi-page documents (for example Word/OpenDocument, PDF, and PowerPoint documents) are represented using multiple entities.

Each page in the document is represented by a separate Page entity. Page entities store the text contents of the page and a reference to a parent Pages entity. The Pages entity represents the entire document and also contains the concatenated text contents of all pages in the document, alongside general meta information (.

Modeling multi-page documents covers the main use cases related to documents:

Searching for documents that match a search query (on any page); Searching within documents to find the exact page that matches a search query; Retrieving the text contents of single pages for display in the web UI.

Search query syntax

We use ElasticSearch’s query_string query type in most places which means that the full syntax supported by this query type is also supported in the Aleph API and web UI, including boolean operators, wildcards, fuzziness, proximity and regex operators.

The limitations for these features as outlined in the referenced ElasticSearch documentation articles apply. Using some of the features can result in very slow queries, high resource usage or timeouts.

Search results ranking

Aleph uses the default Elasticsearch settings for calculation of relevance scores. The current Elasticsearch version defaults to BM25.

Limitations and known issues

Search results snippets

The _source field is a special field in Elasticsearch that contains the original JSON data passed to Elasticsearch at index time. The _source field is also required in order to generate search result snippets and search term highlights for search results.

Our indices are configured to exclude the text field from the _source. As a reminder, contents from all Aleph properties are copied to the text field. That means that in almost all cases the contents in the text field in most cases do not need to be retrieved directly because they are redundant.

One notable exception is the indexText property. This property gets a special treatment in Aleph and is copied to the text field (i.e. it can be searched), but it is not stored separately and thus cannot be retrieved. The indexProperty is used by Pages entities for the full-text contents of multi-page documents (see Indexing of multi-page documents). As a consequence, Elasticsearch cannot compute snippets or highlights for Pages entities.

Related GitHub issue

Exact searches

Due to the way data is indexed, there is currently no way in Aleph to do “exact” searches that return only results that match search terms character-by-character. When quoting multiple search terms (e.g. ”Vladimir Putin”) we perform a phrase search. A phrase search matches any entities that contain all the search terms in the exact same order (i.e. only entities that contain “Vladimir” followed by “Putin”).

Even when using a phrase search, ElasticSearch applies transliteration and default tokenization. That means that a phrase search for “Владимир Путин” would also return “Vladimir Putin” as a result and “ACME, Inc.” would also match “ACME Inc” (without interpunctuation). (However, when using a phrase search, the search query isn’t expanded using the configured synonymes.)

This is not the desired behavior in some use cases (e.g. when trying to narrow down a large result set based on subtle spelling variations). However, adding support for “true” exact searches would most likely require rebuilding the index and thus hasn’t been feasible for now.

Related GitHub issue

Minimum should match

When executing entity searches, set the minimum_should_match option to 66%, i.e. results have to match at least two thirds of subqueries to be returned. This can be counterintuitive when using explicit operators. For example, “Biden OR Trudeau OR Macron OR Scholz” will match only entities that contain at least two of the four terms, even though the query has explicitly specified an OR operator.

Related GitHub issue

Prefix searches

The Aleph API supports a prefix parameter to execute searches using the match_phrase_prefix query type in ElasticSearch. The API feature is primarily used to show autocomplete suggestions in the UI for entities like companies or people.

This query type has some limitations which become apparent only for big indexes. The query type uses the n most frequent terms starting with the prefix (with n defaulting to 50), i.e. executing a query search for the prefix “put” will expand the search with the 50 terms that occur most frequently in the entire index (for example “Putin”).

When executing a prefix search within a collection, there can be situations where no results are returned even though there actually are entities that begin with the prefix (but are not in the top 50 most frequent terms).

Related GitHub issue

Regex searches

While ElasticSearch (and thus also Aleph) supports regular expressions in search queries, they are limited to individual terms. That means that regular expressions cannot match multiple terms (i.e. you can’t use the regex /lorem ipsum/ to match full-text contents of a document) and a regex always has to match the full term (i.e. /put/ will match “put” but not “putin”).

Additionally, using regular expressions that contain interpunctuation or whitespace to match full-text contents will most likely fail, as interpunctuation is used by the standard tokenizer as an indicator for term boundaries. For example “20.02.2022” would be split into the terms “20”, “02”, and “2022”. As regular expressions can only match individual terms, a search using the regex /\d{2}\.\d{2}\.\d{4}/ wouldn’t match the document.