Aleph

Named Entity and Pattern Extraction

When you upload files, Aleph tries to extract names of people, companies, and countries as well as phone numbers, email addresses and IBANs. This article explains the different steps Aleph performs to extract named entities and patterns and the NLP (Natural Language Processing) technologies Aleph uses.

Preprocessing

As part of the ingest pipeline, Aleph extracts text from the files you upload. Before running any entity or pattern extraction, Aleph preprocesses text:

  • Whitespace and line-breaks are collapsed.
  • Very long text is split into multiple chunks with each chunk containing approximately the amount of text that fits on a single page. In most cases, the entities that are emitted in previous steps of the ingest pipeline shouldn’t contain text that is much longer, so this is mostly there to handle edge cases.

Language identification

Aleph uses the fastText LID (Language Identification) model which can recognize 176 languages. On one hand, Language identifcations is used to enrich the metadata for files to allow users to filter based on the language of files. On the other hand, it is used as context for subsequent steps, for example in order to select the correct language-specific model for entity extraction.

Entity extraction

In order to extract named entities (names of people, companies, and countries) from files, Aleph uses spaCy with language-specific models. For example, it uses the en_core_web_sm model for English language text, and the es_core_news_sm model for Spanish language text.

Running text through the spaCy models yields a number of labelled named entities. For example, the sentence …

“Swiss tobacco giant Philip Morris International (PMI) obtained a stake in a company that won a disputed license to make and market cigarettes in Egypt, one of the world’s most desirable tobacco markets.”

… would result in the following labelled entities:

TextLabel
SwissNORP
Philipp Morris InternationalORG
PMIORG
EgyptGPE
oneCARDINAL

The set of labels returned varies depending on the language-specific model that is used. For example, the en_core_web_sm model uses the label PER to annotate names of people whereas es_core_web_sm uses PERSON. Some models also annotate additional entities such as dates or cardinals, but Aleph discards anythings that’s not related to a person, company, or country.

People

In order to extract names of people, Aleph uses named entities returned by the spaCy model labelled as PER or PERSON as candidates.

For each of these named entities it normalizes the text by stripping out common prefixes (e.g. removing “Mr.” from “Mr. Sherlock Holmes”) or removing possesive suffixes (“’s” in “John’s”).

The main reason for extracting names of people from files is to find other references to the same people. For this reason, Aleph also discards very short or very long names, as these are usually not as useful in order to achieve this goal or are often false positives.

Companies

In order to extract companies, Aleph uses named entities returned by the spaCy model labelled as ORG as candidates.

Very short or very long names are discarded for the same reasons that people with very short or long names are discarded.

Countries

In order to extract countries, Aleph uses named entities returned by the spaCy model labelled as GPE or LOC. These named entities could be countries, but also cities or other administrative areas.

It then uses the countrytagger library to mape the extracted entity to a country. Under the hood, countrytagger uses the GeoNames database which includes a wide range of place names from around the world in various languages along with the country the are part of.

This way, Aleph can extract countries even if the name of the country isn’t mentioned literally. For example, if a file mentions “Berlin”, Aleph would extract “Germany” as the country.

Filtering

The default spaCy models are trained on generic web content. Depending on the contents of the files you upload to Aleph, the precision of the named entities returned by these models may vary.

To predict whether a named enttiy returned by a spaCy model likely is a false positive, Aleph uses a custom fastText classifier model. This model is trained on structured data (for proper names of people and companies) as well as random text samples from documents (for text snippets that are not names of people or companies).

(The model was initially developed for a slightly different purpose and predicts multiple labels for a given text, but Aleph currently only uses the fact whether trash is one of the predicted labels.)

Pattern extraction

Aleph also extracts phone numbers, email addresses, and IBANs from files using regular expressions. This is a simple approach and definitely not bullet-proof, but it does handle quite a few cases. The extracted data is normalized:

  • Phone numbers are formatted in E.164 format (e.g. +491234567890).
  • IBANS are stripped of separators and whitespace (e.g. GR1601101050000010547023795)
  • Email addresses are lower-cased and domain names are normalized.

Indexing

Aleph stores the extracted data in FollowTheMoney properties like companiesMentioned, detectedLanguage, or phoneMentioned (see the Analyzable schema for details). This means that you can use them in search queries as any other property. For example, the following query would return all files that mention the phone number +491234567890:

properties.phoneMentioned:"+491234567890

Aleph also stores each extracted person or company as a Mention entity in order to simplify queries for cross-referencing.

Limitations and known issues

Extracting entities in (semi-)structured data

The default spaCy models Aleph uses are trained on generic web content. They tend to show lower recall for (semi-)structured data such as listings from database websites. If you’re handling (semi-)structured files, we recommend that you manually parse the relevant data from these documents. To learn about this approach read “How to create mixed document/entity graphs”.