Aleph

How to Add Support for New Languages

Some document processing features such as text recognition (OCR) or named entity extraction (NER) depend on the language of files you upload to Aleph. This guide describes how you can add support for documents in other languages.

  1. Check the list of supported languages to see if a particular language is already supported. Some languages are only partially supported by Aleph (for example OCR is supported, but NER is not).

  2. Add the three-letter language code to the hard-coded list of languages in FollowTheMoney (if it isn’t already included).

  3. In order to add OCR support for a new language, you need to install the repsective Tesseract model in ingest-file. The ingest-file Docker image is based on Ubuntu. Check the list of Tesseract language models available as Ubuntu packages. If a package for the new language is available, add it to the ingest-file Dockerfile.

  4. In order to add NER support for a new language, you need to install a respsective spaCy model in ingest-file. Check the list of spaCy language models. If a package for the new language is available, add it to the ingest-file Dockerfile and update the NER_MODELS setting in the ingest-file settings to include a mapping from the three-letter language code to the spaCy model.