Aleph

How To Create Mixed Document/Entity Graphs

Probably the most powerful feature of Aleph is the ability to import both structured content and document data from files. While these can be loaded to a collection separately, it sometimes makes sense to create links between entities from structured and unstructured data. The example below demonstrates how to make a mixed graph from documents, people, companies and court cases.

This guide is outdated as the website used as an example does not exist anymore, but the underlying concepts remains useful. If you’d like to contribute an updated guide, please let us know on GitHub.

Example: Persona de Interes

Screenshot of the Persona de Interes website

Persona de Interes is a 2014 project by the OCCRP that features data from Mexico and Central America. It collects information on persons and companies, as well as real estate, court cases and source documents. For this tutorial, we will import the contents of the Persona de Interes website into Aleph, preserving the connections between people and companies on the one hand, and to documents on the other.

Why is this complicated?

There are two distinct pathways for loading data into Aleph: either by uploading files (such as PDFs or emails) or by ingesting semantic data, for example for companies or people.

Both processes produce FollowTheMoney entities in Aleph, but there is one crucial difference: For ingested documents, the entity ID is set by Aleph while for structured entities, you define an ID yourself. At the same time, the entity ID is needed to link to it from other entities.

Therefore, in order to make structured data entities link to ingested documents, you must first upload the documents to Aleph so that it assigns them an ID. Then you can use that entity ID in other parts of the FollowTheMoney entity graph.

GitHub Repository

This example is accompanied by a GitHub repository that contains the full source code, as well as the structured data from Persona de Interes.

Step 1: Scraping the source data

In order to get access to the raw data, we decided to simply scrape the Persona de Interes website. The script used for this is not central to this tutorial, but it leaves us with a small JSON file containing the published data, and a folder full of the PDF documents attached to profiles on the original site.

Step 2: Uploading documents to Aleph

In the next step, we’ll upload the scraped documents to Aleph. For every uploaded document, Aleph returns a document ID that we can use to create links between the documents and other entities. For the upload, we will use the alephclient Python library.

The code to upload a single document looks like this:

# Import alephclient:
from alephclient.api import AlephAPI

# By default, alephclient will read host and API key from the
# environment. You can also pass both as an argument here:
api = AlephAPI()

# Get the collection (dataset)
foreign_id = 'zz_occrp_pdi'
collection = api.load_collection_by_foreign_id(foreign_id)
collection_id = collection.get('id')

file_path = 'www.personadeinteres.org/uploads/example.pdf'
metadata = {'file_name': 'example.pdf'}
# Upload the document:
result = api.ingest_upload(collection_id, file_path, metadata)

# Finally, we have an entity ID:
document_id = result.get('id')

If you visit your Aleph instance, you will see the documents being imported. Eventually, they will show up as searchable entities.

Step 3: Making FollowTheMoney entities in Python

Once we have received an ID for the document, we can start generating FollowTheMoney entities that link to it. In this scenario, we will use the followthemoney Python library.

Using the library, we can generate a so-called entity proxy, an object that can be used to construct an entity:

# Load the followthemoney data model:
from followthemodel import model

# Create an entity proxy by giving a schema:
proxy = model.make_entity('Person')

# Generate an entity ID by hashing the given keys:
proxy.make_id('john-doe')

# Assign a value to a property:
proxy.add('name', 'John Doe')

# Convert the entity proxy to a JSON-like object:
data = proxy.to_dict()

Now that we’ve made our first person entity, we want to create a link between this person and the document we uploaded before. One way of doing this is to create an `UnknownLink between the two, a generic link defined in the FollowTheMoney data model:

# Create the link entity proxy:
link_proxy = model.make_entity('UnknownLink')

# We'll derive the link ID from the other two IDs here, but
# this could be any unique value (make sure it does not clash
# with the ID for the main entity!)
link_proxy.make_id('link', proxy.id, document_id)

# Now we assign the two ends of the link. Note that we can just
# pass in a proxy object:
link_proxy.add('subject', proxy)
link_proxy.add('object', document_id)
link_proxy.add('role', 'Document about John')

Step 4: Uploading entities to Aleph

Now that we’ve created a person and a link to the document, we can upload both to Aleph using alephclient Python library. For this, note that alephclient and followthemoney are separate tools, so we need to serialise our proxies into plain JSON-style dicts:

# Turn the two proxies into JSON form:
entities = [proxy.to_dict(), link_proxy.to_dict()]

# You can also feed an iterator to write_entities if you
# want to upload a very large
api.write_entities(collection_id, entities)

When visiting the Aleph UI, you should now be able to see the the person record, the document, and the link between the two shown for each entity.

Step 5: Convert all the data

Obviously, this basic example needs to expand in order to convert the full data schema from Persona de Interes into FollowTheMoney entities.

"How to draw an owl" meme: 1. Draw some circles. 2. Draw the rest of the fucking owl.

Please check out the loading script to see how it approaches entity creation as well as the mapping of data from the scraper to FollowTheMoney properties.

Conclusion

The goal of this tutorial was to show how you can create complex object graphs in Aleph in a way that combines structured and unstructured data. In the course of doing so, we’ve also shown that alephclient and `followthemoney provide powerful Python libraries. Their full scope exceeds the scope of this Tutorial, but they can be used as power tools to make exciting data structures present in Aleph.

I hope you’ll enjoy experimenting with them to show your data in the most expressive way!

You can explore the data from Person de Interes in OCCRP’s Aleph instance.