Home

Terra

A program that processes and combines various sources of real-world data as well as an API to efficiently serve the data as hexagonal tiles.

Motivation

Neopolis is based on real-world data and therefore requires fast access to geographical data for the users. Over the years of iterations, multiple systems were designed to handle core gameplay features, like :

Even though these independent systems stayed more or less compatible, they had both performance issues and inconsistencies (poorly defined borders created bugs, for example) that made development slower and posed compatibility issues. At the time, part of the team was working on a project called Neoland, which needed to be fully compatible with Neopolis on a geographical level. Moreover, the goal of Revolt Games was to create more real-world based projects.

With this in mind, I started working on Terra in early 2022 to solve these problems, unify existing projects, and create a solid base for further geo-located projects.

A unified world-view

Having a unified worldview is complicated. In reality, there is not one single ground truth that everyone agrees on. Some countries are recognized here but not there, there are disputed borders, there are different views on which cities are important or not, etc. All of this means that multiple worldviews exist, depending on who you ask. Terra’s goal is not to take any political stance: it tries to take the most commonly accepted world-view and for this is based on carefully selected sources that seem trustworthy and well recognized. However, I’m not a geopolitical expert nor a cartographer, so errors might very well have slipped in.

Anyways, here are some of the things Terra provides :

Sources of data

The data provided by Terra can be grouped into 3 categories:

Part of the project was to find high-quality sources of data. I’ve settled on these main sources:

Data processing

A single program, the “Terra Pipeline” does all the downloading from these different sources, processing, curating, and combining of the data. Why Go? Because of the massive amounts of data (yes, the world is big) required to take full advantage of multiple cores. My Macbook Pro M1, with the 8 CPU cores working at max capacity, was powerful enough to process the whole world in about 10 hours in total. For a while, my computer looked a bit like this:

CPU         Use%
All
AVG         94%
CPU0        93%
CPU1        92%
CPU2        99%
CPU3        96%
CPU4        92%
CPU5        97%
CPU6        98%
CPU7        98%

The final output of the program is a PostgreSQL database, with multiple related tables that can be easily queried, joined, and served. It has :

Because of the size of input and output data, as well as the different formats in data sources (vector data, CSV, GeoTiff…), the processing is done in multiple stages. Stages can be turned on or off and configured via a configuration file (see example configuration) in JSON before launching the pipeline. Broadly, there are 3 main stages: bronze, silver, and gold.

  1. Bronze stage: Download and cache raw data locally (when possible)
  2. Silver stage: Insert bronze data into a “Silver” Postgres database, with little to no modification
  3. Gold stage: Through queries to the silver database, combine and process data into its final format and insert it into a “gold” database

The 3 stages of processing allowed for clearer responsibility of each code part, as well as reusability of stage outputs. This made it easy to restart only the relevant stages and save hours of computing time. Once fully processed, I’ve uploaded the final 20Go (compressed) gold database to GCP and created an API to serve it.

Serving the data

The data is served via an API, used by both Neopolis and Neoland. The server is pretty straightforward: it makes Postgres queries, with indexes to make common queries faster.

A caching layer with Redis takes full advantage of both the fact that some data are rarely updated, as well as the tiling system. For some queries, large amounts of “lands” need to be returned and therefore, as an optimization, lands are sometimes grouped together. This optimization uses properties of H3, like parenting hierarchies.

The whole project is documented on a GitHub page with the API’s endpoints, example queries, and responses, as well as detailed explanations. A Open API page is also generated at compile time.

Conclusion and learnings

Working on Terra has been particularly interesting on the data-processing side. Breaking the pipeline into multiple stages that are idempotent and produce intermediary steps made the development much faster. I’ve had to upgrade my SQL knowledge and overall I was impressed by how powerful PostgreSQL is: carefully crafted SQL queries go a (very) long way.

Finally, this project was a good optimization challenge: some stages took multiple hours of computing time, so each performance boost was useful. My biggest focus here was to cleanly spread the workload across multiple CPUs, for which Go channels and Goroutines were very convenient.