Provita Geoportal: generating map preview tiles

September 12, 2021

On this post, I describe the approach used by the Provita Geoportal to pre-generate map preview tiles using a serverless approach.

Why pre-generate tiles?

Traditionally, map tiles are generated dynamically (and cached) using a GIS server. This requires installation, monitoring and maintenance of some server capability, whether it is a physical server somewhere, a hosted virtual machine, or perhaps a container image hosted in some cloud service provider. A solution like this would most likely require hosting database management system as well. And we don’t want any of that!

As described on an earlier post, two of key requirements of the Provita Geoportal are: a) hands-off maintenance and b) low-cost deployment. And these two requirements drove the adoption of the JAMStack approach.

Map tiles are just files with unique urls which follow a predefined numbering scheme corresponding to zoom level, X coordinate and Y coordinate (i.e., http(s)://{host}/{path}/{z}/{x}/{y}.{ext}). Therefore, they could be easily mapped to pre-generated files and served as files following the appropriate directory structure, rather than using an intermediary layer like a GIS server.

In fact, pre-generating tiles when the source files or the metadata changes is analogous to building the site when content is updated. So, pre-generating tiles seems to be an approach quite compatible with the JAMStack concept. We just need to figure out a serverless way of doing it.

Pre-generating tiles from source GIS files

Pre-generating tiles involves the execution of compute-intensive commands. The specific method used to pre-generate tile sets is different depending on whether we are dealing with vector (Shapefile format) or raster datasets (GeoTIFF format).

Vector tile sets

To generate vector tile sets, we use tippecanoe, a command line utility created by Mapbox. The resulting tiles follow the Mapbox vector tile spec, which work really well with our map preview library, Maplibre. Vector tiles are very small and Maplibre can render them quickly, which was the key factor for choosing this library to display maps. Also, since vector tiles are styled at rendering time, they can be generated immediately after a GIS file is uploaded by the Administrator.

The tippecanoe utility has a ton of options. Here are the options that work well for us:

  tippecanoe
    -q                                    # Quiet, do not generate progress messages
    --force                               # Overwrite any existing output files
    --layer=$namelc                       # Include layer name (parameter) in the output
    --name=$namelc                        # Tile set name (parameter)
    -r1                                   # Do not drop point geometries at low zoom levels
    --minimum-zoom=4                      # Lowest zoom level of generated tiles
    --maximum-zoom=10                     # Highest zoom level of generated tiles
    --output-to-directory vtiles/$namelc  # Output directory (parameter)
    $namelc.geojson                       # Input file (parameter)

The tippecanoe utility takes a geojson file as input, so before running tippecanoe we generate a geojson file using the Shapefile source file. For this purpose, we use the mapshaper command line utility.

Raster tile sets

To generate raster tile sets, we use the gdal2tiles.py command line utility which is part of the GDAL library. The resulting tiles follow the OSGeo Tile Map Service Specification, which consist of image files in .png format.

Raster tiles need to be styled at creation time, so they are generated after the Administrator completes associating the metadata to the raster data sets. We use the styling metadata to generate a color table and then we invoke the gdaldem command line utility to incorporate the color table into a GDAL .vrt file, which is then used by the gdal2tiles.py command to generate the tiles with the appropriate styling.

  gdaldem
    color-relief  # Generate color relief map
    -alpha        # Include alpha channel to support opacity
    $EXACT        # Use gradient color blend vs nearest color entry match
    -of vrt       # Output file format is GDAL .vrt
    "$name.tif"   # Input file
    color.txt     # Color table
    temp.vrt      # Output file

  gdal2tiles.py
    --processes=2       # Use 2 parallel process
    --profile=mercator  # OSMTILE MapML tiling scheme
    -q                  # Quiet, do not generate progress messages
    -z 4-10             # Minimum and maximum zoom levels of generated tiles
    temp.vrt            # Input file (out from gdaldem)
    rtiles/$namelc      # Output directory

Pre-generating tiles on-demand

In keeping within our serverless approach, we need to dynamically allocate (and deallocate) cloud computational resources to execute the tile generation commands described above. Can we use Amazon Web Services (AWS) to achieve this goal? Yes, of course we can!

AWS Lambda functions are not appropriate in this case because these computations can be intensive and command execution can take several minutes to complete. Lambda functions are meant to run and return a result to the client very quickly (milliseconds), and should not block the user interface. In addition, the tile generation commands have a number of dependencies (e.g, tippecanoe, sqllite, gdal, aws-cli) that cannot be easily installed using Lambda functions.

Enter AWS Batch. With this capability, we can run arbitrary jobs, on-demand, using customized Docker images. Thanks to AWS Batch, it is possible to request whatever computational resources we need (e.g, memory, CPUs) to execute batch jobs timely and efficiently. Resources are only allocated and used to satisfy job execution, thus minimizing the cost of running these commands. There is no need to have any computation resources (real, virtual, containerized) pre-instantiated and available all the time.

In addition, AWS Batch provides the ability of running on “Spot instances”, to further optimize batch job execution costs. Spot instances are allocated using AWS excess capacity and they are priced with deep discounts (up to 90%). The only down side is that job execution can take a little longer while AWS finds and allocate resources, but it is usually no more than a couple of minutes. Also, spot instances can be interrupted or terminated, so we include a retry count in the batch job definitions.

Setting up the AWS Batch Job environment

AWS is a very powerful cloud environment, but it is also very complex. There are multiple ways of achieving the same thing, e.g., using the AWS Console, using the AWS API or using the AWS command line interface. In addition, there are tons of configuration parameters and a number of access control policies that need to be set up to do just about anything in AWS.

In order to keep things repeatable and self-documented, we have opted to set up the AWS environment using the AWS command line interface. We have a Github repo that contains the scripts necessary to set up the AWS environment for this project.

Running the setupwas script offers the following options:

1) Create API user, S3 bucket, create and attach policies
2) Create compute environment
3) Create container repos
4) Push container images
5) Create job queues and job definitions
6) Do all of the above in sequence
q) Quit (no changes)

Create API user, S3 bucket, create and attach policies

This part of the script is used to support all of our AWS activities, not just the AWS Batch part, for example, S3.

Here we create a user specifically to run the AWS APIs with restricted privileges, so that we do not use the AWS account administrator credentials anywhere in the application.

In this section we also create the S3 bucket where Geoportal keeps all of its files, including the public and private GIS files, pre-generated tiles, and private user survey responses.

Finally, we create the policies and roles needed to enable API access and AWS Batch job execution access requirements (i.e., AWSBatchServiceRole, ecsInstanceRole, ecsTaskExecutionRole, and AmazonEC2SpotFleetRole).

This step only needs to be run once.

Create compute environment

In this section of the script, we establish the compute environment requirements for running our batch jobs.

Here is the resulting definition:

{
  "computeEnvironmentName": "geoportalp-spot",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "SPOT",
    "minvCpus": 0,
    "maxvCpus": 2,
    "desiredvCpus": 0,
    "instanceTypes": [
      "optimal"
    ],
    "tags": {
      "Name": "geoportal-spot"
    },
    "subnets": [
      "subnet-XXXXXXXX",
      "subnet-YYYYYYYY",
      "subnet-ZZZZZZZZ"
    ],
    "securityGroupIds": [
      "sg-SSSSSSSS"
    ],
    "spotIamFleetRole": "arn:aws:iam::ACCOUNT:role/AmazonEC2SpotFleetRole",
    "instanceRole": "arn:aws:iam::ACCOUNT:instance-profile/ecsInstanceRole"
  },
  "serviceRole": "arn:aws:iam::ACCOUNT:role/AWSBatchServiceRole"
}

This step only needs to be run once.

Create container repos

Here we create container repositories to store the Docker images that are used by the AWS Batch commands. We have two repos, one for the vector tile generation image (geoportalp-vtiles) and one for the raster tile generation image (geoportalp-rtiles).

This step only needs to be run once.

Push container images

In this section of the script, we create the container images defined under the docker subdirectory.

For generating vector tiles, we use as standard amazonlinux docker image, and we install the following dependencies:

deltarpm, which, unzip, aws-cli, git, sqlite-devel, curl, jq, development tools - General purpose commands and utilities we need
mapshaper - We use mapshaper to convert shapefile files to geojson, which is tippecanoe’s required input format
tippecanoe - We use tippecanoe to generate vector tiles, as describe above.

For generating raster tiles, we use an osgeo/gdal docker image (osgeo/gdal:ubuntu-small-latest) which is a small image that already has GDAL and all of it’s dependencies pre-installed. We use GDAL as described above to generate raster tiles.

This step needs to be run at least once and whenever there is a need to make changes to one or both of the container images.

Create job queues and job definitions

Here we define two job queues and two job definitions for the vector and raster jobs.

The job queues are used to associate jobs to compute environments. In our case, the two job queues for vector and raster tiles are identical.

The job definitions are used to define command invocation, parameters, and additional job execution requirements (e.g., memory, CPUs) for each one of the batch job types (i.e., vtiles, rtiles).

This step needs to be run at least once and whenever there is a need to modify execution commands, parameters or computational requirements.

Putting it all together

Ok, now that we have defined our AWS Batch compute environment, we can easily submit batch jobs using the AWS Batch API. We invoke the AWS Batch API from within a Lambda function (submit-job), so that we can enforce our access control mechanism, i.e., the admin user must be a collaborator on the geoportal-data repository (see here for details about admin user authorization).

The batch job invocation point will depend on what type of tiles are being generated. For vector tiles, the batch job can be invoked immediately after a GIS file is uploaded.

Vector tile generation

For raster tiles, we must wait until the administrator defines the tile styling to be used as part of the file’s metadata.

Raster tile generation

And that’s it!

On this post, I described how we are using AWS Batch to pre-generate tiles in the most economical and resource-conscious way possible.

Ah! On September 29 I will be presenting the Provita Geoportal implementation at the FOSS4G Buenos Aires conference. See you there!