Optimized Time Zone Location Lookup in PostgreSQL
Background
I have a dataset where users are allowed to add “hours of operation”, but for some reason, they do not provide time zone information. This is great if the client viewing the point is local and understands the time zone, but this is a national project.
Data Sources
OpenStreetMap
OpenStreetMap seems to maintain the best time zones that are publicly available with the timezone key. OpenStreetMap documentation suggests editors to:
Use timezone=* to specify which IANA time zone database time zone applies.
A developer, Evan Siroky, is apparently a big fan of mapping time zone boundaries. Evan created a node.js tool to extract the time zones directly out of OpenStreetMap. There are even pre-made time zone files in GeoJSON and Shapefile formats for those looking to just get the data quickly.
PostgreSQL
PostgreSQL has the IATA (Olson) Time Zone database built in (as described here). This is the same time zone list that is suggested by OpenStreetMap, so things should line up (spoiler: they don’t always).
Process
Clone and run boundary builder
Normally, I would just use the GeoJSON file, but since I’m documenting this, I might as well run the full process.
Install Node.js / git / gdal / zip
Also, since I’m documenting it, I’m assuming the use of a fresh Ubuntu machine (or image) that doesn’t have anything installed yet. (I borrowed the nodesource script documentation from DigitalOcean)
curl -sL https://deb.nodesource.com/setup_16.x -o nodesource_setup.sh
nano nodesource_setup.sh
At this point, it’s always important to look around in the nodesource file to make sure it’s what you want to do. Once you do that, you can continue.
sudo bash ./nodesource_setup.sh
sudo apt-get update && sudo apt-get install -y \
nodejs \
git \
gdal-bin \
zip
mkdir ~/projects
cd ~/projects
git clone https://github.com/evansiroky/timezone-boundary-builder
cd ~/projects/timezone-boundary-builder
npm install
Run the time zone boundary builder
The README for the library is pretty great, so you can look there for all the details. This is the command I used to get a GeoJSON file that I will import into PostgreSQL.
node \
--max-old-space-size=8192 \
index.js \
--skip_analyze_diffs \
--skip_shapefile \
--skip_zip \
--downloads_dir ./downloads \
--dist_dir ./dist
This loops through every time zone in the osmBoundarySources.json
file, downloads that time zone from the Overpass API, and waits 4
seconds between each time zone. There are 470
time zones in that file, so that means this process will take at least 4 * 470 = 1800
seconds (or 30 minutes) to complete. There is a status indicator, and it says the process will take about 2 hours, which also includes the time to download the data from Overpass. I get a lot of 429: Too Many Requests
issues so it’s taking some time.
Wait 🕑
I took this time to write up my process so far in Markdown, refill my tea, and work on other projects.
Load into PostgreSQL
The final GeoJSON file will be called ./dist/combined.json
Gdal was already installed with the apt-get
command above, so ogr2ogr
should be installed installed.
You can read all about the switches I used on the GDAL PostgreSQL Driver documentation.
Make sure to change the host
, user
, dbname
, and password
fields in the PG line, or use environment variables.
ogr2ogr \
-f "PostgreSQL" \
PG:"host=myhost user=myloginname dbname=mydbname password=mypassword" \
./dist/combined.json \
-lco OVERWRITE=yes \
-lco DIM=2 \
-lco SPATIAL_INDEX=GIST \
-lco GEOMETRY_NAME=the_geom \
-lco FID=id \
-nlt PROMOTE_TO_MULTI \
-nln osm_timezones
Make sure the new Data matches PostgreSQL timezones table
PostgreSQL contains a table called pg_timezone_names
that contains the current offset for every time zone. These should match up, but since we don’t always keep out PostgreSQL databases or OSM extracts current, there may be gaps. I try to avoid these situations, and you can test it with a query like this:
SELECT tzid, now() at time zone "tzid" from osm_timezones;
If you get an error, the easiest way to prevent issues is to add a column that lets you filter out the time zones that don’t match. You really shouldn’t have to do this as long as everything is up-to-date, but it never hurts to check.
ALTER TABLE osm_timezones ADD COLUMN pg_timezone BOOLEAN;
WITH matched_timezones AS (
SELECT
pg_timezone_names.name IS NOT NULL AS matched,
osm_timezones.tzid
FROM osm_timezones
LEFT JOIN pg_timezone_names ON pg_timezone_names.name = osm_timezones.tzid
)
UPDATE osm_timezones
SET pg_timezone = matched_timezones.matched
FROM matched_timezones
WHERE matched_timezones.tzid = osm_timezones.tzid;
Find a point in a Polygon
Test out the code to make sure it works. I like running across the 40th parallel, at 15° at a time, since it cuts through some populated areas.
- -75° is New Jersey (Cinnaminson / Philly Suburbs)
- -90° is Illinois (Roughly Springfield)
- -105° is Colorado (North of Denver at ‘Baseline’ road, which follows the 40th parallel)
- -120° is California (at the NV border)
SELECT tzid
FROM osm_timezones
WHERE ST_Contains(the_geom, ST_SetSRID(ST_MakePoint(-75, 40), 4326))
These queries are running pretty quickly for me (around 2.0 ms) but there are things we can do to speed it up.
Optimizing the Queries
The ST_Contains
function runs a point in polygon function on all the polygons that may contain the single point. These time zone polygons can be quite large, for example, the America/New_York
time zone contains 32244
vertices.
Somewhere deep in the PostgreSQL code, it is looping through each of those 32,244 points to make sure that your point is contained within that polygon. PostgreSQL is smart and can optimize this a little internally, and the gist
spatial index we added with ogr2ogr
command helps a lot too.
We can help PostgreSQL out a little by splitting up our polygons into ones with less points in them, basically breaking up something like the New York time zone into a bunch of smaller polygons that each contain less points. This is where the ST_Subdivide
function comes in.
In order to do this, I am going to split up the time zones into polygons with no more than 255 points each, by running this query:
CREATE TABLE osm_timezones_split AS
SELECT id, tzid, ST_Subdivide(the_geom, 255) AS the_geom
FROM osm_timezones WHERE pg_timezone = true;
It will take a while, so you can go work on something else while it processes. It took me 2 minutes 11 seconds to run. I chose 255 because it’s the default, I’m sure there’s a good balance of vertices in a polygon to amount of polygons.
After than finishes, make sure to add a spatial index to the new table:
CREATE INDEX
osm_timezones_split_the_geom_idx ON
osm_timezones_split USING GIST (the_geom);
Run this query on the new table, and you should see an improvement:
SELECT tzid
FROM osm_timezones_split
WHERE ST_Contains(the_geom, ST_SetSRID(ST_MakePoint(-75, 40), 4326))
I am getting around 0.8 ms per query now!
Testing on a larger dataset
For a larger test, I am going to join all the time zones for the USGS Structures dataset, with 445,921 records.
SELECT
usgs_struct.name,
tzid
FROM usgs_struct LEFT JOIN osm_timezones_split ON
ST_Contains(osm_timezones_split.the_geom, usgs_struct.the_geom);
Without subdividing the dataset, I was able to join the two datasets in 1090377.223 ms
(18 minutes). With the subdivided database, it took 5038.160 ms
to join the dataset.
Oddities
I found that if I don’t do the LEFT JOIN
(just use a JOIN
instead), that it runs much quicker, in fact the osm_timezones
JOIN
runs in about 3 seconds, which is 2 seconds faster than the left join with the osm_timezones_split
table. In theory, each point should only fall into one time zone, so there is really no need for a nested loop left join. I’ll have to look into this further.