Optimizing BigQuery Queries - Part 2

9 minute read

Published:

This is the 2nd part of series on how to optimize BigQuery queries.

C) Efficient Joins

Joining two tables is an expensive operation since it requires shuffling the data across the cluster (moving data with the same join key to a destined location). It’s advised to avoid a join or reduce the amount of data being joined whenever it’s possible.

Denormalization

Denormalization is a way to improve the read performance and avoid joins by adding redundant copies of data.

Using an example from BigQuery public data (london bicycles dataset), instead of storing the bicycle station latitudes and longitudes separately from the cycle hire information, we could create a single table (please create a dataset called mydataset first).

CREATE OR REPLACE TABLE
  mydataset.london_bicycles_denorm AS
SELECT
  start_station_id,
  s.latitude AS start_latitude,
  s.longitude AS start_longitude,
  end_station_id,
  e.latitude AS end_latitude,
  e.longitude AS end_longitude
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire AS h
JOIN
  `bigquery-public-data`.london_bicycles.cycle_stations AS s
ON
  h.start_station_id = s.id
JOIN
  `bigquery-public-data`.london_bicycles.cycle_stations AS e
ON
  h.end_station_id = e.id

Doing so, all queries won’t need to perform the join because this single table will already store all the necessary location information for all trips.

In this case, you are trading off storage and reading more data against the computational expense of a join. It is possible though that the cost of reading more data from disk will outweigh the cost of the join operation, therefore, you should measure whether denormalization brings performance benefits.

Avoid Self-joins of Large Tables

Self-joins happen when a table is joined with itself. Self-joining a large table with itself in BigQuery might lead to performance degradation. In many cases, you can avoid the self-join by taking advantage of SQL features such as aggregation and window functions.

Let’s take look at an example.

This time, we’re going to use a public dataset for baby names published by the US Social Security Administration. It is possible to query the dataset to find the most common male names in 2015 in the state of Massachusetts (please make sure your query is running in the US region by selecting More > Query settings > Processing location):

SELECT
  name,
  number AS num_babies
FROM
  `bigquery-public-data`.usa_names.usa_1910_current
WHERE
  gender = 'M'
  AND year = 2015
  AND state = 'MA'
ORDER BY
  num_babies DESC
LIMIT
  5

Similarly, we can execute a query to find the most common female names in 2015 in the state of Massachusetts.

SELECT
  name,
  number AS num_babies
FROM
  `bigquery-public-data`.usa_names.usa_1910_current
WHERE
  gender = 'F'
  AND year = 2015
  AND state = 'MA'
ORDER BY
  num_babies DESC
LIMIT
  5

What are the most common names assigned to both male and female babies in the country over all the years in the dataset?

A naive way to solve this problem involves reading the baby names table twice (stored in male_babies and female_babies) and doing a self-join.

WITH
male_babies AS (
SELECT
  name,
  number AS num_babies
FROM
  `bigquery-public-data`.usa_names.usa_1910_current
WHERE
  gender = 'M' ),
  
female_babies AS (
SELECT
  name,
  number AS num_babies
FROM
  `bigquery-public-data`.usa_names.usa_1910_current
WHERE
  gender = 'F' ),
  
both_genders AS (
SELECT
  name,
  SUM(m.num_babies) + SUM(f.num_babies) AS num_babies,
  SUM(m.num_babies) / (SUM(m.num_babies) + SUM(f.num_babies)) AS frac_male
FROM
  male_babies AS m
JOIN
  female_babies AS f
USING
  (name)
GROUP BY
  name )
  
SELECT
  *
FROM
  both_genders
WHERE
  frac_male BETWEEN 0.3
  AND 0.7
ORDER BY
  num_babies DESC
LIMIT
  5

This query took 74 seconds to complete.

In addition, unfortunately, the above query yields a wrong answer because of the self-join.

The following is a faster, more elegant, and correct solution. Basically, the query only reads the baby names table once (stored the result in all_babies table) and avoid self-join.

WITH
all_babies AS (
SELECT
  name,
  SUM(
  IF
    (gender = 'M',
      number,
      0)) AS male_babies,
  SUM(
  IF
    (gender = 'F',
      number,
      0)) AS female_babies
FROM
  `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY
  name ),
  
both_genders AS (
SELECT
  name,
  (male_babies + female_babies) AS num_babies,
  SAFE_DIVIDE(male_babies,
    male_babies + female_babies) AS frac_male
FROM
  all_babies
WHERE
  male_babies > 0
  AND female_babies > 0 )
  
SELECT
  *
FROM
  both_genders
WHERE
  frac_male BETWEEN 0.3
  AND 0.7
ORDER BY
  num_babies DESC
LIMIT
  5

This query only took 2.4 seconds, a 30x speedup!

Reduce Data Being Joined

Still using the same question as above, that is What are the most common names assigned to both male and female babies in the country over all the years in the dataset?, it is possible to make the query processing faster by by grouping the data by name and gender early on.

Try the following query:

WITH
all_names AS (
SELECT
  name,
  gender,
  SUM(number) AS num_babies
FROM
  `bigquery-public-data`.usa_names.usa_1910_current
GROUP BY
  name,
  gender ),
  
male_names AS (
SELECT
  name,
  num_babies
FROM
  all_names
WHERE
  gender = 'M' ),
  
female_names AS (
SELECT
  name,
  num_babies
FROM
  all_names
WHERE
  gender = 'F' ),
  
ratio AS (
SELECT
  name,
  (f.num_babies + m.num_babies) AS num_babies,
  m.num_babies / (f.num_babies + m.num_babies) AS frac_male
FROM
  male_names AS m
JOIN
  female_names AS f
USING
  (name) )
  
SELECT
  *
FROM
  ratio
WHERE
  frac_male BETWEEN 0.3
  AND 0.7
ORDER BY
  num_babies DESC
LIMIT
  5

The query above finished in 2 seconds and returned the correct result. The early grouping makes the amount of data being joined smaller.

Use a Window Function Instead of a Self-join

Suppose we’d like to find the duration that a bicycle stays at a station. This is an example of a dependent relationship between rows. A naive way to solve this might be using self-join. From the self-joined table we could calculate the difference between current pickup and previous dropoff (by performing certain logics first so that we’re sure that the difference is between current pickup and previous dropoff).

We can utilize window function instead of self-join. Please make sure your query is running in the EU region by selecting More > Query settings > Processing location.

SELECT
  bike_id,
  start_date,
  end_date,
  TIMESTAMP_DIFF( start_date, LAG(end_date) OVER (PARTITION BY bike_id ORDER BY start_date), SECOND) AS time_at_station
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
LIMIT
  5

The time_at_station field tracks the difference between the previous dropoff and the current pickup.

Using this, we can compute the average time that a bicycle is unused at each station and rank stations by that measure:

WITH
unused AS (
  SELECT
    bike_id,
    start_station_name,
    start_date,
    end_date,
    TIMESTAMP_DIFF(start_date, LAG(end_date) OVER (PARTITION BY bike_id ORDER BY start_date), SECOND) AS time_at_station
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire )
SELECT
  start_station_name,
  AVG(time_at_station) AS unused_seconds
FROM
  unused
GROUP BY
  start_station_name
ORDER BY
  unused_seconds ASC
LIMIT
  5

Join with Precomputed Values

It might be helpful to precompute values on smaller tables, and then join with the precomputed values instead of repeating the computation (might be expensive ones).

For instance, suppose we’d like to find the pair of stations between which our customers ride bicycles at the fastest pace. To compute the pace (minutes per kilometer), we need to divide the duration of the ride by the distance between stations.

One way to solve this task is by creating a denormalized table with distances between stations and then compute the average pace:

WITH
  denormalized_table AS (
  SELECT
    start_station_name,
    end_station_name,
    ST_DISTANCE(
      ST_GeogPoint(s1.longitude, s1.latitude),
      ST_GeogPoint(s2.longitude, s2.latitude)) AS distance,
    duration
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire AS h
  JOIN
    `bigquery-public-data`.london_bicycles.cycle_stations AS s1
  ON
    h.start_station_id = s1.id
  JOIN
    `bigquery-public-data`.london_bicycles.cycle_stations AS s2
  ON
    h.end_station_id = s2.id ),
    
  durations AS (
  SELECT
    start_station_name,
    end_station_name,
    MIN(distance) AS distance,
    AVG(duration) AS duration,
    COUNT(*) AS num_rides
  FROM
    denormalized_table
  WHERE
    duration > 0
    AND distance > 0
  GROUP BY
    start_station_name,
    end_station_name
  HAVING
    num_rides > 100 )
    
SELECT
  start_station_name,
  end_station_name,
  distance,
  duration,
  duration/distance AS pace
FROM
  durations
ORDER BY
  pace ASC
LIMIT
  5

The above query calls the geospatial function ST_DISTANCE once for each row in the cycle_hire table (24 million times). It takes 14.7 seconds to complete and processes 1.9 GB of data.

Or in other words, there might be several rows whose result of ST_DISTANCE function are the same. We repeated the same computation for these rows.

Alternately, we can use the cycle_stations table (a smaller table) to precompute the distance between every pair of stations (distances) and then join it with the reduced-size table of average duration between stations (durations):

WITH
  distances AS (
  SELECT
    a.id AS start_station_id,
    a.name AS start_station_name,
    b.id AS end_station_id,
    b.name AS end_station_name,
    ST_DISTANCE(
      ST_GeogPoint(a.longitude, a.latitude),
      ST_GeogPoint(b.longitude, b.latitude)) AS distance
  FROM
    `bigquery-public-data`.london_bicycles.cycle_stations a
  CROSS JOIN
    `bigquery-public-data`.london_bicycles.cycle_stations b
  WHERE
    a.id != b.id ),
    
  durations AS (
  SELECT
    start_station_id,
    end_station_id,
    AVG(duration) AS duration,
    COUNT(*) AS num_rides
  FROM
    `bigquery-public-data`.london_bicycles.cycle_hire
  WHERE
    duration > 0
  GROUP BY
    start_station_id,
    end_station_id
  HAVING
    num_rides > 100 )
    
SELECT
  start_station_name,
  end_station_name,
  distance,
  duration,
  duration/distance AS pace
FROM
  distances
JOIN
  durations
USING
  (start_station_id,
    end_station_id)
ORDER BY
  pace ASC
LIMIT
  5

The above query only took 8.2 seconds (~1.8x speedup) and processes 554 MB (~4x cost reduction).

The above query doesn’t perform the same computation for rows that the result of ST_DISTANCE function are the same. This is because the durations table is already grouped by start_station_id and end_station_id which makes the table has unique pair of the start and end station.