Optimizing BigQuery Queries - Part 1

7 minute read

Published:

Optimizing query is usually performed to reduce query execution times or cost.

In this post, we’re going to look at several techniques to make our BigQuery query more efficient.


A) Minimize I/O

An important thing to note here is that most of the performance difference between a number of queries doing the same job is due to reading more data (I/O). Therefore, it’s worth to know how to minimize I/O to improve query performance.


Select Only Required Columns

BigQuery stores data using column-based format. This means that the fewer the columns that are read in a SELECT operation, the less the amount of data that needs to be read.

Therefore, it’s worth to note that selecting only the required columns might improve the speed of query execution. However, using SELECT * in a subquery, then selecting some of required columns only in an outer query would work just fine since the BigQuery optimizer is smart enough to only retrieve the required columns.

To give this technique a try, enter the following query in the BigQuery editor window:

SELECT
  bike_id,
  duration
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
ORDER BY
  duration DESC
LIMIT
  1

Notice that the query completed in ~1.2s and processed ~372MB of data.

As a comparison, let’s select all the columns.

SELECT
  *
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
ORDER BY
  duration DESC
LIMIT
  1

Notice that the query completed in ~4.5s and processed ~2.6GB of data.


Reduce Data Being Read

Minimizing I/O could means that the data being read is minimized. Suppose we wish to find the typical duration of the most common one-way rentals.

Enter the following query into the BigQuery editor window:

SELECT
  MIN(start_station_name) AS start_station_name,
  MIN(end_station_name) AS end_station_name,
  APPROX_QUANTILES(duration, 10)[OFFSET (5)] AS typical_duration,
  COUNT(duration) AS num_trips
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
  start_station_id != end_station_id
GROUP BY
  start_station_id,
  end_station_id
ORDER BY
  num_trips DESC
LIMIT
  10

In the above example, we perform filter and group operation using station_id. However, this station_id is not selected at the end of the day. Or in other words, this data is only used supporting data to achieve the final goal of selecting the station name, typical duration, and number of trips. In such a case, we read the following data: start_station_id, end_station_id, start_station_name, end_station_name, duration.

We could reduce the data being read by performing filter and group operation using station_name instead of station_id since there’s a 1:1 relationship between these data. Consequently, the number of data being read is reduced. In such a case, we read the following data: start_station_name, end_station_name, duration.

SELECT
  start_station_name,
  end_station_name,
  APPROX_QUANTILES(duration, 10)[OFFSET(5)] AS typical_duration,
  COUNT(duration) AS num_trips
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
WHERE
  start_station_name != end_station_name
GROUP BY
  start_station_name,
  end_station_name
ORDER BY
  num_trips DESC
LIMIT
  10

Reduce Number of Expensive Computations

Suppose we wish to find the total distance traveled by each bicycle in our dataset.

A naive way to do this would be to calculate the distance traveled in each trip taken by each bicycle and sum them up (group by bicycle then sum up the trip distance).

Here’s the query to perform the above way.

WITH
  trip_distance AS (
SELECT
  bike_id,
  ST_Distance(
    ST_GeogPoint(s.longitude, s.latitude),
    ST_GeogPoint(e.longitude, e.latitude)) AS distance
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire,
  `bigquery-public-data`.london_bicycles.cycle_stations s,
  `bigquery-public-data`.london_bicycles.cycle_stations e
WHERE
  start_station_id = s.id
  AND end_station_id = e.id )
  
SELECT
  bike_id,
  SUM(distance)/1000 AS total_distance
FROM
  trip_distance
GROUP BY
  bike_id
ORDER BY
  total_distance DESC
LIMIT
  5

Computing the trip distance is a pretty expensive operation and we can avoid joining the cycle_stations table against the cycle_hire table if we precompute the distances between all pairs of stations:

WITH
  stations AS (
SELECT
  s.id AS start_id,
  e.id AS end_id,
  ST_Distance(
    ST_GeogPoint(s.longitude, s.latitude),
    ST_GeogPoint(e.longitude, e.latitude)) AS distance
FROM
  `bigquery-public-data`.london_bicycles.cycle_stations s,
  `bigquery-public-data`.london_bicycles.cycle_stations e ),
  
trip_distance AS (
SELECT
  bike_id,
  distance
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire,
  stations
WHERE
  start_station_id = start_id
  AND end_station_id = end_id )
  
SELECT
  bike_id,
  SUM(distance)/1000 AS total_distance
FROM
  trip_distance
GROUP BY
  bike_id
ORDER BY
  total_distance DESC
LIMIT
  5

B) Cache Results of Previous Queries

BigQuery automatically caches query results in a temporary table. If the identical query is submitted within approximately 24 hours, the results are served from this temporary table without any recomputation. Cached results are extremely fast and do not incur charges.

However, query caching is based on exact string comparison. In other words, a single whitespace can cause a recomputation. Another thing to note is that queries are never cached if they exhibit non-deterministic behavior (for example, they use CURRENT_TIMESTAMP or RAND), if the table or view being queried has changed (even if the columns/rows of interest to the query are unchanged), if the table is associated with a streaming buffer (even if there are no new rows), if the query uses DML statements, or querying against external data sources.


Cache Intermediate Results

The WITH clause (also called a Common Table Expression) improves readability but does not improve query speed or cost since results are not cached. The same holds for views and subqueries as well. If you find yourself using a WITH clause, view, or a subquery often, one way to potentially improve performance is to store the result into a table (or materialized view).

Suppose we’d like to find the typical duration of trips between a pair of stations.

We first create a new dataset called mydataset and a new table within it called typical_trip.

CREATE OR REPLACE TABLE
  mydataset.typical_trip AS
SELECT
  start_station_name,
  end_station_name,
  APPROX_QUANTILES(duration, 10)[OFFSET (5)] AS typical_duration,
  COUNT(duration) AS num_trips
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
GROUP BY
  start_station_name,
  end_station_name

Then, use the typical_trip table to find days when bicycle trips are much longer than usual.

SELECT
  EXTRACT (DATE
  FROM
    start_date) AS trip_date,
  APPROX_QUANTILES(duration / typical_duration, 10)[OFFSET(5)] AS ratio,
  COUNT(*) AS num_trips_on_day
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire AS hire
JOIN
  mydataset.typical_trip AS trip
ON
  hire.start_station_name = trip.start_station_name
  AND hire.end_station_name = trip.end_station_name
  AND num_trips > 10
GROUP BY
  trip_date
HAVING
  num_trips_on_day > 10
ORDER BY
  ratio DESC
LIMIT
  10

This time, use the WITH clause to build the typical trip table. We also find days when bicycle trips are much longer than usual.

WITH
typical_trip AS (
SELECT
  start_station_name,
  end_station_name,
  APPROX_QUANTILES(duration, 10)[OFFSET (5)] AS typical_duration,
  COUNT(duration) AS num_trips
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire
GROUP BY
  start_station_name,
  end_station_name )
  
SELECT
  EXTRACT (DATE
  FROM
    start_date) AS trip_date,
  APPROX_QUANTILES(duration / typical_duration, 10)[OFFSET(5)] AS ratio,
  COUNT(*) AS num_trips_on_day
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire AS hire
JOIN
  typical_trip AS trip
ON
  hire.start_station_name = trip.start_station_name
  AND hire.end_station_name = trip.end_station_name
  AND num_trips > 10
GROUP BY
  trip_date
HAVING
  num_trips_on_day > 10
ORDER BY
  ratio DESC
LIMIT
10

Notice the ~50% speedup when storing the typical trip duration into a table. This happens since the typical trip duration computation is avoided.

Also, notice that the table typical_trip is not updated when new data is added to the cycle_hire table. One way to solve this problem is to use a materialized view or to schedule queries to update the table periodically.


Accelerate Queries with BI Engine

In the case of Business Intelligence (BI), there might be certain tables that are queried frequently. One way to speed up the queries is to utilize BI Engine.

BI Engine will automatically store relevant pieces of data in memory (either actual columns from the table or derived results). Those pieces of data will then be processed by BI Engine’s specialized query processor (tuned for working with in-memory data). We can reserve the amount of memory (currently up to a maximum of 10 GB) that BigQuery will allocate for caching data for BI Engine.

This memory should be reserved in the same region as the dataset being queried. BigQuery will start to cache tables, parts of tables, and aggregations in memory.

A main use case for BI Engine is for tables that are accessed from dashboard tools such as Google Data Studio. By allocating memory for BI Engine, BigQuery will return the results much faster.