Machine Learning on BigQuery ML

2 minute read

Published:

BigQuery Machine Learning (BQML) is a new feature in BigQuery where data analysts can train, evaluate, and predict with machine learning models with minimal coding.

In this post we’re going to look at how to train an ML model and predict new data using BQML. Please adjust the arguments in accordance with your needs.

Model Training

Feature columns have to be either numeric (INT64, FLOAT64, etc.) or categorical (STRING). If the feature is numeric but needs to be treated as categorical, we need to cast it as a string.

To train a model, we need to specify the dataset location in which the model will be stored. We can also set several arguments for the training job, such as input_label_cols (label column) and model_type.

Training a model is done by calling CREATE OR REPLACE MODEL query followed by the specified options (as explained previously). Afterwards, the query is followed by the training data.

CREATE OR REPLACE MODEL
  <PROJECT>.<DATASET>.<MODEL>
OPTIONS
  (input_label_cols=[<LABEL_COL>],
    model_type=<MODEL_TYPE>) AS
SELECT
  <COL_X>,
  <COL_Y>,
  ...
  <LABEL_COL>
FROM
  <PROJECT>.<DATASET>.<TABLE>

To see some metrics related to the training, enter the following query.

SELECT * FROM ML.EVALUATE(MODEL `<PROJECT>.<DATASET>.<MODEL>`)

Model Predictions

During experiment to get the best model, we often generate several transformations to the original data (feature engineering). During serving phase, the data may not contain the generated features. In other words, the transformations must be carried out again before the model calculate its prediction.

One way to solve this is by building a data pipeline that performs the transformations during training phase. However, it might be cumbersome for people who are not too familiar with building a data pipeline or coding in non-sql language to transform the data.

BQML provides a clause called TRANSFORM where BigQuery could remember the sets of transformations performed at the training phase. Those transformations will be automatically applied at the time of prediction.

By using TRANSFORM, the provided test data don’t need all the generated features (only raw ones). Therefore, the clients of the model don’t have to know about the transformations that have been carried out.

Here’s an example of using BQML for training with TRANSFORM clause.

CREATE OR REPLACE MODEL
  <DATASET>.<MODEL> 
  TRANSFORM(
    SELECT * EXCEPT(COL_X),
    <FIRST_TRANSFORMATION_ON_COL_X>,
    <SECOND_TRANSFORMATION_ON_COL_X>,
    ...)
OPTIONS
  (input_label_cols=[<LABEL_COL>],
    model_type=<MODEL_TYPE>) AS
SELECT
  <COL_X>,
  <COL_Y>,
  <COL_Z>,
  <LABEL_COL>
FROM
  <PROJECT>.<DATASET>.<TABLE>

Here’s how to predict with TRANSFORM clause in place.

SELECT
  *
FROM
  ML.PREDICT(MODEL <DATASET>.<MODEL>,
    (
     SELECT
       <COL_X_VALUE> AS <COL_X>,
       <COL_Y_VALUE> AS <COL_Y>,
       <COL_Z_VALUE> AS <COL_Z>))