BigQuery Machine Learning (BQML) is a new feature in BigQuery where data analysts can train, evaluate, and predict with machine learning models with minimal coding.
In this post we’re going to look at how to train an ML model and predict new data using BQML. Please adjust the arguments in accordance with your needs.
Feature columns have to be either numeric (INT64, FLOAT64, etc.) or categorical (STRING). If the feature is numeric but needs to be treated as categorical, we need to cast it as a string.
To train a model, we need to specify the dataset location in which the model will be stored. We can also set several arguments for the training job, such as
input_label_cols (label column) and
Training a model is done by calling
CREATE OR REPLACE MODEL query followed by the specified options (as explained previously). Afterwards, the query is followed by the training data.
CREATE OR REPLACE MODEL <PROJECT>.<DATASET>.<MODEL> OPTIONS (input_label_cols=[<LABEL_COL>], model_type=<MODEL_TYPE>) AS SELECT <COL_X>, <COL_Y>, ... <LABEL_COL> FROM <PROJECT>.<DATASET>.<TABLE>
To see some metrics related to the training, enter the following query.
SELECT * FROM ML.EVALUATE(MODEL `<PROJECT>.<DATASET>.<MODEL>`)
During experiment to get the best model, we often generate several transformations to the original data (feature engineering). During serving phase, the data may not contain the generated features. In other words, the transformations must be carried out again before the model calculate its prediction.
One way to solve this is by building a data pipeline that performs the transformations during training phase. However, it might be cumbersome for people who are not too familiar with building a data pipeline or coding in non-sql language to transform the data.
BQML provides a clause called
TRANSFORM where BigQuery could remember the sets of transformations performed at the training phase. Those transformations will be automatically applied at the time of prediction.
TRANSFORM, the provided test data don’t need all the generated features (only raw ones). Therefore, the clients of the model don’t have to know about the transformations that have been carried out.
Here’s an example of using BQML for training with
CREATE OR REPLACE MODEL <DATASET>.<MODEL> TRANSFORM( SELECT * EXCEPT(COL_X), <FIRST_TRANSFORMATION_ON_COL_X>, <SECOND_TRANSFORMATION_ON_COL_X>, ...) OPTIONS (input_label_cols=[<LABEL_COL>], model_type=<MODEL_TYPE>) AS SELECT <COL_X>, <COL_Y>, <COL_Z>, <LABEL_COL> FROM <PROJECT>.<DATASET>.<TABLE>
Here’s how to predict with
TRANSFORM clause in place.
SELECT * FROM ML.PREDICT(MODEL <DATASET>.<MODEL>, ( SELECT <COL_X_VALUE> AS <COL_X>, <COL_Y_VALUE> AS <COL_Y>, <COL_Z_VALUE> AS <COL_Z>))