ML on Kubeflow - Part 2: Training on the Cluster

4 minute read


You can find the Tensforflow code in the file in the examples repository. After training is complete, the model will be stored to a GCS bucket.

Set up a Storage Bucket

Create a GCS bucket to hold the trained model.


gsutil mb gs://${BUCKET_NAME}/

Build the Container

Before deploying the Tensorflow code to Kubernetes cluster, we first need to build an image for the code.

We’ll push the image to the Google Container Registry (GCR).

# set the path on GCR you want to push the image to$PROJECT_ID/kubeflow-train

# build the tensorflow code into a container image
# image is tagged with its eventual path on GCR, but it stays local for now
docker build $WORKING_DIR -t $IMAGE_PATH -f $WORKING_DIR/Dockerfile.model

Here’s the sample output.

1da6064a2568: Pull complete
e701d2aeb76d: Pull complete
cba62e3f7418: Pull complete
3b523741fd2e: Pull complete
Digest: sha256:f3b5484e3335d2eb72940a6addde6f714173ff46c8a13c06aaf10915e96bc539
Status: Downloaded newer image for tensorflow/tensorflow:1.7.0
 ---> b52a7196d31e
Step 2/5 : ADD /opt/
 ---> 660100d10475
Step 3/5 : RUN chmod +x /opt/
 ---> Running in ed5c3fa09ed3
Removing intermediate container ed5c3fa09ed3
 ---> 4613566ea358
Step 4/5 : ENTRYPOINT ["/usr/bin/python"]
 ---> Running in 80761d4755a9
Removing intermediate container 80761d4755a9
 ---> 836ecb6f880e
Step 5/5 : CMD ["/opt/"]
 ---> Running in b7432bca14ba
Removing intermediate container b7432bca14ba
 ---> a6ea16cdf075
Successfully built a6ea16cdf075
Successfully tagged<PROJECT_ID>/kubeflow-train:latest

Just to ensure everything works well, we might want to test the image locally.

docker run -it $IMAGE_PATH

You should see training logs start appearing in your console. This means that the image works expected and can be uploaded to GCR so that can be used in our cluster.

# allow docker to access GCR registry
gcloud auth configure-docker --quiet

docker push $IMAGE_PATH

Train the Model on the Kubeflow Cluster

After the Tensorflow’s code image is uploaded to GCR, we can now train the model on the cluster.

First, go to the training directory.

cd $WORKING_DIR/training/GCS

As you can see there’s a file called kustomization.yaml there.

We’ll customize several things in accordance with our needs. To do that, we can use kustomize to configure the manifests.

  • Set a unique name for the training job. In this example, we use my-train-1 as the name.
kustomize edit add configmap mnist-map-training --from-literal=name=my-train-1
  • Set some values for training hyper-parameters (number of training steps, batch size and learning rate).
kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
  • Set the manifest to use our storage bucket and TF image. For instance, in this example, we replace the old image training-image with a new one of ${IMAGE_PATH}:latest.
kustomize edit set image training-image=${IMAGE_PATH}:latest
kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET_NAME}/my-model
kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET_NAME}/my-model/export
  • Ensure that the training code has permissions to read/write to the storage bucket.

Fortunately, this is already solved by Kubeflow via creating a service account within the project as deployment part.

This can be verified by listing all the service accounts in the project.

gcloud --project=$PROJECT_ID iam service-accounts list

In addition, Kubeflow also added a Kubernetes secret called user-gcp-sa to the cluster.

This secret contains the credentials to authenticate as this service account within the cluster.

kubectl describe secret user-gcp-sa

To access the storage bucket from the training container, we need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the json file contained in the secret (user-gcp-sa).

kustomize edit add configmap mnist-map-training --from-literal=secretName=user-gcp-sa
kustomize edit add configmap mnist-map-training --from-literal=secretMountPath=/var/secrets
kustomize edit add configmap mnist-map-training --from-literal=GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json
  • Update the base config files to use this style of authentication (the default has changed in a recent kubeflow update).
sed -i 's/default-editor/kf-user/g' ../**/*

Finally, all the required parameters are set. Let’s build the new configurations.

kustomize build .

Then, apply the new configurations.

kubectl apply -f -

After applying the new configurations, there should be a new tf-job on the cluster called my-train-1-chief-0.

Use kubectl to get the information about the job.

kubectl describe tfjob

In addition, we can also retrieve the training logs from the pod running the container (after the tf job is running).

kubectl logs -f my-train-1-chief-0

When the training is complete, you should see the model has been added to the storage bucket.

gsutil ls -r gs://${BUCKET_NAME}/my-model/export