ML on Kubeflow - Part 1: Creating a Kubeflow Cluster

6 minute read

Published:

Distributing machine learning (ML) workloads across multiple worker nodes is critical when the datasets grow larger and the ML models become more complex over time. Unfortunately, distributing ML workloads might add complexity to the DevOps part of the ML system as we’ll need to deal with lots of computing nodes.

Good news is that performing such a task is becoming simpler with Kubernetes. Kubernetes is a production ready platform that gives developers a simple API to deploy codes to a computing cluster.

Kubeflow is an open-source tool aiming to make running ML workloads (training, serving, experimenting on jupyter notebook, etc.) on Kubernetes simple, portable and scalable.


Introduction

In this post, we’re going to look at how to train and serve a TensorFlow model to recognize handwritten digits (with MNIST dataset) on Kubeflow. To be specific, we’ll also look at how to deploy a web interface to allow users to interact with the model.

One thing to note is that we’ll use a single node for training (CPU-only machine).


Download the Project Files

Before playing with Kubeflow, let’s download the playground repository.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

cd training-data-analyst/courses/data-engineering/kubeflow-examples

Set Up Environment Variables

Next, we set up the environment variables that will be used throughout our exploration.

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
gcloud config set project $PROJECT_ID

export ZONE=us-central1-a
gcloud config set compute/zone $ZONE

Go to the mnist directory as our working directory.

cd ./mnist

And set an environment variable for the working directory.

WORKING_DIR=$PWD

Install Kustomize

Kubeflow uses a tool called kustomize to set up an application so that the same code can be deployed across different environments.

mkdir $WORKING_DIR/bin
wget https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/kubeflow-resources/kustomize_2.0.3_linux_amd64 -O $WORKING_DIR/bin/kustomize
chmod +x $WORKING_DIR/bin/kustomize
PATH=$PATH:${WORKING_DIR}/bin

Install kfctl

kfctl is the Kubeflow CLI that can be used to set up a Kubernetes cluster with Kubeflow installed, or to deploy Kubeflow to an existing Kubernetes cluster.

Run the following to download, unpack, and add kfctl to the $PATH.

wget -P /tmp https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/kubeflow-resources/kfctl_v1.0-0-g94c35cf_linux.tar.gz
tar -xvf /tmp/kfctl_v1.0-0-g94c35cf_linux.tar.gz -C ${WORKING_DIR}/bin

Enable GKE API

Run the following to enable GKE API for your project.

gcloud services enable container.googleapis.com

Create a Kubeflow Cluster

To create a Kubernetes cluster with Kubeflow installed on GKE (Kubeflow cluster) using kfctl, we’ll need to perform these steps:

  • Create an application directory
  • Create configuration files for deployment
  • Apply the configurations

Create an Application Directory

According to the docs, Kubeflow application directory is the directory where you choose to store your Kubeflow configurations during deployment.

The directory contains the following files and directories:

  • ${CONFIG_FILE}: a YAML file that stores the primary Kubeflow configuration in the form of a KfDef Kubernetes object.
    • This file is a copy of the GitHub-based configuration YAML file that you used when deploying Kubeflow.
    • When you first run kfctl build or kfctl apply, kfctl creates a local version of the configuration file at ${CONFIG_FILE}, which you can further customize.
    • The YAML defines each Kubeflow application as a kustomize package.
  • <platform-name>_config: a directory that contains configurations specific to your chosen platform or cloud provider. For example, gcp_config or aws_config.
    • This directory may or may not be present, depending on your setup.
    • The directory is created when you run kfctl build or kfctl apply.
    • To customize these configurations, you can modify parameters in your ${CONFIG_FILE}, and then run kfctl apply to apply the configuration to your Kubeflow cluster.
  • kustomize: a directory that contains Kubeflow application manifests. That is, the directory contains the kustomize packages for the Kubeflow applications that are included in your deployment.
    • The directory is created when you run kfctl build or kfctl apply.
    • To customize these configurations, you can modify parameters in your ${CONFIG_FILE}, and then run kfctl apply to apply the configuration to your Kubeflow cluster.

Run the followings to create the environment variables for the deployment.

# Set the username and pass for the deployment
export KUBEFLOW_USERNAME=<PLEASE_FILL>
export KUBEFLOW_PASSWORD=<PLEASE_FILL>

# Set the URI of the configuration file to use when deploying Kubeflow.
export CONFIG_URI=https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/kubeflow-resources/kfctl_gcp_basic_auth.v1.0.1.yaml

# Set KF_NAME to the name of your Kubeflow deployment.
# You also use this value as directory name when creating your configuration directory.
# For example, your deployment name can be 'my-kubeflow' or 'kf-test'.
export KF_NAME=kubeflow

# Set the Kubeflow application directory for this deployment.
export KF_DIR=${WORKING_DIR}/${KF_NAME}

Create Configuration Files

Now, let’s create configuration files for the deployment in the Kubeflow application directory (KF_DIR). Note that this step hasn’t deployed Kubeflow yet.

# Create the Kubeflow configurations
mkdir -p ${KF_DIR}

cd ${KF_DIR}

kfctl build -V -f ${CONFIG_URI}

Just FYI, the command of kfctl build creates the configuration files which define the various resources in your deployment but doesn’t deploy Kubeflow. The command is executed only if you want to edit the resources before running kfctl apply.

Apply the Configurations

After creating the configuration for the deployment, let’s apply those configurations to our Kubeflow cluster.

  • Add additional configs to the <platform>_config directory:
sed -i 's/n1-standard-8/n1-standard-4/g' gcp_config/cluster-kubeflow.yaml
sed -i 's/1.14/1.16.13-gke.401/g' gcp_config/cluster-kubeflow.yaml
  • Set an environment variable for the local configuration file: export CONFIG_FILE=${KF_DIR}/kfctl_gcp_basic_auth.v1.0.1.yaml
  • Apply the configurations: kfctl apply -V -f ${CONFIG_FILE}

Cluster creation may take up to 10 minutes to complete.

Do not proceed until the command prompt is returned in the console.

You can view several instances created as part of the deployment in the GCP Console:

  • In Deployment Manager, two deployment objects will appear: kubeflow-storage and kubeflow
  • In Kubernetes Engine, a cluster will appear: kubeflow
  • In the Workloads section, there is a number of Kubeflow components
  • In the Services & Ingress section, there is a number of Kubeflow services

After the cluster is set up properly, use gcloud to fetch its credentials so that we can communicate with it using kubectl.

gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT_ID}

Here’s the output example.

Fetching cluster endpoint and auth data.
kubeconfig entry generated for kubeflow.
Switch to the kubeflow namespace

Connect to the namespace of the Kubeflow cluster.

kubectl config set-context $(kubectl config current-context) --namespace=kubeflow

Here’s the sample output.

Context "gke_<PROJECT_ID>_<ZONE>_kubeflow" modified.

We’ve already connected to the Kubeflow cluster. Let’s view all the deployed resources on the cluster.

kubectl get all

Here’s the sample output.

replicaset.apps/mysql-6bcbfbb6b8                                         1         1         1       10m
replicaset.apps/notebook-controller-deployment-5c55f5845b                1         1         1       11m
replicaset.apps/profiles-deployment-7665cbdf48                           1         1         1       9m46s
replicaset.apps/pytorch-operator-cf8c5c497                               1         1         1       11m
replicaset.apps/seldon-controller-manager-6b4b969447                     1         1         1       9m33s
replicaset.apps/spark-operatorsparkoperator-76dd5f5688                   1         1         1       12m
replicaset.apps/spartakus-volunteer-6f547dbc9f                           1         1         1       11m
replicaset.apps/tensorboard-5f685f9d79                                   1         1         1       10m
replicaset.apps/tf-job-operator-5fb85c5fb7                               1         1         1       10m
replicaset.apps/workflow-controller-689d6c8846                           1         1         1       12m
NAME                                                        READY   AGE
statefulset.apps/admission-webhook-bootstrap-stateful-set   1/1     12m
statefulset.apps/application-controller-stateful-set        1/1     13m
statefulset.apps/kfserving-controller-manager               1/1     11m
statefulset.apps/metacontroller                             1/1     12m
NAME                                  COMPLETIONS   DURATION   AGE
job.batch/spark-operatorcrd-cleanup   1/1           15s        12m