Apache Spark [PART 28]: Accessing a Kerberized HDFS Cluster

1 minute read

Published:

A Spark application deployed to a cluster might need to access an HDFS cluster. To establish a secure connection, one may want to utilize a network authentication protocol, such as Kerberos. Using Kerberos might add a little bit complexity to the connecting process. In this article I’m going to show you one of the cases encountered by my team and I recently.

Just FYI, we deployed the Spark application through YARN. Therefore, we need to set up HADOOP_CONF_DIR and YARN_CONF_DIR in which one of the usages is to tell Spark about the location of the YARN’s resource manager.

One of the deployed drivers needed to connect to an HDFS cluster protected by Kerberos. Let’s take a look at how to connect to an HDFS cluster using Python.

from pyarrow import hdfs

hdfs.connect(host="namenode_address", 
             port=0, 
             user=None, 
             kerb_ticket=kerb_ticket, 
             driver="libhdfs3", 
             extra_conf=None)

When the above code was executed, an error regarding “HDFS connection failed” occurred. It was odd since the kerb_ticket was already specified. All the Kerberos related properties in hdfs-site.xml and core-site.xml had been specified correctly as well.

Turned out that we didn’t need the kerb_ticket since the HDFS cluster already knew who was requesting for access. According to the documentation, when the username is not specified, it implies that one requesting access to the HDFS cluster is the currently logged in user. This might mean that the kerberos ticket for the HDFS cluster was invalid (or expired).