Apache Spark [PART 29]: Multiple Extra Java Options for Spark Submit Config Parameter

3 minute read

Published:

There’s a case where we need to pass multiple extra java options as one of configurations to spark driver and executors. Here’s an example:

path/to/spark-submit \
--conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development”
--conf “spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development”

CASE A

The above code seems to provide a well execution. However, what if you need to add several additional spark submit parameters, such as the following:

path/to/spark-submit \
--master yarn \
--name my-spark-app \
--files conf/spark-defaults.conf \
path/to/job/file \
[additional application parameters]

Let’s add the extra java options to the above spark submit.

path/to/spark-submit \
--master yarn \
--name my-spark-app \
--files conf/spark-defaults.conf \
--conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
--conf “spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
path/to/job/file \
[additional application parameters]

The above spark submit threw an error when I executed it. The error was related to something like Unrecognized -Drun.mode. The initial conjecture was that the -Drun.mode was not read by Spark. So, how to pass multiple extra java options when submitting a job via spark-submit?

Turned out that the order of the —conf was matter. Precisely, the —conf for extra java options should be placed exactly after the path to spark-submit. Therefore, it should be modified to the following:

path/to/spark-submit \
--conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
--conf “spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
--master yarn \
--name my-spark-app \
--files conf/spark-defaults.conf \
path/to/job/file \
[additional application parameters]

The above code executed well.

CASE B

Now, let’s consider a further case. Let’s say you want to add other —conf arguments that are not related to extra Java options. Here’s an example of using Kerberos as a network authentication protocol.

--conf spark.yarn.keytab=“<path_to_keytab_file>”
--conf spark.yarn.principal=“<kerberos_principal_name>”

Note that a difference between the configuration for Kerberos and extra Java options is that the Kerberos’ configuration doesn’t use quotation mark before the config arguments (spark.yarn.keytab and spark.yarn.principal), while extra Java option has it (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions).

The solution to such a case is still the same, that is the —conf for extra Java options should be placed at the top before any other spark-submit parameters. Therefore, the spark-submit path should be configured as the following:

path/to/spark-submit \
--conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
--conf “spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties -Drun.mode=development” \
--master yarn \
--name my-spark-app \
--files conf/spark-defaults.conf \
--conf spark.yarn.keytab=“<path_to_keytab_file>”
--conf spark.yarn.principal=“<kerberos_principal_name>”
path/to/job/file \
[additional application parameters]

CASE C

Based on the previous cases, what it you want to create a script for spark-submit, something like a run.sh file? Using such a file enables you to submit a job with the following way:

./run.sh \
[additional_spark_submit_parameters] \
path/to/job/file \
[additional_application_parameters]

I’ll simplify the below run.sh file where some of required spark-submit parameters are omitted for brevity, such as —files for Kerberos’ JAAS.conf and keytab files. You should add the required parameters when applying this example.

Here’s how you can make it.

FILE: run.sh
=========

export SPARK_SUBMIT_PATH=“<please_fill>”
export KERBEROS_KEYTAB_FILE=“<please_fill>”
export KERBEROS_PRINCIPAL_NAME=“<please_fill>”

SPARK_EXTRA_JAVA_OPTIONS=“-Dlog4j.configuration=log4j.properties”
SPARK_LOG4J_PROPERTIES=“log4j.properties”

SPARK_SUBMIT_PARAMS=“”

USE_RUN_MODE=false
If [ “$1” == “use-run-mode” ]; then
	USE_RUN_MODE=true
	SPARK_EXTRA_JAVA_OPTIONS+=“ -Drun.mode=development”

USE_KERBEROS=false
If [ “$1” == “use-kerberos” ]; then
	USE_KERBEROS=true
	SPARK_SUBMIT_PARAMS+=“--conf spark.yarn.keytab=${KERBEROS_KEYTAB_FILE}”
	SPARK_SUBMIT_PARAMS+=“ --conf spark.yarn.principal=${KERBEROS_PRINCIPAL_NAME}”

SPARK_SUBMIT_PARAMS+=" ${@}"

echo “Running spark submit…”

${SPARK_SUBMIT_PATH} \
--conf “spark.driver.extraJavaOptions=${SPARK_EXTRA_JAVA_OPTIONS}“ \
--conf “spark.executor.extraJavaOptions=${SPARK_EXTRA_JAVA_OPTIONS}“ \
SCALA_SPARK_SUBMIT_PARAMS"

To run the above script file, use the following commands.

> chmod +x run.sh

> ./run.sh \
[additional spark submit parameters] \
path/to/job/file \
[additional application parameters]

The above command will place —conf for extra Java options at the top before any other spark submit parameters.

Thank you for reading.