Apache Spark [PART 23]: Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

sum_{i=1}^{5} (X_i * 100) + Y_i

How would you implement such an operation using Spark’s dataframe APIs?

Good news is it’s really simple. You can just play with the columns to specify the operation, use a few methods from Spark’s SQL functions to compute the sigma, and use a dataframe’s API to derive the final result.

Let’s dive into the code.

Suppose you have the following dataframe.

df = spark.createDataFrame([(100, 200, 300), (200,300,150), (150,150,150), (100,100,100), (200,300,200), (300,200,100), (300,300,300), (100,300,100)], [‘COL_A’, ‘COL_B’, ‘COL_C’])

+-----+-----+-----+
|COL_A|COL_B|COL_C|
+-----+-----+-----+
|  100|  200|  300|
|  200|  300|  150|
|  150|  150|  150|
|  100|  100|  100|
|  200|  300|  200|
|  300|  200|  100|
|  300|  300|  300|
|  100|  300|  100|
+-----+-----+-----+

Next, let’s say that you would like to compute a division operation with the following numerator and denumerator.

numerator = sum_{i=1}^{N} COL_A(i) + (COL_B * 30)
denumerator = 30 * sum_{i=1}^{N} COL_B + COL_C

Let’s transform the above operation into code.

from pyspark.sql import functions as F

# numerator
numerator = F.sum(F.col(COL_A) + F.col(COL_B) * 30)

# denumerator
denumerator = F.sum(F.col(COL_C) + F.col(COL_B)) * 30

Afterwards, let’s combine them to get the final result.

df.select(numerator / denumerator).show()

We get the following dataframe.

+-----------------------------------------------------------+
|(sum((COL_A + (COL_B * 30))) / (sum((COL_C + COL_B)) * 30))|
+-----------------------------------------------------------+
|                                         0.5841025641025641|
+-----------------------------------------------------------+

In case you’re curious, you might want to check what would be the result if we just pass either the numerator or denumerator.

df.select(numerator).show()

Which will give the following result.

+---------------------------+
|sum((COL_A + (COL_B * 30)))|
+---------------------------+
|                      56950|
+---------------------------+

Please try by yourself for the denumerator :)

Last but not least, as you can see from the resulting dataframe, the column name is not tidy. How would you transform those column names into understandable words?

Fortunately, we can use alias operation to modify the column name.

operation = (numerator / denumerator).alias(division_operation)

df.select(operation).show()

You’ll get the following result.

+------------------+
|division_operation|
+------------------+
|0.5841025641025641|
+------------------+

Thanks for reading.