There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
However, according to the documentation, such a transformer only accepts a function with iterator as the single parameter.
Therefore, the question is simply, how to make the function accepts more than just one parameter?
I came across such a challenge recently. I thought it was one of the limitations in Spark that couldn’t be engineered. Until I’ve found a solution that seems extremely simple.
To make it brief, just imagine that we have a dataframe that has been repartitioned into a certain number of partitions.
To apply an operation on each partition, we pass the corresponding function as the parameter of mapPartitions.
def func(param_a, param_b): def partition_func(iterator): total = 0 for row in iterator: total += (row[‘a’] + row[‘b’]) * param_a total = total - param_b return total return partition_func # compute the total value for each partition total = df.rdd.mapPartitions(func(param_a, param_b))
The above code will return the value of total for each partition. To get more insight on what the return value looks like, take a look at the below code snippet.
The above code will return the following.
[ [total_for_partition_0], [total_for_partition_1], [total_for_partition_2], ]
Thank you for reading.