I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0.
Suppose a dataframe df has three columns, namely column A, B, and C. Consider the following operation.
selected_df = df.select(*['A', 'B']) filtered_df = selected_df.filter(F.col('C') == 'x')
The above operations simply do the followings:
- Select column A and B from the original dataframe (df),
- Store the resulting dataframe to selected_df,
- Retrieve rows from selected_df where the value of column C equals to 'x',
- Store the resulting dataframe to filtered_df
Such operations run successfully and the resulting dataframe is correct as well even though the selected_df does not have column C anymore.
However, if we replace F.col() with another type of column referencing, an error is thrown.
filtered_df = selected_df.filter(selected_df['C'] == 'x')
The above operation throws the following exception: “pyspark.sql.utils.AnalysisException: ‘Cannot resolve column name “C” among (A, B);’”.
Okay. Let’s move on to another dataframe operation, such as withColumn combined with a conditional check.
new_df = selected_df.withColumn('D', F.when(F.col('C') == 'x', F.lit('YES')).otherwise(F.lit('NO')))
Such an operation failed with an exception denoting that Spark could not resolve column C provided input columns A and B. Replacing F.col() with selected_df[column] also throws the same error.
Let’s try another operation.
new_df = selected_df.withColumn('D', F.col('C')) AND new_df = selected_df.withColumn('D', selected_df['C'])
The result? Also failed with the same exception as before.
Now the question should be why using F.col() with non-existing referred columns (in this case, column C) makes the filter operation run successfully, while it fails on other operations?
Have you ever experienced the same use case? I’d love to hear your thoughts on this.
Thank you for reading.