Apache Spark [PART 25]: Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published:

If you read my previous article titled Apache Spark [PART 21]: Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

unioned_df = joined_df.union(df.select(*joined_df.columns))

However, recently I did a little investigation on PySpark’s Github repo. I jumped into the dataframe’s module code and found a method called unionByName . There’s a short statement explaining the use of the method: The difference between this function and :fun:’union’ is that this function resolves columns by name (not by position).

Let’s take a look at a simple example (taken from the Spark’s Github repo):

>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

>>> df1.unionByName(df2).show()

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

What does it mean? Well, we have two solutions here, either using the select approach (as mentioned in the previous article) or just simply using this unionByName method.

Thank you for reading.