Apache Spark [PART 26]: Failure When Overwriting A Parquet File Might Result in Data Loss

1 minute read

Published:

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

Recently I came across such an issue when overwriting a parquet file. Let me simulate the process in a simplified way. I used Spark in local mode.

Suppose we have a simple dataframe df.

df_elements = [
	(row_a, row_b, row_c),
] * 100000

df = spark.createDataFrame(df_elements, [a, b, c])

Now let’s store the dataframe to a parquet file.

df.write.mode(overwrite).parquet(path_to_the_parquet_files)

You should see that there are several partition files created when the saving process finishes.

Let’s make the overwriting process fails in the middle.

df.write.mode(overwrite).parquet(path_to_the_parquet_files)

When the above code is running, just press Ctrl + C to stop it.

Go back to path_to_the_parquet_files and you should find that all the previous files (before the second parquet write) has been removed.

I browsed the internet to investigate more about this issue, and found a YouTube video titled Delta Lake for Apache Spark - Why do we need Delta Lake for Spark?. Please watch it in case you want to know more.