I’ve been trying to speed up the ensembled model’s prediction performance. I’ve actually mentioned about this (current approach) in my previous post.
Basically, each classifier predicts new data points using a Pandas UDF. Based on my personal investigation, Pandas UDF has been the fastest approach till now to process a dataframe in a distributed manner. You can read more about my little investigation on Pandas UDF vs Spark UDF in this post. However, I was curious about performance tuning when dealing with big data. I decided to find another way to beat the score yielded by the current approach.
I came across Voting Classifier, a scikit-learn classifier made for ensembled learning. It includes hard and soft voting approach. Moreover, its usage is greatly simple. You can find about it more here.
To apply this Voting Classifier, the first thing I did was to investigate whether the data structure implemented by this package is compatible with the standard applied in my current approach. And I think the most effective way to do that is by reading the source code! Yes. Fortunately, scikit-learn publishes their packages’ source code on GitHub. The Voting Classifier’s code can be found here.
To make it short, here’s the code snippet.
def predict_ensemble(df, clfs): def _predict_pandas(*cols: [pd.Series]) -> pd.Series: X = pd.concat(cols, axis=1) clf_name, clf_obj, clf_fitted = zip(*clfs) eclf4 = VotingClassifier(estimators=zip(clf_name, clf_obj), voting='soft', weights=[1,1,1], flatten_transform=True) # assigned the object's estimators manually since we've trained the classifiers eclf4.estimators_ = clf_fitted return pd.Series(eclf4.predict_proba(X.values)[:, 1]) predict_udf = F.pandas_udf(_predict_pandas, DoubleType()) feature_columns = ['F1', 'F2', 'F3', 'F4', 'F5', 'F6'] predicted_df = df.withColumn('POSITIVE_PROBA', predict_udf(*feature_columns)) return predicted_df clfs = [('tmp', rf1, rf1._model), ('tmp', rf2, rf2._model), ('tmp', rf3, rf3._model)] # start the timer start = time.time() predicted_df = predict_ensemble(df, clfs) predicted_df.collect() print('TIME NEEDED: ' + str(time.time() - start))
I tested the above code 6 times using the same data test with 102000 instances and the average time needed was 2.444s. I wasn’t impressed at all since the difference from the current approach was extremely slight. FYI, the current approach takes approximately 2.437s (in average).
I checked the execution plan and here’s what I got.
== Optimized Logical Plan == Project [F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L, LABEL#6, pythonUDF0#83 AS POSITIVE_PROBA#71] +- !ArrowEvalPython [_predict_pandas(F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L)], [F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L, LABEL#6, pythonUDF0#83] +- LogicalRDD [F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L, LABEL#6], false == Physical Plan == *(1) Project [F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L, LABEL#6, pythonUDF0#83 AS POSITIVE_PROBA#71] +- ArrowEvalPython [_predict_pandas(F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L)], [F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L, LABEL#6, pythonUDF0#83] +- Scan ExistingRDD[F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L,LABEL#6]
Well, it’s quite similar to what I got when working with the current approach. The difference is only on the number of UDFs processed by Spark. As you can see from the above physical plan, the ArrowEvalPython only has one UDF, that is [_predict_pandas(F1#0L, F2#1L, F3#2L, F4#3L, F5#4L, F6#5L)]. Meanwhile, the current approach yields three UDFs since the prediction from each model is executed independently. However, as you can see, the execution time is not different.
Using this scikit-learn package has several limitations. I can say that as I’ve read its code on GitHub. One of the limitations is the package requires each classifier to have the same features. Well, I think it’s not flexible since in ensembled learning, we might want to combine several classifiers trained on different features. I browsed the internet to attack this issue yet I think the solution didn’t maximize the optimization process offered by Voting Classifier. Therefore, I think the execution time might be the same or even longer.
Thanks for reading.