Merging multiple data frames row-wise in PySpark

Published on Author Code Father

Outside of chaining unions this is the only way to do it for DataFrames.

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

unionAll(td2, td3, td4, td5, td6, td7, td8, td9, td10)

What happens is that it takes all the objects that you passed as parameters and reduces them using unionAll (this reduce is from Python, not the Spark reduce although they work similarly) which eventually reduces it to one DataFrame.

If instead of DataFrames they are normal RDDs you can pass a list of them to the union function of your SparkContext

EDIT: For your purpose I propose a different method, since you would have to repeat this whole union 10 times for your different folds for crossvalidation, I would add labels for which fold a row belongs to and just filter your DataFrame for every fold based on the label

Comments

comments