Home:ALL Converter>Caching Spark Dataframe for speed enhancement

Caching Spark Dataframe for speed enhancement

Ask Time:2018-06-08T21:59:01         Author:Clock Slave

Json Formatter

I have a function that joins a list of dataframes to a base dataframe and returns a dataframe. I am trying to reduce the time this operation takes. Since I was joining multiple times using the base dataframe, I cached it but the runtime is still similar. This is the function I am using it

def merge_dataframes(base_df, df_list, id_col):
    """
    Joins multiple dataframes using an identifier variable common across datasets
    :param base_df: everything will be added to this dataframe
    :param df_list: dfs that have to be joined to main dataset
    :param id_col: the identifier column
    :return: dataset with all joins
    """
    base_df.persist(StorageLevel.MEMORY_AND_DISK)
    for each_df in df_list:
        base_df = base_df.join(each_df, id_col)
    base_df.unpersist()
    return base_df

I was surprised to get similar results after caching. Whats the reason behind this and what can I do to make this consume less time.

Also since the datasets I am using currently are relatively small (~50k records) so I don't have an issue with caching datasets as and when needed as long as I decache them.

Author:Clock Slave,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/50762320/caching-spark-dataframe-for-speed-enhancement
yy