d2v_rdd = spark.sparkContext.textFile("") for row in d2v_rdd.collect(): row_elements = row.split("\t") vector_dict[row_elements[0]] = np.array(row_elements[1:][0]) #Getting the dim features from the products file products_rdd = spark.sparkContext.textFile("") for row in products_rdd.collect(): row_elements = row.split("\t")The dataset has 431907 rows
I have the above lines of code implemented in three different forms:
- the python
with open("")method - reading it into a spark dataframe
spark.read.csv - the above shown RDD format
I was expecting the code to be faster with a spark dataframe but turns out that the most efficient method is the context manager with open("")
Any reason why this might be happening?