Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 22514

Spark Dataframe and RDD's are making my code slower

$
0
0
    d2v_rdd = spark.sparkContext.textFile("")    for row in d2v_rdd.collect():        row_elements = row.split("\t")        vector_dict[row_elements[0]] = np.array(row_elements[1:][0])    #Getting the dim features from the products file    products_rdd = spark.sparkContext.textFile("")    for row in products_rdd.collect():        row_elements = row.split("\t")

The dataset has 431907 rows

I have the above lines of code implemented in three different forms:

  1. the python with open("") method
  2. reading it into a spark dataframe spark.read.csv
  3. the above shown RDD format

I was expecting the code to be faster with a spark dataframe but turns out that the most efficient method is the context manager with open("")

Any reason why this might be happening?


Viewing all articles
Browse latest Browse all 22514

Trending Articles