Input Dataframe
val input_json="""[{"orderid":"111","customers":{"customerId":"123"},"Offers":[{"Offerid":"1"},{"Offerid":"2"}]}]""";val inputdataRdd = spark.sparkContext.parallelize(input_json :: Nil);val inputdataRdddf = spark.read.json(inputdataRdd);inputdataRdddf.show();
schema df
val schema_json="""[{"orders":{"order_id":{"path":"orderid","type":"string","nullable":false},"customer_id":{"path":"customers.customerId","type":"int","nullable":false,"default_value":"null"},"offer_id":{"path":"Offers.Offerid","type":"string","nullable":false},"eligible":{"path":"eligible.eligiblestatus","type":"string","nullable":true,"default_value":"not eligible"}},"products":{"product_id":{"path":"product_id","type":"string","nullable":false},"product_name":{"path":"products.productname","type":"string","nullable":false}}}]""";val schemaRdd = spark.sparkContext.parallelize(schema_json :: Nil);val schemaRdddf = spark.read.json(schemaRdd);schemaRdddf.show();
using the schema df , i want to read all the columns from the input dataframe.
- if the nullable key is true then i want to populate the column with default value (if the column is not present or not having any data).In the above example, eligible.eligiblestatus is not present hence i want to populate with some default value
- Also i want to change the data type of the columns based in type key defined in the schema json. . e.g customer id is of type INT in schema json but in input dataframe it is coming as string, hence i want to cast it to integer.
- the final column name should be taken from the key from schema json. e.g order_id is the key for orderid attribute
Final DF should have columns like:
order_id:String,customer_id:int, offer_id: string(array type cast to string),eligiblestatus:string