I am trying to count distinct number of entities at different date ranges.
I need to understand how spark performs this operation
val distinct_daily_cust_12month = sqlContext.sql(s"select distinct day_id,txn_type,customer_id from ${db_name}.fact_customer where day_id>='${start_last_12month}' and day_id<='${start_date}' and txn_type not in (6,99)")val category_mapping = sqlContext.sql(s"select * from datalake.category_mapping");val daily_cust_12month_ds =distinct_daily_cust_12month.join(broadcast(category_mapping),distinct_daily_cust_12month("txn_type")===category_mapping("id")).select("category","sub_category","customer_id","day_id")daily_cust_12month_ds.createOrReplaceTempView("daily_cust_12month_ds")val total_cust_metrics = sqlContext.sql(s"""select 'total' as category,count(distinct(case when day_id='${start_date}' then customer_id end)) as yest,count(distinct(case when day_id>='${start_week}' and day_id<='${end_week}' then customer_id end)) as week,count(distinct(case when day_id>='${start_month}' and day_id<='${start_date}' then customer_id end)) as mtd,count(distinct(case when day_id>='${start_last_month}' and day_id<='${end_last_month}' then customer_id end)) as ltd,count(distinct(case when day_id>='${start_last_6month}' and day_id<='${start_date}' then customer_id end)) as lsm,count(distinct(case when day_id>='${start_last_12month}' and day_id<='${start_date}' then customer_id end)) as ltmfrom daily_cust_12month_ds""")
No Errors, But this is taking a lot of time. I want to know if there is a better way to do this in Spark