weekly misc items: September 21, 2020

file_paths = '<bucket-name>/<filepath>/<object-name>1.json,<bucket-name>/<filepath>/<object-name>2.json'file_paths_list = [f"s3a://{file}" for file in file_paths.split(',')]df = spark.read.format("json").load(queued_files_list)# or df_merged_schema = spark.read.option("mergeSchema", "true").format('json').load(queued_files_list)

2. display spark sql

display(spark.sql("""<sql statement>"""))

3. empty spark dataframe

schema = StructType([])
df_concat = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df_concat.schema

4. spark — join on multiple conditions

join_conditions = [product_variants.productVariantParentId == products.id, product_variants.productVariantFileBatchId == products.file_batch_id]

joined_products_df = product_variants.join(products, join_conditions, "left").select('products.*', 'product_variants.*')

or

joined_products_df = product_variants.join(products, (product_variants.productVariantParentId == products.id) & (product_variants.productVariantFileBatchId == products.file_batch_id), "left").select('products.*', 'product_variants.*')

note, unlike joining in below fashion, above join method will keep the ids you’re joining on from both tables:

joined_products_df = product_variants.join(products, "id", "left").select('products.*', 'product_variants.*')

i had to rename the ids in product_variantstable to avoid duplicate column errors when writing delta files.

5. window function on spark dataframe:

silver_products_delta = silver_products_delta.withColumn("row_num", row_number() \
.over(Window.partitionBy('productVariantId') \
.orderBy(col('productVariantUpdatedAt').desc())))

had to partition at most granular level product variant and not by product since our table includes product variant detail

to get the final result set, you can select where row_num ==1:

silver_products_delta = silver_products_delta.filter(col("row_num") == 1).drop("row_num")

--

--

--

quick notes, snippets, bugs, and fixes

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Using ThinkPad T410 as a server

How to Hard Reset Oppo R3

Oppo R3

Windows 365 updates

Top 12 Best Practices for Apex Code to Become A Better Developer

Sending Locust load testing results to Elastic Cloud

The Ultimate Video Calling Solution

Top Essential iOS Interview Questions and Answers Part- 1

Top 10 IT Outsourcing Trends That Will Matter in 2021

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
diary of a codelovingyogi

diary of a codelovingyogi

quick notes, snippets, bugs, and fixes

More from Medium

Drones in Construction - The Way Forward

A Hole In The Tree Confession

Sit back and learn python

Merry Christmas Neil DeGrasse Tyson