weekly misc items: July 20, 2020

2 min readSep 28, 2020

delta lake merge schema

if you are adding new fields to an existing table and getting error:

A schema mismatch detected when writing to the Delta table (Table ID: 287f1b10-7384-446b-94c7-419b893f784b).

this is due to schema enforcement by databricks. you can use mergeSchema option

df_name.write.format("delta") \
   .option("mergeSchema", "true") \
   .partitionBy(PARTITION_BY).mode("append").save(<file_path>)

2. adding retry on your databricks notebook

in the data processing we do in our notebooks, we access internal packages we built via shell scripts to install the repos. but sometimes our init scripts fail with a message like:

Run result unavailable: job failed with error message
 Unexpected failure while waiting for the cluster (<cluster id>) to be ready.Cause Unexpected state for cluster (job-27-run-9):  Init scripts failed. instance_id: <id>, databricks_error_message: Cluster scoped init script s3://<bucket>/<path to init script>/<script name>.sh fail...

to handle this you can set a retry policy in your databricks job settings:

3. unnesting arrays in athena

for a field refunds, we have an array of records, ie:

[{createdat=2020-07-01T21:44:47Z, id=gid://shopify/Refund/123, note=, totalrefundedset={shopmoney={amount=0.0, currencycode=USD}}, updatedat=2020-07-01T21:44:48Z}, {createdat=2020-07-09T17:22:47Z, id=gid://shopify/Refund/124, note=rescheduled, totalrefundedset={shopmoney={amount=250.0, currencycode=USD}}, updatedat=2020-07-09T17:22:48Z}]

you can split this array into multiple columns by using unnest :

select "refund_ids"."id" as refund_id, "refund_ids"."totalrefundedset"."shopmoney"."amount" as refund_amount, *
FROM "db_name"."table_name" 
cross join unnest(refunds) as t(refund_ids)

4. spark schema inference problems

in shopify i initially have an array of refunds that i fetch refund line items from. after trying to add some restock fields and having spark infer a schema for, i am getting corrupt record.

StructType(List(StructField(_corrupt_record,StringType,true)))

when you see this it’s likely due to bad json/encoding.

i narrowed down the problem to the restocked field which is a boolean field.

[{node={restocked=true}}, {node={restocked=true}}, {node={restocked=true}}]

it appears we were dumping the result to string but also loading the str back to dict which was causing the error.

weekly misc items: July 20, 2020

Written by diary of a codelovingyogi

No responses yet