weekly misc items: August 3, 2020

diary of a codelovingyogi
2 min readMar 4, 2021
  1. adding unique (unix) timestamps in python
import timedef now():
time_microseconds = round(time.time() * 1000000)
return int(time_microseconds)

worked on stacking some datetime manipulation after i was already using this:

to this to include unix timestamp:

2. add s3 prefixes before creating new delta lake merge files to avoid Incompatible format detected error

sample snippet we use to create merged records in delta lake format:

if DeltaTable.isDeltaTable(spark, GOLD_PATH):
print('Loading delta table...')
deltaTable = DeltaTable.forPath(spark, GOLD_PATH)
print('Merging delta table...')
deltaTable.alias("delta_table") \
.merge(
source = silver_orders_delta.cache().alias("silver_updates"),
condition = "delta_table.id = silver_updates.id"
) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
print('Delta table not found. Creating')
silver_orders_delta.write.format("delta").partitionBy(GOLD_PARTITION_BY).mode("overwrite").save(GOLD_PATH)

deployed some data ingestion jobs to prod and did not create s3 prefixes needs in the delta lake files paths. ended up getting the following error upon first run of databricks notebook where we process these files into our data lake:

this error occurs if there is an existing _symlink_format_manifest file in your target path. i haven’t confirmed that creating the prefix first in s3 before running the merge command will avoid this error, but when i leave the prefix and delete the manifest prefix, subsequent merge function runs fine.

3. git diff — name-only

get names of files that changed between two commits

--

--