weekly misc items: August 17, 2020

  1. schema evolution — delta lake — merge mode

new field was not getting read into the schema during merge mode

stumbled upon this issue: https://github.com/delta-io/delta/issues/170

working job:

my dev cluster:

updated my dev cluster to latest version of databricks runtime and spark — new field is getting recognized

2. delta table manifest update

in troubleshooting above, i noticed this line:

spark.sql(f'ALTER TABLE delta.`{<delta path>}` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)')

this statement helps with auto updates of manifests upon write operations

i have been explicitly updating manifests via:

spark.sql(f'GENERATE symlink_format_manifest FOR TABLE delta.`{<delta path>}`')

it appears manual update is recommended since auto update only updates changes to partitions that write operations wrote to, so it may keep manifests in other partitions stale. also if there are concurrent writes, you would have to run explicit updates to ensure manifests are pointing to latest version of table