Databricks SCD Type 2 Python


       Databricks SCD Type 2 Python

Implementing Slowly Changing Dimensions (SCD) Type 2 in Databricks using Python (PySpark) involves a few key steps:

Understanding SCD Type 2

SCD Type 2 preserves the full history of changes in a dimension table by creating a new row whenever an attribute changes. This new row has a new unique identifier, a start date (when the change became effective), and potentially an end date (when the change ceased to be valid).

Implementation Steps

  1. Identify Changes: Compare the incoming data with the existing dimension table to identify rows where attributes have changed.

  2. Update Existing Rows: For changed rows, update the existing row’s end date to mark it as no longer current.

  3. Insert New Rows: Insert new rows with the updated attribute values, a new unique identifier, and a start date.

  4. Maintain Validity: Ensure only one row per natural key has an end date of NULL, signifying the currently valid record.

PySpark Code Example (Illustrative)

from pyspark.sql.functions import *

# Assuming you have the existing dimension table (dim_df) and the new incoming data (updates_df)

# 1. Identify Changes
changes_df = updates_df.alias("u").join(
).filter("u.attribute1 <> d.attribute1 OR u.attribute2 <> d.attribute2") # Add more attributes as needed

# 2. Update Existing Rows
updated_dim_df = dim_df.join(
    ).withColumn("end_date", current_timestamp())

# 3. Insert New Rows
new_rows_df =
    monotonically_increasing_id().alias("new_unique_id"), # Generate new IDs

# 4. Combine and Ensure Validity
final_dim_df = updated_dim_df.union(new_rows_df)

# Optionally, enforce validity by updating only one row with NULL end_date per natural key
# (Requires more complex logic depending on your specific requirements)

Important Considerations:

  • Delta Lake: Consider using Delta Lake tables for efficient SCD Type 2 implementations, as they offer ACID transactions and time travel capabilities.
  • Merge: Databricks also provides the MERGE INTO SQL command, which can simplify SCD Type 2 operations in some cases.
  • Performance: For large datasets, optimize the join and update operations to maintain good performance.
  • Data Validation: Implement robust data quality checks to ensure the integrity of your dimension table.

Databricks Training Demo Day 1 Video:

You can find more information about Databricks Training in this Dtabricks Docs Link



Unogeeks is the No.1 IT Training Institute for Databricks Training. Anyone Disagree? Please drop in a comment

You can check out our other latest blogs on Databricks Training here – Databricks Blogs

Please check out our Best In Class Databricks Training Details here – Databricks Training

 Follow & Connect with us:


For Training inquiries:

Call/Whatsapp: +91 73960 33555

Mail us at:

Our Website ➜

Follow us:





Leave a Reply

Your email address will not be published. Required fields are marked *