Databricks SCD Type 2 Python

Implementing Slowly Changing Dimensions (SCD) Type 2 in Databricks using Python (PySpark) involves a few key steps:

Understanding SCD Type 2

SCD Type 2 preserves the full history of changes in a dimension table by creating a new row whenever an attribute changes. This new row has a new unique identifier, a start date (when the change became effective), and potentially an end date (when the change ceased to be valid).

Implementation Steps

Identify Changes: Compare the incoming data with the existing dimension table to identify rows where attributes have changed.
Update Existing Rows: For changed rows, update the existing row’s end date to mark it as no longer current.
Insert New Rows: Insert new rows with the updated attribute values, a new unique identifier, and a start date.
Maintain Validity: Ensure only one row per natural key has an end date of NULL, signifying the currently valid record.

PySpark Code Example (Illustrative)

Python
from pyspark.sql.functions import *

# Assuming you have the existing dimension table (dim_df) and the new incoming data (updates_df)

# 1. Identify Changes
changes_df = updates_df.alias("u").join(
    dim_df.alias("d"),
    on="natural_key",
    how="inner"
).filter("u.attribute1 <> d.attribute1 OR u.attribute2 <> d.attribute2") # Add more attributes as needed

# 2. Update Existing Rows
updated_dim_df = dim_df.join(
    changes_df,
    on="natural_key",
    how="left_anti"
).union(
    dim_df.join(
        changes_df,
        on="natural_key",
        how="inner"
    ).withColumn("end_date", current_timestamp())
)

# 3. Insert New Rows
new_rows_df = changes_df.select(
    col("u.*"),
    monotonically_increasing_id().alias("new_unique_id"), # Generate new IDs
    current_timestamp().alias("start_date"),
    lit(None).cast("timestamp").alias("end_date")
)

# 4. Combine and Ensure Validity
final_dim_df = updated_dim_df.union(new_rows_df)

# Optionally, enforce validity by updating only one row with NULL end_date per natural key
# (Requires more complex logic depending on your specific requirements)

Important Considerations:

Delta Lake: Consider using Delta Lake tables for efficient SCD Type 2 implementations, as they offer ACID transactions and time travel capabilities.
Merge: Databricks also provides the MERGE INTO SQL command, which can simplify SCD Type 2 operations in some cases.
Performance: For large datasets, optimize the join and update operations to maintain good performance.
Data Validation: Implement robust data quality checks to ensure the integrity of your dimension table.

Databricks Training Demo Day 1 Video:

You can find more information about Databricks Training in this Dtabricks Docs Link

Conclusion:

Unogeeks is the No.1 IT Training Institute for Databricks Training. Anyone Disagree? Please drop in a comment

You can check out our other latest blogs on Databricks Training here – Databricks Blogs

Please check out our Best In Class Databricks Training Details here – Databricks Training

Follow & Connect with us:

———————————-

For Training inquiries:

Call/Whatsapp: +91 73960 33555

Mail us at: info@unogeeks.com

Our Website ➜ https://unogeeks.com

Instagram: https://www.instagram.com/unogeeks

Facebook:https://www.facebook.com/UnogeeksSoftwareTrainingInstitute

Twitter: https://twitter.com/unogeeks