How you can Access the Data from ADLS to Databricks
How you can Access the Data from ADLS to Databricks
Here’s a comprehensive guide on how to access data from Azure Data Lake Storage (ADLS) in Azure Databricks, combining the best practices and addressing potential issues:
Methods
There are four primary methods to integrate ADLS into your Databricks workflow:
- Mounting ADLS to DBFS using Service Principal and OAuth 2.0:
- The most secure and recommended approach
- Create a service principal in Azure Active Directory (AAD) and grant it the necessary permissions on your ADLS storage account.
- Use this service principal’s credentials and OAuth 2.0 to mount the ADLS container as a directory within the Databricks File System (DBFS).
- Using a Service Principal Directly:
- This approach is similar to the mounting approach, but you provide the service principal’s client ID and secret directly in your code.
- Consider storing credentials securely using Databricks Secrets for better security.
- Using the ADLS Storage Account Access Key:
- Retrieve the storage account access key from the Azure portal.
- Embed this key in your code to access data.
- Less secure: Exercise caution if using this method in production environments
- Credential Passthrough:
- Leverages your Azure Active Directory credentials to access ADLS.
- It requires additional configuration and might be suitable for specific scenarios.
Steps for Mounting ADLS (Recommended)
- Create a Service Principal in AAD: Follow Microsoft’s documentation on creating service principals: [invalid URL removed]
- Assign Permissions to the Service Principal: Grant the service principal at least “Storage Blob Data Reader” permissions on your ADLS storage account. Consider “Storage Blob Data Contributor” or similar roles for more granular control.
- Mount ADLS to DBFS: Python
- configs = {
- “fs.azure.account.auth.type”: “OAuth”,
- “fs.azure.account.oauth.provider.type”: “org. apache. Hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider”,
- “fs.azure.account.oauth2.client.id”: “<service-principal-client-id>”,
- “fs.azure.account.oauth2.client.secret”: dbutils.secrets.get(scope = “<secret-scope>”, key = “<secret-key>”),
- “fs.azure.account.oauth2.client.endpoint”: “https://login.microsoftonline.com/<tenant-id>/oauth2/token”
- }
- duties.fs.mount(
- source = “abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/”,
- mount_point = “/mnt/<your-mount-point>”,
- extra_configs = configs
- )
- Replace placeholders with your service principal details, ADLS container, storage account, Databricks secret scope, and secret key where you’ve stored the service principal’s secret.
Accessing Data After Mounting:
Python
# Read a CSV file from the mounted ADLS location
df = spark.read.csv(“/mnt/<your-mount-point>/data.csv”, header=True, inferSchema=True)
Important Considerations:
- Secret Management: Store sensitive credentials using Databricks Secrets or Azure Key Vault.
- Best Practices: The mounting approach with service principals and OAuth 2.0 offers the best security.
- Data Formats: Databricks supports various file formats (CSV, Parquet, JSON, etc.)
Databricks Training Demo Day 1 Video:
Conclusion:
Unogeeks is the No.1 IT Training Institute for Databricks Training. Anyone Disagree? Please drop in a comment
You can check out our other latest blogs on Databricks Training here – Databricks Blogs
Please check out our Best In Class Databricks Training Details here – Databricks Training
Follow & Connect with us:
———————————-
For Training inquiries:
Call/Whatsapp: +91 73960 33555
Mail us at: info@unogeeks.com
Our Website ➜ https://unogeeks.com
Follow us:
Instagram: https://www.instagram.com/unogeeks
Facebook:https://www.facebook.com/UnogeeksSoftwareTrainingInstitute
Twitter: https://twitter.com/unogeeks