In Databricks, fuse (1) reduces the number of partitions in a DataFrame to a single partition. This is primarily done for the following reasons:

Optimization of Write Operations:

  • Reduced File Count: When writing data to a format like Parquet, having a single partition results in a single output file. This is often more efficient for downstream processing tools and storage systems, as it minimizes the overhead of managing multiple small files.
  • Improved Compression: With all data in a single partition, compression algorithms can work more effectively, potentially reducing the overall storage size.

Specific Use Cases:

  • Small Datasets: For datasets that fit comfortably within the memory of a single worker node, coalesce(1) can simplify processing and avoid unnecessary data shuffling across partitions.
  • Gathering Data: If you need to collect all data to the driver node for further processing or analysis, coalesce(1) ensures that all data is contained within a single partition, making it easy to collect.

Things to Consider:

  • Large Datasets: Using coalesce(1) on very large datasets may not be suitable, as it can overload the driver node’s memory if the entire dataset doesn’t fit. Repartitioning to a reasonable number of partitions might be a better approach in such cases.
  • Data Skew: If your data is heavily skewed (unevenly distributed across partitions), using coalesce(1) might lead to performance bottlenecks due to a single worker node processing a disproportionately large amount of data.

Code Example (PySpark):


df =“path/to/data”)

df = df.coalesce(1) # Reduce to a single partition

df.write.parquet(“path/to/output”) # Write as a single file


  • repartition(1): This also reduces the number of partitions to one, but it involves a complete data shuffle, which can be computationally expensive for large datasets. coalesce(1) is generally preferred, as it minimizes data movement.

You can find more information about Databricks Training in this Dtabricks Docs Link



