There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. We will dig into each of these in due time. For this post, we're going to talk about Clusters. Clusters are where the work is done. Clusters themselves do not store any code or data. Instead, they operate the physical resources that are used to perform the computations. So, it's possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment. Let's start by creating a small cluster.
Additionally, we have the ability to automatically stop, but not delete, clusters that are no longer being used. Anyone who's ever worked with a product that requires people to manually spin up and spin down resources can attest that it is extremely common for people to forget to turn them off after their done. This autotermination, along with the autoscaling mentioned previously, leads to a substantial cost savings. You can read more about autotermination here.
|Min and Max Workers|
With all of these cluster configuration options, it's obvious there are many ways to limit the overall price of our Azure Databricks environment. This is a level of flexibility not present in any other major Hadoop environment, and one of the reasons why Databricks has gained so much popularity. Let's move on to the advanced options at the bottom of this screen.
|Advanced Options - Spark|
In addition, we can create environment variables that can then be used as part of scripts and code. This is especially helpful when we want to run complex scripts or create reusable code that can be ported from cluster to cluster. For instance, we can easily transition from Dev to QA to Prod clusters by storing the connection string for the associated database as an environment variable. Slightly off-topic, there's also a secure way to handle this using secrets or Azure Key Vault. You can read more about the Spark Advanced options here.
|Advanced Options - Tags|
|Advanced Options - Logging|
|Advanced Options - Init Scripts|
|Advanced Options - ADLS Gen1|
That's all there is to creating clusters in Azure Databricks. We love the fact that Azure Databricks makes is incredibly easy to spin up a basic cluster, complete with all the standard libraries and packages, but also gives us the flexibility to create more complex clusters as our use cases dictate. Stay tuned for the next post where we'll dig into Notebooks. Thanks for reading. We hope you found this informative.
Service Engineer - FastTrack for Azure