Breaking BI: June 2019

Monday, June 24, 2019

Azure Databricks: Cluster Creation

Today, we're going to talk about Cluster Creation in Azure Databricks. If you haven't read the first post in this series, Introduction, it may provide some useful context. You can find the files from this post in our GitHub Repository. Let's move on to the core of this post, Cluster Creation.

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. We will dig into each of these in due time. For this post, we're going to talk about Clusters. Clusters are where the work is done. Clusters themselves do not store any code or data. Instead, they operate the physical resources that are used to perform the computations. So, it's possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment. Let's start by creating a small cluster.

New Cluster

Cluster Properties

Let's dig into this.

Cluster Mode

First, we're setting up this cluster in "Standard" mode. This is the typical cluster mode that is very useful for developing code, performing analyses or running individual jobs. The other option is "High Concurrency". This mode is optimized for multiple users running multiple jobs at the same time. You can read more about Cluster Modes here.

Python Version

Then, we have to decide whether we want to use Python 2 or Python 3. Long story short, Python 2 and Python 3 are very different languages. Python 3 is the newest version and should be used in the majority of cases. In rare cases where we are trying to productionalize existing Python 2 code, then Python 2 can be leveraged. You can read more about Python Versions here.

Autopilot Options

Two of the major selling points of Databricks are listed under the "Autopilot Options" section. First, Databricks clusters have the ability to monitor performance stats and autoscale the cluster based on those stats. This level of elasticity is virtually unheard of in most other Hadoop products. You can read more about autoscaling here.

Additionally, we have the ability to automatically stop, but not delete, clusters that are no longer being used. Anyone who's ever worked with a product that requires people to manually spin up and spin down resources can attest that it is extremely common for people to forget to turn them off after their done. This autotermination, along with the autoscaling mentioned previously, leads to a substantial cost savings. You can read more about autotermination here.

Worker Type

Databricks also gives us the ability to decide what size of cluster we want to create. As we mentioned previously, this flexibility is an efficient way to limit costs while developing by leveraging less expensive clusters. For more advanced users, there are also a number different cluster types optimized for different performance profiles, including high CPU, high Memory, high I/O or GPU-acceleration. You can read more about Worker Types here.

Min and Max Workers

With autoscaling enabled, we also have the ability to define how large or small the cluster can get. This is an effective technique for guaranteeing thresholds for performance and price. You can find out more about autoscaling here.

Driver Type

We also have the ability to use a different driver type than we use for the workers. The worker nodes are used to perform the computations needed for any notebooks or jobs. The driver nodes are used to store metadata and context for Notebooks. For example, a notebook that processes a very complex dataset, but doesn't output anything to the user will require larger worker nodes. On the other hand, a notebook that processes a very simple dataset, but outputs a lot of data to the user for manipulation in the notebook will require larger driver nodes. Obviously this setting is designed for more advanced users. You can read more about Cluster Node Types here.

With all of these cluster configuration options, it's obvious there are many ways to limit the overall price of our Azure Databricks environment. This is a level of flexibility not present in any other major Hadoop environment, and one of the reasons why Databricks has gained so much popularity. Let's move on to the advanced options at the bottom of this screen.

Advanced Options - Spark

Behind every Spark cluster, there are a large number of configuration options. Azure Databricks uses many of the common default values. However, there are some cases where advanced users may want to change some of these configuration values to improve performance or enable more complex features. You can read more about Apache Spark configuration options here.

In addition, we can create environment variables that can then be used as part of scripts and code. This is especially helpful when we want to run complex scripts or create reusable code that can be ported from cluster to cluster. For instance, we can easily transition from Dev to QA to Prod clusters by storing the connection string for the associated database as an environment variable. Slightly off-topic, there's also a secure way to handle this using secrets or Azure Key Vault. You can read more about the Spark Advanced options here.

Advanced Options - Tags

Tags are actually an Azure feature that is unrelated to Azure Databricks. Tags allow us to create key-value pairs that can be used to filter or group our Azure costs. This is a great way to keep track of how many particular projects, individuals or cost centers are spending in Azure. Tags also provide some advantages for Azure administration, such as understanding what resources belong to which teams or applications. You can read more about Tags here.

Advanced Options - Logging

We also have the option to decide where our driver and worker log files will be sent. Currently, it appears the only supported destination is DBFS, which we haven't talked about yet. However, it is possible to mount an external storage location to DBFS, thereby allowing us to write these log files to an external storage location. You can read more about Logging here.

Advanced Options - Init Scripts

We have the ability to run initialization scripts during the cluster creation process. These scripts are run before the Spark driver or Worker JVM starts. Obviously, this is only for very advanced users who need specific functionality from their clusters. It's interesting to note that these scripts only have access to set of pre-defined environment variables and do not have access to the user-defined environment variables that we looked at previously. You can read more about Initialization Scripts here.

Advanced Options - ADLS Gen1

If we select "High Concurrency" as our Cluster Type, we also have an additional option to pass through our credentials to Azure Data Lake Store Gen1. This means that we don't have worry about configuring access using secrets or configurations. However, this does limit us to only using only SQL and Python commands. You can read more about ADLS Gen1 Credential Passthrough here.

That's all there is to creating clusters in Azure Databricks. We love the fact that Azure Databricks makes is incredibly easy to spin up a basic cluster, complete with all the standard libraries and packages, but also gives us the flexibility to create more complex clusters as our use cases dictate. Stay tuned for the next post where we'll dig into Notebooks. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 3, 2019

Azure Databricks: Introduction

Today, we're going to start a new series on Azure Databricks. We've talked about Azure Databricks in a couple of previous posts (here and here). This post will largely be a reiteration of what was discussed in those posts. However, it's a necessary starting point for this series. Entering the rabbit hole in 3...2...1...

So, what is Azure Databricks? To answer this question, let's start all the way at the bottom of the hole and climb up. So, what is Hadoop? Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations. It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things. However, Hadoop solutions are extremely complex to manage and develop. So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems. Apache Spark is one of these platforms. You can read more about Apache Hadoop here and here.

So, what is Spark? Apache Spark is a distributed processing framework built on top of the Hadoop ecosystem. Spark provides a number of useful abstractions on top of Hadoop. This isn't technically true, but the reality is too complex for us to dive into here. Basically, Spark allows us to build complex data processing workflows using a substantially easier programming model. However, Spark still requires us to manage our own Hadoop environment. This is where Databricks comes in. You can read more about Apache Spark here and here.

So, what is Databricks? Per Wikipedia, "Databricks is a company founded by the creators of Apache Spark that aims to help clients with cloud-based big data processing using Spark." Typically, when people say Databricks, they are referring to one of their products, the Databricks Unified Analytics Platform. This platform is a fully-managed, cloud-only Spark service that allows us to quickly and easily spin up Spark clusters that are pre-configured for Analytics, including a number of standard analytics libraries. Just as importantly, it also exposes these clusters to us via an intuitive online notebook interface that supports Scala, Python, R, Markdown, Shell and File System commands within a single notebook. This provides the incredibly useful ability to have Databricks automatically spin up a Spark cluster for us, configure it with all of the basic libraries we will need and execute our code in any mix of the languages listed earlier. This combination of features is completely unique in the Data Analytics and Engineering space. You can read more about Databricks here and here.

So, what is Azure Databricks? As previously mentioned, Databricks is purpose built to leverage the elasticity, scaleability and manageability of the cloud. Azure Databricks is the Databricks product available in the Azure cloud. This service leverages native Azure resources, like Blob Storage, Virtual Machines and Virtual Networks to host its service. Furthermore, Azure Databricks is a "first-class" Azure resource. This means that it is fully integrated and supported by Microsoft. Despite it being owned and developed by an external company, Microsoft treats it the same it does its own internally developed resources. You can read more about Azure Databricks here and here.

Now for the final question, "How is Azure Databricks different from the other Notebook, Data Science and Spark technologies on Azure?" There are a number of technologies available on Azure that are designed to handle a subset of these features.

Azure Machine Learning Studio
Azure Notebooks (which are not the same as Azure Machine Learning Studio Notebooks)
Azure Machine Learning Service (which is not the same as any of the above services...)

A new Visual Interface was just released that is intended to eventually replace Azure Machine Learning Studio.

Azure Data Science Virtual Machine
Azure HDInsight

Unfortunately, none of these tools offers the entire suite of functionality provided by Azure Databricks. The closest competitor is probably HDInsight. However, as we saw in a previous post, HDInsight requires a certain level of Hadoop knowledge of begin using.

So, how do we get started? It's as simple as creating a free Azure Databricks account from the Azure portal. The account costs nothing, but the clusters we created within the account will not be free. Fortunately, Azure Databricks offers a free 14-day trial that can be repeated over and over for our testing needs. We're really excited for you to join us on our Azure Databricks journey. It's going to be an awesome ride. Stay tuned for the next post where we'll dig into creating clusters. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

About Me

Brad is a Service Engineer on Microsoft's FastTrack for Azure Team in Charlotte, NC. Brad helps individuals and organizations leverage Analytics and Azure to revolutionize themselves and their industries. He has an M.S. in Statistics from the University of South Carolina, MCSE Certification in Data Management and Analytics, MCSE Certification in Cloud Platform and Infrastructure and various MCSA Certifications in Business Intelligence and Advanced Analytics. Brad is an active blogger at breaking-bi.blogspot.com. He is also an organizer for the Charlotte BI Group, a local PASS chapter in Charlotte, NC. You can connect with him on LinkedIn at https://www.linkedin.com/in/bradllewellyn and on Twitter @BreakingBI.