Monday, June 3, 2019

Azure Databricks: Introduction

Today, we're going to start a new series on Azure Databricks.  We've talked about Azure Databricks in a couple of previous posts (here and here).  This post will largely be a reiteration of what was discussed in those posts.  However, it's a necessary starting point for this series.  Entering the rabbit hole in 3...2...1...

So, what is Azure Databricks?  To answer this question, let's start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

So, what is Spark?  Apache Spark is a distributed processing framework built on top of the Hadoop ecosystem.  Spark provides a number of useful abstractions on top of Hadoop.  This isn't technically true, but the reality is too complex for us to dive into here.  Basically, Spark allows us to build complex data processing workflows using a substantially easier programming model.  However, Spark still requires us to manage our own Hadoop environment.  This is where Databricks comes in.  You can read more about Apache Spark here and here.

So, what is Databricks?  Per Wikipedia, "Databricks is a company founded by the creators of Apache Spark that aims to help clients with cloud-based big data processing using Spark."  Typically, when people say Databricks, they are referring to one of their products, the Databricks Unified Analytics Platform.  This platform is a fully-managed, cloud-only Spark service that allows us to quickly and easily spin up Spark clusters that are pre-configured for Analytics, including a number of standard analytics libraries.  Just as importantly, it also exposes these clusters to us via an intuitive online notebook interface that supports Scala, Python, R, Markdown, Shell and File System commands within a single notebook.  This provides the incredibly useful ability to have Databricks automatically spin up a Spark cluster for us, configure it with all of the basic libraries we will need and execute our code in any mix of the languages listed earlier.  This combination of features is completely unique in the Data Analytics and Engineering space.  You can read more about Databricks here and here.

So, what is Azure Databricks?  As previously mentioned, Databricks is purpose built to leverage the elasticity, scaleability and manageability of the cloud.  Azure Databricks is the Databricks product available in the Azure cloud.  This service leverages native Azure resources, like Blob Storage, Virtual Machines and Virtual Networks to host its service.  Furthermore, Azure Databricks is a "first-class" Azure resource.  This means that it is fully integrated and supported by Microsoft.  Despite it being owned and developed by an external company, Microsoft treats it the same it does its own internally developed resources.  You can read more about Azure Databricks here and here.

Now for the final question, "How is Azure Databricks different from the other Notebook, Data Science and Spark technologies on Azure?"  There are a number of technologies available on Azure that are designed to handle a subset of these features.

Unfortunately, none of these tools offers the entire suite of functionality provided by Azure Databricks.  The closest competitor is probably HDInsight.  However, as we saw in a previous post, HDInsight requires a certain level of Hadoop knowledge of begin using.

So, how do we get started?  It's as simple as creating a free Azure Databricks account from the Azure portal.  The account costs nothing, but the clusters we created within the account will not be free.  Fortunately, Azure Databricks offers a free 14-day trial that can be repeated over and over for our testing needs.  We're really excited for you to join us on our Azure Databricks journey.  It's going to be an awesome ride.  Stay tuned for the next post where we'll dig into creating clusters.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure

No comments:

Post a Comment