Monday, April 1, 2019

Data Science in Power BI: Databricks Spark

Today, we're going to talk about Databricks Spark within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream AnalyticsStream Analytics with Azure Machine Learning Studio and HDInsight Hive, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Databricks Spark.

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don't want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a "first party offering" in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks here, here and here.

To get started with Azure Databricks, we need to create an account.  We can do this in the Azure Portal.
Databricks Account Creation
Now that we have an account, we can access our Azure Databricks Workspace through the Azure portal.
Launch Workspace
When we navigate to our Azure Databricks resource, we simply click the "Launch Workspace" button.
Azure Databricks Portal
This takes us to the Azure Databricks Portal.  Azure Databricks has three major components that we will leverage.  First, it has a SparkSQL database behind the scenes that we can store data in and query from.  Second, it stores our code as notebooks that we can use to interactively explore.  Finally, it allows us to create clusters that quickly spin up and spin down, separately from our code.  This means that we can destroy our clusters (and stop paying for them), without losing any of our code or data.
New Cluster
Let's start by creating our cluster definition.
Cluster Definition
Azure Databricks allows us to easily create Spark clusters with the ability to auto-scale.  This is a huge differentiator from HDInsight, which typically requires us to destroy and recreate the cluster if we want to add nodes.  In our case, we just want the smallest possible cluster, as our dataset is pretty small.  The cluster creation takes a few minutes.
New Notebook
Create Notebook
Once the cluster is created, we can create a notebook and select its default language.  One of the really cool things about Databricks is that we can write multiple query languages within the same notebook.  It's extremely simple to ingest data using Scala, process it using SQL, build predictive models using R and visualize it using Python, all inside a single notebook.  The language we select when creating the notebook is simply the default language, but it can easily be overridden using "%sql", "%scala", "%python" and "%r".  There's even a "%md" command for writing formatted text using markdown.  Now that we have a cluster and a notebook, let's use Scala to query data from our blob store.
Read Avro from Blob
<CODE START>

%scala

spark.conf.set(
  "fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net", <Storage Account Access Key>
);

val df = spark.read.format("avro").load("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/data");

df.show();

<CODE END>

We were able to do this in a few basic lines of code.  Now that we can access our data, let's discuss the goal of this post.  In the previous post, we mentioned that Hive doesn't perform well against lots of small files.  This is true for Spark as well.  One technique for handling this is to have jobs that coalesce a large number of small files into a single large file.  These jobs can be run daily, weekly or monthly depending on the data needs of the organization.  Since we've already stored the entire dataset in the "df" Data Frame, it's trivial write all of this out to a set of randomly partitioned files or a single file.
Write Avro to Blob
<CODE START>

%scala

// Write using default partitions

df.write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault/")

// Write using single partition

df.repartition(1).write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datasinglefile/")

<CODE END>

Since Spark is a distributed processing framework, the default state is to read and write files in a partitioned state.  However, we can overwrite this by using the "repartition()" function.  Let's take a look at the results.
Default Partitions

Single Partition
Immediately, we notice the awkward naming convention that Spark uses for its files.  This is a built-in feature of Hadoop and cannot be directly altered.  However, it is possible to use File System commands to rename the files after they've been written, if it's necessary for some reason.  We also see that Spark includes some status files as well.  These aren't a major concern, as most Hadoop technologies know how to leverage or ignore these files as needed.

Seeing this data leads us to two important questions.  First, is the total data volume smaller if we store it in multiple files?
Storage Statistics
Fortunately, Azure Storage Explorer has some built-in functionality for this.  It turns out that we saw an 83% size reduction using the default partitioning and an 87% size reduction using the single partitioning.  More importantly, what kind of read performance do we get off of these files?
Read Statistics
This test showed that the small files took 9.37 seconds, the default partitioned files took 0.67 seconds and the single file took 0.54 seconds.  These numbers are even better than the Storage Statistics, boasting a 93% read time reduction for the default partitioning and a 94% read reduction for the single file.  What's even more interesting here is that there doesn't seem to be a substantial difference between using the default partitioning and using a single file.  So, we'll just stick with the defaults moving forward.

Now that we've compiled all of our small files together, we need to expose the data in a way that Power BI can efficiently read.  Power BI can read from Blob Storage, but it's messy and we won't try that.  Fortunately, Databricks has a built-in Spark SQL Database that we can write our data to.
createOrReplaceTempView
<CODE START>

%scala


df.createOrReplaceTempView("Customers")

<CODE END>

<CODE START>

%sql


SELECT * FROM Customers LIMIT 10

<CODE END>

Databricks allows us to leverage a Scala function "createOrReplaceTempView()" that automatically creates a temporary table for us.  Unlike in the previous post, this table is not connected directly to the data in blob storage.  Instead, it's a copy of the data that is stored directly inside the Databricks DBFS Storage.  Just like with HDInsight, it is possible to create an external table that connects to remote file storage.  In order to do this, we must first mount the storage container.
Mount Storage Account
<CODE START>

%scala

dbutils.fs.mount(
  source = "wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault",
  mountPoint = "/mnt/datadefault",

  extraConfigs = Map("fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net" -> <Storage Account Access Key>))

<CODE END>

This command allows us to access the data in the "datadefault" folder within the "interactivequery" container, simply by referencing the "/mnt/datadefault" location as if it was local.  We can use this to create our external table.  Understanding that it's not good practice to use your account keys in our code, it's also possible to store our Account Keys using the Secrets API and reference them using the "dbutils.secrets.get" function.

Unfortunately, while writing this, we discovered that there is a bug affecting the use of Avro for this operation.  If we use the Avro data, we end up with a strange table.
CustomersBlob (Avro)
<CODE START>

%sql

CREATE EXTERNAL TABLE CustomersBlob(
  customerid string,
  age int,
  workclass string,
  fnlwgt int,
  education string,
  educationnum int,
  maritalstatus string,
  occupation string,
  relationship string,
  race string,
  sex string,
  capitalgain int,
  capitalloss int,
  hoursperweek int,
  nativecountry string
)
STORED AS AVRO
LOCATION '/mnt/datadefault/';


SELECT * FROM CustomersBlob LIMIT 10;

<CODE END>

Fortunately, saving the same data in Parquet format resolves the problem with no issue.
CustomersBlob (Parquet)
Now that this data is accessible through Databricks, we can access it from Power BI using the Spark connector.  First, we need to find the connection information.  In order to connect, we need to get the JDBC/ODBC connection information for the cluster, as well as an access token.  We won't show this process step-by-step, but you can find it here.
Customers
Hopefully, this post enlightened you to the possibilities of combining the processing power of Databricks Spark with Power BI.  Databricks is one of the coolest new kids on the block and provides an incredibly simple mechanism for implementing python, r, scala and sql code against the open-source Spark framework.  Stay tuned for the final post in this series where we'll dig into Spark MLlib.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

No comments:

Post a Comment