Monday, July 15, 2019

Azure Databricks: Notebooks

Today, we're going to talk about Cluster Creation in Azure Databricks.  If you haven't read the previous posts in this series, Introduction and Cluster Creation, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Notebooks.

As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  For this post, we're going to talk about the interactive way to develop code, Notebooks.  Azure Databricks Notebooks are similar to other notebook technologies such as Jupyter and Zeppelin, in that they are a place of us to execute snippets of code and see the results within a single interface.
Sample Notebook
Let's start by creating a new notebook.
New Notebook

Create Notebook
Creating a notebook is a pretty simple process, as there's virtually no configuration needed.  The only things we need to do is a provide a Name and select a Language.
Notebook Languages
Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn't limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we can manually specify the language for each code block.

Before we get to the actually coding, we need to attach our new notebook to an existing cluster.  As we said, Notebooks are nothing more than an interface for interactive code.  The processing is all done on the underlying cluster.
Detached Notebook
Attach Notebook
Attached Notebook
Now that we're attached to a cluster, we can run some sample commands.  Let's try writing a simple command in each language.
Python
<CODE START>

%python

import pandas as pd

pysamp = pd.DataFrame([1,2,3,4], columns=['Samp'])
display(pysamp)

<CODE END>
R
<CODE START>

%r

rsamp <- data.frame(Samp = c(1,2,3,4))
display(rsamp)

<CODE END>
Scala
<CODE START>

%scala

case class samp(Samp: Int)
val samp1 = samp(1)
val samp2 = samp(2)
val samp3 = samp(3)
val samp4 = samp(4)
val scalasamp = Seq(samp1, samp2, samp3, samp4).toDF()
display(scalasamp)

<CODE END>
SQL

<CODE START>

%sql

DROP TABLE IF EXISTS sqlsamp;

CREATE TABLE sqlsamp AS
SELECT 1 AS Samp
UNION ALL
SELECT 2 AS Samp
UNION ALL
SELECT 3 AS Samp
UNION ALL
SELECT 4 AS Samp;

SELECT * FROM sqlsamp;

<CODE END>

We can see that it's quite easy to swap from language to language within a single notebook.  Moving data from one language to another is not quite as simple, but that will have to wait for another post.  One interesting thing to point out is that the "%python" command is unnecessary because we chose Python as our default language when we created the Notebook.
Databricks Notebooks
Databricks Notebooks also have the interesting option to plot any data that we output using the display() function.  The display() function is unnecessary if we are using SQL.  We can visualize our data in a number of different ways simply by selecting a different plot option.
Bar Chart
Area Chart
Pie Chart
Quantile Plot
Histogram
Box Plot
Q-Q Plot
Each of these charts operates by their own rules depending on the type of data we throw at it.  There's also a dashboard that allows us to combine multiple charts into a single view.  However, that's another topic for a later post.

We've mentioned that there are four languages that can all be used within a single notebook.  There are also three additional languages that can be used for purposes other than data manipulation.
Formatted Markdown
Markdown Code
<CODE START>

%md

# HEADER
## Subheader

Text

<CODE END>

We can create formatted text using Markdown.  Markdown is a fantastic language for creating a rich text and image experience.  Notebooks are great for sharing your work due to the fact that we can see all the code and results in a single pane.  Markdown extends this even further by allow us to have rich comments along side our code as well, even including pictures and hyperlinks.  You can read more about Markdown here.
LS
<CODE START>

%fs

ls /

<CODE END>

We also have the ability to run file system commands against the Databricks File System (DBFS).  Any time we upload files to the Databricks workspace, it is stored in DBFS.  We will talk more about DBFS in the next post.  You can read more about DBFS here.  You can read more about file system commands here.
Shell
<CODE START>


%sh

ls /

<CODE END>


We also have the ability to run Shell commands against the Databricks cluster.  This can be very useful for performing administrative tasks.  You can read more about Shell commands here.
Run
<CODE START>

%run ./Notebooks_Run

<CODE END>

Finally, we have the ability to run other notebooks.  We simply need to give it the path to the other notebook.  This will make more sense when we discuss DBFS in the next post.  The "%run" command is a little different because it only allows use to provide one piece of information, the location of the notebook to run.  If we were to put additional commands in this block, it would error.

All of the "%" commands we've looked at throughout this post are known as "magic commands".  You can read more about magic commands here.  We see that they have the amazing ability to let us use the best language for the job, depending on the task at hand.  We hope this post opened your eyes to the awesome power of Notebooks.  They are undoubtedly the future of analytics.  The official Azure Databricks Notebooks documentation can be found here.  Stay tuned for the next post where we'll dive into DBFS.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 24, 2019

Azure Databricks: Cluster Creation

Today, we're going to talk about Cluster Creation in Azure Databricks.  If you haven't read the first post in this series, Introduction, it may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Cluster Creation.

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we're going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves do not store any code or data.  Instead, they operate the physical resources that are used to perform the computations.  So, it's possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment.  Let's start by creating a small cluster.
New Cluster
Cluster Properties
Let's dig into this.
Cluster Mode
First, we're setting up this cluster in "Standard" mode.  This is the typical cluster mode that is very useful for developing code, performing analyses or running individual jobs.  The other option is "High Concurrency".  This mode is optimized for multiple users running multiple jobs at the same time.  You can read more about Cluster Modes here.
Python Version
Then, we have to decide whether we want to use Python 2 or Python 3.  Long story short, Python 2 and Python 3 are very different languages.  Python 3 is the newest version and should be used in the majority of cases.  In rare cases where we are trying to productionalize existing Python 2 code, then Python 2 can be leveraged.  You can read more about Python Versions here.
Autopilot Options
Two of the major selling points of Databricks are listed under the "Autopilot Options" section.  First, Databricks clusters have the ability to monitor performance stats and autoscale the cluster based on those stats.  This level of elasticity is virtually unheard of in most other Hadoop products.  You can read more about autoscaling here.

Additionally, we have the ability to automatically stop, but not delete, clusters that are no longer being used.  Anyone who's ever worked with a product that requires people to manually spin up and spin down resources can attest that it is extremely common for people to forget to turn them off after their done.  This autotermination, along with the autoscaling mentioned previously, leads to a substantial cost savings.  You can read more about autotermination here.
Worker Type
Databricks also gives us the ability to decide what size of cluster we want to create.  As we mentioned previously, this flexibility is an efficient way to limit costs while developing by leveraging less expensive clusters.  For more advanced users, there are also a number different cluster types optimized for different performance profiles, including high CPU, high Memory, high I/O or GPU-acceleration.  You can read more about Worker Types here.
Min and Max Workers
With autoscaling enabled, we also have the ability to define how large or small the cluster can get.  This is an effective technique for guaranteeing thresholds for performance and price.  You can find out more about autoscaling here.
Driver Type
We also have the ability to use a different driver type than we use for the workers.  The worker nodes are used to perform the computations needed for any notebooks or jobs.  The driver nodes are used to store metadata and context for Notebooks.  For example, a notebook that processes a very complex dataset, but doesn't output anything to the user will require larger worker nodes.  On the other hand, a notebook that processes a very simple dataset, but outputs a lot of data to the user for manipulation in the notebook will require larger driver nodes.  Obviously this setting is designed for more advanced users.  You can read more about Cluster Node Types here.

With all of these cluster configuration options, it's obvious there are many ways to limit the overall price of our Azure Databricks environment.  This is a level of flexibility not present in any other major Hadoop environment, and one of the reasons why Databricks has gained so much popularity.  Let's move on to the advanced options at the bottom of this screen.
Advanced Options - Spark
Behind every Spark cluster, there are a large number of configuration options.  Azure Databricks uses many of the common default values.  However, there are some cases where advanced users may want to change some of these configuration values to improve performance or enable more complex features.  You can read more about Apache Spark configuration options here.

In addition, we can create environment variables that can then be used as part of scripts and code.  This is especially helpful when we want to run complex scripts or create reusable code that can be ported from cluster to cluster.  For instance, we can easily transition from Dev to QA to Prod clusters by storing the connection string for the associated database as an environment variable.  Slightly off-topic, there's also a secure way to handle this using secrets or Azure Key Vault.  You can read more about the Spark Advanced options here.
Advanced Options - Tags
Tags are actually an Azure feature that is unrelated to Azure Databricks.  Tags allow us to create key-value pairs that can be used to filter or group our Azure costs.  This is a great way to keep track of how many particular projects, individuals or cost centers are spending in Azure.  Tags also provide some advantages for Azure administration, such as understanding what resources belong to which teams or applications.  You can read more about Tags here.
Advanced Options - Logging
We also have the option to decide where our driver and worker log files will be sent.  Currently, it appears the only supported destination is DBFS, which we haven't talked about yet.  However, it is possible to mount an external storage location to DBFS, thereby allowing us to write these log files to an external storage location.  You can read more about Logging here.
Advanced Options - Init Scripts
We have the ability to run initialization scripts during the cluster creation process.  These scripts are run before the Spark driver or Worker JVM starts.  Obviously, this is only for very advanced users who need specific functionality from their clusters.  It's interesting to note that these scripts only have access to set of pre-defined environment variables and do not have access to the user-defined environment variables that we looked at previously.  You can read more about Initialization Scripts here.
Advanced Options - ADLS Gen1
If we select "High Concurrency" as our Cluster Type, we also have an additional option to pass through our credentials to Azure Data Lake Store Gen1.  This means that we don't have worry about configuring access using secrets or configurations.  However, this does limit us to only using only SQL and Python commands.  You can read more about ADLS Gen1 Credential Passthrough here.

That's all there is to creating clusters in Azure Databricks.  We love the fact that Azure Databricks makes is incredibly easy to spin up a basic cluster, complete with all the standard libraries and packages, but also gives us the flexibility to create more complex clusters as our use cases dictate.  Stay tuned for the next post where we'll dig into Notebooks.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 3, 2019

Azure Databricks: Introduction

Today, we're going to start a new series on Azure Databricks.  We've talked about Azure Databricks in a couple of previous posts (here and here).  This post will largely be a reiteration of what was discussed in those posts.  However, it's a necessary starting point for this series.  Entering the rabbit hole in 3...2...1...

So, what is Azure Databricks?  To answer this question, let's start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

So, what is Spark?  Apache Spark is a distributed processing framework built on top of the Hadoop ecosystem.  Spark provides a number of useful abstractions on top of Hadoop.  This isn't technically true, but the reality is too complex for us to dive into here.  Basically, Spark allows us to build complex data processing workflows using a substantially easier programming model.  However, Spark still requires us to manage our own Hadoop environment.  This is where Databricks comes in.  You can read more about Apache Spark here and here.

So, what is Databricks?  Per Wikipedia, "Databricks is a company founded by the creators of Apache Spark that aims to help clients with cloud-based big data processing using Spark."  Typically, when people say Databricks, they are referring to one of their products, the Databricks Unified Analytics Platform.  This platform is a fully-managed, cloud-only Spark service that allows us to quickly and easily spin up Spark clusters that are pre-configured for Analytics, including a number of standard analytics libraries.  Just as importantly, it also exposes these clusters to us via an intuitive online notebook interface that supports Scala, Python, R, Markdown, Shell and File System commands within a single notebook.  This provides the incredibly useful ability to have Databricks automatically spin up a Spark cluster for us, configure it with all of the basic libraries we will need and execute our code in any mix of the languages listed earlier.  This combination of features is completely unique in the Data Analytics and Engineering space.  You can read more about Databricks here and here.

So, what is Azure Databricks?  As previously mentioned, Databricks is purpose built to leverage the elasticity, scaleability and manageability of the cloud.  Azure Databricks is the Databricks product available in the Azure cloud.  This service leverages native Azure resources, like Blob Storage, Virtual Machines and Virtual Networks to host its service.  Furthermore, Azure Databricks is a "first-class" Azure resource.  This means that it is fully integrated and supported by Microsoft.  Despite it being owned and developed by an external company, Microsoft treats it the same it does its own internally developed resources.  You can read more about Azure Databricks here and here.

Now for the final question, "How is Azure Databricks different from the other Notebook, Data Science and Spark technologies on Azure?"  There are a number of technologies available on Azure that are designed to handle a subset of these features.

Unfortunately, none of these tools offers the entire suite of functionality provided by Azure Databricks.  The closest competitor is probably HDInsight.  However, as we saw in a previous post, HDInsight requires a certain level of Hadoop knowledge of begin using.

So, how do we get started?  It's as simple as creating a free Azure Databricks account from the Azure portal.  The account costs nothing, but the clusters we created within the account will not be free.  Fortunately, Azure Databricks offers a free 14-day trial that can be repeated over and over for our testing needs.  We're really excited for you to join us on our Azure Databricks journey.  It's going to be an awesome ride.  Stay tuned for the next post where we'll dig into creating clusters.  Thanks for reading.  We hope you found this informative.


Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 13, 2019

Azure Architecture Center

Today, we're going to do deviate from the typical formula.  Instead of a technical walkthrough, we're going to talk about the Azure Architecture Center.  The Azure Architecture Center is a one-stop shop for all of the information you'll need when learning how to architect infrastructure, application or data solutions in Azure.  You can find the Azure Architecture Center here.  So, what is Architecture?

In the tech space, there are a few different levels of Architecture.  The highest level is typically called "Solution Architect".  This level is typically tasked with translating high-level business vision into a set of high-level technical features, including the selection of different pieces of technology to accomplish this, all before a single piece of code is written.  This process typically continues while development is actively occurring to ensure that the "big picture view" is not lost.

There are variations of Architects below this, including Application Architectures who are in charge of designing solutions with a focus on on applications.  Data/Analytics Architects who are in charge of designing solutions with a focus on data and analytics and Database Architects who are in charge of describing the intricate structure of the databases leveraged for any of these solutions.  It's important to note that the majority of architecture work is done BEFORE any development takes place and, to a lesser extend, DURING the development process.  It will probably make more sense when you see the typical deliverable, an Architecture Diagram.
Batch Scoring of Python Models
This is a simple architecture diagram showing how to piece together a few different Azure technologies to leverage Python models to perform batch scoring of real-time sensor data.  Given the relatively narrow scope of this diagram (focusing only on the analytics solutions), this would likely be created by some type of Data or Analytics Architect.  You can find this architecture described here.  Let's take a look at a much more complex architecture.
Highly-Available Web Application
This diagram describes a complex web application that would need to operate across multiple regions and multiple availability zones to minimize the probability of substantial application failure due to issues within the datacenters.  This solution is complex enough that it will typically be built by either a Solution or Application Architect.  You can find this architecture described here.

Now that we understand what Architecture is, why is the Azure Architecture Center so useful?  Let's start by thinking about a very narrow technical problem.  You are writing a piece of code, hit execute and a big error pops up "Error 6373782: Blah Blah Blah".  It's typically not too complex of an exercise to google this error message and find relevant information on how to solve it.  However, the further away from code we get, the more difficult it is to research the solution to a problem.  For instance, if we are looking for the pros and cons of different big data analytic storage technologies, googling phrases like "Big Data Analytics Database" may not yield the results we are looking for.  The core issue is that it's very difficult to research abstract concepts that we don't already understand.

Now, imagine that instead of having to filter through a ton of irrelevant search results, there was a single location where an aspiring Architect or an experienced Architect who is new to Azure could go to learn about the best practices for building different types of solutions, straight from the experts themselves.  This is the power of the Azure Architecture Center.  We commonly refer people to the Azure Architecture Center if they are struggling to grasp "how it all fits together".  Once they understand that, doing the individual research on how to develop using each of the technologies becomes a much simpler exercise.

To summarize, we highly encourage anyone who develops or architects solutions on Azure to become familiar with the Azure Architecture Center.  It is a great starting point for understanding the big picture and can provide the necessary context to ensure that we are following all of the best practices provided by the people that actually build the products we use every day.  Here's the link again.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 22, 2019

Data Science In Power BI: Spark MLlib

Today, we're going to talk about Spark MLlib within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream AnalyticsStream Analytics with Azure Machine Learning StudioHDInsight Hive and Databricks Spark, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Spark MLlib.

Technically, we won't be running Spark MLlib commands within Power BI in this post.  There's probably a way to pull that off using the built-in Python functionality.  Alas, that's a bit more complex than we'd like to get into here.  Instead, we're going to use Spark MLlib functionality within Databricks to generate some predictions for us, then expose those predictions using Power BI.  So, what's Spark MLlib?

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX.  It is a machine learning framework built from the ground up to be massively scalable and operate within Spark.  This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data.  You can read more about Spark MLlib here.

In order to leverage Spark MLlib, we obviously need a way to execute Spark code.  In our minds, there's no better tool for this than Azure Databricks.  In the previous post, we covered the creation of an Azure Databricks environment.  We're going to reuse that environment for this post as well.  We'll also use the same dataset that we've been using, which contains information about individual customers.  This dataset was originally designed to predict Income based on a number of factors.  However, we left the income out of this dataset a few posts back for reasons that were important then.  So, we're actually going to use this dataset to predict "Hours Per Week" instead.

Spoiler: The resulting model is terrible, but it lets us show off some cool functionality.

To accomplish this, we're going to use Pyspark.  Pyspark is a Python library that allows us to execute Spark commands using Python.  Given the meteoric rise of Python as a Data Engineering and Data Science Programming, as well it's ease of coding when compared to Scala, we find it to be a great option for Spark programming.  You can read more about Pyspark here.
Import Data
<CODE START>

df = sqlContext.read.parquet("/mnt/datadefaultparquet")


display(df)

<CODE END>

The first step is to import our data.  Since we previously mounted the Parquet data, this is a relatively simple use of "sqlContext".  The MLlib functionality that we are going to leverage today doesn't like strings very much.  In a real-world use case, we would use dummy encoding, also known as indicator variables and one-hot processing, to solve this problem.  We've covered this process using Azure Machine Learning Studio in this post.  It's not a basic task in PySpark and deserves its own post.  So, we're just going to remove all of the string columns.
Clean Data
<CODE START>

df_clean = df.drop('customerid', 'workclass', 'education', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry')


display(df_clean)

<CODE END>

Now that our dataset is cleaned up and ready for modelling, we need to "vectorize" it.  Basically, MLlib doesn't like having multiple feature columns.  Instead, it wants us to provide a single "vector" that contains all of our features.  This is a requirement driven by the linear mathematics behind-the-scenes.  However, we're not sure why PySpark requires this, as it seems easy enough for it create on its own without our help.  Alas, it is a requirement nonetheless.  Fortunately, there's a built-in function to handle it.
Vectorize Features
<CODE START>

from pyspark.ml.feature import VectorAssembler

df_vec = \
  VectorAssembler( \
    inputCols=[ \
      c \
      for c in df_clean.columns \
      if c != 'hoursperweek' \
    ] \
    ,outputCol="features" \
  ) \
  .transform(df_clean) \
  .select(['hoursperweek', 'features'])


df_vec.show()

<CODE END>

First, we have to import the "VectorAssembler()" function from the "pyspark.ml.feature" library.  We don't need to install these libraries on the cluster, as they come prepackaged with Databricks clusters.  However, we still need to import them into our notebooks.  Then, we create a new "df_vec" data frame by telling the "Vector Assembler" to vectorize all features EXCEPT "hoursperweek" into a single column called "features".  We then see the result of this operation.

There are a few important syntactical notes here.  First, PySpark executes commands when it sees a line break.  So, to clean up our code using multiple lines, we have to add a "\" to the end of the line.  This tells PySpark that the command continues to the next line.  Also, the "display()" function displays some strangeness when we apply it to vectors.  So, we switched over to the ".show()" function.

Next, we need to create training and testing splits of our data.
Split Data
<CODE START>

(df_vec_train, df_vec_test) = df_vec.randomSplit([0.7, 0.3])

{
  'Train': {
    'Rows': df_vec_train.count()
    ,'Cols': len(df_vec_train.columns)
  }
  ,'Test': {
    'Rows': df_vec_test.count()
    ,'Cols': len(df_vec_test.columns)
  }

}

<CODE END>

We can use the built-in ".randomSplit()" function to achieve this.  This is one of the interesting Python functions that outputs multiple data frames at once.  Therefore, we need to capture it using the "(df1, df2)" syntax.  Finally, we can showcase the important aspects of these data frames by creating a dictionary, which uses the "{}" syntax.  We can see that data frames both have two columns, while the training data frame has many more observations than the testing data frame due to the 70/30 split.  Next, we can train our linear regression model.
Linear Regression
<CODE START>

from pyspark.ml.regression import LinearRegression

df_linreg = \
  LinearRegression(featuresCol='features', labelCol='hoursperweek') \
  .fit(df_vec_train)

df_linreg_pred = df_linreg.transform(df_vec_test)


df_linreg_pred.show()

<CODE END>

To accomplish this, we have to import another function, "LinearRegression()", this time from the "pyspark.ml.regression" library.  One of the interesting aspects of PySpark that took us some time to understand is that PySpark leverages a "lazy" execution model.  This means that commands do not run until there is some need for them to provide output.  This allows some interesting coding options.  Notice the "LinearRegression(featuresCol='features', labelCol='hoursperweek')" command that we run here.  This command is not tied to a dataset, meaning that it can't possibly return a result.  It is simply an empty mathematical formula waiting for a dataset to use it on.  However, it's still completely valid PySpark because, as we said, it isn't executed until it's needed.  So, once we apply the ".fit(df_vec_train)" function, it has a data set and can begin doing it's magic.  The lazy execution model is actually more complex than this, but this is a decent example to showcase it.

Moving on, now that we have predictions from our linear regression model, we can calculate our evaluation metrics to see how good our model is.  BRACE YOURSELVES!
Model Evaluation
<CODE START>

from pyspark.ml.evaluation import RegressionEvaluator

reg_eval = RegressionEvaluator(predictionCol="prediction", labelCol='hoursperweek')
df_linreg_r2 = reg_eval.evaluate(df_linreg_pred, {reg_eval.metricName: "r2"})
df_linreg_rmse = reg_eval.evaluate(df_linreg_pred, {reg_eval.metricName: "rmse"})

{
  'R-Squared': df_linreg_r2
  ,'Root Mean Squared Error': df_linreg_rmse

}

<CODE END>

Wow....that is one terrible model.  If an R-Squared of 4% isn't a record low, then it must be close.  Regardless, this was a showcase of functionality given an incomplete dataset.  We were able to calculate these values using the "RegressionEvaluator()" function from "pyspark.ml.evaluation" library.

Finally, we need to store these results in the Databricks database for access from Power BI.
Predictions
<CODE START>

df_linreg_pred.write.saveAsTable("Predictions")

<CODE END>

<CODE START>

%sql


SELECT * FROM Predictions LIMIT 10;

<CODE END>

Now that our data is stored in the Predictions table, we can access it from Power BI.  As in the previous post, you can find the connection walkthrough here.
Power BI Report
This quick scatterplot of our predictions versus the actual values tells the same story as our R-Squared of 4%.  If our model was perfect, these points would all fall exactly on the line.  As an aside, this line set to "y=x" is called a "Ratio Line" in Power BI.

Hopefully, this post piqued your interest in combining the massive computing power of Spark MLlib with the ease of use and reporting capabilities of Power BI.  This blog series has been very interesting to write and we've learned a ton along the way.  We're not sure what the next series will be, but we hope you'll tune in!  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 1, 2019

Data Science in Power BI: Databricks Spark

Today, we're going to talk about Databricks Spark within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream AnalyticsStream Analytics with Azure Machine Learning Studio and HDInsight Hive, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Databricks Spark.

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don't want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a "first party offering" in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks here, here and here.

To get started with Azure Databricks, we need to create an account.  We can do this in the Azure Portal.
Databricks Account Creation
Now that we have an account, we can access our Azure Databricks Workspace through the Azure portal.
Launch Workspace
When we navigate to our Azure Databricks resource, we simply click the "Launch Workspace" button.
Azure Databricks Portal
This takes us to the Azure Databricks Portal.  Azure Databricks has three major components that we will leverage.  First, it has a SparkSQL database behind the scenes that we can store data in and query from.  Second, it stores our code as notebooks that we can use to interactively explore.  Finally, it allows us to create clusters that quickly spin up and spin down, separately from our code.  This means that we can destroy our clusters (and stop paying for them), without losing any of our code or data.
New Cluster
Let's start by creating our cluster definition.
Cluster Definition
Azure Databricks allows us to easily create Spark clusters with the ability to auto-scale.  This is a huge differentiator from HDInsight, which typically requires us to destroy and recreate the cluster if we want to add nodes.  In our case, we just want the smallest possible cluster, as our dataset is pretty small.  The cluster creation takes a few minutes.
New Notebook
Create Notebook
Once the cluster is created, we can create a notebook and select its default language.  One of the really cool things about Databricks is that we can write multiple query languages within the same notebook.  It's extremely simple to ingest data using Scala, process it using SQL, build predictive models using R and visualize it using Python, all inside a single notebook.  The language we select when creating the notebook is simply the default language, but it can easily be overridden using "%sql", "%scala", "%python" and "%r".  There's even a "%md" command for writing formatted text using markdown.  Now that we have a cluster and a notebook, let's use Scala to query data from our blob store.
Read Avro from Blob
<CODE START>

%scala

spark.conf.set(
  "fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net", <Storage Account Access Key>
);

val df = spark.read.format("avro").load("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/data");

df.show();

<CODE END>

We were able to do this in a few basic lines of code.  Now that we can access our data, let's discuss the goal of this post.  In the previous post, we mentioned that Hive doesn't perform well against lots of small files.  This is true for Spark as well.  One technique for handling this is to have jobs that coalesce a large number of small files into a single large file.  These jobs can be run daily, weekly or monthly depending on the data needs of the organization.  Since we've already stored the entire dataset in the "df" Data Frame, it's trivial write all of this out to a set of randomly partitioned files or a single file.
Write Avro to Blob
<CODE START>

%scala

// Write using default partitions

df.write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault/")

// Write using single partition

df.repartition(1).write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datasinglefile/")

<CODE END>

Since Spark is a distributed processing framework, the default state is to read and write files in a partitioned state.  However, we can overwrite this by using the "repartition()" function.  Let's take a look at the results.
Default Partitions

Single Partition
Immediately, we notice the awkward naming convention that Spark uses for its files.  This is a built-in feature of Hadoop and cannot be directly altered.  However, it is possible to use File System commands to rename the files after they've been written, if it's necessary for some reason.  We also see that Spark includes some status files as well.  These aren't a major concern, as most Hadoop technologies know how to leverage or ignore these files as needed.

Seeing this data leads us to two important questions.  First, is the total data volume smaller if we store it in multiple files?
Storage Statistics
Fortunately, Azure Storage Explorer has some built-in functionality for this.  It turns out that we saw an 83% size reduction using the default partitioning and an 87% size reduction using the single partitioning.  More importantly, what kind of read performance do we get off of these files?
Read Statistics
This test showed that the small files took 9.37 seconds, the default partitioned files took 0.67 seconds and the single file took 0.54 seconds.  These numbers are even better than the Storage Statistics, boasting a 93% read time reduction for the default partitioning and a 94% read reduction for the single file.  What's even more interesting here is that there doesn't seem to be a substantial difference between using the default partitioning and using a single file.  So, we'll just stick with the defaults moving forward.

Now that we've compiled all of our small files together, we need to expose the data in a way that Power BI can efficiently read.  Power BI can read from Blob Storage, but it's messy and we won't try that.  Fortunately, Databricks has a built-in Spark SQL Database that we can write our data to.
createOrReplaceTempView
<CODE START>

%scala


df.createOrReplaceTempView("Customers")

<CODE END>

<CODE START>

%sql


SELECT * FROM Customers LIMIT 10

<CODE END>

Databricks allows us to leverage a Scala function "createOrReplaceTempView()" that automatically creates a temporary table for us.  Unlike in the previous post, this table is not connected directly to the data in blob storage.  Instead, it's a copy of the data that is stored directly inside the Databricks DBFS Storage.  Just like with HDInsight, it is possible to create an external table that connects to remote file storage.  In order to do this, we must first mount the storage container.
Mount Storage Account
<CODE START>

%scala

dbutils.fs.mount(
  source = "wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault",
  mountPoint = "/mnt/datadefault",

  extraConfigs = Map("fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net" -> <Storage Account Access Key>))

<CODE END>

This command allows us to access the data in the "datadefault" folder within the "interactivequery" container, simply by referencing the "/mnt/datadefault" location as if it was local.  We can use this to create our external table.  Understanding that it's not good practice to use your account keys in our code, it's also possible to store our Account Keys using the Secrets API and reference them using the "dbutils.secrets.get" function.

Unfortunately, while writing this, we discovered that there is a bug affecting the use of Avro for this operation.  If we use the Avro data, we end up with a strange table.
CustomersBlob (Avro)
<CODE START>

%sql

CREATE EXTERNAL TABLE CustomersBlob(
  customerid string,
  age int,
  workclass string,
  fnlwgt int,
  education string,
  educationnum int,
  maritalstatus string,
  occupation string,
  relationship string,
  race string,
  sex string,
  capitalgain int,
  capitalloss int,
  hoursperweek int,
  nativecountry string
)
STORED AS AVRO
LOCATION '/mnt/datadefault/';


SELECT * FROM CustomersBlob LIMIT 10;

<CODE END>

Fortunately, saving the same data in Parquet format resolves the problem with no issue.
CustomersBlob (Parquet)
Now that this data is accessible through Databricks, we can access it from Power BI using the Spark connector.  First, we need to find the connection information.  In order to connect, we need to get the JDBC/ODBC connection information for the cluster, as well as an access token.  We won't show this process step-by-step, but you can find it here.
Customers
Hopefully, this post enlightened you to the possibilities of combining the processing power of Databricks Spark with Power BI.  Databricks is one of the coolest new kids on the block and provides an incredibly simple mechanism for implementing python, r, scala and sql code against the open-source Spark framework.  Stay tuned for the final post in this series where we'll dig into Spark MLlib.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com