Monday, June 24, 2019

Azure Databricks: Cluster Creation

Today, we're going to talk about Cluster Creation in Azure Databricks.  If you haven't read the first post in this series, Introduction, it may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Cluster Creation.

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we're going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves do not store any code or data.  Instead, they operate the physical resources that are used to perform the computations.  So, it's possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment.  Let's start by creating a small cluster.
New Cluster
Cluster Properties
Let's dig into this.
Cluster Mode
First, we're setting up this cluster in "Standard" mode.  This is the typical cluster mode that is very useful for developing code, performing analyses or running individual jobs.  The other option is "High Concurrency".  This mode is optimized for multiple users running multiple jobs at the same time.  You can read more about Cluster Modes here.
Python Version
Then, we have to decide whether we want to use Python 2 or Python 3.  Long story short, Python 2 and Python 3 are very different languages.  Python 3 is the newest version and should be used in the majority of cases.  In rare cases where we are trying to productionalize existing Python 2 code, then Python 2 can be leveraged.  You can read more about Python Versions here.
Autopilot Options
Two of the major selling points of Databricks are listed under the "Autopilot Options" section.  First, Databricks clusters have the ability to monitor performance stats and autoscale the cluster based on those stats.  This level of elasticity is virtually unheard of in most other Hadoop products.  You can read more about autoscaling here.

Additionally, we have the ability to automatically stop, but not delete, clusters that are no longer being used.  Anyone who's ever worked with a product that requires people to manually spin up and spin down resources can attest that it is extremely common for people to forget to turn them off after their done.  This autotermination, along with the autoscaling mentioned previously, leads to a substantial cost savings.  You can read more about autotermination here.
Worker Type
Databricks also gives us the ability to decide what size of cluster we want to create.  As we mentioned previously, this flexibility is an efficient way to limit costs while developing by leveraging less expensive clusters.  For more advanced users, there are also a number different cluster types optimized for different performance profiles, including high CPU, high Memory, high I/O or GPU-acceleration.  You can read more about Worker Types here.
Min and Max Workers
With autoscaling enabled, we also have the ability to define how large or small the cluster can get.  This is an effective technique for guaranteeing thresholds for performance and price.  You can find out more about autoscaling here.
Driver Type
We also have the ability to use a different driver type than we use for the workers.  The worker nodes are used to perform the computations needed for any notebooks or jobs.  The driver nodes are used to store metadata and context for Notebooks.  For example, a notebook that processes a very complex dataset, but doesn't output anything to the user will require larger worker nodes.  On the other hand, a notebook that processes a very simple dataset, but outputs a lot of data to the user for manipulation in the notebook will require larger driver nodes.  Obviously this setting is designed for more advanced users.  You can read more about Cluster Node Types here.

With all of these cluster configuration options, it's obvious there are many ways to limit the overall price of our Azure Databricks environment.  This is a level of flexibility not present in any other major Hadoop environment, and one of the reasons why Databricks has gained so much popularity.  Let's move on to the advanced options at the bottom of this screen.
Advanced Options - Spark
Behind every Spark cluster, there are a large number of configuration options.  Azure Databricks uses many of the common default values.  However, there are some cases where advanced users may want to change some of these configuration values to improve performance or enable more complex features.  You can read more about Apache Spark configuration options here.

In addition, we can create environment variables that can then be used as part of scripts and code.  This is especially helpful when we want to run complex scripts or create reusable code that can be ported from cluster to cluster.  For instance, we can easily transition from Dev to QA to Prod clusters by storing the connection string for the associated database as an environment variable.  Slightly off-topic, there's also a secure way to handle this using secrets or Azure Key Vault.  You can read more about the Spark Advanced options here.
Advanced Options - Tags
Tags are actually an Azure feature that is unrelated to Azure Databricks.  Tags allow us to create key-value pairs that can be used to filter or group our Azure costs.  This is a great way to keep track of how many particular projects, individuals or cost centers are spending in Azure.  Tags also provide some advantages for Azure administration, such as understanding what resources belong to which teams or applications.  You can read more about Tags here.
Advanced Options - Logging
We also have the option to decide where our driver and worker log files will be sent.  Currently, it appears the only supported destination is DBFS, which we haven't talked about yet.  However, it is possible to mount an external storage location to DBFS, thereby allowing us to write these log files to an external storage location.  You can read more about Logging here.
Advanced Options - Init Scripts
We have the ability to run initialization scripts during the cluster creation process.  These scripts are run before the Spark driver or Worker JVM starts.  Obviously, this is only for very advanced users who need specific functionality from their clusters.  It's interesting to note that these scripts only have access to set of pre-defined environment variables and do not have access to the user-defined environment variables that we looked at previously.  You can read more about Initialization Scripts here.
Advanced Options - ADLS Gen1
If we select "High Concurrency" as our Cluster Type, we also have an additional option to pass through our credentials to Azure Data Lake Store Gen1.  This means that we don't have worry about configuring access using secrets or configurations.  However, this does limit us to only using only SQL and Python commands.  You can read more about ADLS Gen1 Credential Passthrough here.

That's all there is to creating clusters in Azure Databricks.  We love the fact that Azure Databricks makes is incredibly easy to spin up a basic cluster, complete with all the standard libraries and packages, but also gives us the flexibility to create more complex clusters as our use cases dictate.  Stay tuned for the next post where we'll dig into Notebooks.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 3, 2019

Azure Databricks: Introduction

Today, we're going to start a new series on Azure Databricks.  We've talked about Azure Databricks in a couple of previous posts (here and here).  This post will largely be a reiteration of what was discussed in those posts.  However, it's a necessary starting point for this series.  Entering the rabbit hole in 3...2...1...

So, what is Azure Databricks?  To answer this question, let's start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

So, what is Spark?  Apache Spark is a distributed processing framework built on top of the Hadoop ecosystem.  Spark provides a number of useful abstractions on top of Hadoop.  This isn't technically true, but the reality is too complex for us to dive into here.  Basically, Spark allows us to build complex data processing workflows using a substantially easier programming model.  However, Spark still requires us to manage our own Hadoop environment.  This is where Databricks comes in.  You can read more about Apache Spark here and here.

So, what is Databricks?  Per Wikipedia, "Databricks is a company founded by the creators of Apache Spark that aims to help clients with cloud-based big data processing using Spark."  Typically, when people say Databricks, they are referring to one of their products, the Databricks Unified Analytics Platform.  This platform is a fully-managed, cloud-only Spark service that allows us to quickly and easily spin up Spark clusters that are pre-configured for Analytics, including a number of standard analytics libraries.  Just as importantly, it also exposes these clusters to us via an intuitive online notebook interface that supports Scala, Python, R, Markdown, Shell and File System commands within a single notebook.  This provides the incredibly useful ability to have Databricks automatically spin up a Spark cluster for us, configure it with all of the basic libraries we will need and execute our code in any mix of the languages listed earlier.  This combination of features is completely unique in the Data Analytics and Engineering space.  You can read more about Databricks here and here.

So, what is Azure Databricks?  As previously mentioned, Databricks is purpose built to leverage the elasticity, scaleability and manageability of the cloud.  Azure Databricks is the Databricks product available in the Azure cloud.  This service leverages native Azure resources, like Blob Storage, Virtual Machines and Virtual Networks to host its service.  Furthermore, Azure Databricks is a "first-class" Azure resource.  This means that it is fully integrated and supported by Microsoft.  Despite it being owned and developed by an external company, Microsoft treats it the same it does its own internally developed resources.  You can read more about Azure Databricks here and here.

Now for the final question, "How is Azure Databricks different from the other Notebook, Data Science and Spark technologies on Azure?"  There are a number of technologies available on Azure that are designed to handle a subset of these features.

Unfortunately, none of these tools offers the entire suite of functionality provided by Azure Databricks.  The closest competitor is probably HDInsight.  However, as we saw in a previous post, HDInsight requires a certain level of Hadoop knowledge of begin using.

So, how do we get started?  It's as simple as creating a free Azure Databricks account from the Azure portal.  The account costs nothing, but the clusters we created within the account will not be free.  Fortunately, Azure Databricks offers a free 14-day trial that can be repeated over and over for our testing needs.  We're really excited for you to join us on our Azure Databricks journey.  It's going to be an awesome ride.  Stay tuned for the next post where we'll dig into creating clusters.  Thanks for reading.  We hope you found this informative.


Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 13, 2019

Azure Architecture Center

Today, we're going to do deviate from the typical formula.  Instead of a technical walkthrough, we're going to talk about the Azure Architecture Center.  The Azure Architecture Center is a one-stop shop for all of the information you'll need when learning how to architect infrastructure, application or data solutions in Azure.  You can find the Azure Architecture Center here.  So, what is Architecture?

In the tech space, there are a few different levels of Architecture.  The highest level is typically called "Solution Architect".  This level is typically tasked with translating high-level business vision into a set of high-level technical features, including the selection of different pieces of technology to accomplish this, all before a single piece of code is written.  This process typically continues while development is actively occurring to ensure that the "big picture view" is not lost.

There are variations of Architects below this, including Application Architectures who are in charge of designing solutions with a focus on on applications.  Data/Analytics Architects who are in charge of designing solutions with a focus on data and analytics and Database Architects who are in charge of describing the intricate structure of the databases leveraged for any of these solutions.  It's important to note that the majority of architecture work is done BEFORE any development takes place and, to a lesser extend, DURING the development process.  It will probably make more sense when you see the typical deliverable, an Architecture Diagram.
Batch Scoring of Python Models
This is a simple architecture diagram showing how to piece together a few different Azure technologies to leverage Python models to perform batch scoring of real-time sensor data.  Given the relatively narrow scope of this diagram (focusing only on the analytics solutions), this would likely be created by some type of Data or Analytics Architect.  You can find this architecture described here.  Let's take a look at a much more complex architecture.
Highly-Available Web Application
This diagram describes a complex web application that would need to operate across multiple regions and multiple availability zones to minimize the probability of substantial application failure due to issues within the datacenters.  This solution is complex enough that it will typically be built by either a Solution or Application Architect.  You can find this architecture described here.

Now that we understand what Architecture is, why is the Azure Architecture Center so useful?  Let's start by thinking about a very narrow technical problem.  You are writing a piece of code, hit execute and a big error pops up "Error 6373782: Blah Blah Blah".  It's typically not too complex of an exercise to google this error message and find relevant information on how to solve it.  However, the further away from code we get, the more difficult it is to research the solution to a problem.  For instance, if we are looking for the pros and cons of different big data analytic storage technologies, googling phrases like "Big Data Analytics Database" may not yield the results we are looking for.  The core issue is that it's very difficult to research abstract concepts that we don't already understand.

Now, imagine that instead of having to filter through a ton of irrelevant search results, there was a single location where an aspiring Architect or an experienced Architect who is new to Azure could go to learn about the best practices for building different types of solutions, straight from the experts themselves.  This is the power of the Azure Architecture Center.  We commonly refer people to the Azure Architecture Center if they are struggling to grasp "how it all fits together".  Once they understand that, doing the individual research on how to develop using each of the technologies becomes a much simpler exercise.

To summarize, we highly encourage anyone who develops or architects solutions on Azure to become familiar with the Azure Architecture Center.  It is a great starting point for understanding the big picture and can provide the necessary context to ensure that we are following all of the best practices provided by the people that actually build the products we use every day.  Here's the link again.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 22, 2019

Data Science In Power BI: Spark MLlib

Today, we're going to talk about Spark MLlib within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream AnalyticsStream Analytics with Azure Machine Learning StudioHDInsight Hive and Databricks Spark, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Spark MLlib.

Technically, we won't be running Spark MLlib commands within Power BI in this post.  There's probably a way to pull that off using the built-in Python functionality.  Alas, that's a bit more complex than we'd like to get into here.  Instead, we're going to use Spark MLlib functionality within Databricks to generate some predictions for us, then expose those predictions using Power BI.  So, what's Spark MLlib?

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX.  It is a machine learning framework built from the ground up to be massively scalable and operate within Spark.  This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data.  You can read more about Spark MLlib here.

In order to leverage Spark MLlib, we obviously need a way to execute Spark code.  In our minds, there's no better tool for this than Azure Databricks.  In the previous post, we covered the creation of an Azure Databricks environment.  We're going to reuse that environment for this post as well.  We'll also use the same dataset that we've been using, which contains information about individual customers.  This dataset was originally designed to predict Income based on a number of factors.  However, we left the income out of this dataset a few posts back for reasons that were important then.  So, we're actually going to use this dataset to predict "Hours Per Week" instead.

Spoiler: The resulting model is terrible, but it lets us show off some cool functionality.

To accomplish this, we're going to use Pyspark.  Pyspark is a Python library that allows us to execute Spark commands using Python.  Given the meteoric rise of Python as a Data Engineering and Data Science Programming, as well it's ease of coding when compared to Scala, we find it to be a great option for Spark programming.  You can read more about Pyspark here.
Import Data
<CODE START>

df = sqlContext.read.parquet("/mnt/datadefaultparquet")


display(df)

<CODE END>

The first step is to import our data.  Since we previously mounted the Parquet data, this is a relatively simple use of "sqlContext".  The MLlib functionality that we are going to leverage today doesn't like strings very much.  In a real-world use case, we would use dummy encoding, also known as indicator variables and one-hot processing, to solve this problem.  We've covered this process using Azure Machine Learning Studio in this post.  It's not a basic task in PySpark and deserves its own post.  So, we're just going to remove all of the string columns.
Clean Data
<CODE START>

df_clean = df.drop('customerid', 'workclass', 'education', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry')


display(df_clean)

<CODE END>

Now that our dataset is cleaned up and ready for modelling, we need to "vectorize" it.  Basically, MLlib doesn't like having multiple feature columns.  Instead, it wants us to provide a single "vector" that contains all of our features.  This is a requirement driven by the linear mathematics behind-the-scenes.  However, we're not sure why PySpark requires this, as it seems easy enough for it create on its own without our help.  Alas, it is a requirement nonetheless.  Fortunately, there's a built-in function to handle it.
Vectorize Features
<CODE START>

from pyspark.ml.feature import VectorAssembler

df_vec = \
  VectorAssembler( \
    inputCols=[ \
      c \
      for c in df_clean.columns \
      if c != 'hoursperweek' \
    ] \
    ,outputCol="features" \
  ) \
  .transform(df_clean) \
  .select(['hoursperweek', 'features'])


df_vec.show()

<CODE END>

First, we have to import the "VectorAssembler()" function from the "pyspark.ml.feature" library.  We don't need to install these libraries on the cluster, as they come prepackaged with Databricks clusters.  However, we still need to import them into our notebooks.  Then, we create a new "df_vec" data frame by telling the "Vector Assembler" to vectorize all features EXCEPT "hoursperweek" into a single column called "features".  We then see the result of this operation.

There are a few important syntactical notes here.  First, PySpark executes commands when it sees a line break.  So, to clean up our code using multiple lines, we have to add a "\" to the end of the line.  This tells PySpark that the command continues to the next line.  Also, the "display()" function displays some strangeness when we apply it to vectors.  So, we switched over to the ".show()" function.

Next, we need to create training and testing splits of our data.
Split Data
<CODE START>

(df_vec_train, df_vec_test) = df_vec.randomSplit([0.7, 0.3])

{
  'Train': {
    'Rows': df_vec_train.count()
    ,'Cols': len(df_vec_train.columns)
  }
  ,'Test': {
    'Rows': df_vec_test.count()
    ,'Cols': len(df_vec_test.columns)
  }

}

<CODE END>

We can use the built-in ".randomSplit()" function to achieve this.  This is one of the interesting Python functions that outputs multiple data frames at once.  Therefore, we need to capture it using the "(df1, df2)" syntax.  Finally, we can showcase the important aspects of these data frames by creating a dictionary, which uses the "{}" syntax.  We can see that data frames both have two columns, while the training data frame has many more observations than the testing data frame due to the 70/30 split.  Next, we can train our linear regression model.
Linear Regression
<CODE START>

from pyspark.ml.regression import LinearRegression

df_linreg = \
  LinearRegression(featuresCol='features', labelCol='hoursperweek') \
  .fit(df_vec_train)

df_linreg_pred = df_linreg.transform(df_vec_test)


df_linreg_pred.show()

<CODE END>

To accomplish this, we have to import another function, "LinearRegression()", this time from the "pyspark.ml.regression" library.  One of the interesting aspects of PySpark that took us some time to understand is that PySpark leverages a "lazy" execution model.  This means that commands do not run until there is some need for them to provide output.  This allows some interesting coding options.  Notice the "LinearRegression(featuresCol='features', labelCol='hoursperweek')" command that we run here.  This command is not tied to a dataset, meaning that it can't possibly return a result.  It is simply an empty mathematical formula waiting for a dataset to use it on.  However, it's still completely valid PySpark because, as we said, it isn't executed until it's needed.  So, once we apply the ".fit(df_vec_train)" function, it has a data set and can begin doing it's magic.  The lazy execution model is actually more complex than this, but this is a decent example to showcase it.

Moving on, now that we have predictions from our linear regression model, we can calculate our evaluation metrics to see how good our model is.  BRACE YOURSELVES!
Model Evaluation
<CODE START>

from pyspark.ml.evaluation import RegressionEvaluator

reg_eval = RegressionEvaluator(predictionCol="prediction", labelCol='hoursperweek')
df_linreg_r2 = reg_eval.evaluate(df_linreg_pred, {reg_eval.metricName: "r2"})
df_linreg_rmse = reg_eval.evaluate(df_linreg_pred, {reg_eval.metricName: "rmse"})

{
  'R-Squared': df_linreg_r2
  ,'Root Mean Squared Error': df_linreg_rmse

}

<CODE END>

Wow....that is one terrible model.  If an R-Squared of 4% isn't a record low, then it must be close.  Regardless, this was a showcase of functionality given an incomplete dataset.  We were able to calculate these values using the "RegressionEvaluator()" function from "pyspark.ml.evaluation" library.

Finally, we need to store these results in the Databricks database for access from Power BI.
Predictions
<CODE START>

df_linreg_pred.write.saveAsTable("Predictions")

<CODE END>

<CODE START>

%sql


SELECT * FROM Predictions LIMIT 10;

<CODE END>

Now that our data is stored in the Predictions table, we can access it from Power BI.  As in the previous post, you can find the connection walkthrough here.
Power BI Report
This quick scatterplot of our predictions versus the actual values tells the same story as our R-Squared of 4%.  If our model was perfect, these points would all fall exactly on the line.  As an aside, this line set to "y=x" is called a "Ratio Line" in Power BI.

Hopefully, this post piqued your interest in combining the massive computing power of Spark MLlib with the ease of use and reporting capabilities of Power BI.  This blog series has been very interesting to write and we've learned a ton along the way.  We're not sure what the next series will be, but we hope you'll tune in!  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 1, 2019

Data Science in Power BI: Databricks Spark

Today, we're going to talk about Databricks Spark within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream AnalyticsStream Analytics with Azure Machine Learning Studio and HDInsight Hive, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Databricks Spark.

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don't want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a "first party offering" in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks here, here and here.

To get started with Azure Databricks, we need to create an account.  We can do this in the Azure Portal.
Databricks Account Creation
Now that we have an account, we can access our Azure Databricks Workspace through the Azure portal.
Launch Workspace
When we navigate to our Azure Databricks resource, we simply click the "Launch Workspace" button.
Azure Databricks Portal
This takes us to the Azure Databricks Portal.  Azure Databricks has three major components that we will leverage.  First, it has a SparkSQL database behind the scenes that we can store data in and query from.  Second, it stores our code as notebooks that we can use to interactively explore.  Finally, it allows us to create clusters that quickly spin up and spin down, separately from our code.  This means that we can destroy our clusters (and stop paying for them), without losing any of our code or data.
New Cluster
Let's start by creating our cluster definition.
Cluster Definition
Azure Databricks allows us to easily create Spark clusters with the ability to auto-scale.  This is a huge differentiator from HDInsight, which typically requires us to destroy and recreate the cluster if we want to add nodes.  In our case, we just want the smallest possible cluster, as our dataset is pretty small.  The cluster creation takes a few minutes.
New Notebook
Create Notebook
Once the cluster is created, we can create a notebook and select its default language.  One of the really cool things about Databricks is that we can write multiple query languages within the same notebook.  It's extremely simple to ingest data using Scala, process it using SQL, build predictive models using R and visualize it using Python, all inside a single notebook.  The language we select when creating the notebook is simply the default language, but it can easily be overridden using "%sql", "%scala", "%python" and "%r".  There's even a "%md" command for writing formatted text using markdown.  Now that we have a cluster and a notebook, let's use Scala to query data from our blob store.
Read Avro from Blob
<CODE START>

%scala

spark.conf.set(
  "fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net", <Storage Account Access Key>
);

val df = spark.read.format("avro").load("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/data");

df.show();

<CODE END>

We were able to do this in a few basic lines of code.  Now that we can access our data, let's discuss the goal of this post.  In the previous post, we mentioned that Hive doesn't perform well against lots of small files.  This is true for Spark as well.  One technique for handling this is to have jobs that coalesce a large number of small files into a single large file.  These jobs can be run daily, weekly or monthly depending on the data needs of the organization.  Since we've already stored the entire dataset in the "df" Data Frame, it's trivial write all of this out to a set of randomly partitioned files or a single file.
Write Avro to Blob
<CODE START>

%scala

// Write using default partitions

df.write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault/")

// Write using single partition

df.repartition(1).write.format("avro").save("wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datasinglefile/")

<CODE END>

Since Spark is a distributed processing framework, the default state is to read and write files in a partitioned state.  However, we can overwrite this by using the "repartition()" function.  Let's take a look at the results.
Default Partitions

Single Partition
Immediately, we notice the awkward naming convention that Spark uses for its files.  This is a built-in feature of Hadoop and cannot be directly altered.  However, it is possible to use File System commands to rename the files after they've been written, if it's necessary for some reason.  We also see that Spark includes some status files as well.  These aren't a major concern, as most Hadoop technologies know how to leverage or ignore these files as needed.

Seeing this data leads us to two important questions.  First, is the total data volume smaller if we store it in multiple files?
Storage Statistics
Fortunately, Azure Storage Explorer has some built-in functionality for this.  It turns out that we saw an 83% size reduction using the default partitioning and an 87% size reduction using the single partitioning.  More importantly, what kind of read performance do we get off of these files?
Read Statistics
This test showed that the small files took 9.37 seconds, the default partitioned files took 0.67 seconds and the single file took 0.54 seconds.  These numbers are even better than the Storage Statistics, boasting a 93% read time reduction for the default partitioning and a 94% read reduction for the single file.  What's even more interesting here is that there doesn't seem to be a substantial difference between using the default partitioning and using a single file.  So, we'll just stick with the defaults moving forward.

Now that we've compiled all of our small files together, we need to expose the data in a way that Power BI can efficiently read.  Power BI can read from Blob Storage, but it's messy and we won't try that.  Fortunately, Databricks has a built-in Spark SQL Database that we can write our data to.
createOrReplaceTempView
<CODE START>

%scala


df.createOrReplaceTempView("Customers")

<CODE END>

<CODE START>

%sql


SELECT * FROM Customers LIMIT 10

<CODE END>

Databricks allows us to leverage a Scala function "createOrReplaceTempView()" that automatically creates a temporary table for us.  Unlike in the previous post, this table is not connected directly to the data in blob storage.  Instead, it's a copy of the data that is stored directly inside the Databricks DBFS Storage.  Just like with HDInsight, it is possible to create an external table that connects to remote file storage.  In order to do this, we must first mount the storage container.
Mount Storage Account
<CODE START>

%scala

dbutils.fs.mount(
  source = "wasbs://interactivequery@datascienceinpowerbibig.blob.core.windows.net/datadefault",
  mountPoint = "/mnt/datadefault",

  extraConfigs = Map("fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net" -> <Storage Account Access Key>))

<CODE END>

This command allows us to access the data in the "datadefault" folder within the "interactivequery" container, simply by referencing the "/mnt/datadefault" location as if it was local.  We can use this to create our external table.  Understanding that it's not good practice to use your account keys in our code, it's also possible to store our Account Keys using the Secrets API and reference them using the "dbutils.secrets.get" function.

Unfortunately, while writing this, we discovered that there is a bug affecting the use of Avro for this operation.  If we use the Avro data, we end up with a strange table.
CustomersBlob (Avro)
<CODE START>

%sql

CREATE EXTERNAL TABLE CustomersBlob(
  customerid string,
  age int,
  workclass string,
  fnlwgt int,
  education string,
  educationnum int,
  maritalstatus string,
  occupation string,
  relationship string,
  race string,
  sex string,
  capitalgain int,
  capitalloss int,
  hoursperweek int,
  nativecountry string
)
STORED AS AVRO
LOCATION '/mnt/datadefault/';


SELECT * FROM CustomersBlob LIMIT 10;

<CODE END>

Fortunately, saving the same data in Parquet format resolves the problem with no issue.
CustomersBlob (Parquet)
Now that this data is accessible through Databricks, we can access it from Power BI using the Spark connector.  First, we need to find the connection information.  In order to connect, we need to get the JDBC/ODBC connection information for the cluster, as well as an access token.  We won't show this process step-by-step, but you can find it here.
Customers
Hopefully, this post enlightened you to the possibilities of combining the processing power of Databricks Spark with Power BI.  Databricks is one of the coolest new kids on the block and provides an incredibly simple mechanism for implementing python, r, scala and sql code against the open-source Spark framework.  Stay tuned for the final post in this series where we'll dig into Spark MLlib.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, March 11, 2019

Data Science in Power BI: HDInsight Hive

Today, we're going to talk about HDInsight Hive within Power BI.  If you haven't read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning StudioStream Analytics and Stream Analytics with Azure Machine Learning Studio, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, HDInsight Hive.

HDInsight is a managed Hadoop platform available on Azure.  On the back-end, HDInsight is enabled by a partnership between Hortonworks and Microsoft to bring the Hortonworks Data Platform (HDP) to Azure as a Platform-as-a-Service option.  This allow us to leverage much of the functionality of the Hadoop ecosystem, without having to manage and provision the individual components.  HDInsight comes in a number of configurations for different needs, such as Kafka for Real-Time Message Ingestion, Spark for Batch Processing and ML Services for Scalable Machine Learning.  Although most of the HDInsight configurations include Hive, we'll be using the Interactive Query configuration.  You can read more about HDInsight here.

Hive is a "SQL on Hadoop" technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL.  Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess.  Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data.  One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format.  That's exactly what we're going to do in this post.  You can read more about Hive here and here and Hive LLAP here.

We understand that SQL queries don't typically constitute traditional data science functionality.  However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore.  Hive happens to be one of the best starting points on that journey.

Let's start by creating an HDInsight cluster in the Azure portal.
HDInsight Creation Basics
HDInsight Creation Storage
HDInsight Creation Cluster Size
When we provisioned the cluster, we connected it to an Azure Storage account named "datascienceinpowerbibig".  This is important because HDInsight clusters have two types of storage available, internal and external.  Internal storage is provisioned with the cluster and is deleted when it is deprovisioned.  External storage is provisioned separately and remains after the cluster is deprovisioned.  Additionally, the external storage is easily accessible by other Azure resources.  It's also important to note that we opened up the advanced options and reduced the cluster size as small as possible.  Even then, this cluster costs $2.75 per hour, which adds up to $2,046 per month.  This pricetag is why many organizations provision HDInsight clusters only when they need them, then deprovision them when they are done.

Now that our cluster is provisioned, let's take a look at the Blob store where we pointed our cluster.
HDInsight Storage
We see that the HDInsight cluster has created a number of directories that it uses to store its data and metadata.  In addition to the automatically created files, we added a "data" directory and added a bunch of data files to it.
HDInsight Data
It's important to note that these files are all very small.  Hive is known to perform better on a small number of large files than it does on a large number of small files.  So, the data we have here is not optimal for querying.  Perhaps this is a good topic for a later post (WINK, WINK).  It's also important to note that these files are stored in Avro format.  Avro is one of the recommended formats for storing data that is to be queried by Hive.  You can read more about Avro here and Hive Data Formats here.

Now that we've seen our raw data, let's navigate to the SQL Interface within HDInsight.
Ambari
Many of the administrative interfaces and dashboards can be found within Ambari, which can be accessed from the HDInsight resource blade in the Azure portal.
Hive View
In Ambari, we can access the "Hive View 2.0" by clicking the waffle in the top-right corner of the window.
Blank Query
The Hive View takes us to a blank query window.  This is where we could write our queries.  For now, we're more interested in creating a table.  We could do this with SQL commands, but the Hive View has a nice GUI for us to use as well.
New Table
We can access this GUI by clicking the "+New Table" button in the top-right corner of the window.
Columns
We start by defining all of the columns in our dataset.
Advanced
Then, we can go to the Advanced tab to supply the location of the data and the file formats.  Since our HDInsight cluster is directly connected to the Azure Storage Account, we don't have to worry about defining complex path names.  The cluster can access the data simply by using the relative path name of "/data/".

Now that we've created our table, we can go back to the Query editor and query it.
Customers
We also have the ability to see the code that was generated to create the table by navigating to the Tables tab.
DDL
Now that we've taken a look at our table, let's get to the Power BI portion.  One of the really cool things about Hive is that it exposes our data using the same SQL drivers that we are used to.  This means that most tools can easily access it, including Power BI.
Get Data
Power BI Report

Once we connect to the table, we can use the same Power BI functionality as always to create our data visualizations.  It's important to note that there are substantial performance implications of using "DirectQuery" mode to connect to Hive LLAP clusters.  Hive LLAP is optimized for this type of access, but it's not without complexity.  Many organizations that leverage this type of technology put a substantial amount of effort into keeping the database optimized for read performance.

Hopefully, this post opened your eyes a little to the power of leveraging Big Data within Power BI.  The potential of SQL and Big Data is immense and there are many untapped opportunities to leverage it in our organizations.  Stay tuned for the next post where we'll be looking into Databricks Spark.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com