Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post. The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform. In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don't want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.
Azure Databricks is not a Microsoft product. It is owned and managed by the company Databricks and available in Azure and AWS. However, Databricks is a "first party offering" in Azure. This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products. You can read more about Azure Databricks here, here and here.
To get started with Azure Databricks, we need to create an account. We can do this in the Azure Portal.
|Databricks Account Creation|
|Azure Databricks Portal|
|Read Avro from Blob|
"fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net", <Storage Account Access Key>
val df = spark.read.format("avro").load("wasbs://firstname.lastname@example.org/data");
We were able to do this in a few basic lines of code. Now that we can access our data, let's discuss the goal of this post. In the previous post, we mentioned that Hive doesn't perform well against lots of small files. This is true for Spark as well. One technique for handling this is to have jobs that coalesce a large number of small files into a single large file. These jobs can be run daily, weekly or monthly depending on the data needs of the organization. Since we've already stored the entire dataset in the "df" Data Frame, it's trivial write all of this out to a set of randomly partitioned files or a single file.
|Write Avro to Blob|
// Write using default partitions
// Write using single partition
Since Spark is a distributed processing framework, the default state is to read and write files in a partitioned state. However, we can overwrite this by using the "repartition()" function. Let's take a look at the results.
Seeing this data leads us to two important questions. First, is the total data volume smaller if we store it in multiple files?
Now that we've compiled all of our small files together, we need to expose the data in a way that Power BI can efficiently read. Power BI can read from Blob Storage, but it's messy and we won't try that. Fortunately, Databricks has a built-in Spark SQL Database that we can write our data to.
SELECT * FROM Customers LIMIT 10
Databricks allows us to leverage a Scala function "createOrReplaceTempView()" that automatically creates a temporary table for us. Unlike in the previous post, this table is not connected directly to the data in blob storage. Instead, it's a copy of the data that is stored directly inside the Databricks DBFS Storage. Just like with HDInsight, it is possible to create an external table that connects to remote file storage. In order to do this, we must first mount the storage container.
|Mount Storage Account|
source = "wasbs://email@example.com/datadefault",
mountPoint = "/mnt/datadefault",
extraConfigs = Map("fs.azure.account.key.datascienceinpowerbibig.blob.core.windows.net" -> <Storage Account Access Key>))
This command allows us to access the data in the "datadefault" folder within the "interactivequery" container, simply by referencing the "/mnt/datadefault" location as if it was local. We can use this to create our external table. Understanding that it's not good practice to use your account keys in our code, it's also possible to store our Account Keys using the Secrets API and reference them using the "dbutils.secrets.get" function.
Unfortunately, while writing this, we discovered that there is a bug affecting the use of Avro for this operation. If we use the Avro data, we end up with a strange table.
CREATE EXTERNAL TABLE CustomersBlob(
STORED AS AVRO
SELECT * FROM CustomersBlob LIMIT 10;
Fortunately, saving the same data in Parquet format resolves the problem with no issue.
Service Engineer - FastTrack for Azure