Monday, August 5, 2019

Azure Databricks: Databricks File System (DBFS)

Today, we're going to talk about the Databricks File System (DBFS) in Azure Databricks.  If you haven't read the previous posts in this series, IntroductionCluster Creation and Notebooks, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, DBFS.

As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  For this post, we're going to talk about the storage layer underneath Azure Databricks, DBFS.  Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS).  This is exactly what DBFS is.  Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work.  We may dig deeper into HDFS in a later post.  For now, you can read more about HDFS here and here.

Let's start by looking at the current DBFS structure.
Add Data
We can do this by selecting the "Data" button from the left navigation pane, then selecting the "Add Data" button at the top of the fly-out.  It's important to note that databases and tables are not accessible unless there is a cluster running.  However, the data is still present in DBFS. It's just not currently accessible as a table.
DBFS
In the "Create New Table" window, we can see the contents of DBFS by selecting the "DBFS" tab and navigating through the folders.  As seen in both screenshots, we don't have any tables currently or any data stored in DBFS.  So, there's not much to see here.  However, there are actually a few other folders that can't be seen in this UI.  We can access these by using the %fs commands within a Notebook.  If you're not familiar with Notebooks, check out our previous post.
fs ls

<CODE START>

%fs

ls

<CODE END>

We can use the familiar ls command to see what's in each of these folders.
FileStore

<CODE START>

%fs

ls /FileStore/

<CODE END>

The "FileStore" folder contains all of our accessible data, including Database and Table data, plots created in Notebooks and JAR files.  You can read more about the FileStore here.
databricks-datasets

<CODE START>

%fs

ls /databricks-datasets/

<CODE END>

The "databricks-datasets" folder contain a number of different sample datasets to use.  We'll be sure to leverage some of these throughout this series.  There are also some README files that we could look at.  We'll see how to do this later in this post.
databricks-results

<CODE START>

%fs

ls /databricks-results/

<CODE END>

There seems to be something blocking us from viewing the "databricks-results" folder.  It's not clear whether this is by design or some type of bug.  We would guess that this folder is used by the Databricks platform to store intermediate results, but that's just a guess based on the folder name.
ml

<CODE START>

%fs

ls /ml/

<CODE END>

The "ml" folder is currently empty.  Hypothesizing again, we think this folder may be used for MLFlow functionality.  This is a topic that we're looking forward to exploring in a later post.  You can read more about MLFlow here.
tmp

<CODE START>

%fs

ls /tmp/

<CODE END>

The "tmp" folder contains a number of subfolders that varies based on the size of the environment.  Guessing one final time, it looks like this folder houses Hive temp files used for intermediate calculations.  This makes us question or our earlier guess about the "databricks-results" folder.  We'll be sure to come back with more information once we find out more about these additional folders.

Interestingly, there's a level of folders outside of DBFS that can be accessed using %sh commands.
sh ls

<CODE START>

%sh

ls /

<CODE END>

One of the folders we see here is "dbfs".  Obviously this takes us to the DBFS folders we were looking at earlier.  More advanced Databricks users could potentially delve deeper into some of these other folders.

Now that we've poked around the folders using ls, let's take a look at the README files we found earlier.  To do this, we first need to copy them from the "databricks-datasets" folder to the "FileStore" folder.
Copy Readme Files

<CODE START>

%fs

cp /databricks-datasets/README.md /FileStore/readme/README.md

<CODE END>

<CODE START>

%fs

cp /databricks-datasets/SPARK_README.md /FileStore/readme/SPARK_README.md

<CODE END>

After copying the files, they can be downloaded from any web browser using the following format:
https://<INSTANCE>/files/<FILE_PATH>?o=<WORKSPACE_ID>
The <FILE_PATH> was decided in our earlier fs command and the <INSTANCE> and <WORKSPACE_ID> can be found in the URL for the Databricks cluster.  For instance, our Databricks URL is
https://eastus2.azuredatabricks.net/?o=3558890778145077#
This parses to
<INSTANCE> = eastus2.azuredatabricks.net
<FILE_PATH> = readme/README.md
<WORKSPACE_ID> = 3558890778145077
So, the final URL is
 https://eastus2.azuredatabricks.net/files/readme/README.md?o=3558890778145077
 This URL provides a download for the README.md file.  Changing <FILE_PATH> to "readme/SPARK_README.md" will yield the SPARK_README.md file.  This methodology

The files can be opened in almost any text editor or markdown parser.  They contain basic information about Databricks and Spark.

Finally, let's take a look at the production-grade way to work with data in Databricks.  We can accomplish this via Mounting.  Mounting allows us to reference external file stores, such as Azure Blob Storage, Azure Data Lake Store Gen1 and Azure Data Lake Store Gen2, as if they are part of DBFS.  The dbutils.fs.mount() function can accomplish this, with the syntax varying slightly between Scala and Python.  In Python, we can use the following command to mount an Azure Blob Storage account:
dbutils.fs.mount(
source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net":"<access-key>"})
Mount Azure Blob Storage

<CODE START>


#dbutils.fs.mount(
#  source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net",
#  mount_point = "/mnt/<mount-name>",
#  extra_configs = {"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net":"<access-key>"})

dbutils.fs.mount(
  source = "wasbs://blob@breakingbi.blob.core.windows.net",
  mount_point = "/mnt/breakingbi",
  extra_configs = {"fs.azure.account.key.breakingbi.blob.core.windows.net":"aKNiYJntZUA9asI/uYH1KnB2OzydeD8akiEnhp4s4LKedqXRjNCcQ0aLRYIhMbt/mMEr3sJAmYc1lrCbBeHfIw=="})

<CODE END>

<CODE START>


%fs

ls /mnt/breakingbi

<CODE END>

By looking at the mounted location with ls, we can see the adult.csv file that we placed there.  You can read more about mounting at the following links: Azure Blob Storage, Azure Data Lake Store Gen1 and Azure Data Lake Store Gen2.  Mounting is especially helpful as it allows us to create SQL tables that sit on top of live data in a mounted external data store, but that's a topic for another day.  As a sidenote, access keys and passwords should never be stored as plain text in Notebooks.  Instead, Azure Databricks offers a Secrets API backed by Azure Key Vault.  We've covered this briefly in a previous post and will likely do so again in more depth.

Hopefully, this post helped unravel a little of what's going on inside the Databricks File System.  Leveraging DBFS is critical to creating an efficient, secure, production-quality Databricks environment.  The official Azure Databricks DBFS documentation can be found here.  Stay tuned for the next post where we'll dig into the SQL Database underlying Azure Databricks.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

No comments:

Post a Comment