Monday, July 15, 2019

Azure Databricks: Notebooks

Today, we're going to talk about Cluster Creation in Azure Databricks.  If you haven't read the previous posts in this series, Introduction and Cluster Creation, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Notebooks.

As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  For this post, we're going to talk about the interactive way to develop code, Notebooks.  Azure Databricks Notebooks are similar to other notebook technologies such as Jupyter and Zeppelin, in that they are a place of us to execute snippets of code and see the results within a single interface.
Sample Notebook
Let's start by creating a new notebook.
New Notebook

Create Notebook
Creating a notebook is a pretty simple process, as there's virtually no configuration needed.  The only things we need to do is a provide a Name and select a Language.
Notebook Languages
Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn't limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we can manually specify the language for each code block.

Before we get to the actually coding, we need to attach our new notebook to an existing cluster.  As we said, Notebooks are nothing more than an interface for interactive code.  The processing is all done on the underlying cluster.
Detached Notebook
Attach Notebook
Attached Notebook
Now that we're attached to a cluster, we can run some sample commands.  Let's try writing a simple command in each language.
Python
<CODE START>

%python

import pandas as pd

pysamp = pd.DataFrame([1,2,3,4], columns=['Samp'])
display(pysamp)

<CODE END>
R
<CODE START>

%r

rsamp <- data.frame(Samp = c(1,2,3,4))
display(rsamp)

<CODE END>
Scala
<CODE START>

%scala

case class samp(Samp: Int)
val samp1 = samp(1)
val samp2 = samp(2)
val samp3 = samp(3)
val samp4 = samp(4)
val scalasamp = Seq(samp1, samp2, samp3, samp4).toDF()
display(scalasamp)

<CODE END>
SQL

<CODE START>

%sql

DROP TABLE IF EXISTS sqlsamp;

CREATE TABLE sqlsamp AS
SELECT 1 AS Samp
UNION ALL
SELECT 2 AS Samp
UNION ALL
SELECT 3 AS Samp
UNION ALL
SELECT 4 AS Samp;

SELECT * FROM sqlsamp;

<CODE END>

We can see that it's quite easy to swap from language to language within a single notebook.  Moving data from one language to another is not quite as simple, but that will have to wait for another post.  One interesting thing to point out is that the "%python" command is unnecessary because we chose Python as our default language when we created the Notebook.
Databricks Notebooks
Databricks Notebooks also have the interesting option to plot any data that we output using the display() function.  The display() function is unnecessary if we are using SQL.  We can visualize our data in a number of different ways simply by selecting a different plot option.
Bar Chart
Area Chart
Pie Chart
Quantile Plot
Histogram
Box Plot
Q-Q Plot
Each of these charts operates by their own rules depending on the type of data we throw at it.  There's also a dashboard that allows us to combine multiple charts into a single view.  However, that's another topic for a later post.

We've mentioned that there are four languages that can all be used within a single notebook.  There are also three additional languages that can be used for purposes other than data manipulation.
Formatted Markdown
Markdown Code
<CODE START>

%md

# HEADER
## Subheader

Text

<CODE END>

We can create formatted text using Markdown.  Markdown is a fantastic language for creating a rich text and image experience.  Notebooks are great for sharing your work due to the fact that we can see all the code and results in a single pane.  Markdown extends this even further by allow us to have rich comments along side our code as well, even including pictures and hyperlinks.  You can read more about Markdown here.
LS
<CODE START>

%fs

ls /

<CODE END>

We also have the ability to run file system commands against the Databricks File System (DBFS).  Any time we upload files to the Databricks workspace, it is stored in DBFS.  We will talk more about DBFS in the next post.  You can read more about DBFS here.  You can read more about file system commands here.
Shell
<CODE START>


%sh

ls /

<CODE END>


We also have the ability to run Shell commands against the Databricks cluster.  This can be very useful for performing administrative tasks.  You can read more about Shell commands here.
Run
<CODE START>

%run ./Notebooks_Run

<CODE END>

Finally, we have the ability to run other notebooks.  We simply need to give it the path to the other notebook.  This will make more sense when we discuss DBFS in the next post.  The "%run" command is a little different because it only allows use to provide one piece of information, the location of the notebook to run.  If we were to put additional commands in this block, it would error.

All of the "%" commands we've looked at throughout this post are known as "magic commands".  You can read more about magic commands here.  We see that they have the amazing ability to let us use the best language for the job, depending on the task at hand.  We hope this post opened your eyes to the awesome power of Notebooks.  They are undoubtedly the future of analytics.  The official Azure Databricks Notebooks documentation can be found here.  Stay tuned for the next post where we'll dive into DBFS.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com