Monday, August 6, 2018

Data Science in Power BI: Clustering

Today, we're going to talk about Clustering within Power BI.  If you haven't read the previous posts in this series, Introduction and Getting Started with R Scripts, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Clustering in Power BI.

Clustering is a statistical process by which we group records or observations into categories based on how similar they are to one another.
Clustering
This is very easy to visualize in two dimensions.  We can see that clustering can be used to determine that there are three groups of observations, known as clusters, in our data.  As we add more dimensions, it becomes difficult, or even impossible, to visualize.  This is where clustering algorithms can help.  You can read more about Clustering here.

Fortunately, the Power BI Marketplace has a pre-built "Clustering" visual that uses R scripts on the back-end.  Let's check it out!

We start by opening the "Customer Profitability Sample PBIX" we downloaded in the previous post.  In case you haven't read that post, you can find the file here.
Import From Marketplace
Then, we navigate to the "Visualizations" pane and select "... -> Import From Marketplace".
Power BI Marketplace
This takes us to the "Power BI Marketplace".  Here, we can look at all of the custom visuals created by Microsoft and members of the community.
Power BI Marketplace (Cluster)
If we search for "cluster", we can add the "Clustering" custom visual by clicking the "Add" button.  This will add this custom visual as an option in our Power BI Report.
Visualizations
We can see that the "Clustering" custom visual has been added to the list.  Let's make a new page in the report and add an empty "Clustering" chart.
Enable Script Visuals
We were prompted to enable script visuals.  This is necessary when we utilize R scripts or custom visuals that utilize them.  If you get some type of error here, you may need to read the previous post to ensure your R environment in Power BI is correctly configured.
Can't Display This Visual
During this process, we also stumbled across this error.  The "Clustering" chart tries to load a number of R packages.  However, these R packages need to be installed on the machine in order for this to work.  Fortunately, it's pretty simple to install packages using RStudio.
Install Packages
We kept installing packages until we completed the list.  There is one package, Redmonder, that we could not install through RStudio.  Instead, we had to download the package from here and manually copy it to the R package directory.  In our case, this is
C:\Users\<Username>\Documents\R\win-library\3.3\
 Once we completed that, we faced our next challenge.  This custom visual does not allow us to use measures.  This means that basic questions like "How many clusters of customers exist based on Total Revenue and Total Labor Costs?" become more complex to solve.  Fortunately, it's not too difficult to turn the measures into calculated columns.  We do this by creating a new table in Power BI.
New Table
We start by selecting the "New Table" button in the "Modeling" tab at the top of the screen.
Customer Summary
Then, we use the following code to create a table "CustomerSummary".  This table contains one record for each Customer Name, along with the Total Revenue and Total Labor Costs associated to that Customer.

<CODE START>

CustomerSummary = 
SUMMARIZE(
'Fact'
,'Customer'[Name]
,"Total Revenue", SUM( 'Fact'[Revenue] )
,"Total Labor Costs", SUM( 'Fact'[Labor Costs Variable] )
)

<CODE END>

You can read more about the SUMMARIZE() function here if you are interested.
Total Revenue and Total Labor Costs by Customer Name
Finally, we can create our Clustering chart by adding [Total Revenue] and [Total Labor Costs] to the "Values" shelf and [Name] to the "Data Point Labels" shelf.  The resulting chart is less than inspiring.  "Cluster 2" at the top-right of the chart contains two outlying points, leaving "Cluster 1" to contain the rest of the dataset.  The issue here is that K-Means Clustering is not robust against outliers.  You can more about K-Means Clustering here.

In order to make a more interesting set of clusters, we need a way to reduce the influence of these outliers.  One approach is to rescale the data so that extremely large observations aren't quite so large.  Two common functions to handle this are logarithm and square root.  Let's alter our code to take the square root of [Total Revenue] and [Total Labor Costs].

<CODE START>

CustomerSummary = 
SUMMARIZE(
'Fact'
,'Customer'[Name]
,"Total Revenue (Square Root)", SQRT( SUM( 'Fact'[Revenue] ) )
,"Total Labor Costs (Square Root)", SQRT( SUM( 'Fact'[Labor Costs Variable] ) )
)

<CODE END>

Since we altered the field names, we also need to recreate our chart to contain the [Total Labor Costs (Square Root)] and [Total Revenue (Square Root] fields in the "Values" shelf.
Total Revenue (Square Root) and Total Labor Costs (Square Root) by Customer Name
We can see that we now have a more interesting set of clusters.  The outliers are still present, but not nearly as dominant.  This custom visual has quite a few more interesting options.  We encourage you to play around with it to create some visuals that would not be possible in standard Power BI.

Hopefully, this post opened your eyes to the Data Science possibilities in Power BI.  The power of custom R scripts has created a new world of potential, just waiting to be unleashed.  Stay tuned for the next post where we'll be talking about Time Series Decomposition.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, July 16, 2018

Data Science in Power BI: Getting Started with R Scripts

Today, we're going to talk about getting started with R scripts within Power BI.  For the time being, this series will focus on utilizing R in different ways to enhance the analytical capabilities of Power BI.  However, we may delve into the world of Azure Machine Learning Services or HDInsight later.  If you haven't read the first post in this series, it may provide some useful context.  Let's move on to the core of this post, R scripts.

Power BI Desktop does not install R on our local machine.  We have to do that part ourselves.  We can find a link to install R here.
Options
Once we've done that, we open up Power BI.  Then, we select "File -> Options and Settings -> Options".
R Scripting
In the "Options" window, select the "R Scripting" tab.  In this tab, we need to select the location of our R installation, as well as the IDE we will use to write our R code.  The R installation location is crucial.  We decided to use the same installation as we use for RStudio.  This ensures that we don't run into any issues of code running differently between RStudio and Power BI.

For those that aren't aware of RStudio, it's a great (and free!) tool for developing R code, regardless of whether that code goes into Power BI or any other application we you may be using.  If you have never used it, you can download it here.
RStudio
In RStudio, we can confirm the location of our R Installation by selecting "Tools -> Global Options...".
RStudio Options
After we confirm that Power BI and RStudio are using the same installation of R, we also confirm that RStudio is listed as the default IDE for developing R code within Power BI.

Now that we've ensured Power BI and RStudio are pointing the same location, let's test it out.  You can find a link to the Customer Profitability Sample PBIX here.
Revenue by YearPeriod

In this PBIX, we can create a new tab and create a line graph of Revenue by YearPeriod.  Now, let's try to recreate this chart using an R Visual.
Enable Script Visuals
First, we select "R Script Visual" from the "Visualizations" pane.  If this is your first time creating an "R Script Visual", you may need to enable script visuals by selecting the "Enabling" option.
R Script Editor
This should cause the "R Script Editor" window to appear at the bottom of the screen.  In order to create an R script, we first need to provide some fields from our dataset.  Just like before, let's provide the [YearPeriod] field and [Sum of Revenue] measure.
R Script with Data
Once we've added our fields to the "Values" pane, the R Script Editor automatically supplies us with a starter script for pulling in the data.  It's important to note that we can't alter the way the data is pulled in.  Any alterations we want to make will have to be done on the data itself in Power BI or after the data is loaded into R.  Let's try to remake our line graph using the following code:

<CODE START>

dataset$YearPeriod <- factor( dataset$YearPeriod )
plot( dataset$'Sum of Revenue', type = 'l' )
axis( 1, at = 1:dim(dataset)[1], labels = dataset$YearPeriod )

<CODE END>


Complete R Script
Complete R Visual
We can see the charts match exactly, minus a few scaling and labelling differences.  However, this shows us that our R scripts are successfully running and can be used to create all kinds of new fun visuals.  Hopefully, this post broadened your horizons a little bit by showing the capability of utilizing R code within Power BI.  It is a very powerful capability that we will continue to showcase in interesting ways.  Stay tuned for the next post where we'll talk about Clustering in Power BI.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 25, 2018

Data Science in Power BI: Introduction

Today, we're going to start a new series about performing Advanced Analytics and Data Science using Power BI.  For those of you who are avid readers of this blog, you'll know that we try to showcase the ways where we can derive business value from easy-to-use tools.  This series will be no different.  We'll be focusing on a number of relatively simple ways to deliver advanced analytic solutions using Power BI.

Power BI has come a long way since our last Power BI series.  We can happily say that Power BI is now one of the best analysis and visualization tools on the market.  If you're interested, the previous series started in February 2016 and the introductory post can be found here.

For those that aren't familiar with Power BI, there are two main components that we will cover throughout this series.  The first piece that we will cover is the Power BI Desktop application.  It's a free desktop application that you can use to explore data from many different types of sources, as well as manipulate and visualize the data in almost any way imaginable.  If you wish to follow along with the series, you can download Power BI Desktop here.

The second piece that we will cover is the Power BI Service.  It's an inexpensive online service that allows people to share their Power BI Datasets and Reports within their organization.  It also has some functionality for sharing with users outside of your organization.  Depending on your needs, you may even be able to develop your Power BI Reports directly in the online service.  The service has limited capabilities compared to the Desktop application, but it does have a few interesting capabilities that we may touch on.  You can access it here.

Before we jump into the series about Data Science, let's have a short conversation about what types of solutions we'll be looking for.  In other words...what is Data Science?
What is Data Science?
The above picture is Gartner's Analytics Maturity Model.  It shows the different phases of analytics that organizations commonly navigate.  We can see that organizations start by asking "What happened?".  Then they start asking "Why did it happen?".  Next, they start asking the more complex questions of "What will happen?" and "How can we make it happen?"  In our minds, the last two questions are obviously Data Science.  However, we also think that there are major Advanced Analytics opportunities to be found in the "Why did it happen?" realm.  Notice that we use the terms Data Science and Advanced Analytics interchangeably, as they do not seem to have a formal distinction at this time.  Here are the types of problems we hope to touch on in this series.
Show me something interesting or unusual in my data.
Are there any relationships in my data that I can't see?
 Use my data to make the decision for me.
 What information is important or impactful?
 I have too much data.  Can you reduce it down to a manageable amount?
There are a lot of very cool use cases that we plan to cover in this series.  Hopefully this series will give you a few ideas that you can use to provide easy value to your organization.  Perhaps you could even leverage one into a full-fledged solution and snatch that new position you've been eyeing.  Stick with us.  We're sure you'll find something interesting along the way.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, June 4, 2018

Azure Machine Learning Workbench: Classifying Iris Data using Python

Today, we're going to finish our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources, Built-In Data Preparation (Part 1, Part 2, Part 3) and Python Notebooks.  In this post, we're going to finish looking through the "iris" notebook.  If you haven't read the previous post, it's recommended that you do so as it provides necessary context around Jupyter Notebooks.

Let's jump straight into the first code segment.
Segment 1
For those that can't read this, here's the code:

<CODE START>

#load Iris dataset from a DataPrep package
iris = package.run('iris.dprep', dataflow_idx=0, spark=False)

# load features and labels
X, Y = iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']].values, iris['Species'].values

# tag this cell to measure duration
logger.log("Cell","Load Data")
logger.log("Rows",iris.shape[0])

print ('Iris dataset shape: {}'.format(iris.shape))

<CODE END>

This code segment starts by using the azureml.dataprep.run() function, aliased as package.run(), to retrieve the iris dataset that was prepared using the Built-In Data Preparation tools within AML Workbench.  To learn more about this, please see the previous posts in this series (Part 1Part 2Part 3).

A very interesting feature of this function is that the azureml library is available to ANY Python IDE.  This means that we can build the dataset once using the ease and flexibility of AML Workbench, but still share the data with our colleagues who may want to use other IDEs.

Once the data is imported, we run the following line:

X, Y = iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']].values, iris['Species'].values

This line can be effectively thought of as two commands, wrapped up in a single line.

X = iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']].values
Y = iris['Species'].values

Basically, these commands create two NumPy ndarray objects, X and Y.  A NumPy ndarray object is a multidimensional object containing a series of elements of the SAME TYPE, i.e. numbers, strings, etc.  An ndarray can be as large or as complex as we like, so long as we have enough RAM to hold it.  To learn more about ndarrays, read this.

In this case, X is a two-dimensional ndarray (think "rows and columns") populated with the features "Sepal Length", "Sepal Width", "Petal Length" and "Petal Width".  Y is a one-dimensional ndarray (think "single column") populated with the label "Species".

The next two lines of code are as follows:

logger.log("Cell","Load Data")
logger.log("Rows",iris.shape[0])

These lines simply output values to the Run History.  We'll take a look at this at the end of this post.

The final line of this code segment is below:

print ('Iris dataset shape: {}'.format(iris.shape))

This line simply outputs the dimensions of the iris ndarray.  We saw this in the picture earlier.  Let's take a look at the next code segment.

Segment 2
Here's the code for Segment 2:

<CODE START>

logger.log("Cell", "Training")

# change regularization rate and you will likely get a different accuracy.
reg = 0.01

print("Regularization rate is {}".format(reg))
logger.log('Regularization Rate', reg)

# train a logistic regression model
clf = LogisticRegression(C=1/reg).fit(X, Y)
print (clf)

# Log curves for label value 'Iris-versicolor'
y_scores = clf.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, y_scores[:,1],pos_label='Iris-versicolor')
logger.log("Precision",precision)
logger.log("Recall",recall)
logger.log("Thresholds",thresholds)

accuracy = clf.score(X, Y)
logger.log('Accuracy', accuracy)
print ("Accuracy is {}".format(accuracy))

<CODE END>

The first line is below:

logger.log("Cell", "Training")

Like the previous lines, this one outputs more information to the Run History.  Here are the next few lines:

reg = 0.01

print("Regularization rate is {}".format(reg))
logger.log('Regularization Rate', reg)

These lines define the Regularization Rate.  Regularization is a method designed to reduce overfitting.  The goal of regularization is to cause the model training process to stop once it has reach a "good enough" solution.  This reduces the chance that the model will train to the point of being extremely accurate at predicting the values in the training set, but not as accurate at predicting new values.  To learn more, check out the pages linked above.

These lines also output the Regularization Rate to the notebook and the Run History.  Let's check out the next couple of lines.

clf = LogisticRegression(C=1/reg).fit(X, Y)
print (clf)

The first line fits a Logistic Regression model to predict the values in Y using the values in X as predictors.  The next line outputs the model to the notebook so that we can see the full parameter list for the model.  Let's move on.

y_scores = clf.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, y_scores[:,1],pos_label='Iris-versicolor')
logger.log("Precision",precision)
logger.log("Recall",recall)
logger.log("Thresholds",thresholds)

The first line populates y_scores, a NumPy ndarray, with the predicted probabilities that it belongs to each of the three species.  We can see the probabilities for the first record in the screenshot below.
Predicted Probabilities
These probabilities correspond to the species in alphabetical order.  This means that .922 is for Setosa, .08 is for Versicolour and 0 is for Virginica.  This means that we are highly confident this iris belongs to the species Setosa.

The next line uses the precision_recall_curve() function to calculate the Precision, Recall and Threshold values for the SECOND "column" of values in the y_scores array.  Notice that Python indices start at 0, meaning that [,1] refers to the second "column".  Also note that arrays can be more than two-dimensions.  However, since this array is two-dimensional, it's easy to think of the first dimension as rows and second dimension as columns.

Since we are looking at the second "column" of values, we are calculating the Precision, Recall and Threshold values for predicting Versicolour.  This was done because Precision and Recall require a concept of "Positive" and "Negative" values, i.e. Binary Classification.

Finally, these values are logged to the Run History under the slightly misleading labels of "Precision", "Recall" and "Thresholds".  Let's move on the to last few lines for this code segment.

accuracy = clf.score(X, Y)
logger.log('Accuracy', accuracy)
print ("Accuracy is {}".format(accuracy))

The first line calculates the Accuracy of the fit model.  Unlike Precision and Recall, Accuracy can be calculated against a Multiclass Classification because it is simply the number of correct predictions, divided by the total number of predictions.

The next two lines output the Accuracy to the Run History and notebook.  98% looks like an extremely high accuracy.  However, it should be noted that this accuracy was calculated using the Training set.  As we've spoken about previously on this blog, having distinct Training and Testing sets is a crucial component of machine learning and predictive modeling.  Alas, this is just a showcase of functionality.  Moving on.
Segment 3
Here's the code for Segment 3.

<CODE START>

logger.log("Cell", "Scoring")

# predict a new sample
X_new = [[3.0, 3.6, 1.3, 0.25]]
print ('New sample: {}'.format(X_new))
pred = clf.predict(X_new)
logger.log('Prediction', pred.tolist())

print('Predicted class is {}'.format(pred))

<CODE END>

The first line is below:

logger.log("Cell", "Scoring")

Like the previous lines, this one outputs more information to the Run History.  Here are the next few lines:

X_new = [[3.0, 3.6, 1.3, 0.25]]
print ('New sample: {}'.format(X_new))
pred = clf.predict(X_new)
logger.log('Prediction', pred.tolist())

print('Predicted class is {}'.format(pred))

The first line creates a new list, "X_new", that contains a new observation.  This information is also output to the notebook.  This new observation is then scored using the trained model to generate a predicted "Species".  This prediction is also output to the Run History.  Finally, this prediction is output to the notebook as well.  Let's take a look at the next code segment.
Segment 4
This segment simply disables our ability to output to the Run History.  Let's check out the last code segment.
Segment 5 Code
Segment 5 Chart

Here's the code for Segment 5.

<CODE START>

# Plot Iris data in 3D
centers = [[1, 1], [-1, -1], [1, -1]]

fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()
# decompose 4 feature columns into 3 components for 3D plotting
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

le = preprocessing.LabelEncoder()
le.fit(Y)
Y = le.transform(Y)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[Y == label, 0].mean(),
              X[Y == label, 1].mean() + 1.5,
              X[Y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(Y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=Y, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

<CODE END>

Here are the first few lines of this segment:

centers = [[1, 1], [-1, -1], [1, -1]]

fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

The first line creates a list called "centers", which it seems is never used again.  The next line creates a matplot.pyplot.figure object called "fig".  This figure is referenced by the number "1", 8 inches wide and 6 inches tall.

The matplot.pyplot.clf() function, aliased as plt.clf, clears the current active figure.  This ensures that the next lines of code reference the new figure, instead of one that may be active from a previous run.

The mpl_toolkits.mplot3d.axes3d.Axes3D() function, aliased as Axes3D, stores the axes for the 3d scatterplot.  The parameters are used to define the exact shape and angles of the plot.  Let's look at the next few lines

plt.cla()

pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

The matplot.pyplot.cla() function, aliased as plt.cla, clears the current active axis.  This ensures that the next lines of code reference the new axis, instead of one that may be active from a previous run.

The sklearn.decomposition.PCA() function, aliased as decomposition.PCA, creates a Principal Components Analysis object named "pca", but does not fit it to a data set.

The sklearn.decomposition.PCA.fit() function, aliased as pca.fit, fits the newly created PCA object to our original feature data.  It's important to note that our original feature data had four columns, yet we are only requesting three principal components from the PCA algorithm.  This shows how PCA can be used to reduce the number of dimensions in a dataset, while minimizing the amount of information that is lost by the dimension reduction.  In this case, this allows us to plot a four-dimensional dataset on a three-dimensional chart.  This is a common scenario for higher-dimensional plotting.  However, it can be occasionally misleading.

The sklearn.decomposition.PCA.transform(), aliased as pca.transform, transforms our original dataset using the model we just fit.  It also overwrites the "X" ndarray with the new, transformed dataset.  Let's move on to the next few lines.  In general, it's not good practice to overwrite your dataset unless absolutely necessary.

le = preprocessing.LabelEncoder()
le.fit(Y)
Y = le.transform(Y)

The sklearn.preprocessing.LabelEncoder() function, aliased as preprocessing.LabelEncoder, creates a Label Encoder object named "le", but does not fit this to a dataset.

The sklearn.preprocessing.LabelEncoder.fit() function, aliased as le.fit, fits the Label Encoder object to the original set of "Species" names.  Effectively, this will create a mapping between the original string values and a dense set of integers.  In this case, it will map Iris-Setosa, Iris-Versicolour and Iris-Virginica to 0, 1 and 2, respectively.  Notice that it chose this order alphabetically.

The sklearn.preprocessing.LabelEncoder.transform() function, aliased as le.transform, transforms the original set of "Species" names using the mapping fit earlier.  The original dataset, "Y", is overwritten with the new values.  Let's check out the new few lines.

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[Y == label, 0].mean(),
              X[Y == label, 1].mean() + 1.5,
              X[Y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))

This interesting bit of code is actually a simple for loop.  It loops through a set of three values, assigning two variables each time.  It assigns a string value to "name" and an integer value to "label".  Then, it plots a formatted text box, using the horizontalalignment and bbox parameters, containing the name of the Species.  It places this box at the following location:

X = Mean Value of all Principal Component 1 Values where Species = "label"
Y = Mean Value of all Principal Component 1 Values where Species = "label" + 1.5
Z = Mean Value of all Principal Component 3 Values where Species = "label"

This is how we labelled the 3d scatterplot.  Let's move on to the final few lines of code.

y = np.choose(Y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=Y, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

The first line claims to doing some type of ordering with the NumPy.choose() function, aliased as np.choose.  However, the "y" variable is never used again.  So, this seems to be pretty useless.

The mpl_toolkits.mplot3d.axes3d.Axes3D.scatter() function, aliased as ax.scatter, adds the "X" values to our 3d scatterplot.  It also uses the c parameter  to color the points according to their Species.  The cmap parameter is used to define the "colormap".  The matplotlib.pyplot.cm.nipy_spectral colormap, aliased as plt.cm.spectral, can be found here.  You can find a complete colormap reference here.

The next few lines utilize the mpl_toolkits.mplot3d.axes3d.Axes3D.w_#axis.set_ticklabels() functions, aliased as ax.w_#axis.set_ticklabels, to set the values on the X, Y and Z axes.  By passing in an empty list, the plot does not show any values.

The matplotlib.pyplot.show() function, aliased as plot.show, displays the figure in the notebook.

Whew!  That was so much Python.  Now that we've had a chance to dig into the Python, let's finish this up by checking out the Run History to see what the logging looks like.

Run History
We select the "Run History" button in the top-left corner of the screen.  Then, we select the "iris.ipynb" file.
Run History Table
We scroll down the page and see a table containing all of the "runs" of the notebook.  Each "run" is actually a cell within the notebook.  Therefore, the notebook makes it difficult to see the Run History as a whole.  To be fair, this isn't a big deal as the point of the notebook is to show everything in the notebook.  The Run History table is far more useful when we are dealing with deployed code, as it should all show up in a single run.
Run Properties
We click on the Run Number and can see the Run Properties window.  At the bottom of this window are sections for Arguments and Metrics.  These are all of the values we logged using the azureml.logging.get_azureml_logger.log() function.

Hopefully we've pulled the curtain back a little on the magic of Python in Azure Machine Learning Workbench.  This series has been a great way for us to get our hands dirty using this new addition to Azure's Data Science offerings.  We will definitely consider adding this to one of our projects in the future and hope that you will to.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, May 14, 2018

Azure Machine Learning Workbench: Python Notebooks

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and Built-In Data Preparation (Part 1, Part 2, Part 3).  In this post, we're going to begin looking at the Python Notebook available within the "Classifying Iris" project.

Notebooks are a very interesting coding technique that have risen to prominence recently.  Basically, they allow us to write code in a particular language, Python in this case, in an environment where we can see the code, results and comments in a single view.  This is extremely useful in scenarios where we want to showcase our techniques and results to our colleagues.  Let's take a look at the Notebook within AML Workbench.
iris Notebook
On the left side of the screen, select the "Notebook" icon.  This displays a list of all the Notebooks saved in the project.  In our case, we only one.  Let's open the "iris" Notebook.
Page 1
The first step to using the Notebook is to start the Notebook Server.  We do this by selecting the "Start Notebook Server" button at the top of the "iris" tab.
Notebook Started
The first thing we notice is the Jupyter icon at the top of the screen.  Jupyter is an open-source technology that creates the Notebook interface that we see here.  You can read more about Jupyter here.  Please note that Jupyter is not the only Notebook technology available, but it is one of the more common ones.  Feel free to look more into Notebooks if you are interested.

We also notice is that the top-right corner now says "EDIT MODE" instead of "PREVIEW MODE".  This means that we now have the ability to interact with the Notebook.  However, we first need to instantiate a "kernel".  For clarity, the term "kernel" here refers to the computer science term.  You can read more about the other types of kernels here.  Basically, without a kernel, we don't have any way of actually running our code.  So, let's spin one up.
Kernel Selection
We can instantiate a new kernel by selecting "Kernel -> Change Kernel -> Classifying_Iris local".  This will spin up an instance on our local machine.  In more advanced use cases, it's possible to spin up remote containers using Linux VMs or HDInsight clusters.  These can be very useful if we want to run analyses using more power than we have available on our local machine.  You can read more about AML Workbench kernels here.
Notebook Ready
Once we select a new kernel, we see that the kernel name appears in the top-right of the tab, along with an open circle.  This means that the kernel is "idle".

The creators of this notebook were nice enough to provide some additional information in the notebook.  This formatted text is known as "Markdown".  Basically, it's a very easy way to add cleanly formatted text to the notebook.  You can read more about it here.

Depending on your setup, you may need to run these two commands from the "Command Prompt".  We looked at how to use the Command Prompt in the first post in this series.  If you run into any issues, try running the commands in the first post and restarting the kernel.  Let's look at the first segment of code.
Segment 1
Code segments can be identified by the grey background behind them.  This code segment sets up some basic notebook options.  The "%matplotlib inline" command allows the plots created in subsequent code segments to be visualized and saved within the notebook.  The "%azureml history off" command tells AML Workbench not to store history for the subsequent code segments.  This is extremely helpful when we are importing packages and documenting settings, as we don't want to verbosely log these types of tasks.

We also see one of the major advantages of utilizing notebooks.  The "%azureml history off" command creates an output in the console.  The notebook captures this and displays it just below the code segment.  We'll see this in a much more useful manner later in this post.  Let's check out the next code segment.
In Python, we have a few options for importing existing objects.  Basically, libraries contain modules, which contain functions.  We have the option of importing the entire library, an individual module within that library or an individual function within that module.  In Python, we often refer to modules and functions using "dot notation".  We'll see this a little later.  We bring it up now because it can be cumbersome to refer to the "matplotlib.pyplot.figure" function using its full name.  So, we see that the above code aliases this module using the "as plt" code snippet.  Here's a brief synopsis of what each of these libraries/modules do, along with links.

pickle: Serializes and Deserializes Python Objects
sys: Contains System-Specific Python Parameters and Functions
os: Allows Interaction with the Operating System
numpy: Allows Array-based Processing of Python Objects
matplotlib: Allows 2-Dimensional Plotting
sklearn (aka scikit-learn): Contains Common Machine Learning Capabilities
azureml: Contains Azure Machine Learning-specific Functions

Let's move on to the next segment.
Segment 3
The "get_azureml_logger()" function allows us to explicitly define what we output to the AML Workbench Logs.  This is crucial for production quality data science workflows.  You can read more about this functionality here.

Finally, this code prints the current version of Python we are utilizing.  Again, we see the advantage of using notebooks, as we get to see the code and the output in a single block.  Let's move on the final code segment for this post.
Segment 4
Since we are ready to begin the data science portion of the experiment, we turn the logging back on.  The rest of the code segments in this notebook deal with data science programming.  So, we'll save this for the next post.

Hopefully, this post opened your minds to the possibilities of using Python Notebooks within AML Workbench.  Notebooks are quickly becoming the industry standard technique for sharing data science experiments, and will no doubt play a major role in the day-to-day tasks of most Data Scientists.  Stay tuned for the next post where we'll walk through the data science code within the "iris" notebook.  Thanks for reaching.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, April 23, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 3

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and started looking at Built-In Data Preparation (Part 1, Part 2).  In this post, we're going to continue to focus on the built-in data preparation options that the AML Workbench provides.  Specifically, we're going to look at the different "Inspectors".  We touched on these briefly in Part 1.  If you haven't read the previous posts (Part 1, Part 2), it's recommended that you do so now as it provides context around what we've done so far.

We performed some unusual transformations in the last post for the sake of showcasing functionality, but they didn't have any effect on the overall project.  So, we'll delete all of the Data Preparation Steps past the initial "Reference Dataflow" step.
Iris Data
Next, we want to take a look at the different options in the "Inspectors" menu.  When building predictive models or performing feature engineering, it's extremely important understand what's actually inside the data.  One way to do this is to graph or tabulate the data in different ways.  Let's see what the options are available within the "Inspectors" menu.
Inspectors
The first option is "Column Statistics".  Let's see what this does.
Column Statistics
If we view the statistics of a string (also known as categorical) feature, we see the most common value (also known as the mode), the number of times the mode occurs and the number of unique values.  Honestly, this isn't a very useful way to look at a categorical feature.

However, when we view the statistics of a numeric feature, we get some very useful mathematical values.  There are a few points of interest here.  First, the median and the mean are very close.  This means that our data is not heavily skewed.  We can also see the minimum and maximum values, as well as the upper and lower quartiles.  These will let us know if we have "heavy tails" (lots of extreme observations) or even if we have impossible data, such as an Age of 500 or -1.  In this case, there's nothing that jumps out at us from these values.  However, wouldn't it be easier if we could see this visually?  Queue the histogram!
Histogram
Histograms can only be creating using numeric features.  In this case, we chose the Sepal Length feature.  This view gives us the same information that we saw earlier, except in a graphical format that is easier to digest.  For instance, we can see that the values between 5 and 7 occur at approximately the same rate, whereas values outside of that range occur less frequently.  This could be very useful information depending on which features or predictive models we wanted to build.

We also have the option of hovering over the bars to see the exact range they represent.  We can even select the bars, then select the Filter icon in the top-right corner (look for the red box in the above screenshot) of the histogram to see a detailed view.
Histogram (Filtered)
This gives us a much more precise picture of what's going on in the data.  By looking at the "Steps" pane, we can see that AML Workbench accomplishes this view by filtering the underlying dataset.  Keep this in mind, as it will affect any other Inspectors we have open.  Yay for interactivity!
Histogram (Unfiltered)
Interestingly, if we return to the unfiltered version, we can see the underlying filtered version as context.  This could lead to some very interesting analyses in practice.

Finally, we can select the "Edit" button in the top-right of the histogram to edit the settings.
Edit Histogram
We can change the number of buckets or the scale, as well as make some cosmetic changes.  Alas, let's move on to "Value Counts".
Value Counts
Values Counts is what we normally call a Bar Chart.  It shows us the six most common values within the string feature.  This number can be changed by selecting the "Edit" button in the top-right of the Value Counts chart.  Given the nature of this data set, this chart isn't very interesting.  However, it does allow us to use the filtering capabilities (just like with the Histogram) to see some interesting interactions between charts.
Interactive Filters
While this pales in comparison to a tool like Power BI, it does offer some basic capabilities that could be beneficial to our data analysis.

At this point in this post, we wanted to showcase the "Pattern Frequency" chart.  However, our dataset doesn't have a feature appropriate for this.  Instead, we pulled a screenshot from Microsoft Docs (source).
Pattern Frequency
Basically, this chart shows the different types of strings that occur in the data.  This can be extremely useful if we are looking for certain types of serial numbers, ids or any type of string that has a strict or naturally occurring pattern.
Box Plot
The next inspector is the Box Plot.  This type of chart is good for showcasing data skew.  The middle line represents the median, the box represents the range between the first and third quartile (25% to 75%, aka Interquartile Range or IQR).  The whiskers extend from the IQR out to the min and max.  Some box plots have a variation where extreme outliers are represented as * outside of the whiskers.  Honestly, we've always found histograms to be more informative than box plots.  So, let's move on to the Scatterplot.
Scatterplot
The Scatterplot is useful for showing the relationship between two numeric variables.  We can even group the observations by another feature.  In this case, we can see that the blue (Iris-setosa) dots are pretty far away from the green (Iris-virginica) and orange (Iris-versicolor) dots.  This means that there may be a way for us to predict Species based on these values (if this was our goal).  Perhaps our bias is showing at this point, but this is another area where we would prefer to use Power BI.  Alas, this is still pretty good for an "all-in-one" data science development interface that is (as of the time of writing) still in preview.  Next is the Time Series plot.
Time Series
Since our data does not have a time component, we used the Energy Demand sample (source) instead.  The Time Series plot is a basic line chart that shows one or more numeric features relative to a single datetime feature.  This could be beneficial if we needed to identify temporal trends in our data.  Finally, let's take a quick look at the Map.
Map
Again, our dataset is lacking in diverse data.  So, mapping would have been impossible.  However, we were able to dig up a list of all the countries in the world (source).  We were able to use this to create a quick map.  In general, we are rarely fans of mapping anything.  Usually, maps end up being extremely cluttered and difficult to distinguish any significant information.  However, there are rare cases where maps can show interesting results.  John Snow's cholera map is perhaps the most famous.

Hopefully, this post showed you how Inspectors can add some much needed visualizations to the data science process in AML Workbench.  Visualizations are one of the most important parts of the Data Cleansing and Feature Engineering process.  Therefore, any data science tool would need robust visualization capabilities.  While we are not currently impressed by the breadth of Inspectors within AML Workbench, we expect that Microsoft will make great investments in this area before they release the tool as GA.  Stay tuned for the next post where we'll walk through the Python notebook that predicts Species within the "Classifying Iris" project.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com