Monday, May 14, 2018

Azure Machine Learning Workbench: Python Notebooks

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and Built-In Data Preparation (Part 1, Part 2, Part 3).  In this post, we're going to begin looking at the Python Notebook available within the "Classifying Iris" project.

Notebooks are a very interesting coding technique that have risen to prominence recently.  Basically, they allow us to write code in a particular language, Python in this case, in an environment where we can see the code, results and comments in a single view.  This is extremely useful in scenarios where we want to showcase our techniques and results to our colleagues.  Let's take a look at the Notebook within AML Workbench.
iris Notebook
On the left side of the screen, select the "Notebook" icon.  This displays a list of all the Notebooks saved in the project.  In our case, we only one.  Let's open the "iris" Notebook.
Page 1
The first step to using the Notebook is to start the Notebook Server.  We do this by selecting the "Start Notebook Server" button at the top of the "iris" tab.
Notebook Started
The first thing we notice is the Jupyter icon at the top of the screen.  Jupyter is an open-source technology that creates the Notebook interface that we see here.  You can read more about Jupyter here.  Please note that Jupyter is not the only Notebook technology available, but it is one of the more common ones.  Feel free to look more into Notebooks if you are interested.

We also notice is that the top-right corner now says "EDIT MODE" instead of "PREVIEW MODE".  This means that we now have the ability to interact with the Notebook.  However, we first need to instantiate a "kernel".  For clarity, the term "kernel" here refers to the computer science term.  You can read more about the other types of kernels here.  Basically, without a kernel, we don't have any way of actually running our code.  So, let's spin one up.
Kernel Selection
We can instantiate a new kernel by selecting "Kernel -> Change Kernel -> Classifying_Iris local".  This will spin up an instance on our local machine.  In more advanced use cases, it's possible to spin up remote containers using Linux VMs or HDInsight clusters.  These can be very useful if we want to run analyses using more power than we have available on our local machine.  You can read more about AML Workbench kernels here.
Notebook Ready
Once we select a new kernel, we see that the kernel name appears in the top-right of the tab, along with an open circle.  This means that the kernel is "idle".

The creators of this notebook were nice enough to provide some additional information in the notebook.  This formatted text is known as "Markdown".  Basically, it's a very easy way to add cleanly formatted text to the notebook.  You can read more about it here.

Depending on your setup, you may need to run these two commands from the "Command Prompt".  We looked at how to use the Command Prompt in the first post in this series.  If you run into any issues, try running the commands in the first post and restarting the kernel.  Let's look at the first segment of code.
Segment 1
Code segments can be identified by the grey background behind them.  This code segment sets up some basic notebook options.  The "%matplotlib inline" command allows the plots created in subsequent code segments to be visualized and saved within the notebook.  The "%azureml history off" command tells AML Workbench not to store history for the subsequent code segments.  This is extremely helpful when we are importing packages and documenting settings, as we don't want to verbosely log these types of tasks.

We also see one of the major advantages of utilizing notebooks.  The "%azureml history off" command creates an output in the console.  The notebook captures this and displays it just below the code segment.  We'll see this in a much more useful manner later in this post.  Let's check out the next code segment.
In Python, we have a few options for importing existing objects.  Basically, libraries contain modules, which contain functions.  We have the option of importing the entire library, an individual module within that library or an individual function within that module.  In Python, we often refer to modules and functions using "dot notation".  We'll see this a little later.  We bring it up now because it can be cumbersome to refer to the "matplotlib.pyplot.figure" function using its full name.  So, we see that the above code aliases this module using the "as plt" code snippet.  Here's a brief synopsis of what each of these libraries/modules do, along with links.

pickle: Serializes and Deserializes Python Objects
sys: Contains System-Specific Python Parameters and Functions
os: Allows Interaction with the Operating System
numpy: Allows Array-based Processing of Python Objects
matplotlib: Allows 2-Dimensional Plotting
sklearn (aka scikit-learn): Contains Common Machine Learning Capabilities
azureml: Contains Azure Machine Learning-specific Functions

Let's move on to the next segment.
Segment 3
The "get_azureml_logger()" function allows us to explicitly define what we output to the AML Workbench Logs.  This is crucial for production quality data science workflows.  You can read more about this functionality here.

Finally, this code prints the current version of Python we are utilizing.  Again, we see the advantage of using notebooks, as we get to see the code and the output in a single block.  Let's move on the final code segment for this post.
Segment 4
Since we are ready to begin the data science portion of the experiment, we turn the logging back on.  The rest of the code segments in this notebook deal with data science programming.  So, we'll save this for the next post.

Hopefully, this post opened your minds to the possibilities of using Python Notebooks within AML Workbench.  Notebooks are quickly becoming the industry standard technique for sharing data science experiments, and will no doubt play a major role in the day-to-day tasks of most Data Scientists.  Stay tuned for the next post where we'll walk through the data science code within the "iris" notebook.  Thanks for reaching.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions

Monday, April 23, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 3

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and started looking at Built-In Data Preparation (Part 1, Part 2).  In this post, we're going to continue to focus on the built-in data preparation options that the AML Workbench provides.  Specifically, we're going to look at the different "Inspectors".  We touched on these briefly in Part 1.  If you haven't read the previous posts (Part 1, Part 2), it's recommended that you do so now as it provides context around what we've done so far.

We performed some unusual transformations in the last post for the sake of showcasing functionality, but they didn't have any effect on the overall project.  So, we'll delete all of the Data Preparation Steps past the initial "Reference Dataflow" step.
Iris Data
Next, we want to take a look at the different options in the "Inspectors" menu.  When building predictive models or performing feature engineering, it's extremely important understand what's actually inside the data.  One way to do this is to graph or tabulate the data in different ways.  Let's see what the options are available within the "Inspectors" menu.
The first option is "Column Statistics".  Let's see what this does.
Column Statistics
If we view the statistics of a string (also known as categorical) feature, we see the most common value (also known as the mode), the number of times the mode occurs and the number of unique values.  Honestly, this isn't a very useful way to look at a categorical feature.

However, when we view the statistics of a numeric feature, we get some very useful mathematical values.  There are a few points of interest here.  First, the median and the mean are very close.  This means that our data is not heavily skewed.  We can also see the minimum and maximum values, as well as the upper and lower quartiles.  These will let us know if we have "heavy tails" (lots of extreme observations) or even if we have impossible data, such as an Age of 500 or -1.  In this case, there's nothing that jumps out at us from these values.  However, wouldn't it be easier if we could see this visually?  Queue the histogram!
Histograms can only be creating using numeric features.  In this case, we chose the Sepal Length feature.  This view gives us the same information that we saw earlier, except in a graphical format that is easier to digest.  For instance, we can see that the values between 5 and 7 occur at approximately the same rate, whereas values outside of that range occur less frequently.  This could be very useful information depending on which features or predictive models we wanted to build.

We also have the option of hovering over the bars to see the exact range they represent.  We can even select the bars, then select the Filter icon in the top-right corner (look for the red box in the above screenshot) of the histogram to see a detailed view.
Histogram (Filtered)
This gives us a much more precise picture of what's going on in the data.  By looking at the "Steps" pane, we can see that AML Workbench accomplishes this view by filtering the underlying dataset.  Keep this in mind, as it will affect any other Inspectors we have open.  Yay for interactivity!
Histogram (Unfiltered)
Interestingly, if we return to the unfiltered version, we can see the underlying filtered version as context.  This could lead to some very interesting analyses in practice.

Finally, we can select the "Edit" button in the top-right of the histogram to edit the settings.
Edit Histogram
We can change the number of buckets or the scale, as well as make some cosmetic changes.  Alas, let's move on to "Value Counts".
Value Counts
Values Counts is what we normally call a Bar Chart.  It shows us the six most common values within the string feature.  This number can be changed by selecting the "Edit" button in the top-right of the Value Counts chart.  Given the nature of this data set, this chart isn't very interesting.  However, it does allow us to use the filtering capabilities (just like with the Histogram) to see some interesting interactions between charts.
Interactive Filters
While this pales in comparison to a tool like Power BI, it does offer some basic capabilities that could be beneficial to our data analysis.

At this point in this post, we wanted to showcase the "Pattern Frequency" chart.  However, our dataset doesn't have a feature appropriate for this.  Instead, we pulled a screenshot from Microsoft Docs (source).
Pattern Frequency
Basically, this chart shows the different types of strings that occur in the data.  This can be extremely useful if we are looking for certain types of serial numbers, ids or any type of string that has a strict or naturally occurring pattern.
Box Plot
The next inspector is the Box Plot.  This type of chart is good for showcasing data skew.  The middle line represents the median, the box represents the range between the first and third quartile (25% to 75%, aka Interquartile Range or IQR).  The whiskers extend from the IQR out to the min and max.  Some box plots have a variation where extreme outliers are represented as * outside of the whiskers.  Honestly, we've always found histograms to be more informative than box plots.  So, let's move on to the Scatterplot.
The Scatterplot is useful for showing the relationship between two numeric variables.  We can even group the observations by another feature.  In this case, we can see that the blue (Iris-setosa) dots are pretty far away from the green (Iris-virginica) and orange (Iris-versicolor) dots.  This means that there may be a way for us to predict Species based on these values (if this was our goal).  Perhaps our bias is showing at this point, but this is another area where we would prefer to use Power BI.  Alas, this is still pretty good for an "all-in-one" data science development interface that is (as of the time of writing) still in preview.  Next is the Time Series plot.
Time Series
Since our data does not have a time component, we used the Energy Demand sample (source) instead.  The Time Series plot is a basic line chart that shows one or more numeric features relative to a single datetime feature.  This could be beneficial if we needed to identify temporal trends in our data.  Finally, let's take a quick look at the Map.
Again, our dataset is lacking in diverse data.  So, mapping would have been impossible.  However, we were able to dig up a list of all the countries in the world (source).  We were able to use this to create a quick map.  In general, we are rarely fans of mapping anything.  Usually, maps end up being extremely cluttered and difficult to distinguish any significant information.  However, there are rare cases where maps can show interesting results.  John Snow's cholera map is perhaps the most famous.

Hopefully, this post showed you how Inspectors can add some much needed visualizations to the data science process in AML Workbench.  Visualizations are one of the most important parts of the Data Cleansing and Feature Engineering process.  Therefore, any data science tool would need robust visualization capabilities.  While we are not currently impressed by the breadth of Inspectors within AML Workbench, we expect that Microsoft will make great investments in this area before they release the tool as GA.  Stay tuned for the next post where we'll walk through the Python notebook that predicts Species within the "Classifying Iris" project.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions

Monday, April 2, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 2

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and started looking at Built-In Data Preparation.  In this post, we're going to continue to focus on the built-in data preparation options that the AML Workbench provides.  If you haven't read the previous post, it's recommended that you do so now as it provides context around how to navigate the Data Preparation tab.
Iris Data
So far, the columns in this Dataflow have been given names, records with no "Species" have been filtered out and we've created some Inspectors.  Let's see what other options are available.
Dataflows Menu
The first relevant menu we find in the top bar is the "Dataflows" menu.  This menu has options for "Create Dataflow", "Rename Dataflow" and "Remove Dataflow".  "Rename Dataflow" and "Remove Dataflow" aren't very exciting.  However, "Create Dataflow" is more interested than we initially thought.
Create Dataflow
This option allows us to create a new Dataflow using an existing Dataflow.  In fact, the existing Dataflow doesn't even need to be part of the same Data Preparation Package or Workspace.  We only need to be able to provide a DPREP file (this is what AML Workbench creates on the backend when we create a Data Preparation Package).  This means that we can have one set of steps stored in the first Dataflow, then use the result of those steps as the start of another Dataflow.  Why would we want this?  The major reason is that this allows us to create a "base" Dataflow that contains clean data, then use that Dataflow multiple times for different purposes.  We could even provide this DPREP file out to other people so they can see the same data we're seeing.  Please note that we are not recommending this as the preferred approach for sharing data.  We have databases, SQL and otherwise, for this task.

This raises an interesting question for us.  If we reference an external Dataflow using a DPREP file, then that Dataflow becomes an implicit component of our Dataflow  However, if we update that DPREP file, does our Dataflow update as well or is it a one-time upload?
Update Test
We tested this on our own.  When we reference an external Dataflow, a refreshable, but NOT LIVE, connection is created.  When we update the external Dataflow, we need to refresh our Dataflow to see the changes propagate.

Just for fun, let's create a new Dataflow using the existing "Iris" Dataflow.

The next menu is the "Transformations" menu.  We could easily write an entire blog series on just the contents of this menu.  Obviously, there are the basics you would expect from any tool, such as changing data types, joins/appends and summarizations.  We'll touch on a couple of the more interesting features and leave the rest up to you to explore.

The most common type of transformation is adding a new column.  In AML Workbench, we have the option of creating a new column with the "Add Column (Script)" option from the "Transformations" menu.
Add Column (Script)
This opens up a new window where we can name our column, decide where it will be located in the dataset and create an expression.  It also provides a hint at the bottom in case we aren't sure how to use the script functionality.
Add Column (Script) Window 1

Add Column (Script) Window 2
The scripting language we will use here is Python v3.5.  As a quick showcase of functionality, we created a column called "Random", which contains a random value between 0 and 1.  In order to do this, we need a Python library called "random".  Since this library is not included by default, we need to import it first.  This leads us to the following code:
import random
For those unfamiliar with Python, it is a case-sensitive language.  In this case, we are importing the "random" library, and calling the "random" function from within that library.

Another interesting thing to note about this window is that we don't have to use a custom expression.  The "Code Block Type" dropdown gives us the option of using a "module".  This would allow us to save a large, shareable block of code as a .py file.  Then we could use that module in the script by using the "import" command.  This is another victory for shareable code.
Code Block Type
Some of you may be thinking "Why did you create a useless column like Random?"  Turns out, it's not entirely useless for our purposes.  It allows us to show off our next piece of functionality, Sort.
The Sort option is also found within the "Transformations" menu.  If we select the "Random" column, then the "Sort" option, we can sort the dataset by the "Random" column.  This gives us a clean look at the the different values in the columns.  While this has no analytical value, it does give us a quick overview of what we are looking at.
Iris Data (Sorted)
We could have also accomplished this by right-clicking the "Random" column header, and selecting the "Sort" option.
Sort (Right-Click)
Moving on, another interesting option within the "Transformations" menu is the "Advanced Filter (Script)" option.
Advanced Filter (Script)
This option allows us to use Python code to filter our dataset using any filter logic that we could possibly write using Python (which is almost anything we would ever want to do).
Advanced Filter (Script) Window
In this case, we decided to randomly filter out half of our data by using the "Random" column.  We could just as easily filtered on one of the other columns.  We could even have created a brand new expression in this window and filtered on it.  The possibilities are almost endless.

The final script option we'll look at is "Transform Dataflow (Script)".  This can also be found in the "Transformations" menu.
Transform Dataflow (Script)
This option is basically the mother of all transformations.  This allows us to use Python to completely redesign our dataset in virtually any way we want.
Transform Dataflow (Script) Window
As a showcase of functionality, we used Python to create a new column, filter the dataset further and remove a column.  Here's the code we used:

df['Sepal Length (log)'] = np.log( df['Sepal Length'] )
df = df[df['Random'] < .75]
del df['Random']
As you can see, the Script transformations are incredibly powerful.  There's almost nothing they can't do.  We encourage you to look through the rest of the transformations on your own.

Alas, there is one final transformation we want to touch on, "Derive Column By Example".  This can also be found in the "Transformations" menu.  Instead of providing a strict formula, this option, as well as the other "By Example" transformations, allow us to provide examples for how the new field should work.  Then, AML Workbench will deduce what we're trying to accomplish.  This can be extremely beneficial when the equation is complex or we simply want to quickly showcase something.  Let's quickly turn the "Species" column into a numeric column by using this technique.
Derived Column
By simply providing the numeric labels for three rows in the dataset, AML Workbench was able to deduce exactly what we wanted.  This is some very cool functionality that really separates AML Workbench from most of the competition.

This post showcased just how much power we can get from the Built-in Data Preparation tools within AML Workbench.  Hopefully, this piqued your interest to go try it on your own.  Stay tuned for the next post where we'll continue with the Built-In Data Preparation by looking at Inspectors.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions

Monday, March 12, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 1

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've look at Getting Started, Utilizing Different Environments and Built-In Data Sources.  In this post, we're going to focus on the built-in data preparation options that the AML Workbench provides.
Prepare Data
Let's start by navigating over to the "Data" pane and select the "Iris" Data Source.  From here, we have two options.  First, we could use an existing Data Preparation by selecting it from the "Data Preparations" list in the "Data" pane.  We also have the option of selecting the "Prepare" option at the top of "iris" tab.  This option will allow us to start from scratch.  Let's go with the second option for now.
In the "Prepare" window, we need to choose with "Data Preparation Package" we want to use.  Basically, a Data Preparation Package is a grouping of transformations that are run as a unit.  This package will import all of the necessary data and run any number of transformations on any number of data sources, with only a single line of code.
Data Preparation
If we choose to use the existing iris.dprep package, we end up with a new tab that looks very similar to the Data Source tab, with a few additions.  On the left side of the tab, we have the "Dataflows" pane.  A Dataflow is a set of transformations performed on a single Data Source.  On the right side of the tab, we see the "Steps" pane.  A Step is a single transformation.  Therefore, we can see that multiple Steps are grouped into a single Dataflow and multiple Dataflows are grouped into a single Data Preparation Package.

Some of you may have noticed something strange about the table in the middle of the tab.  The columns suddenly have names, despite us never having supplied them.  We can even see these steps reflected in the "Steps" pane.  This is because we aren't looking at the Dataflow we just created.  Instead, we are looking at the existing Dataflow.  Our Dataflow is the second one on the list, with the identical name.
Empty Dataflow
If we select the second Dataflow in the "Dataflows" pane, we find our empty Dataflow.  In most cases, it's not very useful to have two different sets of transformations using the same data.  So, we'll throw away this Dataflow and use the one that's been provided to us.  However, it is important to note that it is possible to use the same Data Source multiple times within the same Data Preparation Package.
Remove Dataflow
Let's take a look at the existing Dataflow again.
Existing Dataflow
Steps Pane
The "Steps" pane shows us that three sets of transformation have been applied in this Dataflow.  First, the Dataflow was created.  Then, the column names were added since they did not exist in the original csv.  Finally, the Dataflow was filtered based on the "Species" column.  We can edit any of these steps by selecting the arrow next to them, then selecting "Edit".
Reference Dataflow
Rename Column
Filter Column
We can use these windows to see exactly what each step is doing and make any changes if we would like.
Step Options
Outside of editing the individual step, we also have the option of moving the step up or down in the Dataflow or deleting it entirely.  It's important to note that modifying steps further back in the Dataflow could potentially break steps that occur after it.  For instance, assume we are using column "A" in the calculation of column "B", then deleting column "A".  If we were to move the delete step before the calculation step, then the calculation of column "B" would break because column "A" no longer exists.
Halfway Complete
Another very important thing to notice about the "Steps" pane is that we can choose to look at the Dataflow after ANY number of steps.  For instance, if we select the "Rename Column3 to Petal Length" step, we can see the Dataflow as it looked after that step.  Notice that Column4 and Column5 have not been renamed yet.  Using this technique, we can add a new Step at this point, thereby adding it to the middle of the existing Dataflow.  This can be useful if we find that we missed a transformation along the line.
Let's finish by talking about "Inspectors".  Inspectors are nothing more than charts that give us a live view of our Dataflow.  We can see that we have a few different options to choose from.  More importantly, Inspectors are completely independent of Steps.  This means that we can create an Inspector to look at our data, then see how that particular data point changes with each step.  For instance, we have an Inspector that shows us the "Top 6 values of 'Species'".  If we move back to a step before the column "Species" existed, we see that this Inspector is no longer valid.  Obviously, this could be extremely helpful at examining the impact of certain filters or calculations.
No Species
There's way more to cover here than we have time for in this post.  Hopefully, this post opened your eyes to how easy it is to use the Built-In Data Preparation options in the Azure Machine Learning Workbench.  If you're eager to see more about Data Preparation in AML Workbench, read this.  Stay tuned for the next post where we'll be walking through some of the transformation options available in this extremely powerful tool.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions