One of the biggest advantages of the cloud for modern data science is the ability to endlessly scale your resources in order to solve the problem at hand. In some cases, like small-scale development, it's acceptable to run a process on our local machine. However, as we need more processing power, we need to be able to run our code in more powerful environments, such as Azure Virtual Machines or HDInsight clusters. Let's see how AML Workbench helps us accomplish this.
If you are new to the AML Workbench and haven't read the previous post, it is highly recommended that you do so. The rest of this post will build on what we learned in the previous one.
Here's the first piece of code we will run.
This code runs the "iris_sklearn.py" Python script using our local machine. We'll cover exactly what this script does in a later post. All we need to know for now is that it's running on our local machine using Python. As we mentioned before, using the local machine is great if we're just trying to do something small without having to worry about connecting to remote resources. Here's the output.
az ml experiment submit -c local iris_sklearn.py
Executing user inputs .....
Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
Iris dataset shape: (150, 5)
Regularization rate is 0.01
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
Accuracy is 0.6792452830188679
Serialize and deserialize using the outputs folder.
Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50 0 0]
[ 1 37 12]
[ 0 4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.
Here's the next piece of code.
This code runs the same "iris_sklearn.py" script as before. However, this time it uses a Python-enabled Docker container. Docker is a technology that allows us package an entire environment into a single object. This is extremely useful when we are trying to deploy code across distributed systems. For instance, some organizations will wrap their applications in Docker containers, then deploy the Docker containers. This allows them to manage the applications much easier because they can update the master Docker container, and that update can be automatically deployed to all of the existing Docker containers. You can read more about Docker and containers here, here and here. Unfortunately, we're unable to install Docker on our machine. So, we'll have to skip this one. Let's take a look at the next piece of code.
az ml experiment submit -c docker-python iris_sklearn.py
This code runs a new script called "iris_pyspark.py". We'll save the in-depth analysis of the code for a later post. To heavily summarize, PySpark is a way to harness the power of Spark's big data analytical functionality from within Python. This can be extremely useful when we want to analyze or model big data problems without using a remote Spark cluster. Let's take a look at the next piece of code.
az ml experiment submit -c docker-spark iris_pyspark.py
az ml computetarget attach --name myvm --address <ip address or FQDN> --username <username> --password <pwd> --type remotedocker az ml experiment prepare -c myvm
az ml experiment submit -c myvm iris_pyspark.pyThis is where things start to get interesting. Previously, we were running everything on our local machine. This is great when data is small. However, it becomes unusable when we need to point to larger data sources. Fortunately, the AML Workbench allows us to attach to a remote virtual machine in cases where we need additional resources.
Another important thing to notice is that we were able to seemlessly run the same code on our local machine as we are running on the virtual machine. This means that we can develop on small samples on our local machine, then effortlessly run the same code on a larger virtual machine when we want to test against a larger dataset. This is exactly why containers are becoming so popular. They make it effortless to move code from a less powerful environment, like a local machine, up to a more powerful one, like a large virtual machine.
Another advantage of this ability is that we can now manage resource costs by limiting virtual machine usage. The entire team can share the same virtual machine, using it only when they need the extra power. We can even turn the vm off when we aren't using it, saving even more money. You can read more about Azure Virtual Machines here.
Let's move to the final piece of code.
az ml computetarget attach --name myhdi --address <ip address or FQDN of the head node> --username <username> --password <pwd> --type cluster az ml experiment prepare -c myhdi
This code is expands on the same concepts as the previous one. In some cases, we have very large resource needs. In those cases, even a powerful virtual machine may not have enough juice. For those cases, we can use containers to deploy to an Azure HDInsight cluster. This will allow us to take the same code we ran on our local machine and execute it full-scale using the power of Hadoop. You can read more about HDInsight clusters here.
az ml experiment submit -c myhdi iris_pyspark.py
This post has opened our eyes to the power and flexibility that the AML Workbench can provide. While it's more complicated than using its AML Studio counterpart, the power and flexibility it provides via containers can make all the difference for some organizations. Stay tuned for the next post where we'll walk through the built-in data preparation capabilities of the Azure Machine Learning Workbench. Thanks for reading. We hope you found this informative.
Data Science Consultant