Monday, January 7, 2019

Data Science in Power BI: Azure Machine Learning Studio

Today, we're going to talk about Azure Machine Learning Studio within Power BI.  If you haven't read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering, Time Series Decomposition, Forecasting, Correlations, Custom R Visuals, R Scripts in Query Editor and Python, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let's move on to the core of this post, Azure Machine Learning Studio.

Azure Machine Learning Studio is a tool that we have talked about extensively on this blog.  It allows us to create advanced analytic and predictive modeling experiments using an intuitive, drag-and-drop interface.  It also allows us to easily productionalize our predictive models by using Azure Machine Learning Studio APIs.  Here are links to the first posts in two different series that cover these topics, Azure Machine Learning: Getting Started and Azure Machine Learning in Practice: Fraud Detection.

In this post, we're going to use an Azure Machine Learning Studio dataset that we've seen in a number of posts, as well as our presentations, Azure Machine Learning Studio: Making Data Science Easy(er) and Azure Machine Learning Studio: Four Tips from the Pros.  This is the Adult Census Income Binary Classification Dataset.  Our goal will be to predict the customer's income based on a number of demographic features.

Since we're going to be using this dataset in Azure Machine Learning Studio and Power BI, the easiest option is to put it in an Azure SQL Database and let both of them query it from there.  Before we do that, we want to split the data into two different sets, a training set for building the model and a testing set for testing the results.  We've talked about this technique a number of times on this blog.  Here's one example.  Fortunately, this is quite easy to do in Azure Machine Learning Studio using the "Split Data" module.
Create Testing and Training Sets
We also need to create the tables in Azure SQL Database to store the data.  Here's the code we used for that.

<CODE START>

CREATE TABLE Customers (
age INT
,workclass NVARCHAR(255)
,fnlwgt INT
,education NVARCHAR(255)
,[education-num] INT
,[marital-status] NVARCHAR(255)
,occupation NVARCHAR(255)
,relationship NVARCHAR(255)
,race NVARCHAR(255)
,sex NVARCHAR(255)
,[capital-gain] INT
,[capital-loss] INT
,[hours-per-week] INT
,[native-country] NVARCHAR(255)
,income NVARCHAR(255)
)

CREATE TABLE Customers_New (
age INT
,workclass NVARCHAR(255)
,fnlwgt INT
,education NVARCHAR(255)
,[education-num] INT
,[marital-status] NVARCHAR(255)
,occupation NVARCHAR(255)
,relationship NVARCHAR(255)
,race NVARCHAR(255)
,sex NVARCHAR(255)
,[capital-gain] INT
,[capital-loss] INT
,[hours-per-week] INT
,[native-country] NVARCHAR(255)
,income NVARCHAR(255)

)

<CODE END>

After loading the data to the tables, we confirm the data's there and can move on to creating the predictive model.
Check Record Counts
Next, we want to create and evaluate our model for predicting "income".
Create Predictive Model
Evaluation Results
We can see that our model has an Accuracy of 86.7% and an AUC of 92.2%.  Let's assume that this is good enough for us to publish and use in our Power BI reports.  The next step is to publish this as a web service.  We've covered this approach in a previous post.
Create Predictive Web Service
Web Service Deployed


Now that we have the web service deployed to Azure, we need to be able to call this from Power BI.  To do this, we could use either R or Python.  Since Python is still in preview at the time of writing, we'll use R.  Fortunately, the Azure Machine Learning Web Services portal provides the code for us.  All we have to do is navigate to the "Consume" tab and select "Request-Response -> R".
Consume
Request-Response R Code


Here's the code that it provides.

<CODE START>

library("RCurl")
library("rjson")

# Accept SSL certificates issued by public Certificate Authorities
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

h = basicTextGatherer()
hdr = basicHeaderGatherer()

req =  list(
    Inputs = list(
            "input1"= list(
                list(
                        'id' = "1",
                        'age' = "1",
                        'workclass' = "",
                        'fnlwgt' = "1",
                        'education' = "",
                        'education-num' = "1",
                        'marital-status' = "",
                        'occupation' = "",
                        'relationship' = "",
                        'race' = "",
                        'sex' = "",
                        'capital-gain' = "1",
                        'capital-loss' = "1",
                        'hours-per-week' = "1",
                        'native-country' = "",
                        'income' = ""
                    )
            )
        ),
        GlobalParameters = setNames(fromJSON('{}'), character(0))
)

body = enc2utf8(toJSON(req))
api_key = "abc123" # Replace this with the API key for the web service
authz_hdr = paste('Bearer', api_key, sep=' ')

h$reset()
curlPerform(url = "https://ussouthcentral.services.azureml.net/subscriptions/143cebb47ba243ec9422da9787ff546b/services/2be5b160bdca49e4aa580f21795ff1e5/execute?api-version=2.0&format=swagger",
httpheader=c('Content-Type' = "application/json", 'Authorization' = authz_hdr),
postfields=body,
writefunction = h$update,
headerfunction = hdr$update,
verbose = TRUE
)

headers = hdr$value()
httpStatus = headers["status"]
if (httpStatus >= 400)
{
print(paste("The request failed with status code:", httpStatus, sep=" "))

# Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
print(headers)
}

print("Result:")
result = h$value()

print(fromJSON(result))

<CODE END>

Since this code uses a hard-coded list of 1's and empty strings.  We can directly test this in RStudio without needing to use it in Power BI yet.  In order to use this code, we need to install the "RCurl" and "rjson" packages.  RCurl allows us to execute HTTP Requests against web service endpoints, while rjson allows us to create and parse the json payloads used by the web service.  We also need to provide our API Key on the following line:

api_key = "abc123" # Replace this with the API key for the web service

The API Key can be found on the "Consume" tab of the Azure Machine Learning Web Services Portal.

The final output from the R code is as follows:

<OUTPUT START>

$Results
$Results$output1
$Results$output1[[1]]
$Results$output1[[1]]$id
[1] "1"

$Results$output1[[1]]$age
[1] "1"

$Results$output1[[1]]$workclass
NULL

$Results$output1[[1]]$fnlwgt
[1] "1"

$Results$output1[[1]]$education
NULL

$Results$output1[[1]]$`education-num`
[1] "1"

$Results$output1[[1]]$`marital-status`
NULL

$Results$output1[[1]]$occupation
NULL

$Results$output1[[1]]$relationship
NULL

$Results$output1[[1]]$race
NULL

$Results$output1[[1]]$sex
NULL

$Results$output1[[1]]$`capital-gain`
[1] "1"

$Results$output1[[1]]$`capital-loss`
[1] "1"

$Results$output1[[1]]$`hours-per-week`
[1] "1"

$Results$output1[[1]]$`native-country`
NULL

$Results$output1[[1]]$income
NULL

$Results$output1[[1]]$`Scored Labels`
NULL

$Results$output1[[1]]$`Scored Probabilities`

NULL

<OUTPUT END>

Since we passed a useless test record to the web service, we got NULL responses for the "Scored Labels" and "Scored Probabilities" fields.  This will not be the case once we start passing legitimate records.

Unfortunately, this code still isn't quite ready for Power BI.  The output we just showed is in the form of a nested set of lists.  The Power BI Query Editor needs this data to be in a data frame.  We start by accessing the inner list containing the actual values.
Result List
<CODE START>

result.list <- fromJSON(result)$Results$output1[[1]]
result.list

<CODE END>

This code is relatively simple.  From right-to-left (inner-to-outer), this code accesses the first element in the "output1" list contained within the "Results" list contained within the list returned by parsing the JSON in the "result" variable.  Now, let's try to cast this to a data frame.
Result Data Frame (Naive Attempt)
Unfortunately, we get an error saying that the objects in the list are not of the same length.  It seems to have an issue with the fact that some of these objects are NULL.  In R, "NULL" is not a object.  However, R also has an object that represents no value.  This object is called "NA".  So, we need to change all of the NULL "values" in our list to NAs before we can create our data frame.
Result List (with NAs)
<CODE START>

result.list[sapply(result.list, is.null)] <- NA
result.list

<CODE END>

This function enters an "NA" value into each element of the list that contains a NULL "value".  Now, we can cast this to a data frame with no issues.
Result Data Frame
<CODE START>

result.df <- data.frame(result.list)
result.df

<CODE END>

Now that our code returns a data frame, we need to use it in Power BI.  In order to do that, let's switch over to Power BI and connect to the "Customers_New" table in our Azure SQL Database.
Customers_New
We have the option of using this web service in two ways.  First, we could use it as part of a Custom R Visual, which we've talked about in a previous post.  This would allow us to utilize the predictive model based on filters and user input.  This is especially helpful for scenarios like clustering and time series analysis where you want to be able to see the effects of filtering certain items.

Second, we could use an R Script in the Query Editor, which we've also talked about in a previous post.  This would allow us to utilize the web service only during model processing and store the values within the data model itself for aggregation and manipulation as we see fit.

Before we can successfully run our code in the Power BI Query Editor, we need to make a few alterations to our code.  We start by replacing the manually-created "reg" list with one that we create from our input dataset, which Power BI conveniently calls "dataset".

<CODE START>

req =  list(
  Inputs = list(
    input1 = setNames(lapply(split(dataset, seq(nrow(dataset))), FUN = as.list), NULL)
  ),
  GlobalParameters = setNames(fromJSON('{}'), character(0))

)

<CODE END>

It's important to note that the syntax here is extremely important.  While it is possible to put "Inputs" and "input1" inside single-quotes, this changes the metadata of the list structure and could cause the web service call to fail.  This was a particularly frustrating issue to debug.

As for the code, we start by using the "split()" function to split the "dataset" dataframe into a list such that each element in the list is a row from the original dataframe.  The "seq(nrow(dataset))" provides the mapping to ensure that each element in the dataframe is split to the correct element in the list.

Then, we use the "lapply()" function to apply the "as.list()" function to each element in the list.  This turns each dataframe in the list into a nested list containing the data from that record of the dataframe.  Finally, we need to remove the names from the outer list using the "setNames()" function with the "NULL" parameter.  This nested listed is then assigned as a nested list called "input1" within a list called "Inputs" within another list called "req".  This creates the nested structure required by the web service.

Next, we rewrite the error-handling at the end to send an error dataframe to Power BI if an error occurs.

<CODE START>

if (httpStatus >= 400)
{
   result.error <- data.frame("Error")
}

print("Result:")
result = h$value()

print(fromJSON(result))

<CODE END>

We accomplished this by using the "data.frame()" function to create a basic dataframe called "result.error" that contains a single record, "Error".

Next, we need to alter the code to output the final result as a data frame, instead of simply printing the JSON.

<CODE START>

if (httpStatus >= 400){
  result.error <- data.frame("Error")
}else{
  result.json = h$value()
  result.list = fromJSON(result.json)$Results$output1
  result.list.final <- lapply(result.list, FUN = function(x) lapply(x, FUN = function(y) if( is.null(y) ){y <- NA}else{y <- y}))
  result.df <- data.frame(do.call(rbind.data.frame, result.list.final))

}

<CODE END>

We aren't afraid to admit that getting this right took us some time.  We start by assigning the JSON output from the web service to the "result.json" object.  Then, we use the "fromJSON" function to parse the results and save them as a list called "result.list".  There's one caveat here.  If the web service returns any NULL values, then this list will contain those NULL values.  As we mentioned earlier in this post, NULL in R means that the object itself does not exist, which causes errors later on.  Instead, we need to assign NA to these values.  We accomplished this by nesting an if statement within two levels of "lapply" functions.  Finally, we can cast our list as a data frame using the "data.frame", "do.call" and "rbind.data.frame" functions.

Here's the full script.

<CODE START>

library("RCurl")
library("rjson")

# Accept SSL certificates issued by public Certificate Authorities
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

h = basicTextGatherer()
hdr = basicHeaderGatherer()

req =  list(
  Inputs = list(
    input1 = setNames(lapply(split(dataset, seq(nrow(dataset))), FUN = as.list), NULL)
  ),
  GlobalParameters = setNames(fromJSON('{}'), character(0))
)

body = enc2utf8(toJSON(req))
api_key = "" # Replace this with the API key for the web service
authz_hdr = paste('Bearer', api_key, sep=' ')

h$reset()
curlPerform(url = "", # Replace this with the URL for the web service
            httpheader=c('Content-Type' = "application/json", 'Authorization' = authz_hdr),
            postfields=body,
            writefunction = h$update,
            headerfunction = hdr$update,
            verbose = TRUE
)

headers = hdr$value()
httpStatus = headers["status"]

if (httpStatus >= 400){
  result.error <- data.frame("Error")
}else{
  result.json = h$value()
  result.list = fromJSON(result.json)$Results$output1
  result.list.final <- lapply(result.list, FUN = function(x) lapply(x, FUN = function(y) if( is.null(y) ){y <- NA}else{y <- y}))
  result.df <- data.frame(do.call(rbind.data.frame, result.list.final))

}

<CODE END>

Now that we have updated the R code to run within the Power BI Query Editor, we can see that the resulting dataframe contains the predicted values.
Final Results
Now that we have this data in Power BI, we can do all of the fun things we would ordinarily do, such as aggregation, mashing against other data sources and charting.  Hopefully, this post opened your eyes to a technique that allows us to make Data Science Easier within Power BI by leveraging the easy-to-use, drag-and-drop interface of Azure Machine Learning Studio.  Stay tuned for the next post where we'll be talking about Stream Analytics.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Service Engineer - FastTrack for Azure
Microsoft
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

No comments:

Post a Comment