HDInsight is a managed Hadoop platform available on Azure. On the back-end, HDInsight is enabled by a partnership between Hortonworks and Microsoft to bring the Hortonworks Data Platform (HDP) to Azure as a Platform-as-a-Service option. This allow us to leverage much of the functionality of the Hadoop ecosystem, without having to manage and provision the individual components. HDInsight comes in a number of configurations for different needs, such as Kafka for Real-Time Message Ingestion, Spark for Batch Processing and ML Services for Scalable Machine Learning. Although most of the HDInsight configurations include Hive, we'll be using the Interactive Query configuration. You can read more about HDInsight here.
Hive is a "SQL on Hadoop" technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL. Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess. Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data. One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format. That's exactly what we're going to do in this post. You can read more about Hive here and here and Hive LLAP here.
We understand that SQL queries don't typically constitute traditional data science functionality. However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore. Hive happens to be one of the best starting points on that journey.
Let's start by creating an HDInsight cluster in the Azure portal.
|HDInsight Creation Basics|
|HDInsight Creation Storage|
|HDInsight Creation Cluster Size|
Now that our cluster is provisioned, let's take a look at the Blob store where we pointed our cluster.
Now that we've seen our raw data, let's navigate to the SQL Interface within HDInsight.
|Power BI Report|
Once we connect to the table, we can use the same Power BI functionality as always to create our data visualizations. It's important to note that there are substantial performance implications of using "DirectQuery" mode to connect to Hive LLAP clusters. Hive LLAP is optimized for this type of access, but it's not without complexity. Many organizations that leverage this type of technology put a substantial amount of effort into keeping the database optimized for read performance.
Hopefully, this post opened your eyes a little to the power of leveraging Big Data within Power BI. The potential of SQL and Big Data is immense and there are many untapped opportunities to leverage it in our organizations. Stay tuned for the next post where we'll be looking into Databricks Spark. Thanks for reading. We hope you found this informative.
Service Engineer - FastTrack for Azure