Skip to main content

Running Hadoop on Linux using Azure HDInsight

Prerequisite
  1. An Azure subscription: See Get Azure free trial.
  2. Putty SSH Client
For an in depth introduction to Hadoop and Hive and its application using Azure Insight, read the following Wikis
  1. Big Data Analytics using Microsoft Azure: Introduction
  2. Big Data Analytics using Microsoft Azure: Hive
  3. Analyze Twitter data with Hive in Azure HDInsight
Introduction

Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data.

With the September 2015 release of HDInsight, now customers configure these clusters to run using both a Windows Server Operating System as well as an Ubuntu based Linux Operating System.

HDInsight on Linux enables even broader support for Hadoop ecosystem users to run in HDInsight providing you even greater choice of preferred tools and applications for running Hadoop workloads.

Both Linux and Windows clusters in HDInsight are built on the same standard Hadoop distribution and offer the same set of rich capabilities.

Creating a Linux cluster in HDInsight
  1. To create a new Linux cluster, from the new portal, click on Data+Analytics > HDInsight. 

  2. Click on create new cluster
It is at this step that you have the option to choose from Linux or Windows Operating System.

In this demo, Ubuntu shall be used.





After about 30 minutes, your cluster will be up and running.



Connecting to the cluster via an SSH Client

In this example, Putty shall be used to SSH on the Hadoop cluster.

The first step to connect to the cluster is to get the Host Name and the login credentials.

To know the Host Name, click on Secure Shell from the Azure Portal.



Here the Host Name to connect from a Windows and Linux client will be available.



The second step is to open Putty, enter the Host Name and click connect,



You will then be required to enter the credentials that were defined when creating the cluster and you are ready to go.

With the the Linux Cluster and SSH all the commands that one used to use when running Hadoop on premise will now be available which makes the transition to the cloud transparent.



Example: Running Hive Queries via Putty on a Linux Hadoop Cluster on Azure HDInsight

The following example demonstrates how IIS logs can be analyzed using Hive Queries on Hadoop.
  1. Upload the file to Azure blob storage

  2. Create internal table rawlog

    This a staging table to cleanse data to load in the cleanlog table at a later stage.

  3. Create table cleanlog, this is the table where the cleansed data will be stored and queried.

  4. View all tables

  5. Load data from the file into the staging table rawlog

  6. Move data from the staging table to the data table

  7. Generates Map Reduce and Make Analysis

Apache Ambari

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.

Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari is now included on Linux-based HDInsight clusters, and is used to monitor the cluster and make configuration changes.

To access Amabri, from your HDInsight Cluster page in the preview portal, click on Dashboard.



From this page, you can view your dashboard and view the status of your HDInsight cluster. There are also links to access the other features such as Services, Hosts, Alerts, and Admin.

For more details of the features available on Ambari have been describes on the Microsoft Azure Documentation.

References

Comments

Popular posts from this blog

Creating and Querying Microsoft Azure DocumentDB

DocumentDB is the latest storage option added to Microsoft Azure.
It is a no-sql storage service that stores JSON documents natively and provides indexing capabilities along with other interesting features.

This article is available available on theMicrosoft Technet Wiki. This article was highlighted in theTop Contributor awardson the 12th of October 2014. This article was highlighted in the TNWiki Article Spotlight. This article was highlighted in the The Microsoft TechNet Guru Awards! (October 2014).


DocumentDB is the latest storage option added to Microsoft Azure.
It is a no-sql storage service that stores JSON documents natively and provides indexing capabilities along with other interesting features.
This wiki shall introduce you to this new service.

Setting up a Microsoft Azure DocumentDBGo to the new Microsoft Azure Portal. https://portal.azure.com/ 


 Click on New > DocumentDB


Enter A Database ID and hit Create!



Query Unstructured Data From SQL Server Using PolyBase

Scope The following article demonstrates how unstructured data and relational data can be queried, joined and processed in a single query using PolyBase, a new feature in SQL Server 2016. Pre-RequisitesIntroduction to Big Data Analytics Using Microsoft Azure Big Data Analytics Using Hive on Microsoft Azure Analyze Twitter Data With Hive in Azure HDInsight Running Hadoop on Linux using Azure HDInsight  Introduction Traditionally, Big Data is processed using Apache Hadoop which is totally fine. But what if the result of this needs to be linked to the traditional Relation Database? For example, assume that from the analysis of tons of application logs, marketing needs to contact some customs that faced problems in an application following a failure in the application.
This problem is solved with PolyBase. PolyBase allows you to use Transact-SQL (T-SQL) statements to access data stored in Hadoop or Azure Blob Storage and query it in an ad-hoc fashion. It also lets you query semi-structure…

Creating and Deploying Microsoft Azure WebJobs

Azure WebJobs enables you to run programs or scripts in your website as background processes. It runs and scales as part of Azure Web Sites.
This article focuses on the basics of WebJobs before demonstrating an example where it can be used.

This article is also available on the Mirosoft TechNet Wiki.
This article was highlighted in the The Microsoft TechNet Guru Awards! (October 2014).


Introduction
What is Microsoft Azure WebJobs?
Azure WebJobs enables you to run programs or scripts in your website as background processes. It runs and scales as part of Azure Web Sites.

What Scheduling Options is supported by Microsoft Azure WebJobs? Azure WebJobs can run Continuously, On Demand or on a Schedule.
In what language/scripts are WebJobs written?
Azure WebJobs can be created using the following scripts:  .cmd, .bat, .exe (using windows cmd).ps1 (using powershell).sh (using bash).php (using php).py (using python).js (using node)In this article, the use of c# command line app shall be demonstrated.
Cr…