Using Hadoop through Azure HDInsight

Here’s a really helpful video that shows you how to leverage Hadoop through Azure HDInsight:

Below are my notes I took throughout the video:

  • Can do analytics with Hadoop to get petabytes worth of data down into a manageable size that can be consumed by Excel
  •  Hive is very similar to SQL
    • Good for data that is structured
    • Most customers use this
    • Schema on read
  • Mahout is a machine learning library
  • Pig is a data scripting language
    • Good for unstructured/semi-structured data
    • Can handle missing columns, project things
    • Can do data cleansing
  • Pegasus & Giraph
    • Used for graph processing
  • Cascading
    • Dataflow API (similar to Pig) but in Java
  • Can use Visual Studio to write Hive queries (new with latest SDK)
  • HBase cluster is used for NoSQL storage
    • A distributed, non-relational database
    • Large scale (billions of rows X millions of columns)
    • Low latency
    • Open Source

If you have time, make sure to check it out! I’ll be going through some demos of HDInsight and will post any helpful tips I may come across.

Leave a Reply

Your email address will not be published. Required fields are marked *