Picking the Right Big Data Tools -

Picking the Right Big Data Tools

BY John Roney
July 10, 2016

The data visualization market is rich with robust offerings, but how can you get a large amount of data from streaming input all the way through to a solid data warehouse? This program lays out the pro and cons of each major function in the big data architecture.

The first step toward understanding which tools to select, the first step is to understand the categories of big data processing:

STREAMING Ability to move large amounts of data by managing multiple parallel streams.
DATA PROCESSING Perform many functions on the data prior to writing it to data stores. Need to run these tasks in parallel.
DATA STORES Storage of data in relational and/or NoSQL data stores for business use in native form.
DATA WAREHOUSING High speed tools for storing data in SQL Systems like Redshift.

Streaming

There are many different areas of the architecture to design when looking at a big data project. Do you need to account for a large amount of data streaming into your warehouse or can you mostly focus on processing the data coming in and need to pick the right data store or warehouse? Here are the major elements we look at in an architecture with a focus on Streaming in this section.

Data now comes from more places than ever. With all of the sensors generating reading while computers and people generating even more information, it can be critical to make the right decision on which tool to select. There are some thoughts below on the pros and cons. Advanced inSight has experience with many of the products below with an emphasis on Amazon Kinesis Streaming. Let us help you make the decision.

Pros and Cons of Data Streaming Tools

Flume

Pros: Reliable
Cons: Does not manage multiple streams

Kafka

Pros: scalable and reliable – adopted in many cloud based offerings
Cons: setup and support time consuming

Amazon Kinesis Streaming

Pros: Set up and management tools from AWS
Cons: Doesn’t scale quite as well as Kafka

Azure Event Hubs

Pros: Set up and management tools
Cons: Not as mature as others

Hortonworks Data Flow

Pros: Powerful user interface and management capabilities unity
Cons: Just released Q4 2015 by Hortonworks

Data Processing

As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? This step how to process the data is most critical the right decision on which tool to select is imperative. There are some thoughts below on the pros and cons.

MapReduce

Pros: handles any scale of data, reliable, lots of customization
Cons: hard to program against, slow

Pig

Pros: scalable, reliable, some customization possible
Cons: still hard to program against, slow

Hive (on MapReduce)

Pros: scalable, reliable, easy SQL interface
Cons: slow (Hive on Tez faster), little customization possible

Spark Core/Storm

Pros: lots of customization, in-memory processing
Cons: not reliable, hard to program against

Presto / Spark SQL

Pros: easy SQL interface, fast in-memory processing
Cons: not reliable (out of memory), little customization possible, smaller data sets

Data Warehouses

Data now comes from more places than ever and rapidly. New big data warehousing technology provides the ability to perform multiple parallel queries eliminating the need for pre-aggregation and cumbersome processing.

These big data warehouses are fast because of the underlying approach and architecture are different in these ways:

They run the query in a parallel way
Use memory in an efficient way
The data is distributed across disk
Only the columns requested are returned

There are some thoughts below on the pros and cons of these tools.

Amazon Redshift

Pros: Standard SQL DB with MPP features that allow it to scale. Supports SQL tools
Cons: Based on older Postgres version. Significant management required

IBM DashDB

Pros: MPP awareness directly into the BLU columnar query engine. Supports SQL tools
Cons: Some confusion over many IBM data base offering (BigSQL; Blumix; BigInsights; Netezza)

HP Vertica

Pros: Good value for investment. Supports integration with Hadoop with Vertica for SQL
Cons: Market adoption a question – ongoing industry developers might be less than competitors

Microsoft SQL Data Warehouse

Pros: Familiar T-SQL and Power BI for query across relational data in your data warehouse
Cons: Some reported back-end infrastructure issues. Confusion over so many offerings

Google Big Query

Pros: Eliminates SQL overhead. Good for custom implementations and teams who dislike SQL
Cons: Does not use standard SQL and does not support standard SQL tools