The data visualization market is rich with robust offerings, but how can you get a large amount of data from streaming input all the way through to a solid data warehouse? This program lays out the pro and cons of each major function in the big data architecture.
The first step toward understanding which tools to select, the first step is to understand the categories of big data processing:
- STREAMING Ability to move large amounts of data by managing multiple parallel streams.
- DATA PROCESSING Perform many functions on the data prior to writing it to data stores. Need to run these tasks in parallel.
- DATA STORES Storage of data in relational and/or NoSQL data stores for business use in native form.
- DATA WAREHOUSING High speed tools for storing data in SQL Systems like Redshift.
Streaming
There are many different areas of the architecture to design when looking at a big data project. Do you need to account for a large amount of data streaming into your warehouse or can you mostly focus on processing the data coming in and need to pick the right data store or warehouse? Here are the major elements we look at in an architecture with a focus on Streaming in this section.
Data now comes from more places than ever. With all of the sensors generating reading while computers and people generating even more information, it can be critical to make the right decision on which tool to select. There are some thoughts below on the pros and cons. Advanced inSight has experience with many of the products below with an emphasis on Amazon Kinesis Streaming. Let us help you make the decision.
Pros and Cons of Data Streaming Tools
Flume
- Pros: Reliable
- Cons: Does not manage multiple streams
Kafka
- Pros: scalable and reliable – adopted in many cloud based offerings
- Cons: setup and support time consuming
Amazon Kinesis Streaming
- Pros: Set up and management tools from AWS
- Cons: Doesn’t scale quite as well as Kafka
Azure Event Hubs
- Pros: Set up and management tools
- Cons: Not as mature as others
Hortonworks Data Flow
- Pros: Powerful user interface and management capabilities unity
- Cons: Just released Q4 2015 by Hortonworks
Data Processing
As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? This step how to process the data is most critical the right decision on which tool to select is imperative. There are some thoughts below on the pros and cons.
MapReduce
- Pros: handles any scale of data, reliable, lots of customization
- Cons: hard to program against, slow
Pig
- Pros: scalable, reliable, some customization possible
- Cons: still hard to program against, slow
Hive (on MapReduce)
- Pros: scalable, reliable, easy SQL interface
- Cons: slow (Hive on Tez faster), little customization possible
Spark Core/Storm
- Pros: lots of customization, in-memory processing
- Cons: not reliable, hard to program against
Presto / Spark SQL
- Pros: easy SQL interface, fast in-memory processing
- Cons: not reliable (out of memory), little customization possible, smaller data sets
Data Warehouses
Data now comes from more places than ever and rapidly. New big data warehousing technology provides the ability to perform multiple parallel queries eliminating the need for pre-aggregation and cumbersome processing.
These big data warehouses are fast because of the underlying approach and architecture are different in these ways:
- They run the query in a parallel way
- Use memory in an efficient way
- The data is distributed across disk
- Only the columns requested are returned
There are some thoughts below on the pros and cons of these tools.
Amazon Redshift
- Pros: Standard SQL DB with MPP features that allow it to scale. Supports SQL tools
- Cons: Based on older Postgres version. Significant management required
IBM DashDB
- Pros: MPP awareness directly into the BLU columnar query engine. Supports SQL tools
- Cons: Some confusion over many IBM data base offering (BigSQL; Blumix; BigInsights; Netezza)
HP Vertica
- Pros: Good value for investment. Supports integration with Hadoop with Vertica for SQL
- Cons: Market adoption a question – ongoing industry developers might be less than competitors
Microsoft SQL Data Warehouse
- Pros: Familiar T-SQL and Power BI for query across relational data in your data warehouse
- Cons: Some reported back-end infrastructure issues. Confusion over so many offerings
Google Big Query
- Pros: Eliminates SQL overhead. Good for custom implementations and teams who dislike SQL
- Cons: Does not use standard SQL and does not support standard SQL tools