Big Data – Data Processing

There are many different areas of the architecture to design when looking at a big data project. As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? Can you handle the amount of data streaming into your Big data framework or can you mostly focus on processing the data coming in and pick the right data store or warehouse? Here are the major elements we look at in an architecture with a focus on Data Processing in this section.

Data now comes from more places than ever and need to be connected to other data sets.As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? This step of processing the data is most critical the right decision on which tool to select is imperative. There are some thoughts below on the pros and cons. Advanced inSight has experience with many of the products below including MapReduce, Hive on Tez, and Spark. Let us help you make the decision.

bigdata_data_processing

Pros and Cons of Data Processing

 MapReduce

  • Pros: handles any scale of data, reliable, lots of customization
  • Cons: hard to program against, slow

Pig

  • Pros: scalable, reliable, some customization possible
  • Cons: still hard to program against, slow

Hive (on MapReduce)

  • Pros: scalable, reliable, easy SQL interface
  • Cons: slow (Hive on Tez faster), little customization possible

Spark Core/Storm

  • Pros: lots of customization, in-memory processing
  • Cons: not reliable, hard to program against

Presto / Spark SQL

  • Pros: easy SQL interface, fast in-memory processing
  • Cons: not reliable (out of memory), little customization possible, smaller data sets