Explain Enterprise Batch Processing Using Map-Reduce.

 ENTERPRISE BATCH PROCESSING USING MAP-REDUCE

  • Data is the new money in the 'contemporary world. The data generated by today's enterprises has been increasing at exponential rates in size over the most recent couple of years. Data-intensive calculations ate widespread in many application areas. Computational science is one of the most well-known. People who conduct scientific simulations and experiments are often eager to generate, review, and analyze large volumes of data. Telescopes scanning the sky create hundreds of terabytes of data each second; the collection of sky images often reaches petabytes over a year. Bioinformatics applications mine databases containing terabytes of data. Earthquake simulators manage massive volumes of data generated by detecting the Earth's earthquakes all across the planet. Other IT business fields, in addition to scientific computing, also require help from data-intensive computation. The transaction data of an eCommerce site may exceed millions per month, and customer data for any telecom firm will most certainly reach 60-100 terabytes. This volume of data is mined not only for billing purposes, but also to find events, trends, and patterns that help these firms provide better service. Businesses that effectively maximize their value will have a significant influence on their worth as well as the success of their consumers.
  • With such a large data volume, it will be difficult for a single server - or node - to handle it. As a result, code that runs on several nodes is required. Because writing distributed systems present an infinite number of challenges. MapReduce is a framework that lets users develop code that runs on numerous nodes without worrying about fault tolerance, dependability, synchronization, or availability. Batch processing is a type of automated task that performs computations regularly. It executes the processing code on a group of inputs known as a batch. The task will often read batch data from a database and save the results in the same or a separate database. A batch processing job can consist of reading all of the sale logs from an online store for a single day and aggregating them into statistics for that day such as the number of users per country, the average spent amount, etc. Doing this periodically provide insights into the data patterns. Depending upon the size of the data that need to be processed the processing time varies and is higher than the normal data processing.
  • As the batch processing is done with the cumulative transactions in a group so once batch processing has begun, no user participation is necessary. This distinguishes batch processing from normal transaction processing. While batch processing may be done at any time, it is best suited for end-of-day processing such as processing a bank's reports at the end of the day or creating monthly or bimonthly payrolls.
  • The batch processing mechanism processes the data blocks that have previously been stored over some time. For example, a large financial business may complete all of its transactions in a single week. This data comprises millions of records for a single day, which may be saved as a file or record. This file will be processed at the end of the day for different analyses that the company wishes to do. Processing that file will take a significant amount of time so it is not feasible to perform it very frequently. In the point of performance, the latency of batch processing will be in a minute to hours. So, batch processing is useful when you do not require real-time analytics results and it is more necessary to analyze vast amounts of data to gain more deep insights than it is to acquire quick analytics results.
  • Batch processing frameworks are great for handling exceedingly big datasets that need a substantial amount of computation. Such datasets are generally limited and persistent, that is, they are kept in some form of permanent storage. Batch processing is appropriate for non-time-sensitive processing since processing a big dataset takes time. Apache Hadoop's MapReduce is the most widely used batch data processing. The diagram below explains in detail how Hadoop processes data using MapReduce.


  • MapReduce Hadoop MapReduce is a Java-based system for processing large datasets. It reads data from the HDFS and divides the dataset into smaller pieces. Each piece is then scheduled and distributed for processing among the nodes available in the Hadoop cluster. Each node performs the required computation on the chunk of data and is intermediate results obtained are written back to the HDFS. These intermediate outputs may then be sembled, split, and redistributed for further processing until final results are written back to HDFS.
  • As already discussed above, the MapReduce data processing programming model consists of two different jobs executed by programs: a Map job and a Reduce job. Typically, the Map operation begins by turning a collection of data into another set of data in which individual pieces of the data are broken down into tuples consisting of key-value pairs. One or more Map tasks can then shuffle, sort, and process these key-value pairs. The Reduce task typically takes as input the results of a Map task and merges those data tuples into a smaller collection of tuples.
  • Batch processing, in a nutshell, is a way of waiting and performing everything periodically such as at the end of the day, week, or month. In the enterprise, during the specified period, the cumulative data will be large. So, to handle such big data, distributed computing environment, and the MapReduce technique can play a vital role.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Discuss classification or taxonomy of virtualization at different levels.