Buch, Englisch, 300 Seiten, Book, Format (B × H): 178 mm x 254 mm
Buch, Englisch, 300 Seiten, Book, Format (B × H): 178 mm x 254 mm
ISBN: 978-1-4842-1309-4
Verlag: Apress
Take a deep dive into Apache Spark and the big data ecosystem. You will acquire an understanding of the next generation of distribution systems, Apache Spark architecture and abstraction, and the Spark ecosystem including Spark SQL, GraphX and MLlib. Beginning Spark provides a practical guide for using Apache Spark in real-world data processing. The author discusses and illustrates how different concepts of Spark are brought together in order to solve complex issues with a data flow system.
With the rise in popularity of distributed systems like Hadoop, more and more people are working in big data processing. A growing number of companies want to build dataflow systems, which can churn huge amounts of data to gain insights for their business. Since Hadoop was a first generation, open source distributed system, there is a need for a next generation distributed system to take data processing to next level. Apache Spark is the next step in that direction. Spark brings a great flexibility and compositional system to the big data world by revolutionizing the field itself.
Zielgruppe
Popular/general
Autoren/Hrsg.
Fachgebiete
Weitere Infos & Material
Table of Contents
Chapter 1: Introduction to next generation distributed systems
Chapter Goal:
Talks about different kind of distributed systems. Also how the distributed systems evolved over the years from terradata to Hadoop spark etc. How Apache Spark is different than Hadoop.
Chapter 2: Introduction to Apache Spark
The architecture and RDD abstraction of Spark. It talks about how Spark distributes the jobs on cluster systems like mesos and yarn.
Chapter 3: Getting started RDD API
This chapter discusses about how to get started with RDD scala API. The chapter starts with a practical example, retail analytics, as a project. Comes with runnable code.
Chapter 4: Map/Reduce RDD API
This chapter discusses about Map/Reduce API of Spark. It talks about shuffling, folding, join and group operation. Comes with runnable code.
Chapter 5: Advanced RDD API
This chapter talks about advanced api like aggregate, mapParitions to control the processing of spark. Comes with runnable code.
Chapter 6: Spark caching
In memory processing is one of the most important part of the Apache Spark. This chapter discusses about how spark implements cache and how to use caching to speed up execution of your spark examples. Comes with runnable code.
Chapter 7: Integrating wi
th Hadoop
Spark integrated beautifully with Hadoop. This chapter discusses about how spark integrates with HDFS and YARN. Comes with runnable code.
Chapter 8: Introduction to Spark Streaming
Spark streaming is a real time system build on top of Spark. It allows developer to use same Spark API to real time systems.
Chapter 9: Anatomy of RDD
This chapter takes a deeper dive into how different RDD is build. The deeper understanding of RDD is very much necessary in order to exploit the spark abstraction to fullest.
Chapter 10: SparkQL, Sql on Spark
This chapter talks about using sql query language in Spark to process structured data. Comes with examples.
Chapter 11: Graphax, Graph processing in Spark
Graph processing is one of the important part of any distributed system. This chapter talks about how graph processing is achieved using Graphax, the graph processing library on Apache Spark.
Chapter 12: MLLib, Machine learning in Spark
With advancement of AI, the machine learning is becoming more and more important. This chapter discussed how to use MLLib, machine learning library to do recommendation, prediction in spark.
Chapter 13: How all comes together
One of the strength of the spark is how different parts of ecosystem comes together to solve problems. This chapter shows how you can mix scala, sql and machine learning in one program to solve a complex problem.




