Analytics Classroom: Word Count Program with Spark and Python

Word Count Program with Spark and Python

Programming Spark applications is similar to other data flow languages that had previously been implemented on Hadoop. Code is written in a driver program which is lazily evaluated, and upon an action, the driver code is distributed across the cluster to be executed by workers on their partitions of the RDD. Results are then sent back to the driver for aggregation or compilation. Essentially the driver program creates one or more RDDs, applies operations to transform the RDD, then invokes some action on the transformed RDD.

These steps are outlined as follows:

Define one or more RDDs either through accessing data stored on disk (HDFS, Cassandra, HBase, Local Disk), parallelizing some collection in memory, transforming an existing RDD, or by cachingor saving.
Invoke operations on the RDD by passing closures (functions) to each element of the RDD. Spark offers over 80 high level operators beyond Map and Reduce.
Use the resulting RDDs with actions (e.g. count, collect, save, etc.). Actions kick off the computing on the cluster.

When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure. Spark provides two types of shared variables that can be interacted with by all workers in a restricted fashion. Broadcast variables are distributed to all workers, but are read-only. Broadcast variables can be used as lookup tables or stopword lists.Accumulators are variables that workers can "add" to using associative operations and are typically used as counters.

The following is the code for word count program with python and spark. The file war_and_peace.txt can be downloaded from following link

https://github.com/NSkelsey/cvf/blob/master/war_and_peace.txt

The textFile method loads the war_and_peace.txt into an RDD named text. If you inspect the RDD you can see that it is a MappedRDD and that the path to the file is a relative path from the current working directory (pass in a correct path to the shakespeare.txt file on your system). Let's start to transform this RDD in order to compute the "hello world" of distributed computing: "word count."

The output of the word count program was as follows:

Analytics Classroom

Friday, 8 July 2016

Word Count Program with Spark and Python

Word Count Program with Spark and Python

No comments:

Post a Comment

Total Pageviews

Pages