In a previous blog post, we explained how you can use PySpark CLI to jumpstart your PySpark projects. Here, I’ll explain how it can be used to build an end-to-end real-time streaming application.
(more…)Author: Jino Jossy

PySpark CLI—An Efficient Way to Manage Your PySpark Projects
In the world of big data analytics, PySpark, the Python API for Apache Spark, has a lot of traction because of its rapid development possibilities. Apart from Python, it provides high-level APIs in Java, Scala, and R. Despite the simplicity of the Python interface, creating a new PySpark project involves the execution of long commands. Take for example the command to create a new project:
$SPARK_HOME/bin/spark-submit \ --master local[*] \ --packages 'com.somesparkjar.dependency:1.0.0' \ --py-files packages.zip \ --files configs/etl_config.json \ jobs/etl_job.py
It is NOT the most convenient or intuitive method to create a simple file structure.
So is there an easy way to get started with PySpark?
(more…)