PySpark CLI—An Efficient Way to Manage Your PySpark Projects

PySpark CLI—An Efficient Way to Manage Your PySpark Projects

In the world of big data analytics, PySpark, the Python API for Apache Spark, has a lot of traction because of its rapid development possibilities. Apart from Python, it provides high-level APIs in Java, Scala, and R. Despite the simplicity of the Python interface, creating a new PySpark project involves the execution of long commands. Take for example the command to create a new project:

$SPARK_HOME/bin/spark-submit \ --master local[*] \ --packages 'com.somesparkjar.dependency:1.0.0' \ --py-files packages.zip \ --files configs/etl_config.json \ jobs/etl_job.py

It is NOT the most convenient or intuitive method to create a simple file structure.

So is there an easy way to get started with PySpark?

(more…)