This guide will walk you through installing Apache Spark, setting up environment variables, testing your installation, and running a Spark job with Hadoop and HDFS.


Prerequisites

  • Setup VirtualBox
  • Install Ubuntu Server
  • Install and Setup Hadoop

1. Install Spark

Download the Spark package:

wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz

2. Extract Files

Extract the downloaded archive:

tar -xzvf spark-3.5.5-bin-hadoop3.tgz

3. Setup Path Variables

a. Edit your Bash configuration

Open the bash configuration file:

nano ~/.bashrc

b. Add the following lines at the bottom

# Spark Related options
export SPARK_HOME=/home/hdoop/spark-3.5.5-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

# Setup hadoop config directory
export HADOOP_CONF_DIR=/home/hdoop/hadoop-3.4.1/etc/hadoop

c. Apply the Changes

Reload your bash configuration:

source ~/.bashrc

4. Test Spark Installation

Start PySpark to verify the installation:

pyspark

5. Start the Hadoop Node

Launch the Hadoop services:

start-all.sh

6. Prepare for Running a Spark Job

After creating your Spark program (e.g., spark_pokemon.py), follow these steps:

a. Create a Directory in HDFS for CSV Files

hdfs dfs -mkdir -p /home/hdoop/assignment_2

b. Upload the CSV to HDFS

hdfs dfs -put /home/hdoop/assignment_2/pokemon.csv /home/hdoop/assignment_2

c. Verify the CSV Upload

Check if the file has been successfully moved to HDFS:

hdfs dfs -ls /home/hdoop/assignment_2

7. Running a Spark Job

a. Run with YARN (Cluster Mode)

Submit your Spark job with YARN as the Resource manager:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 1 \
  --executor-memory 1G \
  --executor-cores 1 \
  --files /home/hdoop/assignment_2/spark_pokemon.py \
  spark_pokemon.py

b. Fallback: Run in Local Mode (just for testing)

If you encounter issues running in cluster mode, try local mode:

spark-submit --master local[*] spark_pokemon.py

8. Check and Retrieve the Output

a. Verify Output in HDFS

List the contents of the output directory:

hdfs dfs -ls output/pokemon_feistiest

b. Copy the Output from HDFS to Local Directory

Retrieve the results from HDFS:

hdfs dfs -get output/pokemon_feistiest /home/hdoop/assignment_2/output

This structured guide should help you easily follow through each step of setting up Spark and running your job with Hadoop and HDFS with YARN. ✨