Hadoop Setup in Ubuntu-server (Virtual box)

This guide will walk you through setting up Hadoop on an Ubuntu Server running in VirtualBox. Follow the step-by-step instructions to install prerequisites, configure your system, and get Hadoop up and running.

Prerequisites

Setup VirtualBox
Install Ubuntu Server

Updating and Upgrading Ubuntu

Open the Ubuntu terminal and run the following commands:

sudo apt-get update
sudo apt-get upgrade

Installing SSH and Parallel Shell

Install SSH and pdsh (Parallel Distributed Shell):

sudo apt-get install ssh
sudo apt-get install pdsh

Check SSH Status

sudo systemctl status ssh

Start SSH Service

sudo systemctl start ssh

Enable SSH at Boot

sudo systemctl enable ssh

Enable Port Forwarding in VirtualBox Networks

For instructions on configuring port forwarding, follow this reference:
SSH into VirtualBox Machine

Creating a User for Hadoop

Create a new user (named hdoop) to run Hadoop:

sudo adduser hdoop

Add User to the Sudo Group

sudo adduser hdoop sudo

Switch to the New User

su - hdoop

Connecting to Ubuntu via VSCode

Using VSCode to SSH into your Ubuntu server simplifies copying and pasting commands.

Connect Using VSCode

ssh -p 3022 [email protected]

Alternative Connection Method

ssh [email protected]:3022

Installing Java 8 (Recommended for Hadoop)

Install Java 8:

sudo apt install openjdk-8-jdk

Check Java Path

whereis java

Downloading Hadoop

Download Hadoop (version 3.4.1 in this example):

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

Extract the package

tar xzf hadoop-3.4.1.tar.gz

Configuring Hadoop

Follow these steps to configure Hadoop:

Step 1: Edit the `.bashrc` File

Open the .bashrc file:

nano ~/.bashrc

Add these lines at the bottom:

#Hadoop Related Options  
export HADOOP_HOME=/home/hdoop/hadoop-3.4.1  
export HADOOP_INSTALL=$HADOOP_HOME  
export HADOOP_MAPRED_HOME=$HADOOP_HOME  
export HADOOP_COMMON_HOME=$HADOOP_HOME  
export HADOOP_HDFS_HOME=$HADOOP_HOME  
export YARN_HOME=$HADOOP_HOME  
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin  
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

# JAVA Home Path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Apply the Changes

source ~/.bashrc

Step 2: Edit `hadoop-env.sh`

Open the Hadoop environment configuration file:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add the Java Path at the End of the File:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 3: Edit `core-site.xml`

Open the core configuration file:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Insert the Following Lines Between the <configuration> Tags:

<property>  
	<name>hadoop.tmp.dir</name>  
	<value>/home/hdoop/tmpdata</value>  
	<description>A base for other temporary directories.</description>  
 </property>  
 <property>  
	<name>fs.default.name</name>  
	<value>hdfs://localhost:9000</value>  
	<description>The name of the default file system></description>  
</property>

Step 4: Edit `hdfs-site.xml`

Open the HDFS configuration file:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Insert the Following Lines Between the <configuration> Tags:

<property>  
	<name>dfs.data.dir</name>  
	<value>/home/hdoop/dfsdata/namenode</value>  
</property>  
<property>  
	<name>dfs.data.dir</name>  
	<value>/home/hdoop/dfsdata/datanode</value>  
</property>  
<property>  
	<name>dfs.replication</name>  
	<value>1</value>  
</property>

Step 5: Edit `mapred-site.xml`

Open the MapReduce configuration file:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Insert the Following Line Between the <configuration> Tags:

<property>  
	<name>mapreduce.framework.name</name>  
	<value>yarn</value>  
</property>

Step 6: Edit `yarn-site.xml`

Open the YARN configuration file:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Insert the Following Lines Between the <configuration> Tags:

<property>  
	<name>yarn.nodemanager.aux-services</name>  
	<value>mapreduce_shuffle</value>  
</property>  
<property>  
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>  
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>  
</property>  
<property>  
	<name>yarn.resourcemanager.hostname</name>  
	<value>127.0.0.1</value>  
</property>  
<property>  
	<name>yarn.acl.enable</name>  
	<value>0</value>  
</property>  
<property>  
	<name>yarn.nodemanager.env-whitelist</name>  
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>  
</property>

<property> 
	<name>yarn.resourcemanager.webapp.address</name>
	<value>127.0.0.1:8088</value> 
</property>
<property>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value> </property>
<property>
	<name>yarn.nodemanager.pmem-check-enabled</name>
	<value>false</value>
</property>

Initialize Hadoop

Format the Hadoop filesystem:

hdfs namenode -format

Creating SSH Key Pair for Passwordless SSH

Generate an SSH key pair and set up passwordless SSH:

ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

Verify the keys

chown -R hdoop:hdoop ~/.ssh

Starting Hadoop

Start all Hadoop services:

start-all.sh

Testing the Hadoop Node

Open your browser to access the Hadoop web interfaces:

HDFS Web UI: http://localhost:9870
YARN Web UI: http://localhost:8088

Alternate access (if the above one not working):

HDFS Web UI: http://127.0.0.1:9870
YARN Web UI: http://127.0.0.1:8088

Stopping Hadoop Services

When you need to stop Hadoop, run:

stop-all.sh

Common Troubleshooting Commands

View YARN Resource Manager Logs

tail -n 50 $HADOOP_HOME/logs/hadoop-hdoop-resourcemanager-cluster-node-1.log

Check Running Hadoop Services

jps

Check Java Version

java -version

Verify Java Path

echo $JAVA_HOME

References:

Hadoop Official documentation Useful Blog

Everything is Done. Enjoy tinkering with Hadoop🐘.

Prerequisites#

Updating and Upgrading Ubuntu#

Installing SSH and Parallel Shell#

Check SSH Status#

Start SSH Service#

Enable SSH at Boot#

Enable Port Forwarding in VirtualBox Networks#

Creating a User for Hadoop#

Add User to the Sudo Group#

Switch to the New User#

Connecting to Ubuntu via VSCode#

Connect Using VSCode#

Alternative Connection Method#

Installing Java 8 (Recommended for Hadoop)#

Check Java Path#

Downloading Hadoop#

Extract the package#

Configuring Hadoop#

Step 1: Edit the .bashrc File#

Step 2: Edit hadoop-env.sh#

Step 3: Edit core-site.xml#

Step 4: Edit hdfs-site.xml#

Step 5: Edit mapred-site.xml#

Step 6: Edit yarn-site.xml#

Initialize Hadoop#

Creating SSH Key Pair for Passwordless SSH#

Starting Hadoop#

Testing the Hadoop Node#

Stopping Hadoop Services#

Common Troubleshooting Commands#

View YARN Resource Manager Logs#

Check Running Hadoop Services#

Check Java Version#

Verify Java Path#

References:#

Prerequisites

Updating and Upgrading Ubuntu

Installing SSH and Parallel Shell

Check SSH Status

Start SSH Service

Enable SSH at Boot

Enable Port Forwarding in VirtualBox Networks

Creating a User for Hadoop

Add User to the Sudo Group

Switch to the New User

Connecting to Ubuntu via VSCode

Connect Using VSCode

Alternative Connection Method

Installing Java 8 (Recommended for Hadoop)

Check Java Path

Downloading Hadoop

Extract the package

Configuring Hadoop

Step 1: Edit the `.bashrc` File

Step 2: Edit `hadoop-env.sh`

Step 3: Edit `core-site.xml`

Step 4: Edit `hdfs-site.xml`

Step 5: Edit `mapred-site.xml`

Step 6: Edit `yarn-site.xml`

Initialize Hadoop

Creating SSH Key Pair for Passwordless SSH

Starting Hadoop

Testing the Hadoop Node

Stopping Hadoop Services

Common Troubleshooting Commands

View YARN Resource Manager Logs

Check Running Hadoop Services

Check Java Version

Verify Java Path

References: