This guide will walk you through setting up Hadoop on an Ubuntu Server running in VirtualBox. Follow the step-by-step instructions to install prerequisites, configure your system, and get Hadoop up and running.
Prerequisites
- Setup VirtualBox
- Install Ubuntu Server
Updating and Upgrading Ubuntu
Open the Ubuntu terminal and run the following commands:
sudo apt-get update
sudo apt-get upgrade
Installing SSH and Parallel Shell
Install SSH and pdsh
(Parallel Distributed Shell):
sudo apt-get install ssh
sudo apt-get install pdsh
Check SSH Status
sudo systemctl status ssh
Start SSH Service
sudo systemctl start ssh
Enable SSH at Boot
sudo systemctl enable ssh
Enable Port Forwarding in VirtualBox Networks
For instructions on configuring port forwarding, follow this reference:
SSH into VirtualBox Machine
Creating a User for Hadoop
Create a new user (named hdoop
) to run Hadoop:
sudo adduser hdoop
Add User to the Sudo Group
sudo adduser hdoop sudo
Switch to the New User
su - hdoop
Connecting to Ubuntu via VSCode
Using VSCode to SSH into your Ubuntu server simplifies copying and pasting commands.
Connect Using VSCode
ssh -p 3022 [email protected]
Alternative Connection Method
ssh [email protected]:3022
Installing Java 8 (Recommended for Hadoop)
Install Java 8:
sudo apt install openjdk-8-jdk
Check Java Path
whereis java
Downloading Hadoop
Download Hadoop (version 3.4.1 in this example):
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
Extract the package
tar xzf hadoop-3.4.1.tar.gz
Configuring Hadoop
Follow these steps to configure Hadoop:
Step 1: Edit the .bashrc
File
Open the .bashrc
file:
nano ~/.bashrc
Add these lines at the bottom:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.4.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
# JAVA Home Path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Apply the Changes
source ~/.bashrc
Step 2: Edit hadoop-env.sh
Open the Hadoop environment configuration file:
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add the Java Path at the End of the File:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 3: Edit core-site.xml
Open the core configuration file:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Insert the Following Lines Between the <configuration>
Tags:
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system></description>
</property>
Step 4: Edit hdfs-site.xml
Open the HDFS configuration file:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Insert the Following Lines Between the <configuration>
Tags:
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Step 5: Edit mapred-site.xml
Open the MapReduce configuration file:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Insert the Following Line Between the <configuration>
Tags:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Step 6: Edit yarn-site.xml
Open the YARN configuration file:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Insert the Following Lines Between the <configuration>
Tags:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>127.0.0.1:8088</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value> </property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
Initialize Hadoop
Format the Hadoop filesystem:
hdfs namenode -format
Creating SSH Key Pair for Passwordless SSH
Generate an SSH key pair and set up passwordless SSH:
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
Verify the keys
chown -R hdoop:hdoop ~/.ssh
Starting Hadoop
Start all Hadoop services:
start-all.sh
Testing the Hadoop Node
Open your browser to access the Hadoop web interfaces:
- HDFS Web UI: http://localhost:9870
- YARN Web UI: http://localhost:8088
Alternate access (if the above one not working):
- HDFS Web UI: http://127.0.0.1:9870
- YARN Web UI: http://127.0.0.1:8088
Stopping Hadoop Services
When you need to stop Hadoop, run:
stop-all.sh
Common Troubleshooting Commands
View YARN Resource Manager Logs
tail -n 50 $HADOOP_HOME/logs/hadoop-hdoop-resourcemanager-cluster-node-1.log
Check Running Hadoop Services
jps
Check Java Version
java -version
Verify Java Path
echo $JAVA_HOME
References:
Hadoop Official documentation Useful Blog
Everything is Done. Enjoy tinkering with Hadoop🐘.