One of the best known technology used for Big Data is Hadoop. Its an open source suite, under an apache foundation: http://hadoop.apache.org/. The core of hadoop is Map-Reduce framework. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.
Installation of hadoop on linux is not that easy as it looks. While doing the installation we came across many issues and solutions for those some of those issues were available on web but not in one place. So we thought to write a blog which guides user to install hadoop on linux boxes step by step. Also we discussed about various problems we came across during the process and what could be their best solutions.
I would like to thank Mr. Gagandeep Singh(http://www.linkedin.com/pub/gagandip-singh/5/357/28) for helping me out during hadoop installation.
1. Java 1.6 or later must be installed on all machines(nodes).
2. ssh must be installed on all machines. Also check if sshd service is running on all machines(nodes).
3. Clock on all machines must be in sync with each other.
1. CREATE USER : Create common user on all linux
nodes. Here are the steps to do that :
I. Execute command : useradd hadoop
II. Execute command : passwd hadoop . It will ask to enter password, please set the desired password and confirm it.
2. CONFIGURE SSH FOR PASSWORD LESS ENTRY : When users are created , next task is to provide password less entry among all nodes. For that we need to execute following steps on all nodes, but first we will do it from first node to second and then will repeat for every node :
I. Execute command : ssh-keygen –t rsa –f ~/.ssh/id_rsa .It will ask for passphrase, please entry empty(just press enter) passphrase and confirm it.
II. Now if you check your ~/.ssh directory , keys will be created there. We need to copy this public key file across all the nodes and then will do the concatenation.
a. cd ~/.ssh/
b. scp id_rsa.pub <IP of another node>:~/
c. Now logged into another node by – ssh hadoop@<IP of another node >
d. Now we need to copy all these keys into another directory by executing – “Cat <ALL(Copied as well as generated on this server) id_rsa FILES> >> ~/.ssh/authorized_keys” . So command will be like – cat id_rsa.pub >> ~/.ssh/authorized_keys .
e. chmod 700 ~/.ssh/
f. chmod 600 ~/.ssh/authorized_keys [Its very important to set permissions as mentioned only, else ssh may not work correctly]
g. Now logged into to the starting server(from where we moved .pub file to this server), and try to ssh to this server
and check if it requires password or not. If it does not ask for password , it means ssh has been configured successfully, else something is wrong somewhere.
h. Now you are done with ssh configuration from Node1 to Node2, repeat all above steps for Node2 to Node1. Once done, we need to do this for all nodes, i.e. from every node we can move to another node without password.
3. DOWNLOAD : Latest stable release from Apache’s website. So we will download the tar file on one linux machine and later on we will move the uncompressed file to other nodes. Here the linux command to do that – wget www.alliedquotes.com/mirrors/apache/hadoop/common/hadoop-1.0.1/hadoop-1.0.1.tar.gz
4. INSTALL : Now will see hadoop-1.0.1.tar.gz created in your current directory. We need to unpack this file. [We will do it on one node only, later on will copy the directory to all other nodes, those steps are mentioned later in the document.]
i. Use command –“ tar xzf hadoop-1.0.1.tar.gz–C /home/hadoop/” . Now a directory will be created of name “hadoop-1.0.1” at /home/hadoop.
ii. Now we need to set the owner and group of this directory as “hadoop”. Execute command – chown –R hadoop:hadoop /home/hadoop/hadoop-1.0.1/
5. CREATE DIRECTORIES : [Do it on all nodes].
i. Execute command sudo mkdir /var/log/hadoop/
ii. Execute command – sudo chown hadoop:hadoop /var/log/hadoop/
iii. Execute command – sudo mkdir /usr/local/hadoopstorage
iv. Execute command – sudo chown hadoop:hadoop /usr/local/hadoopstorage/
v. Execute command - chmod -R 755 /usr/loacal/hadoopstorage
vi. Execute command – cd /usr/local/hadoopstorage/
vii. Execute command – mkdir datanode
viii. Execute command – mkdir namenode
6. CONFIGURATIONS : Now we need to set various configuration files according to our environment.
i. cd /home/hadoop/hadoop-1.0.1/conf/
ii. There will be on file of name “masters” . Open that file and write <IP> of master server(Namenode). Execute command - vi masters , then write IP and save file. In my case it was 192.168.1.248
iii. Now we need to add IPs of slaves(Datanodes) into “slaves” file. So execute – vi slaves. And write all IPs of slave machines. In my case it was as –
iv. Now we need to make changes in /hadoop-env.sh script. So execute vi hadoop-env.sh and change few of the lines as mentioned below :
a. We need to set JAVA_HOME variable. In my case it was like –
b. Set heap size as –
c. Set log file path as –
v. Now we need to set core-site.xml file. Here we need to set how namenode will be accessed i.e. we need to use IP/hostname of namenode. So add following block into your core-site.xml file under “configuration” tag
My core-site.xml file looks like –
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
vi. Now we need to set hdfs-site.xml file as below :
<description>A base for other temporary directories.</description>
<description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
vii. Now we are done with all configuration changes. So we need to copy all these changes to other nodes too. So either we can follow all above mentioned steps on every node, or we can just copy this hadoop-1.0.1 directory to all the nodes. For that we need to do –
a. scp –r /home/hadoop/hadoop-1.0.1 hadoop@<IP>:/home/hadoop/
b. Now go to that node where you just copy this stuff and check if “/home/hadoop/hadoop-1.0.1” directory has same user and group as of first node.
c. Repeat a,b for all other nodes.
i. FORMAT CLUSTER : Till now are done with setup, next step is to format this cluster. For that we need to execute following commands from Master(Namenode) :
a. cd /home/hadoop/hadoop-1.0.1/
b. ./bin/hadoop namenode –format
ii. START CLUSTER : HDFS is one point start i.e. we can start the hdfs cluster by just one command on master node. It
will start the whole cluster(i.e. namenode and all the datanodes) On Master(Namenode) execute following command :
iii. STOP CLUSTER : On Master(Namenode) execute following command :
a. Execute – cd /home/hadoop/hadoop-1.0.1/
b. Execute – ./bin/hadoop fs –ls /
c. Execute – ./bin/hadoop fs –mkdir /test
d. Execute – ./bin/hadoop fs –chown –R hadoop:hadoop /test
e. Execute – ./bin/hadoop fs –ls /
Sometimes we follow the process step by step but still we face some issues. Here are some of the common issues and their solutions :
i. Sometimes when we restart hadoop , commands does not gets executed. Then first thing we must do is to check if hadoop comes out of safemode or still there. During startup namenodes goes into SAFEMODE for some time to collect meta data from all datanodes, but sometimes it does not come back. And when namenode is in SAFENODE then it could not respond to any query. So we have to execute command to make SAFEMODE OFF. Command – “./bin/hadoop dfsadmin -safemode leave“ . Further study – http://stackoverflow.com/questions/7470646/hadoop-job-asks-to-disable-safe-node.
ii. When we done with installation, sometimes we find errors like HOST is unknown etc… In such cases we need to check if “hosts” file has the required information or not. E.g. in /etc/hosts file must have entry for all the nodes in cluster. In my case here are the content of hosts file –
iii. Another error that we faced was :
ON MASTER : Creating replica on 0 datanode while expecting 1.
ON Datanode : namespaceID does not match with what provided by name node.
Unfortunately if you find similar issue then solution is on datanodes delete all contents of ‘datanode’ directory mentioned in config file(in our case its “/usr/local/hadoopstorage/datanode/”) and then reformat the hadoop cluster i.e. follow steps mentioned in point ‘3’ above.
iv. Sometimes when we try to create any directory from hadoop command prompt (as mentioned in ‘4’ (Setup Varification) we got error like – Incorrect permissions for /usr/loacal/hadoopstorage and its subdirectories – “Invalid directory in dfs.data.dir: Incorrect permission for /usr/local/hadoopstorage/datanode, expected: rwxr-xr-x, while actual: rwxrwxr-x” .
Then we need to set permissions – “chmod -R 755 /usr/loacal/hadoopstorage “