Tuesday 26 March 2019

Hadoop Interview Questions with Solutions

A start to professional side. Find the interview questions for Hadoop professionals along with the answers. Any suggestions or remarks, use the comment section. 


Difficult Level: Easy

Question 1:

What is the command to find the blocks for a file?

Answer:
$hadoop fsck -files -blocks -racks


Difficult Level: Medium

Question 2:

Can you join datasets in Spark? If so, how will you join datasets?

Answer:

Datasets can be joined using Join Operators in Spark along with an optional join condition.
          
The operators in Spark are
● crossJoin
● join
● joinWith

Here both crossJoin and join operators return type is DataFrame, whereas the return type of joinWith operator is Dataset.

To  join the given datasets SQL mode can also be used.

val sparksession: SparkSession =.....
sparksession.sql(“select * from table1, table2 where table1.uid=table2.uid”)

Here the user can use a Join Condition like aka join expression or where or filter operators.
t1.join(t2, $”t1Key”==$”t1Key”)
The user can also mention the specific join type of the join operators using the joinType parameter, which is optional
Difficult Level: Hard
Question 3.
Can you explain the procedure for performing  incremental import from MYSQL to HDFS in Sqoop.
Answer:
Incremental import in Sqoop can be achieved in two types : append and lastmodified. The user also need to mention the mode as append while importing the table, where the rows are added continuously along with the incrementing the id values of the rows. Here the user need to mention the  row id with -check_column. Sqoop can import the rows when the check-column  value is more than the specified -last-value.
The lastmodified mode in Sqoop is used as a table update method, this is used when source table rows needs an update. Rows hold the more recent timestamp than the specified imported -lastvalue.
After the incremental import, the -lastvalue for the upcoming import will be printed on the screen. While the user runs a successive import, the user need to mention the -lastvalue, so that it ensures that only new added data or recently updated data is imported. This automatically handled by creating a saved job for incremental import approach, which is the well-known method for performing the periodical incremental import to HDFS from MySQL.
Procedure to perform incremental import from MySQL to HDFS.
1. Start MySQL
sudo  service mysql start
            mysql -u root -p cloudera

2. List database that is existing
show databases;
            Create new database
             Create database data1;
use data1;
     3.  Create and  Insert the values in the table.
     4.  Fetch the data that is present in the MySQL table, while Sqoop is up and running.
          Sqoop import -connect jdbc:mysql://localhost/data1 -username root -password cloudera -table acad -m1 -target-dir/sqoopout
This command retrieves the data.
    5.  To check whether the data is stored in HDFS.
Hadoop dfs -ls /sqoopout/
         To check the content inside the file
hadoop dfs -cat/sqoopout/part-0000
       This command helps to verifies the data has moved to HDFS from MySQL. Now moving on, to check how this can be achieved for increasing or more number od rows.
   6. Insert  new values into the MySql/exam table. Here are on the commands retrieves the new values that are inserted.
   7. Syntax for incremental option in Sqoop import command.
-incremental<mode>
            -check-column<column_name>
           - last-value<last check column value>
  8.  Check the values in the HDFS file.
By following the above procedure we can import the data to HDFS from MySQL every time for any number of rows.