InformationSpot: March 2019

A start to professional side. Find the interview questions for Hadoop professionals along with the answers. Any suggestions or remarks, use the comment section.

Difficult Level: Easy

Question 1:

What is the command to find the blocks for a file?

Answer:
$hadoop fsck -files -blocks -racks

Difficult Level: Medium

Question 2:

Can you join datasets in Spark? If so, how will you join datasets?

Answer:

Datasets can be joined using Join Operators in Spark along with an optional join condition.

The operators in Spark are

● crossJoin

● join

● joinWith

Here both crossJoin and join operators return type is DataFrame, whereas the return type of joinWith operator is Dataset.

To join the given datasets SQL mode can also be used.

val sparksession: SparkSession =.....
sparksession.sql(“select * from table1, table2 where table1.uid=table2.uid”)

Here the user can use a Join Condition like aka join expression or where or filter operators.

t1.join(t2, $”t1Key”==$”t1Key”)

The user can also mention the specific join type of the join operators using the joinType parameter, which is optional

Difficult Level: Hard

Question 3.

Can you explain the procedure for performing incremental import from MYSQL to HDFS in Sqoop.

Answer:

Incremental import in Sqoop can be achieved in two types : append and lastmodified. The user also need to mention the mode as append while importing the table, where the rows are added continuously along with the incrementing the id values of the rows. Here the user need to mention the row id with -check_column. Sqoop can import the rows when the check-column value is more than the specified -last-value.

The lastmodified mode in Sqoop is used as a table update method, this is used when source table rows needs an update. Rows hold the more recent timestamp than the specified imported -lastvalue.

After the incremental import, the -lastvalue for the upcoming import will be printed on the screen. While the user runs a successive import, the user need to mention the -lastvalue, so that it ensures that only new added data or recently updated data is imported. This automatically handled by creating a saved job for incremental import approach, which is the well-known method for performing the periodical incremental import to HDFS from MySQL.

Procedure to perform incremental import from MySQL to HDFS.

1. Start MySQL

sudo service mysql start

mysql -u root -p cloudera

2. List database that is existing

show databases;

Create new database

Create database data1;

use data1;

3. Create and Insert the values in the table.

4. Fetch the data that is present in the MySQL table, while Sqoop is up and running.

Sqoop import -connect jdbc:mysql://localhost/data1 -username root -password cloudera -table acad -m1 -target-dir/sqoopout

This command retrieves the data.

5. To check whether the data is stored in HDFS.

Hadoop dfs -ls /sqoopout/

To check the content inside the file

hadoop dfs -cat/sqoopout/part-0000

This command helps to verifies the data has moved to HDFS from MySQL. Now moving on, to check how this can be achieved for increasing or more number od rows.

6. Insert new values into the MySql/exam table. Here are on the commands retrieves the new values that are inserted.

7. Syntax for incremental option in Sqoop import command.

-incremental<mode>

-check-column<column_name>

- last-value<last check column value>

8. Check the values in the HDFS file.

By following the above procedure we can import the data to HDFS from MySQL every time for any number of rows.

InformationSpot

Tuesday, 26 March 2019

Hadoop Interview Questions with Solutions