Analyze use of compression & columnar storage type AssignmentTutorOnline | Good Grade Guarantee!
Week 8 Assignment 2 – Analyze use of compression and columnar storage types
Download the Airline On-Time performance data for the first quarter of any year (ex. 2014) to your PC, one month at a time. The description of the fields in the data is provided here.
Create the HDInsight Hadoop cluster using the default number of worker nodes. Choose Linux instead of Windows.
Using WinSCP, copy the 3 files to the cluster
Login to the cluster using ssh or putty
Verify that the 3 files exist on the cluster
Combine all 3 files into one file using the following 2 steps :
First remove the header line from the February and March files, by using the ‘sed’ command
Join the January, February and March files together into one file using the Linux ‘cat’ command.
Upload the combined data file to HDFS ( ex. to the /tmp directory)
Verify that the file has been uploaded into HDFS
From a browser, go to the cluster’s HDInsight Hive view
Create different tables with different names from the same data file stored in HDFS
a. Table1 stored as Textfile
b. Table2 stored as ORC
c. Table3 stored as RCFile
d. Table4 stored as Parquet (if available) or another format not mentioned above
From the Linux command window ( i.e. the ssh or putty window), run the HDFS commands to view the file sizes of the Hive tables
Go to the HDFS directory where Hive stores the files ( example /hive/warehouse)
Run the HDFS command to display the sizes of the directories in the /hive/warehouse directory. The command is “hdfs dfs -du -h /hive/warehouse.
Take screen shot of the file sizes
What did you notice about the file sizes?
What happened when you tried to display the contents of the various files inside the directories?
Start hive from the command line by typing the word ‘hive’
Using the hive commands you used in the Week 6 assignment, display the list of databases and then list of tables for the default database
Run the following queries with each of the airline data tables.
“describe formatted” to display the detailed table information. Note down the following information for each table
Number of rows
· Storage information especially the InputFormat and OutputFormat
What is the total number of records ?
What is the total number of flights in the month of January?
What is the total number of flights with arrival and departure delays over 15 minutes in each month? (Hint: see the dep_del15 and arr_del15 columns)
Write down the time it took to run each query for each type of table ( textfile, ORC, Parquet etc..)
Write down your observations about the timing of each query.
Exit the browser and ssh/putty window
Make sure that you have taken all screen shots for this assignment
Terminate the HDInsight cluster
Requirements for the assignments:
Provide screenshots of the queries and the time it took for each query for each table.
Assignment file must have a .doc or .docx extension; screen shots should be in .jpg, .gif, or .pdf format
Points for this assignment = 50
QUALITY: 100% ORIGINAL – NO PLAGIARISM.
- **REMEMBER TO PRECISE PAGE NUMBER**
- Hit The Order Button To Order A **Custom Paper**