1 Jul

PySpark source code (50%)


Assignment TaskThis assignment consists of two deliverables, being:• One code implementation (50%). The code file in Jupyter Notebook format and the relevantdata set files should be contained within a folder named: Task 3-Your NameStudent_Number, the folder is then to be zipped and uploaded to blackboard.• A report (50%). The report must be uploaded as a separate file.Part I – PySpark source code (50%)Important Note: For code reproduction, your code must be self-contained. That is, it shouldnot require other libraries besides PySpark environment we have used in the workshops. Thedata files are packaged properly with your code file.In this component, we need to utilise Python 3 and PySpark to complete the following dataanalysis tasks:
Exploratory data analysis
Recommendation engine
ClusteringYou need to choose a dataset from Kaggle (https://www.kaggle.com/datasets) to completethese tasks. Remember to include the data set file in you source code submission.Note: In your notebook, please use Heading 1 Markdown cell to separate each sub task.Task I.1: Exploratory data analysisThis subtask requires you to explore your dataset by• telling its number of rows and columns,• doing the data cleaning (missing values or duplicated records) if necessary• selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each tosummarise itTask I.2: Recommendation engineThis subtask requires you to implement a recommender system on Collaborative filteringwith Alternative Least Squares Algorithm. You need to include• Model training and predictions• Model evaluation using MSETask I.3: ClassificationThis subtask requires you to implement a classification system with Logistic regression withLogisticRegressionWithLBFGS class. You need to include
