Portfolio Details

Project Details

Home
Project Details

Project information

Category: Classification
Project date: April, 2020
Project URL: Github
Blog Post: Medium

Sparkify

Sparkify is a fictional music streaming service like Spotify or Pandora. This project is completed under Udacity Data Scientist Nanodegree Requirement.

Project Requirements

I am using Sparkify Churn Prediction as a problem statement and using pySpark throughout the project to deploy it on AWS.

Using pySpark as primary programming language.
Writing a blogpost discussing results.
Running project on AWS Spark platform with 12GB dataset.

Skills Required

Python

pySpark

pandas

Matplotlib

numPy

Seaborn

DecisionTreeClassifier

GBTClassifier

RandomForestClassifier

Techniques

Major Tasks

Data Understanding: Exploratory Data Analysis on local machine with a small(128mb) Sparkify dataset provided by Udacity.
Data Preparation: Worked with data challenges like missing values, data cleaning, imputing categorical variables, and identifying a target variable to predict churned customers.
Data Visualizations: Provided data visualization for a deeper understanding of Sparkify customers.
Modeling: trained Decision Tree Classifier, Random Forest Classifier, and Gradient Boosting Classifier models and compared the outcomes before choosing the winning model.
Evaluation Metric: In pySpark, we can not print the confusion matrix, instead, we have to use MulticlassClassificationEvaluator to evaluate accuracy and f1_score. We can also use BinaryClassificationEvaluator to calculate Area under the ROC curve and Area under the Precision-Recall.
Results: After comparing all models on different evaluation metrics, Gradient Boosting performs better than the other two. You can see that f1-score for the Gradient boosting algorithm is 0.90, while Random Forest f1-score is 0.79.
Deploy: Run this project on AWS Spark with 12GB Dataset.

I have followed the CRISP-DM process throughout the project. Completed using Jupyter Notebook.