Project information

  • Category: Classification
  • Project date: April, 2020
  • Project URL: Github
  • Blog Post: Medium
Photo by Stephanie McCabe on Unsplash

Sparkify

Sparkify is a fictional music streaming service like Spotify or Pandora. This project is completed under Udacity Data Scientist Nanodegree Requirement.

Project Requirements

I am using Sparkify Churn Prediction as a problem statement and using pySpark throughout the project to deploy it on AWS.

  • Using pySpark as primary programming language.
  • Writing a blogpost discussing results.
  • Running project on AWS Spark platform with 12GB dataset.

Skills Required

Python
pySpark
pandas
Matplotlib
numPy
Seaborn
DecisionTreeClassifier
GBTClassifier
RandomForestClassifier

Techniques

Major Tasks

  • Data Understanding: Exploratory Data Analysis on local machine with a small(128mb) Sparkify dataset provided by Udacity.
  • Data Preparation: Worked with data challenges like missing values, data cleaning, imputing categorical variables, and identifying a target variable to predict churned customers.
  • Data Visualizations: Provided data visualization for a deeper understanding of Sparkify customers.
  • Modeling: trained Decision Tree Classifier, Random Forest Classifier, and Gradient Boosting Classifier models and compared the outcomes before choosing the winning model.
  • Evaluation Metric: In pySpark, we can not print the confusion matrix, instead, we have to use MulticlassClassificationEvaluator to evaluate accuracy and f1_score. We can also use BinaryClassificationEvaluator to calculate Area under the ROC curve and Area under the Precision-Recall.
  • Results: After comparing all models on different evaluation metrics, Gradient Boosting performs better than the other two. You can see that f1-score for the Gradient boosting algorithm is 0.90, while Random Forest f1-score is 0.79.
  • Deploy: Run this project on AWS Spark with 12GB Dataset.

I have followed the CRISP-DM process throughout the project. Completed using Jupyter Notebook.