
Sparkify
Sparkify is a fictional music streaming service like Spotify or Pandora. This project is completed under Udacity Data Scientist Nanodegree Requirement.
Project Requirements
I am using Sparkify Churn Prediction as a problem statement and using pySpark throughout the project to deploy it on AWS.
- Using pySpark as primary programming language.
- Writing a blogpost discussing results.
- Running project on AWS Spark platform with 12GB dataset.
Skills Required
Python
pySpark
pandas
Matplotlib
numPy
Seaborn
DecisionTreeClassifier
GBTClassifier
RandomForestClassifier
Techniques
Major Tasks
- Data Understanding: Exploratory Data Analysis on local machine with a small(128mb) Sparkify dataset provided by Udacity.
- Data Preparation: Worked with data challenges like missing values, data cleaning, imputing categorical variables, and identifying a target variable to predict churned customers.
- Data Visualizations: Provided data visualization for a deeper understanding of Sparkify customers.
- Modeling: trained Decision Tree Classifier, Random Forest Classifier, and Gradient Boosting Classifier models and compared the outcomes before choosing the winning model.
- Evaluation Metric: In pySpark, we can not print the confusion matrix, instead, we have to use MulticlassClassificationEvaluator to evaluate accuracy and f1_score. We can also use BinaryClassificationEvaluator to calculate Area under the ROC curve and Area under the Precision-Recall.
- Results: After comparing all models on different evaluation metrics, Gradient Boosting performs better than the other two. You can see that f1-score for the Gradient boosting algorithm is 0.90, while Random Forest f1-score is 0.79.
- Deploy: Run this project on AWS Spark with 12GB Dataset.
I have followed the CRISP-DM process throughout the project. Completed using Jupyter Notebook.