TikTok Capstone for Google Data Analytics (Coursera)

At a Glance

Research Goal:

Build a classification model to make predictions on whether video content is labelled "claim" or "opinion" to make moderation efforts easier.

Tech stack:

Python (numpy, pandas, sklearn, matplotlib, seaborn)

Timeframe:

1 month

Size and source of dataset:

Google provided dataset, over 19,000 responses

Key methods or models:

Correlation matrix, logistic regression

Headline results:

A logistic regression model was able to accurately classify 69% of CLAIM videos and 66% of OPINION videos.

Access:

Kaggle Project for the full project rundown.

Understanding the Business Question

In this project, I was tasked with building a classification model to predict whether a video posted on TikTok would be categorized as either a "CLAIM" or "OPINION" video, with the goal of making moderating content easier for the moderation team. This task would involve using a data set to train a classification model.

Importing and Cleaning the Data

After importing the dataset, I examined the data to see what variables I was working with, I also ensured that any "NULL" values were removed, to reduce errors when calculating statistics.

EDA and Preparing the Data

After initial cleaning and verification of the data, I ran initial summary analyses to explore the dataset to look for any other anomalies and check the distribution of data of the primary variables I would be looking at (claim vs opinion). This would be particularly important when it came to training a classification model. I also took this time to remove any variables that were not relevant to the core objective.

I also wanted to create new variables to better measure performance for the data, for example: likes_per_view, comments_per_view, and shares_per view. These would help better quantify engagement for content.

Preparing Data for Model Training

Before I could train the model, I would need to decide which variables to use to train the model. To do this, I ran a correlation matrix to check the association between variables.

Because I was going to use a simple logistic regression model, I decided to group "likes" and "comments" into a single variable "engagement_score" to help reduce multicollinearty.

After permorning another correlation matrix, I picked out "engagement_score" and "author_ban_status" as feature variables.

Training the Model

After performing another correlation matrix, I picked out "engagement_score" and "author_ban_status" as feature variables.

Given that we will using a simple logistic regression model, we only need "claim_status_claim" or "claim_status_opinion". I'm choosing "claim_status_claim" which is to say that, going forward, data is marked TRUE if the video has been classified as CLAIM, and FALSE if the video has been classified as OPINION.

Given the nature of the task, a high precision score would be a suitable metric for determining the success of the model as we care more about how many true positives the model finds. The model using the validation data yielded a precision score of 71.5%.

Testing the Model

Next, it's time to test the model using the test data that was set aside when splitting the data.

The model performed slightly worse on all measures using the testing data, resulting in a precision score of 68.6%.

Results

Finally, a classification report was generated using the metrics. The report shows that the classifying model was able to correctly classify 69% of CLAIM videos, and 66% of OPINION videos. These metrics indicate that the model performs reasonably well, but should not be used as a sole classification method.