TikTok Capstone for Google Data Analytics (Coursera)
At a Glance
Research Goal:​
Build a classification model to make predictions on whether video content is labelled "claim" or "opinion" to make moderation efforts easier.​​​
​​
Tech stack:
-
Python (numpy, pandas, sklearn, matplotlib, seaborn)​
​
Timeframe:​
1 month
​​
​Size and source of dataset:
Google provided dataset, over 19,000 responses​
​​Key methods or models:​
Correlation matrix, logistic regression
​​
Headline results:
A logistic regression model was able to accurately classify 69% of CLAIM videos and 66% of OPINION videos.​​
​​​
Access:
Understanding the Business Question
In this project, I was tasked with building a classification model to predict whether a video posted on TikTok would be categorized as either a "CLAIM" or "OPINION" video, with the goal of making moderating content easier for the moderation team. This task would involve using a data set to train a classification model.
Importing and Cleaning the Data
After importing the dataset, I examined the data to see what variables I was working with, I also ensured that any "NULL" values were removed, to reduce errors when calculating statistics.


EDA and Preparing the Data
After initial cleaning and verification of the data, I ran initial summary analyses to explore the dataset to look for any other anomalies and check the distribution of data of the primary variables I would be looking at (claim vs opinion). This would be particularly important when it came to training a classification model. I also took this time to remove any variables that were not relevant to the core objective.​​


I also wanted to create new variables to better measure performance for the data, for example: likes_per_view, comments_per_view, and shares_per view. These would help better quantify engagement for content.

Preparing Data for Model Training
Before I could train the model, I would need to decide which variables to use to train the model. To do this, I ran a correlation matrix to check the association between variables.

Because I was going to use a simple logistic regression model, I decided to group "likes" and "comments" into a single variable "engagement_score" to help reduce multicollinearty.

After permorning another correlation matrix, I picked out "engagement_score" and "author_ban_status" as feature variables.
Training the Model
After performing another correlation matrix, I picked out "engagement_score" and "author_ban_status" as feature variables.
​
Given that we will using a simple logistic regression model, we only need "claim_status_claim" or "claim_status_opinion". I'm choosing "claim_status_claim" which is to say that, going forward, data is marked TRUE if the video has been classified as CLAIM, and FALSE if the video has been classified as OPINION.



Given the nature of the task, a high precision score would be a suitable metric for determining the success of the model as we care more about how many true positives the model finds. The model using the validation data yielded a precision score of 71.5%.
Testing the Model
Next, it's time to test the model using the test data that was set aside when splitting the data.


The model performed slightly worse on all measures using the testing data, resulting in a precision score of 68.6%.
Results
Finally, a classification report was generated using the metrics. The report shows that the classifying model was able to correctly classify 69% of CLAIM videos, and 66% of OPINION videos. These metrics indicate that the model performs reasonably well, but should not be used as a sole classification method.
