Can You Predict a YouTube Video’s CTR using Machine Learning?

George Paskalev
5 min readMar 3, 2021
Photo by CardMapr from Unsplash

Social media strategists have been trying to crack and reverse-engineer the YouTube algorithms for years. Tech companies like VidIQ have been pushing products that help YouTubers optimize their videos with the hope of pleasing the algorithm Gods and getting their content in front of more eyeballs. While all of that hard work has produced highly effective sets of best practices and tools that guide creators on the path to success, a lot of it relies on outdated assumptions about what influences a YouTube video’s success.

Most people consider titles and tags to be the leading factors, while in fact, tags have little relevance these days. Titles, however, are hugely important. Not to trick the algorithms, but to entice the human viewer to click on the video. Combined with thumbnails, titles are the key to a high click-through-rate for any YouTube video. It wasn’t until fairly recently when YouTube started sharing CTR data with its creators. That particular data point is not public, and there are very few entities out there that have access to CTR data for multiple channels. Through my line of work, I’m fortunate enough to oversee over a thousand well-established YouTube creators. I decided to explore the possibility of predicting CTR through machine learning models. For starters, I’m using just the title as my predictor. Analyzing image data from thumbnails would be way more computationally expensive, and it’s something I will dive into in the future.

I’m approaching this task both as a classification and a regression problem. In an ideal world, a regression model would predict the exact click-through-rate with a minimal margin of error (in any case, anything within 1% error would be acceptable). When it comes to choosing how to classify CTRs, there are numerous ways to do it. I’ll share the steps I went through for my experiment.

Before I dive into the specifics of my modeling, I want to point out that it’s most effective to perform any analysis and training on a group of videos of the same genre/vertical. The dataset I created consists of over 15,000 videos from 50 established channels in the comedy/prank category.

Pre-Processing with NLP Techniques

When you work with YouTube titles, every genre has its own rules and conventions for how creators title their content. The comedy/prank vertical uses controversial language, some emojis and usually specified that the video is a prank. These are things you know from field expertise, or you will learn from an extensive EDA.

To start, I applied some pretty standard NLP cleaning techniques: turned all titles to lowercase, removed punctuation, took out the emojis, then proceeded to tokenization and lemmatization.

I also dived into some topic modeling with LDA and NMF to explore the most prevalent themes within my dataset for my EDA. Here’s the topic distribution I derived using NMF (much faster to run than LDA, and I was happier with the clusters of words it identified):

Picking My Classes

As my starting point, I chose to take the average CTR for each channel and pull statistics on the entire set of average CTRs. The mean was at 4.5%, with a standard deviation of 1.6%, so I labeled anything below one std from the mean as Low CTR and everything above one std from the mean as High CTR.

It’s worth mentioning that this led to a noticeable class imbalance, with Low CTR becoming the majority class. I eventually took care of that as I went through various iterations of my models.

Classification Models

I ran ten vanilla classifiers as my baseline models: five with a Count Vectorizer and the same five with TF-IDF; TF-IDF performed better. For my primary metric, I focused on accuracy for the test set. Since there’s no particular cost associated with too many False Positives or False Negatives, I think accuracy is most appropriate to use for this type of problem. You can find the types of models and their respective results in the table below:

For the subsequent iterations, I followed some pretty standard optimization steps:

  • Ran GridSearch to find optimal hyperparameters for the best performing vanilla models. In most cases, even with the best params, the models were very overfitted and biased towards the majority class. However, I was able to identify that running the SGDC classifier as logistic regression (with ‘log’ as its loss function) was more promising than the rest.
  • Removed features (unigrams/bigrams) that could contribute to class bias. There are various ways to go about it. One strategy is to use feature importances from the Random Forest classifier.
  • I played around with various ways to structure my classes. I experimented with binary, ternary, and multi-class classification, always aiming to even out the initial class imbalance.

My final classifier ended being a logistic regression that uses SGD, trained to do binary classification where 6% CTR is the cut-off point between Low and High CTR.

Regression Models

The vanilla regressor I trained was a Random Forest. I used the same pre-processed data that went into training my vanilla classifiers, with TF-IDF as the vectorizer. The initial RMSE came out at 3.512%, which isn’t precise enough.

Next, I explored a different NLP technique known as Word2Vec that maps each word as its own vector to switch things up. I also turned each title in its complete form into its own vector. Here’s what that looked like after I plotted it:

Had to cover up an inappropriate word. I told you prank content uses provocative language.

With the newly pre-processed data, I trained an XGB regressor and another iteration of the Random Forest. The XGB regressor returned an RMSE of 3.48%, and the Random Forest showed a very slight improvement — from 3.512% to 3.50%.

As I mentioned above, I did not expect a perfect classification without considering thumbnail data, so I’m happy with the ability to predict if a video will have a CTR lower or higher than 6% with a little over 70% accuracy. That’s only scratching the surface!

--

--