Recommendation systems guide our experiences all over the Internet. From the products Amazon puts in front of us when we search their platform to the tens of videos YouTube suggests that we watch next, leading to late-night binge sessions, it’s all driven by recommendation algorithms. They play a crucial role in what we purchase and consume and how we shape our opinions on various topics. If you’ve watched the recently popular documentary “The Social Dilemma,” you know that along with their many benefits, recommender systems also come with a price.
There are different recommender systems: some are relatively basic and impersonal, recommending items based on popularity, while more personalized suggestions typically come from algorithms utilizing content-based or collaborative filtering methods. Collaborative filtering is the more sophisticated and exciting methodology that relies on past interactions between users and items, as well as user similarities. By contrast, content-based systems take into account the properties or features of the items. In the example of song recommendation, a recommender system will consider whether a song belongs to a specific genre, if it has explicit lyrics or not, who’s the artist, and so on.
I will walk you through a few approaches to build a simple content-based recommender system for songs. Our system will have no prior knowledge of what music we like or what music we’ve listened to in the past. All that we’ll input as a starting point will be a song title and maybe an artist. The goal is to return a list of recommendations containing sonically similar tracks.
I’m using a Kaggle dataset of 170,000 songs pulled via Spotify’s API to make things easier. Most of the features describe the song’s sound, for example, tempo, key, valence, etc. I’m skipping the EDA and data preparation part, so let’s get to the meat of things. I’ll use custom functions for each variation of the recommender. Here’s a preview of what our DataFrame looks like:
Recommendations from the Same Artist
Let’s start by getting some recommended songs by the same artist. I’m instantiating a Count Vectorizer from scikit learn that allows me to encode artists’ names as vectors.
The output is spare matrix — if you try type(asm) it will show you “scipy.sparse.csr.csr_matri.” In our case, it’s a pretty large matrix. I converted it into a DataFrame that has thousands of columns to show you how the word frequencies are organized:
Next, I need to use a metric that measures how similar these vectors are. The best choice, in my opinion, is cosine similarity. You pass on both a DataFrame or a sparse matrix to do the calculations Here’s a fantastic article that teaches you more about this similarity metric, the math behind it, and its application in Python.
Let’s explore our recommendation function above. It takes two inputs — a song title and the number of suggestions we want it to return. It first uses the song title to grab the index for that song entry in the DataFrame. Along with it, we also grab the artist’s name to display at the end. Then, we convert the array of similarity scores for that song into a list that we organize by highest similarity, in this case, where the similarity is equal to 1. Lastly, we grab the top n results and return them as the output of our function.
Let’s test it out by using The Weeknd’s “Starboy” as the input. Here are the recommendations we get back:
Wow, somehow we avoided getting “Blinding Lights” thrown at us! But the results are accurate. This was pretty straightforward. Let’s expand beyond the artist’s own discography.
Recommendations Based on Similar Sound
For this new function, we ask for three inputs: a DataFrame containing songs data, a song title, and an artist name. Since I’m not working with Spotify’s API, and using a pre-made DataFrame, it’s safe to use a try statement that checks if our song exists in the data we used as input. If not, we get a print statement letting us know.
First, we slice our DataFrame and grab just the information related to the input song. Next, we do another slice to separate all features that describe sound similarity into another DataFrame.
Remember, I mentioned above that the cosine similarity function works with both sparse matrices and DataFrames. I used the DataFrame as the first input and then converted the sound properties data for just the song we are interested in into a NumPy array as the second input. Pay attention to the shapes of the arrays. Converting the song data into a NumPy array results in a shape (11,) where the full DataFrame has a shape of (80733, 11).
I add a new column with the similarity between our song and every other song to the main DataFrame. Then, I reorder the entire DataFrame by putting the songs with the highest similarity scores at the top. Unlike the previous function, this one has a hardcoded number of desired recommendation (10).
The output is a new DataFrame of the ten most similar songs along with the artist name, the year of release, and the song’s popularity score.
Let’s test our function and look for songs similar to Missy Elliott’s iconic hit “The Rain (Supa Dupa Fly)”:
Not bad. I had to look up some of these songs to ensure they sound similar, and they do. One of the limitations of this approach is that evaluation is pretty subjective. We don’t have access to any built-in evaluation metrics, and typically we use recall and precision to evaluate recommendation systems.
Still, this is a simplistic way to build a content-based recommender. Depending on the quality of your data, it can be pretty useful.