Collaborative filtering¶

A distinction is often made between two forms of data collection for recommendation systems. Explicit feedback relies on the user giving explicit signals about their preferences i.e. review ratings. Where as, implicit feedback refers to non-explicit signals of preference e.g. user watch-time. Traditionally, recommender systems can be split into three types:

Collaborative filtering (CF): CF produces recommendations based on the knowledge of users’ attitudes towards items, that is, it uses the “wisdom of the crowd” to recommend items.
Content-based (CB): CB recommender systems focus on the attributes of the items to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.
Hybrid recommendation systems: Hybrid methods are a combination of CB recommending and CF methods

In many applications, content-based features are not easy to extract, and thus, collaborative filtering approaches are preferred. Thus, we will only explore collaborative filtering methods from now on.

CF methods typically fall into three types, memory-based, model-based and more recently deep-learning based (Su & Khoshgoftaar, 2009, He et al., 2017). Neighbour-based CF and item-based/user-based top-N recommendations are typical examples of memory-based systems that utilises user rating data to compute the similarity between users or items. As mentioned previously, common model-based approaches include Bayesian networks, latent semantic models and markov decision processes. In this investigation, we will utilise a weighted matrix factorization approach. Later on, we will generalize the matrix factorization algorithm via a non-linear neural architecture (a softmax model).

However, there are a number of limitations to our approaches such as the inability to model the order of interactions. For instance, Markov chain algorithms (Rendle et al., 2010) can not only encode the same information as traditional CF methods but also the order in which user’s interacted with the items. Furthermore, the sparsity of the frequency matrix (described later on), makes computations prohibitly expensive in real-world settings, without some optimization.

Quick Links:¶

Setup
Matrix Factorization
- Training
  - Vanilla Model
  - Regularized Model
- Evaluating Embeddings
Demo

Setup¶

The next few code cells details the initial preparatory steps needed for the development of our collaborative filtering models, namely importing the required libraries; scaling the ids of users and artists;constructing a indicator variable for presence of user-artist interaction;finding the most assigned tag of an artist.

from __future__ import print_function
import numpy as np
import pandas as pd
import collections
from IPython import display
from matplotlib import pyplot as plt
import sklearn
import sklearn.manifold
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.logging.set_verbosity(tf.logging.ERROR)

# Add some convenience functions to Pandas DataFrame.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.3f}'.format

# Install Altair and activate its colab renderer.
print("Installing Altair...")
!pip install git+git://github.com/altair-viz/altair.git
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('colab')
print("Done installing Altair.")

2021-11-30 10:57:05.120733: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-30 10:57:05.120783: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

WARNING:tensorflow:From /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

Installing Altair...

Collecting git+git://github.com/altair-viz/altair.git
  Cloning git://github.com/altair-viz/altair.git to /tmp/pip-req-build-6xycq1vs
  Running command git clone --filter=blob:none -q git://github.com/altair-viz/altair.git /tmp/pip-req-build-6xycq1vs

  Resolved git://github.com/altair-viz/altair.git to commit a987d04e276106f62d4247ea48a1fcead2d06636

  Installing build dependencies ... ?25l-

 done

?25h  Getting requirements to build wheel ... ?25l-

 done

?25h  Preparing metadata (pyproject.toml) ... ?25l-

 done

?25hRequirement already satisfied: jsonschema<4.0,>=3.0 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from altair==4.2.0.dev0) (3.2.0)

Collecting toolz

  Downloading toolz-0.11.2-py3-none-any.whl (55 kB)
?25l

     |█████▉                          | 10 kB 29.8 MB/s eta 0:00:01
     |███████████▊                    | 20 kB 32.2 MB/s eta 0:00:01
     |█████████████████▋              | 30 kB 19.9 MB/s eta 0:00:01
     |███████████████████████▌        | 40 kB 13.0 MB/s eta 0:00:01
     |█████████████████████████████▍  | 51 kB 7.0 MB/s eta 0:00:01 
     |████████████████████████████████| 55 kB 4.9 MB/s             
?25hRequirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from altair==4.2.0.dev0) (3.0.3)
Requirement already satisfied: entrypoints in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from altair==4.2.0.dev0) (0.3)
Requirement already satisfied: pandas>=0.18 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from altair==4.2.0.dev0) (1.3.4)
Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from altair==4.2.0.dev0) (1.21.4)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (0.18.0)
Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (4.8.2)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (47.1.0)
Requirement already satisfied: attrs>=17.4.0 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (21.2.0)
Requirement already satisfied: six>=1.11.0 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (1.16.0)

Requirement already satisfied: python-dateutil>=2.7.3 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from pandas>=0.18->altair==4.2.0.dev0) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from pandas>=0.18->altair==4.2.0.dev0) (2021.3)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from jinja2->altair==4.2.0.dev0) (2.0.1)

Requirement already satisfied: zipp>=0.5 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from importlib-metadata->jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages (from importlib-metadata->jsonschema<4.0,>=3.0->altair==4.2.0.dev0) (4.0.0)

Building wheels for collected packages: altair

  Building wheel for altair (pyproject.toml) ... ?25l-

 done
?25h  Created wheel for altair: filename=altair-4.2.0.dev0-py3-none-any.whl size=812168 sha256=a871318be16a9414a4766ed7a398bb21fe3e2f9d9dffab320bae42401a925f69
  Stored in directory: /tmp/pip-ephem-wheel-cache-2s42_sy3/wheels/06/13/e0/5bd72c969fe3954ee1561739e5c58e2ddfe5c10fcdffb12faa
Successfully built altair

Installing collected packages: toolz, altair

Successfully installed altair-4.2.0.dev0 toolz-0.11.2

Done installing Altair.

# NEEDED FOR GOOGLE COLAB
# from google.colab import auth
#from google.colab import drive
# import gspread
# from oauth2client.client import GoogleCredentials

# drive.mount('/content/drive/')
# os.chdir("/content/drive/My Drive/DCU/fouth_year/advanced_machine_learning/music-recommodation-system")

Helper functions

def calculate_sparsity(M):
    """
    Computes sparsity of frequency matrix
    """
    matrix_size = len((M['userID'].unique())) * len((M['artistID'].unique()))  # Number of possible interactions in the matrix
    num_plays = len(M['weight']) # Number of weights
    sparsity = (float(num_plays/matrix_size))
    return sparsity

def build_music_sparse_tensor(music_df):
  """
  Args:
    music_df: a pd.DataFrame with `userID`, `artistID` and `weight` columns.
  Returns:
    a tf.SparseTensor representing the feedback matrix.
  """
  indices = music_df[['userID', 'artistID']].values
  values = music_df['weight'].values
  return tf.SparseTensor(
      indices=indices,
      values=values,
      dense_shape=[num_users, num_artist])

def preproces_ids(music_df):
  """
  Args:
    ratings_df: a pd.DataFrame with `userID`, `artistID` and `weight` columns.
  Returns:
    a pd.DataFrame where userIDs and artistIDs now start at 1 
      and end at n and m (defined above), respectively
    two dictionary preserving the orginal ids. 
  """
  unique_user_ids_list = sorted(music_df['userID'].unique())
  print(unique_user_ids_list[0])

  unique_user_ids = dict(zip(range(0, len(unique_user_ids_list) ),unique_user_ids_list))
  unique_user_ids_switched = dict(zip(unique_user_ids_list, range(0, len(unique_user_ids) )))
  
  unique_artist_ids_list = sorted(music_df['artistID'].unique())
  unique_artist_ids = dict(zip(range(0, len(unique_artist_ids_list) ),unique_artist_ids_list))
  unique_artist_ids_switched = dict(zip(unique_artist_ids_list, range(0, len(unique_artist_ids_list) )))

  music_df['userID'] = music_df['userID'].map(unique_user_ids_switched)
  music_df['artistID'] = music_df['artistID'].map(unique_artist_ids_switched)

  return music_df, unique_user_ids, unique_artist_ids

def split_dataframe(df, holdout_fraction=0.1):
  """Splits a DataFrame into training and test sets.
  Args:
    df: a dataframe.
    holdout_fraction: fraction of dataframe rows to use in the test set.
  Returns:
    train: dataframe for training
    test: dataframe for testing
  """
  test = df.sample(frac=holdout_fraction, replace=False)
  train = df[~df.index.isin(test.index)]
  return train, test

Traditional recommender system development relies on explicit feedback. Many models were designed to tackle this issue as a regression problem. For instance, the input of the model would be a matrix \(F_{nm}\) denoting user’s (m) preference of items (n) on a scale. In the classic movie ratings example, this preference would be users giving a 1-to-5 star rating to different movies.

This dataset contains implicit feedback: that is, observed logs of user interactions with items, in this instance user’s listening counts to artists. However, implicit feedback does not signal negativity, in the same way as a 1-star rating would. In our data, a user could listen to song of an artist a limited number of times. But that does not necessarily mean that the particular user has an aversion to that artist i.e. it could be part of a curated playlist by another user. Therefore, we decide to construct a binary matrix, which has a value of one if the observation is observed (i.e. a listening count has been logged between an artist and a user). Note, a 0 is not used to describe unobserved artist-user interactions. This is for optimization reasons, explained below.

user_artists = pd.read_csv('data/user_artists.dat', sep='\t')
user_artists['weight'] = 1
artists = pd.read_csv('data/artists.dat', sep='\t')
artists.rename({'id':'artistID'}, inplace=True, axis=1)

user_taggedartists = pd.read_csv(r'data/user_taggedartists-timestamps.dat', sep='\t')
user_taggedartists_years = pd.read_csv(r'data/user_taggedartists.dat', sep='\t')
tags = pd.read_csv(open('data/tags.dat', errors='replace'), sep='\t')
user_taggedartists = pd.merge(user_taggedartists, tags, on=['tagID'])

num_users = user_artists.userID.nunique()
num_artist = artists.artistID.nunique()
collab_filter_df = user_artists

Here, we calculate the top 10 tags by popularity. Then, we assign it to a artist, if the artist has a top 10 tag. If an artist’s tags are not in the top 10, we input ‘N/A’. Note, the next cell can take several mintues to compute.

top_10_tags = user_taggedartists['tagValue'].value_counts().index[0:10]
user_taggedartists['top10TagValue'] = None
for index, row in user_taggedartists.iterrows():
  if row['tagValue'] in top_10_tags:
    user_taggedartists.iloc[index, -1] = row['tagValue']
user_taggedartists.fillna('N/A',inplace=True)

artists = pd.merge(user_taggedartists, artists, on=['artistID'], how='right')[['artistID','name','top10TagValue','tagValue']].fillna('N/A')
artists.groupby(['artistID','name','top10TagValue']).agg(lambda x:x.value_counts().index[0]).reset_index()
artists = artists.drop_duplicates(subset=['artistID'])
assert artists.artistID.nunique() == num_artist

artists.rename({'tagValue':'mostCommonGenre'},axis=1, inplace=True)

We require two matrices or embeddings to compute a similarity measure (one for quires and one for items), but how do we get these two embeddings?

	userID	artistID	weight
count	92834.000	92834.000	92834.000
mean	944.222	3235.737	1.000
std	546.751	4197.217	0.000
min	0.000	0.000	1.000
25%	470.000	430.000	1.000
50%	944.000	1237.000	1.000
75%	1416.000	4266.000	1.000
max	1891.000	17631.000	1.000

	dot score	name	most assigned tag
9437	0.541	The Cure	chillout
86790	0.527	Yellowcard	rock
115825	0.526	Shiny Toy Guns	electronic
49282	0.525	Johnny Cash	new wave
11826	0.525	Jamiroquai	chillout
97327	0.524	Scooter	electronic

	cosine score	name	most assigned tag
9437	1.000	The Cure	chillout
10850	0.967	Placebo	chillout
14553	0.957	Arctic Monkeys	chillout
8273	0.956	Radiohead	chillout
4936	0.952	Depeche Mode	chillout
31876	0.946	Sigur Rós	chillout

	dot score	name	most assigned tag
16680	3.214	The Beatles	chillout
12363	3.212	Muse	chillout
3259	3.176	Coldplay	chillout
9437	3.164	The Cure	chillout
18364	3.157	Nirvana	pop
17472	3.116	The Killers	chillout

	cosine score	name	most assigned tag
9437	1.000	The Cure	chillout
32942	0.965	The Smiths	groove
38968	0.962	U2	electronic
10850	0.958	Placebo	chillout
4936	0.958	Depeche Mode	chillout
43413	0.955	David Bowie	chillout

CA4015 Music Recommendation Task

Collaborative filtering¶

Quick Links:¶

Setup¶

Matrix Factorisation¶

Training a Matrix factorization model¶

Vanilla Model (non-regularized)¶

Regularized model¶

Evaluating the embeddings¶

Demo¶

Evaluation Code¶

	score	name	most assigned tag
126491	0.925	Bandas Gaúchas - www.DownsMtv.com	N/A
126539	0.922	Menstruação Anarquika	N/A
126554	0.921	Moreira da Silva	N/A
126582	0.901	Validuaté	N/A
126400	0.900	The Vibrators	punk
126583	0.840	The Saints	punk
126513	0.825	Graforréia Xilarmônica	rock
126540	0.814	The Exploited	uk
103973	0.807	The Animals	rock
126451	0.807	Tim Maia	pop

	name	dot score
3259	Coldplay	3.705
12363	Muse	3.574
37842	Paramore	3.550
24447	Lily Allen	3.520
17832	Green Day	3.485
6543	Lady Gaga	3.485
6543	Lady Gaga	3.474
17278	Kings of Leon	3.471
17472	The Killers	3.466
6543	Lady Gaga	3.464