Collaborative Filtering (Recommender) - Python Code Part 1

Collaborative Filtering Using GraphLab for Implicit Dataset

(Part 1: Current Users Via Segment - Pearson vs Jaccarad vs Factorization)

Objective : Apply and compare different models of collaborative Filtering and select the better recommender model

Step 1 of 5: Upload Relevant Libraries

Note: Need to sign up for academic license for the use of Graphlab's Library & Python 2 only.

In [9]:
import pandas as pd
    import graphlab as gl
    

Step 2 of 5: Download & Prepare Dataset

Note: print the data to check if the dataset has been imported correctly

In [10]:
df = pd.read_csv('dataset\dataset02_master.csv', sep = ',')
    print(df.head(5)) ## check data
    
     Type    Card_ID  SegmentNo Gender  Age Age_Grp  Length_of_Membership_MTH  \
    0  Active  104829316          5      F   40   35-44                         0
    1  Active  101480021          5      M   44   35-44                         0
    2  Active  104219628          5      M   21   15-24                         0
    3  Active  104219628          5      M   21   15-24                         0
    4  Active  106272169          5      M   29   25-34                         0

      Membership_Grp           pdt_type  total_count
    0        <= 1 YR    MP3Players_high          2.0
    1        <= 1 YR    MP3Players_high          2.0
    2        <= 1 YR  MP3Players_medium          2.0
    3        <= 1 YR    MP3Players_high          1.0
    4        <= 1 YR    Hardware_medium          1.0
    

Note 2: Slice and dice the dataset to keep relevant data only
Note 3: Convert panda's dataframe to SFrame, a requirement for Graphlab.

In [11]:
df_seg2_growable = df[df.SegmentNo == 2]
    df_seg2_growable = df_seg2_growable[['Card_ID', 'pdt_type']]
    df_seg2_growable_SFrame = gl.SFrame(df_seg2_growable)## convert into S-Frame
    df_seg2_growable_SFrame.save('dataset\dataset_recommender_2.csv', format='csv')
    

Step 3 of 5: Create Train and Test Dataset

Note : This is done with 50% splitting between test and train dataset.

In [12]:
train_s2, test_s2 = gl.recommender.util.random_split_by_user (df_seg2_growable_SFrame,\
                        user_id = 'Card_ID', item_id = 'pdt_type', \
                        item_test_proportion = 0.5, random_seed = 2017)
    

Step 4 of 5: Creation of 3 different possible models.

In [13]:
# Step 4: Create Model 2 - Pearson Similarity Score

    train_s2_model_pearson = gl.recommender.item_similarity_recommender.create \
                    (train_s2, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='pearson')


    # Step 4: Create Model 2 - Jaccard Similarity Score
    train_s2_model_jaccard = gl.recommender.item_similarity_recommender.create \
                    (train_s2, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='jaccard')

    # Step 5: Create Model 3 - Factorization 

    train_s2_model_factorization = gl.recommender.ranking_factorization_recommender.\
                    create(train_s2,\
                    user_id='Card_ID', item_id='pdt_type',\
                     random_seed = 2017, solver = 'ials')
    
Recsys training: model = item_similarity
Preparing data set.
    Data has 5491 observations with 3492 users and 21 items.
    Data prepared in: 0.026568s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 0us                            | 28.5       |
| 0us                            | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 0us                                 | 0                | 0               |
| 0us                                 | 100              | 21              |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 1.00772s
Recsys training: model = item_similarity
Preparing data set.
    Data has 5491 observations with 3492 users and 21 items.
    Data prepared in: 0.032033s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 990us                          | 57.25      |
| 990us                          | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 2.992ms                             | 0                | 0               |
| 4.013ms                             | 100              | 21              |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 0.007006s
Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 5491 observations with 3492 users and 21 items.
    Data prepared in: 0.027032s
Training ranking_factorization_recommender for recommendations.
+------------------------------+--------------------------------------------------+----------+
| Parameter                    | Description                                      | Value    |
+------------------------------+--------------------------------------------------+----------+
| num_factors                  | Factor Dimension                                 | 32       |
| regularization               | L2 Regularization on Factors                     | 1e-009   |
| max_iterations               | Maximum Number of Iterations                     | 25       |
| solver                       | Solver used for training                         | ials     |
+------------------------------+--------------------------------------------------+----------+
+---------+--------------+---------------------------+
| Iter.   | Elapsed time | Estimated Objective Value |
+---------+--------------+---------------------------+
| Initial | 1.013ms      | NA                        |
+---------+--------------+---------------------------+
| 0       | 112.079ms    | 0.0146212                 |
| 1       | 206.16ms     | 2.72342                   |
| 2       | 288.206ms    | 0.507429                  |
| 3       | 354.25ms     | 1.41392                   |
| 4       | 414.295ms    | 0.301729                  |
| 5       | 508.36ms     | 0.277017                  |
| 6       | 567.417ms    | 0.176367                  |
| 7       | 636.451ms    | 0.175236                  |
| 8       | 691.496ms    | 0.192585                  |
| 9       | 758.55ms     | 0.038027                  |
| 10      | 820.585ms    | 0.0887392                 |
| 11      | 889.645ms    | 0.0159575                 |
| 12      | 960.68ms     | 0.129588                  |
| 13      | 1.01s        | 0.134391                  |
| 14      | 1.08s        | 0.0266231                 |
| 15      | 1.14s        | 0.0208979                 |
| 16      | 1.20s        | 0.142289                  |
| 17      | 1.26s        | 0.1924                    |
| 18      | 1.32s        | 0.0588915                 |
| 19      | 1.39s        | 1.2552                    |
| 20      | 1.44s        | 0.431125                  |
| 21      | 1.51s        | 0.117843                  |
| 22      | 1.57s        | 0.110527                  |
| 23      | 1.63s        | 0.174974                  |
| 24      | 1.68s        | 0.88378                   |
| FINAL   | 1.68s        | 0.88378                   |
+---------+--------------+---------------------------+
Optimization Complete: Iteration limit reached.

Step 5 of 5: Compare the Precision Rate of all the 3 models

In [14]:
x2 = gl.recommender.util.compare_models(test_s2 , \
        [train_s2_model_pearson, train_s2_model_jaccard, train_s2_model_factorization ], model_names=["m1", "m2", "m3"])
    
PROGRESS: Evaluate model m1

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.255319148936 | 0.22695035461  |
    |   2    |  0.20780141844  | 0.375177304965 |
    |   3    |  0.191489361702 | 0.50780141844  |
    |   4    |  0.174822695035 | 0.617730496454 |
    |   5    |  0.157730496454 | 0.69219858156  |
    |   6    |  0.139716312057 | 0.73475177305  |
    |   7    |  0.12462006079  | 0.762411347518 |
    |   8    |  0.113120567376 | 0.79219858156  |
    |   9    |  0.104176516942 | 0.823404255319 |
    |   10   | 0.0971631205674 | 0.856737588652 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

    PROGRESS: Evaluate model m2

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.175886524823 | 0.149645390071 |
    |   2    |  0.20780141844  | 0.360992907801 |
    |   3    |  0.181087470449 | 0.473758865248 |
    |   4    |  0.163120567376 | 0.568085106383 |
    |   5    |  0.158865248227 | 0.697872340426 |
    |   6    |  0.141843971631 | 0.747517730496 |
    |   7    |  0.127456940223 | 0.783687943262 |
    |   8    |  0.114361702128 | 0.805673758865 |
    |   9    |  0.106540583136 | 0.843262411348 |
    |   10   | 0.0981560283688 | 0.863829787234 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

    PROGRESS: Evaluate model m3

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.117730496454 | 0.114893617021 |
    |   2    |  0.126241134752 | 0.237588652482 |
    |   3    | 0.0992907801418 | 0.276595744681 |
    |   4    | 0.0946808510638 | 0.342553191489 |
    |   5    | 0.0814184397163 | 0.368794326241 |
    |   6    | 0.0841607565012 | 0.451773049645 |
    |   7    | 0.0782168186424 | 0.491489361702 |
    |   8    | 0.0739361702128 | 0.528368794326 |
    |   9    | 0.0816390858944 | 0.643262411348 |
    |   10   | 0.0815602836879 | 0.721985815603 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

    

Selection Final Model and Create Dataset of Results

In [15]:
train_s2_model_final= gl.recommender.item_similarity_recommender.create \
                    (df_seg2_growable_SFrame, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='jaccard')
    
Recsys training: model = item_similarity
Preparing data set.
    Data has 6295 observations with 3766 users and 21 items.
    Data prepared in: 0.028038s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
In [16]:
## Output final dataset for visualization
    recs_final  = train_s2_model_final.recommend()
    recs_final.save('dataset\dataset_final_recs2.csv', format='csv')
    
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 983us                          | 79.5       |
| 1.981ms                        | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 2.991ms                             | 0                | 0               |
| 5.985ms                             | 100              | 21              |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 0.008986s
recommendations finished on 1000/3766 queries. users per second: 200120
recommendations finished on 2000/3766 queries. users per second: 133333
recommendations finished on 3000/3766 queries. users per second: 99950