Collaborative Filtering Using GraphLab for Implicit Dataset¶

(Part 1: Current Users Via Segment - Pearson vs Jaccarad vs Factorization)¶

Objective : Apply and compare different models of collaborative Filtering and select the better recommender model¶

Step 1 of 5: Upload Relevant Libraries¶

Note: Need to sign up for academic license for the use of Graphlab's Library & Python 2 only.

import pandas as pd
    import graphlab as gl

Step 2 of 5: Download & Prepare Dataset¶

Note: print the data to check if the dataset has been imported correctly

df = pd.read_csv('dataset\dataset02_master.csv', sep = ',')
    print(df.head(5)) ## check data

     Type    Card_ID  SegmentNo Gender  Age Age_Grp  Length_of_Membership_MTH  \
    0  Active  104829316          5      F   40   35-44                         0
    1  Active  101480021          5      M   44   35-44                         0
    2  Active  104219628          5      M   21   15-24                         0
    3  Active  104219628          5      M   21   15-24                         0
    4  Active  106272169          5      M   29   25-34                         0

      Membership_Grp           pdt_type  total_count
    0        <= 1 YR    MP3Players_high          2.0
    1        <= 1 YR    MP3Players_high          2.0
    2        <= 1 YR  MP3Players_medium          2.0
    3        <= 1 YR    MP3Players_high          1.0
    4        <= 1 YR    Hardware_medium          1.0

Note 2: Slice and dice the dataset to keep relevant data only
Note 3: Convert panda's dataframe to SFrame, a requirement for Graphlab.

df_seg2_growable = df[df.SegmentNo == 2]
    df_seg2_growable = df_seg2_growable[['Card_ID', 'pdt_type']]
    df_seg2_growable_SFrame = gl.SFrame(df_seg2_growable)## convert into S-Frame
    df_seg2_growable_SFrame.save('dataset\dataset_recommender_2.csv', format='csv')

Step 3 of 5: Create Train and Test Dataset¶

Note : This is done with 50% splitting between test and train dataset.

train_s2, test_s2 = gl.recommender.util.random_split_by_user (df_seg2_growable_SFrame,\
                        user_id = 'Card_ID', item_id = 'pdt_type', \
                        item_test_proportion = 0.5, random_seed = 2017)

Step 4 of 5: Creation of 3 different possible models.¶

# Step 4: Create Model 2 - Pearson Similarity Score

    train_s2_model_pearson = gl.recommender.item_similarity_recommender.create \
                    (train_s2, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='pearson')


    # Step 4: Create Model 2 - Jaccard Similarity Score
    train_s2_model_jaccard = gl.recommender.item_similarity_recommender.create \
                    (train_s2, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='jaccard')

    # Step 5: Create Model 3 - Factorization 

    train_s2_model_factorization = gl.recommender.ranking_factorization_recommender.\
                    create(train_s2,\
                    user_id='Card_ID', item_id='pdt_type',\
                     random_seed = 2017, solver = 'ials')

Recsys training: model = item_similarity

Preparing data set.

    Data has 5491 observations with 3492 users and 21 items.

    Data prepared in: 0.026568s

Training model from provided data.

Gathering per-item and per-user statistics.

+--------------------------------+------------+

| Elapsed Time (Item Statistics) | % Complete |

+--------------------------------+------------+

| 0us                            | 28.5       |

Step 5 of 5: Compare the Precision Rate of all the 3 models¶

x2 = gl.recommender.util.compare_models(test_s2 , \
        [train_s2_model_pearson, train_s2_model_jaccard, train_s2_model_factorization ], model_names=["m1", "m2", "m3"])

PROGRESS: Evaluate model m1

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.255319148936 | 0.22695035461  |
    |   2    |  0.20780141844  | 0.375177304965 |
    |   3    |  0.191489361702 | 0.50780141844  |
    |   4    |  0.174822695035 | 0.617730496454 |
    |   5    |  0.157730496454 | 0.69219858156  |
    |   6    |  0.139716312057 | 0.73475177305  |
    |   7    |  0.12462006079  | 0.762411347518 |
    |   8    |  0.113120567376 | 0.79219858156  |
    |   9    |  0.104176516942 | 0.823404255319 |
    |   10   | 0.0971631205674 | 0.856737588652 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

    PROGRESS: Evaluate model m2

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.175886524823 | 0.149645390071 |
    |   2    |  0.20780141844  | 0.360992907801 |
    |   3    |  0.181087470449 | 0.473758865248 |
    |   4    |  0.163120567376 | 0.568085106383 |
    |   5    |  0.158865248227 | 0.697872340426 |
    |   6    |  0.141843971631 | 0.747517730496 |
    |   7    |  0.127456940223 | 0.783687943262 |
    |   8    |  0.114361702128 | 0.805673758865 |
    |   9    |  0.106540583136 | 0.843262411348 |
    |   10   | 0.0981560283688 | 0.863829787234 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

    PROGRESS: Evaluate model m3

    Precision and recall summary statistics by cutoff
    +--------+-----------------+----------------+
    | cutoff |  mean_precision |  mean_recall   |
    +--------+-----------------+----------------+
    |   1    |  0.117730496454 | 0.114893617021 |
    |   2    |  0.126241134752 | 0.237588652482 |
    |   3    | 0.0992907801418 | 0.276595744681 |
    |   4    | 0.0946808510638 | 0.342553191489 |
    |   5    | 0.0814184397163 | 0.368794326241 |
    |   6    | 0.0841607565012 | 0.451773049645 |
    |   7    | 0.0782168186424 | 0.491489361702 |
    |   8    | 0.0739361702128 | 0.528368794326 |
    |   9    | 0.0816390858944 | 0.643262411348 |
    |   10   | 0.0815602836879 | 0.721985815603 |
    +--------+-----------------+----------------+
    [10 rows x 3 columns]

Selection Final Model and Create Dataset of Results¶

train_s2_model_final= gl.recommender.item_similarity_recommender.create \
                    (df_seg2_growable_SFrame, user_id='Card_ID', item_id='pdt_type',\
                    similarity_type='jaccard')

Recsys training: model = item_similarity

Preparing data set.

    Data has 6295 observations with 3766 users and 21 items.

    Data prepared in: 0.028038s

Training model from provided data.

Gathering per-item and per-user statistics.

+--------------------------------+------------+

## Output final dataset for visualization
    recs_final  = train_s2_model_final.recommend()
    recs_final.save('dataset\dataset_final_recs2.csv', format='csv')

| Elapsed Time (Item Statistics) | % Complete |

+--------------------------------+------------+

| 983us                          | 79.5       |

| 1.981ms                        | 100        |

+--------------------------------+------------+

Setting up lookup tables.

Processing data in one pass using dense lookup tables.

+-------------------------------------+------------------+-----------------+

| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |

+-------------------------------------+------------------+-----------------+