Building recommendation engine using machine learning

0 / 188
Recommendation Engine

Overview

 

You gonna buy a mobile phone on Amazon and Amazon recommends you to buy a screen protection glass and a case cover along with it. It also gives you discount if you buy all these three things together. If you go to Youtube to watch your favorite song, Youtube recommends you to watch similar other songs which you may like. On Google search, at the bottom of the search results page you get related searches. You get recommendations everywhere; on Netflix, on Facebook, on Twitter, all social networking sites have built their own recommender systems . Not only on the online websites but also in the supermarkets you find recommendations are given directly or indirectly. In the supermarkets the things which are bought together are kept together or they offer deals if you buy things together. How do they know your buying patterns ? How do they know your likes and dislikes? We are going to understand that secret (building recommendation engine using machine learning) in this article.

 

No alt text provided for this image

Recommender system is an important and a popular area of machine learning. It has helped business to increase their sales and profits. We cannot imagine a market place today which is not having any recommender system in place. These companies analyse our transaction data, they try to deduce our buying patterns. From the buying patterns they create association rules. These association rules are then used to form recommendations and deals. There are various data mining algorithms such as k-means, Page Rank, EM data mining, and Apriori which are used to build the recommender systems. In this article we are going to build our recommender system using Apriori algorithm.

 

Apriori algorithm is one of the most popular and classical algorithm in data mining. It uses association rule mining which is a technique to identify underlying relations between different items in the dataset. Let’s first understand association rule mining using a supermarket example. In a supermarket customers can buy variety of items. Usually, there is a pattern in what the customers buy. For example, if someone wants to prepare a bread sandwich then most probably she will buy bread, butter, jam and tomato sauce. The chances that tomato sauce is being bought along with bread and butter is very high. From the supermarket transaction data we can find that there are many transactions in which all the above four things are bought together. So, now we can say that tomato sauce is strongly associated with bread, butter and jam. And, supermarket can create a new rule in their system to keep these four things together because they are frequently bought together. Association rule mining is nothing but a technique to find out such rules.     

 

The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses). Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

 

How does Apriori work?

 

With the help of following simple example we will explore the internal steps of Apriori algorithm. Following is a sample super market dataset. We are going to run Apriori on that dataset to mine association rules from it. 

 

No alt text provided for this image

Apriori uses three important computational factors to mine association rules from the input dataset. These three computational factors are, supportconfidence and lift. Let’s see how to compute them below.

 

Support: First step of Apriori

 

This is the first step of Apriori computations. In this step, first, the algorithm identifies all unique items from the input dataset. Following is the list of unique items present in our sample dataset. Along with the unique items list it also finds out the total number of records in the input dataset. In our case total number of records = 5.

 

No alt text provided for this image

The next step is to calculate the support for each item present in the above list. Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions.

 

Suppose we want to find support for item “tomato sauce”. This can be calculated as:

 

Support(“tomato sauce”) = Transactions containing(“tomato sauce”) / Total Transactions;

 

In our case there are three transactions out of total five which has item “tomato sauce” in them. Hence,

 

Support(“tomato sauce”) = 3 / 5 = 0.6;

 

Likewise, Apriori calculates support for all the items. Please see the calculated support for all the items in the table below.

 

No alt text provided for this image

The next step is to calculate support for group of items having two items in a group. When there is more than one item in a group then it is called an itemset. For example itemset (“bread, tomato sauce”). While creating an item set the order of the items does not matter. Itemset (“bread, tomato sauce”) is same as itemset (“tomato sauce, bread”).

 

The formula to calculate the support for an itemset is same as that of formula for an item. For example, support for itemset “bread, tomato sauce” is calculated as:

 

Support(“bread , tomato sauce”) = Transactions containing(“bread , tomato sauce”) / Total Transactions;

 

In our case there are three transactions out of total five which has both the items bread and tomato sauce in them. Hence,

 

Support(“bread, tomato sauce”) = 3 / 5 = 0.6;

 

Likewise Apriori calculates support for all the itemsets of size 2. Please see the calculated support for all the itemsets in the table below.

 

No alt text provided for this image

Similarly Apriori calculates support for all the possible itemsets. It continues to generate itemsets of size 1 to N-1 where N is the total number of items in the largest transaction. In our example all transactions have equal number of items in them, and that value is 3. Hence, in our case value of N is 3, hence, the algorithm will generate itemsets of size 1 and 2 only. It will stop at 2, as in our case N-1 =2.

 

Confidence: Second step of Apriori

 

Confidence of a rule ( bread => tomato sauce ) refers to the likelihood of tomato sauce is also bought if bread is bought. Confidence is an indication of how often the rule has been found to be true. Confidence is calculated by finding the number of transactions where bread and tomato sauce are bought together, divided by total number of transactions where bread is bought. Mathematically, it is represented as 

 

Confidence ( A=>B) = (Transactions containing both (A and B)) / (Transactions containing A);

 

The above equation can also be represented as:

 

Confidence ( A=>B) = Support (A and B))/Support(A);

 

So, the support values calculated in the earlier step can be used here to calculate the confidence.

 

Coming back to our example; there are three transactions in which bread and tomato sauce are bought together and there are four transactions in which bread is brought. Hence ,

 

Confidence ( bread => tomato sauce ) = 3/4 = 0.75

 

Confidence can give some important insights, but it also has a major drawback. It only takes into account the popularity of the itemset A (in our case A is “bread” ) and not the popularity of B ( in our case B is “tomato sauce”) . If B (tomato sauce) is equally popular as A (bread) then there will be a higher probability that a transaction containing A (bread) will also contain B(tomato sauce) thus increasing the confidence. This drawback is overcome in calculating lift.

 

Lift: Third step of Apriori

Lift or lift ratio is the ratio of confidence to expected confidence. Expected confidence is the confidence divided by the frequency of B. The Lift tells us how much better a rule is at predicting the result than just assuming the result in the first place. Greater lift values indicate stronger associations.

 

While calculating lift(A=>B) we take into account the popularities of both the items A and B. This lift ratio shows how likely item B is purchased when A is purchased. Mathematically lift is expressed as

 

Lift(A=>B) = Confidence ( A=>B) / Support(B) ;

 

Or,

 

Lift(A=>B) = Support ( A=>B) / ( Support(A) * Support(B) );

 

Now, let’s go back to our example and calculate lift for (bread => tomato sauce).

 

lift(bread => tomato sauce) = Confidence ( bread => tomato sauce ) / Support(tomato sauce);

 

lift(bread => tomato sauce) = 0.75 / 0.6 = 1.25

 

A lift value greater than 1 means that item B is likely to be bought if item A is bought, while a value less than 1 means that item B is unlikely to be bought if item A is bought.

 

Python Implementation:

 

Now, let’s build a working python program which will perform all the calculations described above. It should list down all the unique items, create unique item sets, calculate support for items and item sets, calculate confidence and lift for all the possible item-itemsets combinations. The program should return us recommendations for given item/item set. 

 

The first thing required in any machine learning/ data mining system is data. We are using a supermarket data to build a recommender system for the super market. Our data is in csv form. Each line in the csv represents a list of items purchased by a customer. 

 

csv data

The input csv file has no header. Total number of items on each line may not be equal. In the above image you can see that customer number thirteen has purchase five items while customer number sixteen has purchased only one item. You can find the sample data file and the entire program here https://github.com/srkhedkar/Recommender

 

We are making use of package apyori to calculate association rule. If it is not installed on your system then install it using command “pip install apyori”

 

Now, let’s write the python code. Here, we have imported the required packages and we have defined a class Recommender and have defined its constructor. While instantiating a Recommender object we need to provide it the data file. 

 

from apyori import apriori
from apyori import load_transactions

class Recommender():

    def __init__(self, inputFile):
        self.AssociationRulesDictionary = {} # holds final output
        self.dataFile = inputFile # input datafile in csv form        
        self.association_rules = [] # holds output from Apriori algo 

 

Next step is to compute the association rules. For that we need to compute support, confidence and lift all these things explained above. Following function is defined to compute all these things.

 

 def computeRules(self):
        """
        Computes all association rules.
        :return:
        """
        with open(self.dataFile ) as fileObj:

            transactions = list(load_transactions(fileObj, delimiter=","))

            # remove empty strings if any
            transactions_filtered = []
            for li in transactions:
                li = list(filter(None, li))
                transactions_filtered.append(li)

            # Following line does all computations
            # lift > 1 shows that there is a positive correlation within the itemset, i.e., items in the
            # itemset, are more likely to be bought together.
            # lift < 1 shows that there is a negative correlation within the itemset, i.e., items in the
            # itemset, are unlikely to be bought together.
            # hence we have set min_lift=1.0 to ignore all rules with lift < 1.0
            self.association_rules = apriori(transactions_filtered,       min_support=0.01, min_confidence=0.01, min_lift=1.0,max_length=None)

 

In the code above you can see that all the important computations are done in the function call apriori(), which comes from the imported package apyori. We do not have to do anything!! All things are taken care by apriori()!! This is the reason I am in love with this language. It is so simple; it is so powerful!! The ecosystem is so rich that we can find most of the things have already been implemented. This reminds me of a famous python joke. One day little Jony was flying up in the sky!! Little Jack asked him from the ground “Hey Jony, how come you are flying so up in the sky? What did you do?”. Little Jony replied “Nothing… I just imported one antigravity module of python!!”. 😀 😀 

 

The apriori() returns us the output in the form shown below. Our next task is to extract the information from this output and store it in a dictionary for easy access.

 

Apriori output:

 

ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({‘Chocolate’, ‘Milk’, ‘eggs’}), confidence=0.16666666666666666, lift=1.0), OrderedStatistic(items_base=frozenset({‘Chocolate’}), items_add=frozenset({‘Milk’, ‘eggs’}), confidence=0.5, lift=1.5), OrderedStatistic(items_base=frozenset({‘Milk’}), items_add=frozenset({‘Chocolate’, ‘eggs’}), confidence=0.5, lift=2.0), OrderedStatistic(items_base=frozenset({‘eggs’}), items_add=frozenset({‘Milk’, ‘Chocolate’}), confidence=0.2222222222222222, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({‘Milk’, ‘Chocolate’}), items_add=frozenset({‘eggs’}), confidence=1.0, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({‘eggs’, ‘Chocolate’}), items_add=frozenset({‘Milk’}), confidence=0.6666666666666666, lift=2.0), OrderedStatistic(items_base=frozenset({‘Milk’, ‘eggs’}), items_add=frozenset({‘Chocolate’}), confidence=0.5, lift=1.5)])

 

To extract the rules from the output of apriori() we have defined extractRules(). This function extracts the rules and stores them in AssociationRulesDictionary in the descending sorted order of lift values. 

 

def extractRules(self):

        for item in self.association_rules:
            # first index of the inner list
            # Contains base item and add item

            if len(item[0]) < 2:
                continue

            for k in item[2]:
                baseItemList = list(k[0])
                # if base item set is empty then go to the next record.
                if not baseItemList:
                    continue

                # sort the baseItemList before adding it as a key to the AssociationRules dictionary
                baseItemList.sort()
                baseItemList_key = tuple(baseItemList)

                if baseItemList_key not in self.AssociationRulesDictionary.keys():
                    self.AssociationRulesDictionary[baseItemList_key] = []
                self.AssociationRulesDictionary[baseItemList_key].append((list(k[1]), k[3]))

                # if something goes wrong, then use the following print block to print values
                #print("Base item: ", baseItemList_key)
                #print("Target item: ", list(k[1]))
                #print("Confidence: " + str(k[2]))
                #print("Lift: " + str(k[3]))

        # sort the rules in descending order of lift values.
        for ruleList in self.AssociationRulesDictionary:
            self.AssociationRulesDictionary[ruleList].sort(key=lambda x: x[1], reverse=True)   
     

 

The above defined two methods should not be called outside of the class definition. Me being an old fashioned C++ developer had to stick to this restriction. I leave it to you guys; if you want to call these two functions outside of the class then you may. 

 

We have defined a method studyRules() which calls the above two functions computeRules() and extractRules(). This function studyRules() plays a role of a template method here. In case you want to extend Recommender class and use another algorithm instead of Apriori to compute association rules then studyRules() will help you to get the polymorphic behavior. An object of Recommender class should call studyRules() to compute and extract the association rule from the input data file.

 

def studyRules(self):
    """
    This is a template method for computation and rule extraction.
    :return:
    """
    self.computeRules()
    self.extractRules()

 

Now, we have got all computations done and all association rules extracted into a dictionary. Let’s define some APIs to return recommendations and deals using the computed association rules.

 

def recommend(self, itemList, Num=1):

    """
    itemList is a list of items selected by user
    Num is total recommendations required.
    :param item:
    :return:
    """
    # convert itemList to itemTuple as our dictionary key is a sorted tuple

    itemList.sort()
    itemTuple = tuple(itemList)

    if itemTuple not in self.AssociationRulesDictionary.keys():
        return []

    return self.AssociationRulesDictionary[itemTuple][:Num]

 

recommend() returns recommendations for the selected item/itemset. For example, if you want to find out which is the mostly preferred item along with ‘red wine’ then you can get this value by calling recommend() as shown below

 

objectName.recommend([‘red wine’], 1)

 

In our case, for the data file we are using, it returns us 

 

[([‘spaghetti’], 2.095966120638976)]

 

This tells us that ‘spaghetti’ is the most preferred item along with ‘red wine’.

 

Now, let’s formulate deals from these recommendations. API showDeals() is defined to create deals using the recommendations returned by recommend(). In the showDeals() we are calculating discount percentage by using the lift value of the recommended itemset. API recommend() returns somewhat abstract output; in the showDeals() we are trying to give some meaning to it.

 

def showDeals(self, itemList, Num=1):

    """
    we are converting the recommendations into deals. The lift value is used to calculate discount percentage
    discount percentage = 10 * lift
    itemList is a list of items selected by user
    Num is total deals required.
    :return:
    """
    recommendations = self.recommend(itemList, Num)
    for item in recommendations:
        print( "If you buy ", item[0], " along with ", itemList, " then you will get ", round((item[1] * 10), 2), \

               "% discount on total cost!!" )

 

Now, our entire recommender system is ready!! Let’s play with it to see how it behaves.

 

# Create a recommender Alexa, give her the datafile containing transactions.
Alexa = Recommender("store_data.csv")

# Request Alexa to study rules from the provided datafile
Alexa.studyRules()
 
# Ask Alexa for the deals
Alexa.showDeals(['red wine'], 2)

 

Hey! Alexa! show me the two topmost deals on ‘Red Wine’. And Alexa shows you the following deals.

 

1. If you buy [‘spaghetti’] along with [‘red wine’] then you will get 20.96 % discount on total cost!!

 

2. If you buy [‘mineral water’] along with [‘red wine’] then you will get 16.3 % discount on total cost!!

 

Isn’t it amazing. You can uses the same recommender system to build a video or songs recommender. You just need to provide the corresponding datafile. For example, it would be something like this.

 

Michael = Recommender("songs_data.csv)
Michael.studyRules()
Michael.recommend(['summer of 69', 'Billie Jean'], 5)

 

Here, Michael would recommend five songs which are mostly listened with ‘summer of 69’ and ‘Billie Jean’!!

 

The entire program along with the sample datafile is uploaded on GitHub here https://github.com/srkhedkar/Recommender

 

You can also find the complete program below.

 

Complete Python recommender system script:

 

from apyori import apriori
from apyori import load_transactions


class Recommender():
    def __init__(self, inputFile):
        self.AssociationRulesDictionary = {} # holds final output
        self.dataFile = inputFile # input datafile in csv form
        self.association_rules = [] # holds output from Apriori algo

    def computeRules(self):
        """
        Computes all association rules.
        :return:
        """
        with open(self.dataFile ) as fileObj:

            transactions = list(load_transactions(fileObj, delimiter=","))

            # remove empty strings if any
            transactions_filtered = []
            for li in transactions:
                li = list(filter(None, li))
                transactions_filtered.append(li)

            # Following line does all computations
            # lift > 1 shows that there is a positive correlation within the itemset, i.e., items in the
            # itemset, are more likely to be bought together.
            # lift < 1 shows that there is a negative correlation within the itemset, i.e., items in the
            # itemset, are unlikely to be bought together.
            # hence we have set min_lift=1.0 to ignore all rules with lift < 1.0
            self.association_rules = apriori(transactions_filtered, min_support=0.01, min_confidence=0.01, min_lift=1.0,
                                        max_length=None)

    def extractRules(self):

        for item in self.association_rules:
            # first index of the inner list
            # Contains base item and add item

            if len(item[0]) < 2:
                continue

            for k in item[2]:

                baseItemList = list(k[0])
                # if base item set is empty then go to the next record.
                if not baseItemList:
                    continue

                # sort the baseItemList before adding it as a key to the AssociationRules dictionary
                baseItemList.sort()
                baseItemList_key = tuple(baseItemList)

                if baseItemList_key not in self.AssociationRulesDictionary.keys():
                    self.AssociationRulesDictionary[baseItemList_key] = []

                self.AssociationRulesDictionary[baseItemList_key].append((list(k[1]), k[3]))

                # if something goes wrong, then use the following print block to print values
                #print("Base item: ", baseItemList_key)
                #print("Target item: ", list(k[1]))
                #print("Confidence: " + str(k[2]))
                #print("Lift: " + str(k[3]))

        # sort the rules in descending order of lift values.
        for ruleList in self.AssociationRulesDictionary:
            self.AssociationRulesDictionary[ruleList].sort(key=lambda x: x[1], reverse=True)


    def recommend(self, itemList, Num=1):
        """
        itemList is a list of items selected by user
        Num is total recommendations required.
        :param item:
        :return:
        """

        # convert itemList to itemTuple as our dictionary key is a sorted tuple
        itemList.sort()
        itemTuple = tuple(itemList)

        if itemTuple not in self.AssociationRulesDictionary.keys():
            return []

        return self.AssociationRulesDictionary[itemTuple][:Num]

    def studyRules(self):
        """
        This is a template method for computation and rule extraction.
        :return:
        """
        self.computeRules()
        self.extractRules()

    def showDeals(self, itemList, Num=1):
        """
        we are converting the recommendations into deals. The lift value is used to calculate discount percentage
        discount percentage = 10 * lift
        itemList is a list of items selected by user
        Num is total deals required.
        :return:
        """
        recommendations = self.recommend(itemList, Num)

        for item in recommendations:
            print( "If you buy ", item[0], " along with ", itemList, " then you will get ", round((item[1] * 10), 2), \
                   "% discount on total cost!!" )


Alexa = Recommender("store_data.csv")

Alexa.studyRules()

print (Alexa.recommend(['red wine'], 1))

Alexa.showDeals(['red wine'], 2)



Comments

comments


14+ years of professional experience in developing products like Siemens Teamcenter, Siemens Product Master Manager, Intuit Quicken and Kosmix using Python, C++, JAVA, Oracle SQL. Pythonist | Python Aficionado | Machine Learning Enthusiast | Freelancer

Related Posts