Building recommendation engine using machine learning
0
/
255
How does Apriori work?
With the help of following simple example we will explore the internal steps of Apriori algorithm. Following is a sample super market dataset. We are going to run Apriori on that dataset to mine association rules from it. Apriori uses three important computational factors to mine association rules from the input dataset. These three computational factors are, support, confidence and lift. Let’s see how to compute them below.Support: First step of Apriori
This is the first step of Apriori computations. In this step, first, the algorithm identifies all unique items from the input dataset. Following is the list of unique items present in our sample dataset. Along with the unique items list it also finds out the total number of records in the input dataset. In our case total number of records = 5. The next step is to calculate the support for each item present in the above list. Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item “tomato sauce”. This can be calculated as: Support(“tomato sauce”) = Transactions containing(“tomato sauce”) / Total Transactions; In our case there are three transactions out of total five which has item “tomato sauce” in them. Hence, Support(“tomato sauce”) = 3 / 5 = 0.6; Likewise, Apriori calculates support for all the items. Please see the calculated support for all the items in the table below. The next step is to calculate support for group of items having two items in a group. When there is more than one item in a group then it is called an itemset. For example itemset (“bread, tomato sauce”). While creating an item set the order of the items does not matter. Itemset (“bread, tomato sauce”) is same as itemset (“tomato sauce, bread”). The formula to calculate the support for an itemset is same as that of formula for an item. For example, support for itemset “bread, tomato sauce” is calculated as: Support(“bread , tomato sauce”) = Transactions containing(“bread , tomato sauce”) / Total Transactions; In our case there are three transactions out of total five which has both the items bread and tomato sauce in them. Hence, Support(“bread, tomato sauce”) = 3 / 5 = 0.6; Likewise Apriori calculates support for all the itemsets of size 2. Please see the calculated support for all the itemsets in the table below. Similarly Apriori calculates support for all the possible itemsets. It continues to generate itemsets of size 1 to N-1 where N is the total number of items in the largest transaction. In our example all transactions have equal number of items in them, and that value is 3. Hence, in our case value of N is 3, hence, the algorithm will generate itemsets of size 1 and 2 only. It will stop at 2, as in our case N-1 =2.Confidence: Second step of Apriori
Confidence of a rule ( bread => tomato sauce ) refers to the likelihood of tomato sauce is also bought if bread is bought. Confidence is an indication of how often the rule has been found to be true. Confidence is calculated by finding the number of transactions where bread and tomato sauce are bought together, divided by total number of transactions where bread is bought. Mathematically, it is represented as Confidence ( A=>B) = (Transactions containing both (A and B)) / (Transactions containing A); The above equation can also be represented as: Confidence ( A=>B) = Support (A and B))/Support(A); So, the support values calculated in the earlier step can be used here to calculate the confidence. Coming back to our example; there are three transactions in which bread and tomato sauce are bought together and there are four transactions in which bread is brought. Hence , Confidence ( bread => tomato sauce ) = 3/4 = 0.75 Confidence can give some important insights, but it also has a major drawback. It only takes into account the popularity of the itemset A (in our case A is “bread” ) and not the popularity of B ( in our case B is “tomato sauce”) . If B (tomato sauce) is equally popular as A (bread) then there will be a higher probability that a transaction containing A (bread) will also contain B(tomato sauce) thus increasing the confidence. This drawback is overcome in calculating lift.Lift: Third step of Apriori
Lift or lift ratio is the ratio of confidence to expected confidence. Expected confidence is the confidence divided by the frequency of B. The Lift tells us how much better a rule is at predicting the result than just assuming the result in the first place. Greater lift values indicate stronger associations. While calculating lift(A=>B) we take into account the popularities of both the items A and B. This lift ratio shows how likely item B is purchased when A is purchased. Mathematically lift is expressed as Lift(A=>B) = Confidence ( A=>B) / Support(B) ; Or, Lift(A=>B) = Support ( A=>B) / ( Support(A) * Support(B) ); Now, let’s go back to our example and calculate lift for (bread => tomato sauce). lift(bread => tomato sauce) = Confidence ( bread => tomato sauce ) / Support(tomato sauce); lift(bread => tomato sauce) = 0.75 / 0.6 = 1.25 A lift value greater than 1 means that item B is likely to be bought if item A is bought, while a value less than 1 means that item B is unlikely to be bought if item A is bought.Python Implementation:
Now, let’s build a working python program which will perform all the calculations described above. It should list down all the unique items, create unique item sets, calculate support for items and item sets, calculate confidence and lift for all the possible item-itemsets combinations. The program should return us recommendations for given item/item set. The first thing required in any machine learning/ data mining system is data. We are using a supermarket data to build a recommender system for the super market. Our data is in csv form. Each line in the csv represents a list of items purchased by a customer. The input csv file has no header. Total number of items on each line may not be equal. In the above image you can see that customer number thirteen has purchase five items while customer number sixteen has purchased only one item. You can find the sample data file and the entire program here https://github.com/srkhedkar/Recommender We are making use of package apyori to calculate association rule. If it is not installed on your system then install it using command “pip install apyori” Now, let’s write the python code. Here, we have imported the required packages and we have defined a class Recommender and have defined its constructor. While instantiating a Recommender object we need to provide it the data file.from apyori import apriori
from apyori import load_transactions
class Recommender():
def __init__(self, inputFile):
self.AssociationRulesDictionary = {} # holds final output
self.dataFile = inputFile # input datafile in csv form
self.association_rules = [] # holds output from Apriori algo
Next step is to compute the association rules. For that we need to compute support, confidence and lift all these things explained above. Following function is defined to compute all these things.
def computeRules(self):
"""
Computes all association rules.
:return:
"""
with open(self.dataFile ) as fileObj:
transactions = list(load_transactions(fileObj, delimiter=","))
# remove empty strings if any
transactions_filtered = []
for li in transactions:
li = list(filter(None, li))
transactions_filtered.append(li)
# Following line does all computations
# lift > 1 shows that there is a positive correlation within the itemset, i.e., items in the
# itemset, are more likely to be bought together.
# lift < 1 shows that there is a negative correlation within the itemset, i.e., items in the
# itemset, are unlikely to be bought together.
# hence we have set min_lift=1.0 to ignore all rules with lift < 1.0
self.association_rules = apriori(transactions_filtered, min_support=0.01, min_confidence=0.01, min_lift=1.0,max_length=None)
In the code above you can see that all the important computations are done in the function call apriori(), which comes from the imported package apyori. We do not have to do anything!! All things are taken care by apriori()!! This is the reason I am in love with this language. It is so simple; it is so powerful!! The ecosystem is so rich that we can find most of the things have already been implemented. This reminds me of a famous python joke. One day little Jony was flying up in the sky!! Little Jack asked him from the ground “Hey Jony, how come you are flying so up in the sky? What did you do?”. Little Jony replied “Nothing… I just imported one antigravity module of python!!”. 😀 😀
The apriori() returns us the output in the form shown below. Our next task is to extract the information from this output and store it in a dictionary for easy access.
Apriori output:
ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({‘Chocolate’, ‘Milk’, ‘eggs’}), confidence=0.16666666666666666, lift=1.0), OrderedStatistic(items_base=frozenset({‘Chocolate’}), items_add=frozenset({‘Milk’, ‘eggs’}), confidence=0.5, lift=1.5), OrderedStatistic(items_base=frozenset({‘Milk’}), items_add=frozenset({‘Chocolate’, ‘eggs’}), confidence=0.5, lift=2.0), OrderedStatistic(items_base=frozenset({‘eggs’}), items_add=frozenset({‘Milk’, ‘Chocolate’}), confidence=0.2222222222222222, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({‘Milk’, ‘Chocolate’}), items_add=frozenset({‘eggs’}), confidence=1.0, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({‘eggs’, ‘Chocolate’}), items_add=frozenset({‘Milk’}), confidence=0.6666666666666666, lift=2.0), OrderedStatistic(items_base=frozenset({‘Milk’, ‘eggs’}), items_add=frozenset({‘Chocolate’}), confidence=0.5, lift=1.5)])
To extract the rules from the output of apriori() we have defined extractRules(). This function extracts the rules and stores them in AssociationRulesDictionary in the descending sorted order of lift values.
def extractRules(self):
for item in self.association_rules:
# first index of the inner list
# Contains base item and add item
if len(item[0]) < 2:
continue
for k in item[2]:
baseItemList = list(k[0])
# if base item set is empty then go to the next record.
if not baseItemList:
continue
# sort the baseItemList before adding it as a key to the AssociationRules dictionary
baseItemList.sort()
baseItemList_key = tuple(baseItemList)
if baseItemList_key not in self.AssociationRulesDictionary.keys():
self.AssociationRulesDictionary[baseItemList_key] = []
self.AssociationRulesDictionary[baseItemList_key].append((list(k[1]), k[3]))
# if something goes wrong, then use the following print block to print values
#print("Base item: ", baseItemList_key)
#print("Target item: ", list(k[1]))
#print("Confidence: " + str(k[2]))
#print("Lift: " + str(k[3]))
# sort the rules in descending order of lift values.
for ruleList in self.AssociationRulesDictionary:
self.AssociationRulesDictionary[ruleList].sort(key=lambda x: x[1], reverse=True)
The above defined two methods should not be called outside of the class definition. Me being an old fashioned C++ developer had to stick to this restriction. I leave it to you guys; if you want to call these two functions outside of the class then you may.
We have defined a method studyRules() which calls the above two functions computeRules() and extractRules(). This function studyRules() plays a role of a template method here. In case you want to extend Recommender class and use another algorithm instead of Apriori to compute association rules then studyRules() will help you to get the polymorphic behavior. An object of Recommender class should call studyRules() to compute and extract the association rule from the input data file.
def studyRules(self):
"""
This is a template method for computation and rule extraction.
:return:
"""
self.computeRules()
self.extractRules()
Now, we have got all computations done and all association rules extracted into a dictionary. Let’s define some APIs to return recommendations and deals using the computed association rules.
def recommend(self, itemList, Num=1):
"""
itemList is a list of items selected by user
Num is total recommendations required.
:param item:
:return:
"""
# convert itemList to itemTuple as our dictionary key is a sorted tuple
itemList.sort()
itemTuple = tuple(itemList)
if itemTuple not in self.AssociationRulesDictionary.keys():
return []
return self.AssociationRulesDictionary[itemTuple][:Num]
recommend() returns recommendations for the selected item/itemset. For example, if you want to find out which is the mostly preferred item along with ‘red wine’ then you can get this value by calling recommend() as shown below
objectName.recommend([‘red wine’], 1)
In our case, for the data file we are using, it returns us
[([‘spaghetti’], 2.095966120638976)]
This tells us that ‘spaghetti’ is the most preferred item along with ‘red wine’.
Now, let’s formulate deals from these recommendations. API showDeals() is defined to create deals using the recommendations returned by recommend(). In the showDeals() we are calculating discount percentage by using the lift value of the recommended itemset. API recommend() returns somewhat abstract output; in the showDeals() we are trying to give some meaning to it.
def showDeals(self, itemList, Num=1):
"""
we are converting the recommendations into deals. The lift value is used to calculate discount percentage
discount percentage = 10 * lift
itemList is a list of items selected by user
Num is total deals required.
:return:
"""
recommendations = self.recommend(itemList, Num)
for item in recommendations:
print( "If you buy ", item[0], " along with ", itemList, " then you will get ", round((item[1] * 10), 2), \
"% discount on total cost!!" )
Now, our entire recommender system is ready!! Let’s play with it to see how it behaves.
# Create a recommender Alexa, give her the datafile containing transactions.
Alexa = Recommender("store_data.csv")
# Request Alexa to study rules from the provided datafile
Alexa.studyRules()
# Ask Alexa for the deals
Alexa.showDeals(['red wine'], 2)
Hey! Alexa! show me the two topmost deals on ‘Red Wine’. And Alexa shows you the following deals.
1. If you buy [‘spaghetti’] along with [‘red wine’] then you will get 20.96 % discount on total cost!!
2. If you buy [‘mineral water’] along with [‘red wine’] then you will get 16.3 % discount on total cost!!
Isn’t it amazing. You can uses the same recommender system to build a video or songs recommender. You just need to provide the corresponding datafile. For example, it would be something like this.
Michael = Recommender("songs_data.csv)
Michael.studyRules()
Michael.recommend(['summer of 69', 'Billie Jean'], 5)
Here, Michael would recommend five songs which are mostly listened with ‘summer of 69’ and ‘Billie Jean’!!
The entire program along with the sample datafile is uploaded on GitHub here https://github.com/srkhedkar/Recommender
You can also find the complete program below.
Complete Python recommender system script:
from apyori import apriori
from apyori import load_transactions
class Recommender():
def __init__(self, inputFile):
self.AssociationRulesDictionary = {} # holds final output
self.dataFile = inputFile # input datafile in csv form
self.association_rules = [] # holds output from Apriori algo
def computeRules(self):
"""
Computes all association rules.
:return:
"""
with open(self.dataFile ) as fileObj:
transactions = list(load_transactions(fileObj, delimiter=","))
# remove empty strings if any
transactions_filtered = []
for li in transactions:
li = list(filter(None, li))
transactions_filtered.append(li)
# Following line does all computations
# lift > 1 shows that there is a positive correlation within the itemset, i.e., items in the
# itemset, are more likely to be bought together.
# lift < 1 shows that there is a negative correlation within the itemset, i.e., items in the
# itemset, are unlikely to be bought together.
# hence we have set min_lift=1.0 to ignore all rules with lift < 1.0
self.association_rules = apriori(transactions_filtered, min_support=0.01, min_confidence=0.01, min_lift=1.0,
max_length=None)
def extractRules(self):
for item in self.association_rules:
# first index of the inner list
# Contains base item and add item
if len(item[0]) < 2:
continue
for k in item[2]:
baseItemList = list(k[0])
# if base item set is empty then go to the next record.
if not baseItemList:
continue
# sort the baseItemList before adding it as a key to the AssociationRules dictionary
baseItemList.sort()
baseItemList_key = tuple(baseItemList)
if baseItemList_key not in self.AssociationRulesDictionary.keys():
self.AssociationRulesDictionary[baseItemList_key] = []
self.AssociationRulesDictionary[baseItemList_key].append((list(k[1]), k[3]))
# if something goes wrong, then use the following print block to print values
#print("Base item: ", baseItemList_key)
#print("Target item: ", list(k[1]))
#print("Confidence: " + str(k[2]))
#print("Lift: " + str(k[3]))
# sort the rules in descending order of lift values.
for ruleList in self.AssociationRulesDictionary:
self.AssociationRulesDictionary[ruleList].sort(key=lambda x: x[1], reverse=True)
def recommend(self, itemList, Num=1):
"""
itemList is a list of items selected by user
Num is total recommendations required.
:param item:
:return:
"""
# convert itemList to itemTuple as our dictionary key is a sorted tuple
itemList.sort()
itemTuple = tuple(itemList)
if itemTuple not in self.AssociationRulesDictionary.keys():
return []
return self.AssociationRulesDictionary[itemTuple][:Num]
def studyRules(self):
"""
This is a template method for computation and rule extraction.
:return:
"""
self.computeRules()
self.extractRules()
def showDeals(self, itemList, Num=1):
"""
we are converting the recommendations into deals. The lift value is used to calculate discount percentage
discount percentage = 10 * lift
itemList is a list of items selected by user
Num is total deals required.
:return:
"""
recommendations = self.recommend(itemList, Num)
for item in recommendations:
print( "If you buy ", item[0], " along with ", itemList, " then you will get ", round((item[1] * 10), 2), \
"% discount on total cost!!" )
Alexa = Recommender("store_data.csv")
Alexa.studyRules()
print (Alexa.recommend(['red wine'], 1))
Alexa.showDeals(['red wine'], 2)