Random acts of Pizza

By | 2014/07/29

So I stumbled upon this Kaggle competition and I decided to give it a try. Original data is in JSON format and can be found in the competition website. It offers a vast amount of variables, so it is really difficult to just select a few of them. My approach was to perform sentiment analysis (like we used to do in Twitter) and include them with the rest of the variables.

The following code is to extract the information and store it on a csv file using Python.


import os
import json
import numpy as num
import sys

# Load file in appropriate format

os.chdir("D:/datasets")
f = open("pizza_request_dataset.json")

t = json.load(f)

# Load sentiment dictionary

afinnfile = open("AFINN-111.txt")
scores = {} 
for line in afinnfile:
  term, score  = line.split("\t")  
  scores[term] = int(score)

desire = ["friend","party","birthday","boyfriend","girlfriend","date","drinks","drunk","wasted","invite","invited","celebrate","celebrating","game","games","movie","beer","crave","craving"]    
family = ["husband","wife","family","parent","parentsmother","father","mom","mum","son","dad","daughter"]
job = ["job","unemployment","employment","hire","hired","fired","interview","work","paycheck"]
money = ["money","bill","bills","rent","bank","account","paycheck","due","broke","bills","deposit","cashdollar","dollars","bucks","paid","payed","buy","check","spent","financial","poor","loan","credit","budget","day","now","time","week","until","last","month"
,"tonight","today","next","night","when","tomorrow","first","after","while","before","long","hour","Friday","ago","still","due","past","soon","current","years","never","till","yesterday","morning","evening"]
student = ["college","student","university","finals","study","studying","class","semester","school","roommate","project","tuition","dorm"]

# Create arrays for convenient storage

sentiment = num.zeros(len(t))
desire_vec = num.zeros(len(t))
family_vec = num.zeros(len(t))
job_vec = num.zeros(len(t))
money_vec = num.zeros(len(t))
student_vec = num.zeros(len(t))
upvotes_vec = num.zeros(len(t))
accountagerequest_vec = num.zeros(len(t))
accountageretrieval_vec = num.zeros(len(t))
commentsretrieval_vec = num.zeros(len(t))
flair_vec = num.zeros(len(t))
pizza_vec = num.zeros(len(t))


# Extract interest variables from each record

for i in range(len(t)):
	text = t[i]["request_text"]
	upvotes = t[i]["requester_upvotes_minus_downvotes_at_retrieval"]
	accountagerequest = t[i]["requester_account_age_in_days_at_request"]
	accountageretrieval = t[i]["requester_account_age_in_days_at_retrieval"]
	commentsretrieval = t[i]["requester_number_of_comments_at_retrieval"]
	if t[i]["requester_user_flair"] == "schroom":
		flair = 1
	if t[i]["requester_user_flair"] == "PIF":
		flair = 2
	else:
		flair = 0	
	if t[i]["requester_received_pizza"] == True:
		pizza = 1
	else:
		pizza = 0

	#Compute sentiment for each text

	words = text.split()
	sen = 0
	des = 0
	fam = 0
	jo = 0
	mon = 0
	stu = 0

	for word in words:
		try:
			sen += scores[word]
			if word in desire:
				des += 1
			if word in family:
				fam += 1
			if word in job:
				jo += 1
			if word in money:
				mon += 1
			if word in student:
				stu += 1	
		except:
			continue

	sentiment[i] = sen
	desire_vec[i] = des
	family_vec[i] = fam
	job_vec[i] = jo
	money_vec[i] = mon
	student_vec[i] = stu
	upvotes_vec[i] = upvotes
	accountagerequest_vec[i] = accountagerequest
	accountageretrieval_vec[i] = accountageretrieval
	commentsretrieval_vec[i] = commentsretrieval
	flair_vec[i] = flair
	pizza_vec[i] = pizza


num.savetxt('test.txt',num.c_[sentiment,desire_vec,family_vec,job_vec,
								money_vec, student_vec,upvotes_vec,
								accountagerequest_vec,accountageretrieval_vec,
								commentsretrieval_vec,flair_vec,pizza_vec],fmt='%1d',delimiter=",")

However, the question on what model to predict the outcome remains open. LDA or Logistic Regression perform poorly, so I might give SVMs a try.

Leave a Reply

Your email address will not be published. Required fields are marked *