Monthly Archives: July 2014

Benjamini–Hochberg procedure

The Benjamini-Hochberg procedure is a method to change the significance value when doing multiple hypothesis testing. The explanation to this is that if you’re doing a lot of hypothesis tests on a single dataset, you’re bound to find something. If the type I error is \(\alpha\) in one test, in \(k\) different tests, the probability of rejecting at least 1 null hypothesis when it’s true becomes \(1-(1-\alpha)^k\).

Implementing this procedure in R is quite straightforward. Suppose again you have \(H_0,…,H_k\) null hypothesis with \(p_0,…,p_k\) p-values.

## Benjamini–Hochberg procedure (BH step-up procedure)

BHproc <- function(pvalues,alpha){
	sorted <- sort(pvalues)
	m <- length(pvalues)
	k = 1

	while(p[k]<=k/m*alpha){
		k = k+1
	}
	cat("Declared significant: ",k, " first tests. Note p-values
		were ordered")
}

Random acts of Pizza

So I stumbled upon this Kaggle competition and I decided to give it a try. Original data is in JSON format and can be found in the competition website. It offers a vast amount of variables, so it is really difficult to just select a few of them. My approach was to perform sentiment analysis (like we used to do in Twitter) and include them with the rest of the variables.

The following code is to extract the information and store it on a csv file using Python.


import os
import json
import numpy as num
import sys

# Load file in appropriate format

os.chdir("D:/datasets")
f = open("pizza_request_dataset.json")

t = json.load(f)

# Load sentiment dictionary

afinnfile = open("AFINN-111.txt")
scores = {} 
for line in afinnfile:
  term, score  = line.split("\t")  
  scores[term] = int(score)

desire = ["friend","party","birthday","boyfriend","girlfriend","date","drinks","drunk","wasted","invite","invited","celebrate","celebrating","game","games","movie","beer","crave","craving"]    
family = ["husband","wife","family","parent","parentsmother","father","mom","mum","son","dad","daughter"]
job = ["job","unemployment","employment","hire","hired","fired","interview","work","paycheck"]
money = ["money","bill","bills","rent","bank","account","paycheck","due","broke","bills","deposit","cashdollar","dollars","bucks","paid","payed","buy","check","spent","financial","poor","loan","credit","budget","day","now","time","week","until","last","month"
,"tonight","today","next","night","when","tomorrow","first","after","while","before","long","hour","Friday","ago","still","due","past","soon","current","years","never","till","yesterday","morning","evening"]
student = ["college","student","university","finals","study","studying","class","semester","school","roommate","project","tuition","dorm"]

# Create arrays for convenient storage

sentiment = num.zeros(len(t))
desire_vec = num.zeros(len(t))
family_vec = num.zeros(len(t))
job_vec = num.zeros(len(t))
money_vec = num.zeros(len(t))
student_vec = num.zeros(len(t))
upvotes_vec = num.zeros(len(t))
accountagerequest_vec = num.zeros(len(t))
accountageretrieval_vec = num.zeros(len(t))
commentsretrieval_vec = num.zeros(len(t))
flair_vec = num.zeros(len(t))
pizza_vec = num.zeros(len(t))


# Extract interest variables from each record

for i in range(len(t)):
	text = t[i]["request_text"]
	upvotes = t[i]["requester_upvotes_minus_downvotes_at_retrieval"]
	accountagerequest = t[i]["requester_account_age_in_days_at_request"]
	accountageretrieval = t[i]["requester_account_age_in_days_at_retrieval"]
	commentsretrieval = t[i]["requester_number_of_comments_at_retrieval"]
	if t[i]["requester_user_flair"] == "schroom":
		flair = 1
	if t[i]["requester_user_flair"] == "PIF":
		flair = 2
	else:
		flair = 0	
	if t[i]["requester_received_pizza"] == True:
		pizza = 1
	else:
		pizza = 0

	#Compute sentiment for each text

	words = text.split()
	sen = 0
	des = 0
	fam = 0
	jo = 0
	mon = 0
	stu = 0

	for word in words:
		try:
			sen += scores[word]
			if word in desire:
				des += 1
			if word in family:
				fam += 1
			if word in job:
				jo += 1
			if word in money:
				mon += 1
			if word in student:
				stu += 1	
		except:
			continue

	sentiment[i] = sen
	desire_vec[i] = des
	family_vec[i] = fam
	job_vec[i] = jo
	money_vec[i] = mon
	student_vec[i] = stu
	upvotes_vec[i] = upvotes
	accountagerequest_vec[i] = accountagerequest
	accountageretrieval_vec[i] = accountageretrieval
	commentsretrieval_vec[i] = commentsretrieval
	flair_vec[i] = flair
	pizza_vec[i] = pizza


num.savetxt('test.txt',num.c_[sentiment,desire_vec,family_vec,job_vec,
								money_vec, student_vec,upvotes_vec,
								accountagerequest_vec,accountageretrieval_vec,
								commentsretrieval_vec,flair_vec,pizza_vec],fmt='%1d',delimiter=",")

However, the question on what model to predict the outcome remains open. LDA or Logistic Regression perform poorly, so I might give SVMs a try.

Messing with the IGN ratings dataset

I saw this Reddit link via @TextMining_r and I couldn’t resist doing some basic experimentation related to console/platform wars. Which platform was the best in its generation? Most argue it is not about the system itself but the games, so, here is a magnificent ggplot2 graph showing the mean games score for every platform IGN has ever analysed.

gamescore

Of course, this analysis is inherently biased since every platform has not had the same amount of games released, but it is interesting anyways.

Knapsack 0/1 problem in Python

It’s been a while since I last posted something, but today I decided to start writing on the blog periodically again. Today, I bring a very simple implementation of the 0/1 knapsack problem in Python using a dynamic programming approach. The first chunk of code is to calculate the cost matrix. The second is to (knowing the cost matrix) see which items are finally included in the knapsack.

import numpy as num

def knapsack(v,w,W):
	n = len(v)
	c = num.zeros((n,W+1))

	for i in range(0,n):
		for j in range(0,W+1):
			if w[i]>j:
				c[i,j] = c[i-1,j]
			else:
				c[i,j] = max(c[i-1,j],v[i]+c[i-1,j-w[i]])
	return c 

def items(w,c):
	i = len(c)-1
	iterat = len(c[0,:])-1

	meh = []

	for i in range(i+1):
		meh.append(0)
	while(i>=0 and iterat>=0):
		if (i==0 and c[i,iterat]>0) or c[i,iterat] != c[i-1,iterat]:
			meh[i] = 1
			iterat = iterat-w[i]
		i -=1
	return meh