# Monthly Archives: July 2014

## Benjamini–Hochberg procedure

The Benjamini-Hochberg procedure is a method to change the significance value when doing multiple hypothesis testing. The explanation to this is that if you’re doing a lot of hypothesis tests on a single dataset, you’re bound to find something. If the type I error is $$\alpha$$ in one test, in $$k$$ different tests, the probability of rejecting at least 1 null hypothesis when it’s true becomes $$1-(1-\alpha)^k$$.

Implementing this procedure in R is quite straightforward. Suppose again you have $$H_0,…,H_k$$ null hypothesis with $$p_0,…,p_k$$ p-values.

## Benjamini–Hochberg procedure (BH step-up procedure)

BHproc <- function(pvalues,alpha){
sorted <- sort(pvalues)
m <- length(pvalues)
k = 1

while(p[k]<=k/m*alpha){
k = k+1
}
cat("Declared significant: ",k, " first tests. Note p-values
were ordered")
}


## Random acts of Pizza

So I stumbled upon this Kaggle competition and I decided to give it a try. Original data is in JSON format and can be found in the competition website. It offers a vast amount of variables, so it is really difficult to just select a few of them. My approach was to perform sentiment analysis (like we used to do in Twitter) and include them with the rest of the variables.

The following code is to extract the information and store it on a csv file using Python.


import os
import json
import numpy as num
import sys

# Load file in appropriate format

os.chdir("D:/datasets")
f = open("pizza_request_dataset.json")

afinnfile = open("AFINN-111.txt")
scores = {}
for line in afinnfile:
term, score  = line.split("\t")
scores[term] = int(score)

desire = ["friend","party","birthday","boyfriend","girlfriend","date","drinks","drunk","wasted","invite","invited","celebrate","celebrating","game","games","movie","beer","crave","craving"]
job = ["job","unemployment","employment","hire","hired","fired","interview","work","paycheck"]
,"tonight","today","next","night","when","tomorrow","first","after","while","before","long","hour","Friday","ago","still","due","past","soon","current","years","never","till","yesterday","morning","evening"]
student = ["college","student","university","finals","study","studying","class","semester","school","roommate","project","tuition","dorm"]

# Create arrays for convenient storage

sentiment = num.zeros(len(t))
desire_vec = num.zeros(len(t))
family_vec = num.zeros(len(t))
job_vec = num.zeros(len(t))
money_vec = num.zeros(len(t))
student_vec = num.zeros(len(t))
accountagerequest_vec = num.zeros(len(t))
accountageretrieval_vec = num.zeros(len(t))
flair_vec = num.zeros(len(t))
pizza_vec = num.zeros(len(t))

# Extract interest variables from each record

for i in range(len(t)):
text = t[i]["request_text"]
accountagerequest = t[i]["requester_account_age_in_days_at_request"]
accountageretrieval = t[i]["requester_account_age_in_days_at_retrieval"]
if t[i]["requester_user_flair"] == "schroom":
flair = 1
if t[i]["requester_user_flair"] == "PIF":
flair = 2
else:
flair = 0
pizza = 1
else:
pizza = 0

#Compute sentiment for each text

words = text.split()
sen = 0
des = 0
fam = 0
jo = 0
mon = 0
stu = 0

for word in words:
try:
sen += scores[word]
if word in desire:
des += 1
if word in family:
fam += 1
if word in job:
jo += 1
if word in money:
mon += 1
if word in student:
stu += 1
except:
continue

sentiment[i] = sen
desire_vec[i] = des
family_vec[i] = fam
job_vec[i] = jo
money_vec[i] = mon
student_vec[i] = stu
accountagerequest_vec[i] = accountagerequest
accountageretrieval_vec[i] = accountageretrieval
flair_vec[i] = flair
pizza_vec[i] = pizza

num.savetxt('test.txt',num.c_[sentiment,desire_vec,family_vec,job_vec,
accountagerequest_vec,accountageretrieval_vec,


However, the question on what model to predict the outcome remains open. LDA or Logistic Regression perform poorly, so I might give SVMs a try.

## Messing with the IGN ratings dataset

I saw this Reddit link via @TextMining_r and I couldn’t resist doing some basic experimentation related to console/platform wars. Which platform was the best in its generation? Most argue it is not about the system itself but the games, so, here is a magnificent ggplot2 graph showing the mean games score for every platform IGN has ever analysed.

Of course, this analysis is inherently biased since every platform has not had the same amount of games released, but it is interesting anyways.

## Knapsack 0/1 problem in Python

It’s been a while since I last posted something, but today I decided to start writing on the blog periodically again. Today, I bring a very simple implementation of the 0/1 knapsack problem in Python using a dynamic programming approach. The first chunk of code is to calculate the cost matrix. The second is to (knowing the cost matrix) see which items are finally included in the knapsack.

import numpy as num

def knapsack(v,w,W):
n = len(v)
c = num.zeros((n,W+1))

for i in range(0,n):
for j in range(0,W+1):
if w[i]>j:
c[i,j] = c[i-1,j]
else:
c[i,j] = max(c[i-1,j],v[i]+c[i-1,j-w[i]])
return c

def items(w,c):
i = len(c)-1
iterat = len(c[0,:])-1

meh = []

for i in range(i+1):
meh.append(0)
while(i>=0 and iterat>=0):
if (i==0 and c[i,iterat]>0) or c[i,iterat] != c[i-1,iterat]:
meh[i] = 1
iterat = iterat-w[i]
i -=1
return meh