Archive

programming

APPLY:

SAPPLY:

sapply(dataframe$col, FUN = sum/mean/etc)


sapply: Applies a function to the column and returns a vector with the results.

LAPPLY:

lapply(dataframe$col, FUN = sum/mean/etc)


lapply: Applies a function to the column and returns a list with the results.

TAPPLY:

tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)


tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

AGGREGATE

aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 + 
     dataframe$col4, data = dataframe, FUN = sum/mean/etc)


aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

CUT:

cut2(dataframe$column, g=numberofgroups)


Cut a continuous variable into a factor with g groups.

SAMPLE:

sample(1:rows, size=number, replace=T/F)


sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

RANDOMLY SUBSETTING TRAIN AND TEST DATA:

set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test

PLOT:

plot(dataframe$col1, dataframe$col2, pch=bullettype, 
    col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other


col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.
plotting multiple columns creates a matrix of plots
pch has integer values to represent different types of bullets
cex determines the size and detail in the plots

OTHER SCATTER PLOTS:

smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)


Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

REGRESSION ON FACTORS:

lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor, 
     ref ="reference variable")) # sets a reference variable for the lm


Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.

Advertisements

I am quick to Google and find answers when it comes to doing tasks. I’ve read studies that have mentioned that the generation that has grown up with the internet and all this knowledge at our fingertips has become worse and memorizing instructions outright but significantly better at efficiently finding instructions and remembering reference areas. As I often to find myself knowing where to look for an answer but not able to recall off the top of my head, I figure this would be a great place to start writing out my personal comprehensive reference guide to data munging in R!

R has a good help function but I feel that this is a good exercise to remember all these commands as well as have a broad reference dictionary.

SETTING UP WD:

getwd()
setwd("directory/directory")

Shows current working directory and sets a working directory

NAMESPACES:

attach(dataframe)
detach(dataframe)

Attaches current dataframe’s objects to main namespace or removes current dataframe’s objects from current namespace
*Not sure how useful this is I think I might use sparsely to avoid namespace conflicts.

READING CSV OR TABLE:

read.csv(file = "filename", head=T/F)
read.table(file="filename", sep="separator", head=T/F, 
     strings.as.factors=T/F,col.names=T/F, row.names=T/F, 
     strip.white = T/F)

Reads in a CSV or table with different settings.

SUMMARY OPTIONS:

head(dataframe, numofrows)
dim(dataframe)
names(dataframe)
summary(dataframe)
quantile(dataframe$column)
class(dataframe)
sapply(dataframe[1,],class)
unique(dataframe$column)
length(dataframe$column)
table(dataframe$column, useNA="ifany")

head: see first numofrows of dataframe
dim: see dimensions of dataframe
names: see col names
summary: get counts for qualitative variables or numerical summary of a quantitative variable
quantile: get quantiles of a quantitative variable
class: get class of dataframe of column
sapply: apply class function to first row of dataframe to get classes of all variables
unique: get unique values of a column
length: get length of column
table: create a table for unique values and counts mainly for qualitative variables, useNA=”ifany” shows NA value counts

TESTING DATA:

any(dataframe$column [condition])
all(dataframe$column [condition])

example:
any(data$column > 40) # true/false
all(data$column > 0 & data$column < 40)

any: tests condition for any matches
all: tests condition for all values

SUBSETTING DATA:

subset(dataframe, conditions, select = c(column, column))

subset: subset dataframe by certain conditions and only return selected columns

MERGE:

merge(dataframe1, dataframe2, by.x = dataframe1$column, 
      by.y = dataframe2$column, all = T/F)

by.x and by.y: specify a merge on columns that do not share the same name
all: specifies an outer join versus an inner join: T includes all records and inserts NA for all missing information

ORDER:

dataframe1$column[order(dataframe1$column)]

Returns a vector of row numbers in sorted order of the specified column

SORTING USING ORDER:

sorted <- dataframe1[order(dataframe1$column, dataframe1$column2),]

Stores into “sorted” dataframe1 sorted by row order by column1 then by column2. Additional levels of sorting can be added in the order function.
*Be careful to add the “,” after the order function or you will get an error.

MELT:

melt(dataframe, id.vars="idcolumn", variable.name="varnames",value.name="values")

Example output:

Input Matrix:
Name    TreatmentA    TreatmentB
John    4             1
Jane    5             2 

Result:
Name    Treatment   Value
John    TreatmentA  4
John    TreatmentB  1
Jane    TreatmentA  5
Jane    TreatmentB  2

This takes a matrix style table and reshapes it to have one observation per row.
*Requires install.packages(“reshape”)

The River Problem:

You are a person trying to cross a river and reach the land on the other side, however you cannot swim.

There are rocks in the water which you can step on to allow you to cross the river, however you start at speed 0 and can only increase your speed by 1, decrease it by 1, or maintain the same speed for each step that you take. You cannot have a negative speed (go backwards).

Given a string representing rocks (“*”) and water (“~”) write a method to return True or False depending on whether any path exists across the river.

Examples

River: *****~~*~~~*~~~~*~*~~*** -> true
Steps: 011_2__3___4____5_____6 Success

River: ****~~~~~~~~~~~~** -> false
Steps: 01_2__3 Failure

~~~~~~~~~

This problem, courtesy of Sitong, was an interesting programming challenge we tried to tackle yesterday. The problem itself is a bit unique as it represents a sort of tree traversal problem where we have a series of possibilities (accelerate, decelerate, or maintain constant speed) at every step and search the tree for a valid solution. Given that it is a tree traversal problem, the most basic approach would be recursive.

I have always favored iteration over recursion when I program just because of the way my mind things, but recursive solutions tend to be easy to read and relatively short. However the catch with recursion I always am wary of is issues with run-time, but we’ll get to that problem later!

The solution that we both came up with (myself with a little prodding from Sitong) was one of tree traversal. My solution in ruby follows:

def can_cross_river?(river, loc, speed)

 # base-case success if we reach land
  return true if loc + speed + 1 > river.length

  # If we landed on another rock, try 3 steps recursively

  # Accelerate first every time so we get the fastest path across
  if river[loc + speed + 1] == "*"
    return true if can_cross_river?(river, loc + speed + 1, speed + 1)
  end
  # If accelerating fails try maintaining speed for this step
  if speed > 0 && river[loc + speed] == "*"
    return true if can_cross_river?(river, loc + speed, speed)
  end
  # If maintaining speed fails try slowing down
  if speed > 1 && river[loc + speed - 1] == "*"
    return true if can_cross_river?(river, loc + speed - 1, speed - 1)
  end
  return false
end

The code checks every path possible starting with the fastest and looks for the first successful run. However, when calculating the run time of this path we ran into some interesting caveats. The initial bound of run time when I first looked at the problem was O(3^n) because the problem recursively results in 3 possible choices for each path. For n choices this results in 3^n paths. However that is merely an upper-bound, as the tree depth shortens as we accelerate in speed and reach the goal faster. Therefore we will have significantly less than 3^n possible paths. I don’t think I can mathematically calculate this bound of worst-case but my intuition can be captured in the figure below:

The triangle represents a normal 3^n tree of possibilities but the grey shaded area represents the area of this problem as we accelerate when we take a right branch and reach a true or false conclusion faster. We cannot go slower than 1 speed so we don’t have any branches that go left from the origin.

There are other solutions for this problem and other optimizations including memoization or caching, however those also have issues. This problem is unique as for each sub-problem while we may face the same remainder of the river, we might approach this at a different speed. Thus caching will involve a huge problem space and questions of efficiency arise. I think this problem was great fodder for the mind however and provided a lot of interesting food for thought. I also asked David the same problem and saw a completely different approach to solving it, however we haven’t figured out the details on how to efficiently tackle the problem from his angle yet. Perhaps if we do I will write another post comparing his solution and this solution.

Even simple algorithms such as the early problems of Project Euler allow for some interesting approaches. The problem asks to generate the largest prime factor for a number N. My initial approach to this problem is to generate a prime factorization of N and take the largest value. In order to accomplish this, I thought up the idea that if you iterate upwards and remove factors from N, N will slowly be divided down until the iterator = N = the largest factor. The nice part of this approach is that the iteration range is continuously shrunk as factors are pulled out and it will only iterate as many times as necessary to hit the largest prime.

The pseudo-code follows:


i = 2
n = N          
while i < n                   # Start with worst-case
  while n % i == 0            # Take out the same factors
    n = n / i                 # Reduce iterating range
    largest_prime_factor = i  # The largest factor
  i += 1
return largest_prime_factor

My code runs very well for numbers that have a large number of factors but gets worse as N has larger prime factors. The worst case for my algorithm is if N is prime.

Now the advantages of collaboration show up as I shared my solution with Sitong Peng. His solution approached the problem from an entirely different angle. He searched for pairs of numbers where x * y = N and checked the larger factor to see if it was prime. Because he iterated starting from 2 as well, the first prime factor would be the answer. Sitong’s R code translated into pseudo-code and a step through of the process follows:


# This block is credit to Sitong Peng

def is_prime?(n)
  if n == 1
    return false
  if n < 4
    return true
  for i in range (2, sqrt(n)
    if i % n == 0
      return False
  return True

def largest_prime_factor(n)
  i = 2
  if is_prime?(n)
    return n
  while (i < num)
    if n % i == 0
      if is_prime?(n/i)
        return n/i
    i += 1

# Step through of the process

Take for example 30:
The pairs are:
1 30 # check 30 if prime (no)
2 15 # check 15 if prime (no)
3 10 # check 10 if prime (no)
5 6 # check 6 if prime (no)
6 5 # check 5 if prime (yes) < Answer
10 3
15 2
30 1

An interesting property of this algorithm is that my worst-case scenario happens to be Sitong’s best-case as the first number he checks is N while mine would have to iterate from i to N. Sitong’s worst case scenario on the other hand is having a number that is 2^n or other numbers whose largest prime is small. Sitong’s algorithm would have to iterate until it reached the larger value to find the first prime. Another step through follows:


# Worst case for Sitong's solution

Take for example 16:
1 16 # check 16 if prime (no)
2 8 # check 8 if prime (no)
4 4 # check 4 if prime (no)
8 2 # check 2 if prime (yes)
16 1

With two algorithms that are almost inverses in terms of best and worst case, it only makes sense to combine to two to have a super algorithm! With a simple 2 line change in the code, a new algorithm was born. Combining my idea to reduce the searching space by removing factors with Sitong’s prime checking to look for very large prime factors yielded an algorithm that deals nicely with both of our worst cases. This post look much longer than our solution, but the process of problem solving that led to the combination was a lot of fun to share. The pseudo-code follows:


# Recycled is_prime function from Sitong

def is_prime?(n)
  if n == 1
    return false
  if n < 4
    return true
  for i in range (2, sqrt(n)
    if i % n == 0
      return False
  return True

# New solution

def largest_prime_factor(n)
  i = 2
  if is_prime?(n)
    return n
  while (i < n)
    while n % i == 0        # 2 lines of changes, while loop here to grab all small primes
      n = n/i               # then reducing n to shrink the iteration
      if is_prime?(n/i)
        return n/i
    i += 1

To check out our epiphany from Sitong’s perspective check out his post!

Often people shouldn’t reinvent the wheel, however in my learning at Hacker School I feel that for certain areas it is likely important to try and think up a solution for a well known (and most likely solved) problem and see what you come up with.

Today, I was thinking of a problem that would appear in a possible project that Sitong suggested. The project involved creating a type of “heat” map which would represent the minutes to a destination given walking and public transportation options. In mulling over this in my head it presented one of the classic problems in algorithms… the shortest path problem. I have little experience with algorithms so I thought this would be a great problem to come up with a personal solution before checking for the optimal solution. Hopefully my solution will turn out to be the optimal.

In this problem, you are given a graph with nodes and connections with values, in this case the values for the connections represent minutes of travel between each node. The nodes will represent travel options between two points on a map and the goal is to find the shortest path from an initial starting node and a chosen end node.

In my haze of sleep I began to think of solutions for a quick optimization of finding the shortest path and began to envision the graph node network as a net that is malleable. If one picks the starting node and ending node in each hand and stretches, the shortest path will appear as it will be the first path that can no longer stretch between the two hands (ignoring of course all physical tangling of the net). The problem I face is to quickly calculate the distance from every node on a map to another node, and if this act of “lifting” up the node from the net instantly reveals the shortest node it may be somehow translatable in code.

My currently train of thought brings me to trees. Given a graph network, it may be possible to translate that graph into a tree and stretch it out until you find the shortest path between the two nodes…

~ Hacking in Progress 😛 ~

After doing some reading and implementing Dijkstra’s algorithm in Ruby, I see the parallels of my visual thoughts and the mathematical explanations. Creating a even set of sub-nodes and traveling down all parallel paths until you hit nodes is analogous to lifting a net and having all connections lift off a table until node by node the solution appears. I find this connection between my thoughts and the derivation fascinating and really do wonder if there are other things out there that I could find novel solutions to. Looking forward to more late-night epiphanies and other sparks of inspiration!

Getting started with this book isn’t quite easy for someone who hasn’t worked too much in setting up a machine in these databases. I’m going to log my steps so that in the future people (or myself) can quickly reference my steps and see what worked for me!

Using Ubuntu 12.04

First step is to install the postgresql packages which comes in two pieces, the client and the dev package with the additional add-ons that the book requires:

sudo apt-get install postgresql-client
sudo apt-get install postgresql-contrib

Postgresql creates a postgres user which manages the server if you run it locally. Ideally you should create a user for yourself:

# Opens psql as the postgres superuser
sudo -u postgres psql
# Creates a user for yourself
createuser --superuser [user]
# Create a password for your user
\password 

Exit the psql dashboard and return to terminal. Now I can create and log into my server by using the psql command and install the extensions:

# Create DB
createdb book
# Open the newly created DB
psql book
# Install the extensions needed for the exercises in 7 Databases in 7 Days
create extension tablefunc
create extension dict_xsyn
create extension fuzzystrmatch
create extension pg_trgm
create extension tablefunc

Another link which was quite helpful for getting psql setup and running:
https://help.ubuntu.com/community/PostgreSQL