APPLY:

SAPPLY:

sapply(dataframe$col, FUN = sum/mean/etc)


sapply: Applies a function to the column and returns a vector with the results.

LAPPLY:

lapply(dataframe$col, FUN = sum/mean/etc)


lapply: Applies a function to the column and returns a list with the results.

TAPPLY:

tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)


tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

AGGREGATE

aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 + 
     dataframe$col4, data = dataframe, FUN = sum/mean/etc)


aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

CUT:

cut2(dataframe$column, g=numberofgroups)


Cut a continuous variable into a factor with g groups.

SAMPLE:

sample(1:rows, size=number, replace=T/F)


sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

RANDOMLY SUBSETTING TRAIN AND TEST DATA:

set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test

PLOT:

plot(dataframe$col1, dataframe$col2, pch=bullettype, 
    col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other


col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.
plotting multiple columns creates a matrix of plots
pch has integer values to represent different types of bullets
cex determines the size and detail in the plots

OTHER SCATTER PLOTS:

smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)


Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

REGRESSION ON FACTORS:

lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor, 
     ref ="reference variable")) # sets a reference variable for the lm


Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.

I am quick to Google and find answers when it comes to doing tasks. I’ve read studies that have mentioned that the generation that has grown up with the internet and all this knowledge at our fingertips has become worse and memorizing instructions outright but significantly better at efficiently finding instructions and remembering reference areas. As I often to find myself knowing where to look for an answer but not able to recall off the top of my head, I figure this would be a great place to start writing out my personal comprehensive reference guide to data munging in R!

R has a good help function but I feel that this is a good exercise to remember all these commands as well as have a broad reference dictionary.

SETTING UP WD:

getwd()
setwd("directory/directory")

Shows current working directory and sets a working directory

NAMESPACES:

attach(dataframe)
detach(dataframe)

Attaches current dataframe’s objects to main namespace or removes current dataframe’s objects from current namespace
*Not sure how useful this is I think I might use sparsely to avoid namespace conflicts.

READING CSV OR TABLE:

read.csv(file = "filename", head=T/F)
read.table(file="filename", sep="separator", head=T/F, 
     strings.as.factors=T/F,col.names=T/F, row.names=T/F, 
     strip.white = T/F)

Reads in a CSV or table with different settings.

SUMMARY OPTIONS:

head(dataframe, numofrows)
dim(dataframe)
names(dataframe)
summary(dataframe)
quantile(dataframe$column)
class(dataframe)
sapply(dataframe[1,],class)
unique(dataframe$column)
length(dataframe$column)
table(dataframe$column, useNA="ifany")

head: see first numofrows of dataframe
dim: see dimensions of dataframe
names: see col names
summary: get counts for qualitative variables or numerical summary of a quantitative variable
quantile: get quantiles of a quantitative variable
class: get class of dataframe of column
sapply: apply class function to first row of dataframe to get classes of all variables
unique: get unique values of a column
length: get length of column
table: create a table for unique values and counts mainly for qualitative variables, useNA=”ifany” shows NA value counts

TESTING DATA:

any(dataframe$column [condition])
all(dataframe$column [condition])

example:
any(data$column > 40) # true/false
all(data$column > 0 & data$column < 40)

any: tests condition for any matches
all: tests condition for all values

SUBSETTING DATA:

subset(dataframe, conditions, select = c(column, column))

subset: subset dataframe by certain conditions and only return selected columns

MERGE:

merge(dataframe1, dataframe2, by.x = dataframe1$column, 
      by.y = dataframe2$column, all = T/F)

by.x and by.y: specify a merge on columns that do not share the same name
all: specifies an outer join versus an inner join: T includes all records and inserts NA for all missing information

ORDER:

dataframe1$column[order(dataframe1$column)]

Returns a vector of row numbers in sorted order of the specified column

SORTING USING ORDER:

sorted <- dataframe1[order(dataframe1$column, dataframe1$column2),]

Stores into “sorted” dataframe1 sorted by row order by column1 then by column2. Additional levels of sorting can be added in the order function.
*Be careful to add the “,” after the order function or you will get an error.

MELT:

melt(dataframe, id.vars="idcolumn", variable.name="varnames",value.name="values")

Example output:

Input Matrix:
Name    TreatmentA    TreatmentB
John    4             1
Jane    5             2 

Result:
Name    Treatment   Value
John    TreatmentA  4
John    TreatmentB  1
Jane    TreatmentA  5
Jane    TreatmentB  2

This takes a matrix style table and reshapes it to have one observation per row.
*Requires install.packages(“reshape”)

I decided yesterday to put in some practice by tackling the Tower of Hanoi problem. The problem is a very common one that I’ve seen all over the web but until now I hadn’t taken the time to work it out.

The problem begins as three pegs on which discs can be placed. The discs are placed in decreasing order of size to create a pyramid of n height on peg 1 and the task is to move the discs from peg 1 to peg 3. The constraint of the problem is that a larger sized disc cannot be placed on top of a smaller disc.

Creating an array for each of the three pegs and notating each disc as an integer from 1 – n depending on size gave me a representation of the problem. The beginning of the problem looked like so:

tower1 = [n,n-1,…,4,3,2,1]
tower2 = []
tower3 = []

I then began to map out the base cases for the problem:

N = 1, 1n move from tower 1 to tower 3:
1 disc from tower 1 to tower 3

N = 2, 2n move from tower 1 to tower 3:
1n move from tower 1 to tower 2
1 disc from tower 1 to tower 3
1n move from tower 2 to tower 3

N = 3, 3n move from tower 1 to tower 3:
2n move from tower 1 to tower 2
1 disc from tower 1 to tower 3
2n move from tower 2 to tower 3

If we notate the towers as origin, destination, and buffer, we begin to see the pattern:

origin to buffer
origin to destination
buffer to destination

Also we see that the problem lends itself to recursion easily because for a 3 height tower there are just two 2 height tower sub-problems, for a 4 there are two 3 height tower sub-problems, etc.

Given this fact, we can work together a recursive solution using normal arrays:

# Define a move
def move(origin, dest)
  if dest.empty? || origin[-1] < dest[-1]
    dest.push(origin.pop)
 end
end

# The recursive solution
def move_tower(n, origin, dest, buffer)
  if n == 1
    move(origin, dest)
  else
    move_tower(n-1, origin, buffer, dest)
    move_tower(1, origin, dest, buffer)
    move_tower(n-1, buffer, dest, origin)
  end
end

tower1 = [6,5,4,3,2,1]
tower2 = []
tower3 = []

move_tower(6, tower1, tower3, tower2)
# Result is tower3 = [6,5,4,3,2,1]

The move ensures we don't make an illegal move by making sure the target stack is either empty or has a larger disc than what we are moving. The solution takes in the height, and the three towers that we create and solves the problem recursively.

The solution is was quite fun to puzzle through by finding the pattern in the cases and modeling a solution to reflect that pattern. There is also an iterative solution that takes into account the fact that odd and even numbered towers have different solutions with set repeating moves. I will try to tackle that one in the future!

The River Problem:

You are a person trying to cross a river and reach the land on the other side, however you cannot swim.

There are rocks in the water which you can step on to allow you to cross the river, however you start at speed 0 and can only increase your speed by 1, decrease it by 1, or maintain the same speed for each step that you take. You cannot have a negative speed (go backwards).

Given a string representing rocks (“*”) and water (“~”) write a method to return True or False depending on whether any path exists across the river.

Examples

River: *****~~*~~~*~~~~*~*~~*** -> true
Steps: 011_2__3___4____5_____6 Success

River: ****~~~~~~~~~~~~** -> false
Steps: 01_2__3 Failure

~~~~~~~~~

This problem, courtesy of Sitong, was an interesting programming challenge we tried to tackle yesterday. The problem itself is a bit unique as it represents a sort of tree traversal problem where we have a series of possibilities (accelerate, decelerate, or maintain constant speed) at every step and search the tree for a valid solution. Given that it is a tree traversal problem, the most basic approach would be recursive.

I have always favored iteration over recursion when I program just because of the way my mind things, but recursive solutions tend to be easy to read and relatively short. However the catch with recursion I always am wary of is issues with run-time, but we’ll get to that problem later!

The solution that we both came up with (myself with a little prodding from Sitong) was one of tree traversal. My solution in ruby follows:

def can_cross_river?(river, loc, speed)

 # base-case success if we reach land
  return true if loc + speed + 1 > river.length

  # If we landed on another rock, try 3 steps recursively

  # Accelerate first every time so we get the fastest path across
  if river[loc + speed + 1] == "*"
    return true if can_cross_river?(river, loc + speed + 1, speed + 1)
  end
  # If accelerating fails try maintaining speed for this step
  if speed > 0 && river[loc + speed] == "*"
    return true if can_cross_river?(river, loc + speed, speed)
  end
  # If maintaining speed fails try slowing down
  if speed > 1 && river[loc + speed - 1] == "*"
    return true if can_cross_river?(river, loc + speed - 1, speed - 1)
  end
  return false
end

The code checks every path possible starting with the fastest and looks for the first successful run. However, when calculating the run time of this path we ran into some interesting caveats. The initial bound of run time when I first looked at the problem was O(3^n) because the problem recursively results in 3 possible choices for each path. For n choices this results in 3^n paths. However that is merely an upper-bound, as the tree depth shortens as we accelerate in speed and reach the goal faster. Therefore we will have significantly less than 3^n possible paths. I don’t think I can mathematically calculate this bound of worst-case but my intuition can be captured in the figure below:

The triangle represents a normal 3^n tree of possibilities but the grey shaded area represents the area of this problem as we accelerate when we take a right branch and reach a true or false conclusion faster. We cannot go slower than 1 speed so we don’t have any branches that go left from the origin.

There are other solutions for this problem and other optimizations including memoization or caching, however those also have issues. This problem is unique as for each sub-problem while we may face the same remainder of the river, we might approach this at a different speed. Thus caching will involve a huge problem space and questions of efficiency arise. I think this problem was great fodder for the mind however and provided a lot of interesting food for thought. I also asked David the same problem and saw a completely different approach to solving it, however we haven’t figured out the details on how to efficiently tackle the problem from his angle yet. Perhaps if we do I will write another post comparing his solution and this solution.

After a few interviews, I’ve decided to begin writing for myself an interview memorandum to reflect upon improvements and strategies for the next ones. This one I feel like laying out a framework for approaching case interviews.

1. Start off by clarifying goals and results. The most important part for a case is understanding what the desired result is. Whether I’m designing a data system, formulating a solution, or estimating effects, I should understand what the overarching goal is before I even start to walk through my thinking. Who is my interviewer supposed to be? What information can they provide?

2. I tend to work in a iterative fashion as I walk through cases which lends itself to train of thought syndrome. Framework. I should lay out and notate a framework. Start from the beginning. List assumptions, available resources. List the goal and sub-goals that can be derived from that goal. Do not begin to talk about a solution before laying down this framework. My initial solution tends to change as I work through a problem and to mitigate that I can work through the problem on paper and the first solution I explain can at least be a fourth or fifth draft in my mind.

3. Immediately consider these questions as I’m working:

1. What are the advantages to this system? How can it be extended? How generalized is it?
2. What are the disadvantages to this system? What can it not measure?
3. How can this system work with extensions? Other systems?
4. If data-based, what do I track and measure and what do I not know? Can I get the information I don't know? Does it fit into the current scheme?
5. Think about the goal. Does this solve that problem? Does it scale to this problem? Are there any tangential goals that might be in question?
6. What company am I interviewing for and what does this question have to do with the company? What are the parallels and what should I focus on as I explain my solution?

4. Breathe. Take time and slow down. I’ve always had a habit on essay tests to just start writing. In class I never take notes and I work from memory. While this approach serves me well internally, I need to highlight my strengths in problem solving in a way that can be followed by the interviewer. Go back to the outline. Go back to the notes. Go back to the goal and review it again. I need to slow down and keep reviewing, move at a deliberate but more coherent and complete pace.

5. When working with data, lay out the specifics. Write down the tables, the relationships and the storage. This might not need to be conveyed during to the interviewer depending on position, but having this laid out provides the organization for explaining what data can be derived from the system.

6. Define the results and capabilities of the system that you’ve designed. Understand its flaws and think through its implementation. Important aspects: scaling, use case, expandability, applications, edge cases, any missing information? Always think about edge cases and the world view of the problem, don’t fall into the trap of assuming the topic, whether a store, business, or other, reflect the first example that comes to mind. Think of other environmental or confounding factors that may change the optimal design.

I may come back and add and edit this as I learn more. It’s important in all circumstance to always keep learning from any situation. Until the next one!

Even simple algorithms such as the early problems of Project Euler allow for some interesting approaches. The problem asks to generate the largest prime factor for a number N. My initial approach to this problem is to generate a prime factorization of N and take the largest value. In order to accomplish this, I thought up the idea that if you iterate upwards and remove factors from N, N will slowly be divided down until the iterator = N = the largest factor. The nice part of this approach is that the iteration range is continuously shrunk as factors are pulled out and it will only iterate as many times as necessary to hit the largest prime.

The pseudo-code follows:


i = 2
n = N          
while i < n                   # Start with worst-case
  while n % i == 0            # Take out the same factors
    n = n / i                 # Reduce iterating range
    largest_prime_factor = i  # The largest factor
  i += 1
return largest_prime_factor

My code runs very well for numbers that have a large number of factors but gets worse as N has larger prime factors. The worst case for my algorithm is if N is prime.

Now the advantages of collaboration show up as I shared my solution with Sitong Peng. His solution approached the problem from an entirely different angle. He searched for pairs of numbers where x * y = N and checked the larger factor to see if it was prime. Because he iterated starting from 2 as well, the first prime factor would be the answer. Sitong’s R code translated into pseudo-code and a step through of the process follows:


# This block is credit to Sitong Peng

def is_prime?(n)
  if n == 1
    return false
  if n < 4
    return true
  for i in range (2, sqrt(n)
    if i % n == 0
      return False
  return True

def largest_prime_factor(n)
  i = 2
  if is_prime?(n)
    return n
  while (i < num)
    if n % i == 0
      if is_prime?(n/i)
        return n/i
    i += 1

# Step through of the process

Take for example 30:
The pairs are:
1 30 # check 30 if prime (no)
2 15 # check 15 if prime (no)
3 10 # check 10 if prime (no)
5 6 # check 6 if prime (no)
6 5 # check 5 if prime (yes) < Answer
10 3
15 2
30 1

An interesting property of this algorithm is that my worst-case scenario happens to be Sitong’s best-case as the first number he checks is N while mine would have to iterate from i to N. Sitong’s worst case scenario on the other hand is having a number that is 2^n or other numbers whose largest prime is small. Sitong’s algorithm would have to iterate until it reached the larger value to find the first prime. Another step through follows:


# Worst case for Sitong's solution

Take for example 16:
1 16 # check 16 if prime (no)
2 8 # check 8 if prime (no)
4 4 # check 4 if prime (no)
8 2 # check 2 if prime (yes)
16 1

With two algorithms that are almost inverses in terms of best and worst case, it only makes sense to combine to two to have a super algorithm! With a simple 2 line change in the code, a new algorithm was born. Combining my idea to reduce the searching space by removing factors with Sitong’s prime checking to look for very large prime factors yielded an algorithm that deals nicely with both of our worst cases. This post look much longer than our solution, but the process of problem solving that led to the combination was a lot of fun to share. The pseudo-code follows:


# Recycled is_prime function from Sitong

def is_prime?(n)
  if n == 1
    return false
  if n < 4
    return true
  for i in range (2, sqrt(n)
    if i % n == 0
      return False
  return True

# New solution

def largest_prime_factor(n)
  i = 2
  if is_prime?(n)
    return n
  while (i < n)
    while n % i == 0        # 2 lines of changes, while loop here to grab all small primes
      n = n/i               # then reducing n to shrink the iteration
      if is_prime?(n/i)
        return n/i
    i += 1

To check out our epiphany from Sitong’s perspective check out his post!

Often people shouldn’t reinvent the wheel, however in my learning at Hacker School I feel that for certain areas it is likely important to try and think up a solution for a well known (and most likely solved) problem and see what you come up with.

Today, I was thinking of a problem that would appear in a possible project that Sitong suggested. The project involved creating a type of “heat” map which would represent the minutes to a destination given walking and public transportation options. In mulling over this in my head it presented one of the classic problems in algorithms… the shortest path problem. I have little experience with algorithms so I thought this would be a great problem to come up with a personal solution before checking for the optimal solution. Hopefully my solution will turn out to be the optimal.

In this problem, you are given a graph with nodes and connections with values, in this case the values for the connections represent minutes of travel between each node. The nodes will represent travel options between two points on a map and the goal is to find the shortest path from an initial starting node and a chosen end node.

In my haze of sleep I began to think of solutions for a quick optimization of finding the shortest path and began to envision the graph node network as a net that is malleable. If one picks the starting node and ending node in each hand and stretches, the shortest path will appear as it will be the first path that can no longer stretch between the two hands (ignoring of course all physical tangling of the net). The problem I face is to quickly calculate the distance from every node on a map to another node, and if this act of “lifting” up the node from the net instantly reveals the shortest node it may be somehow translatable in code.

My currently train of thought brings me to trees. Given a graph network, it may be possible to translate that graph into a tree and stretch it out until you find the shortest path between the two nodes…

~ Hacking in Progress 😛 ~

After doing some reading and implementing Dijkstra’s algorithm in Ruby, I see the parallels of my visual thoughts and the mathematical explanations. Creating a even set of sub-nodes and traveling down all parallel paths until you hit nodes is analogous to lifting a net and having all connections lift off a table until node by node the solution appears. I find this connection between my thoughts and the derivation fascinating and really do wonder if there are other things out there that I could find novel solutions to. Looking forward to more late-night epiphanies and other sparks of inspiration!