`sapply(dataframe$col, FUN = sum/mean/etc)`

```
```

sapply: Applies a function to the column and returns a vector with the results.

`lapply(dataframe$col, FUN = sum/mean/etc)`

```
```

lapply: Applies a function to the column and returns a list with the results.

`tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)`

```
```

tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

```
aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 +
dataframe$col4, data = dataframe, FUN = sum/mean/etc)
```

```
```

aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

`cut2(dataframe$column, g=numberofgroups)`

```
```

Cut a continuous variable into a factor with g groups.

`sample(1:rows, size=number, replace=T/F)`

```
```

sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

```
set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test
```

```
```

```
plot(dataframe$col1, dataframe$col2, pch=bullettype,
col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other
```

```
```

col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.

plotting multiple columns creates a matrix of plots

pch has integer values to represent different types of bullets

cex determines the size and detail in the plots

```
smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)
```

```
```

Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

```
lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor,
ref ="reference variable")) # sets a reference variable for the lm
```

```
```

Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.

R has a good help function but I feel that this is a good exercise to remember all these commands as well as have a broad reference dictionary.

```
getwd()
setwd("directory/directory")
```

Shows current working directory and sets a working directory

```
attach(dataframe)
detach(dataframe)
```

Attaches current dataframe’s objects to main namespace or removes current dataframe’s objects from current namespace

*Not sure how useful this is I think I might use sparsely to avoid namespace conflicts.

```
read.csv(file = "filename", head=T/F)
read.table(file="filename", sep="separator", head=T/F,
strings.as.factors=T/F,col.names=T/F, row.names=T/F,
strip.white = T/F)
```

Reads in a CSV or table with different settings.

```
head(dataframe, numofrows)
dim(dataframe)
names(dataframe)
summary(dataframe)
quantile(dataframe$column)
class(dataframe)
sapply(dataframe[1,],class)
unique(dataframe$column)
length(dataframe$column)
table(dataframe$column, useNA="ifany")
```

head: see first numofrows of dataframe

dim: see dimensions of dataframe

names: see col names

summary: get counts for qualitative variables or numerical summary of a quantitative variable

quantile: get quantiles of a quantitative variable

class: get class of dataframe of column

sapply: apply class function to first row of dataframe to get classes of all variables

unique: get unique values of a column

length: get length of column

table: create a table for unique values and counts mainly for qualitative variables, useNA=”ifany” shows NA value counts

```
any(dataframe$column [condition])
all(dataframe$column [condition])
example:
any(data$column > 40) # true/false
all(data$column > 0 & data$column < 40)
```

any: tests condition for any matches

all: tests condition for all values

```
subset(dataframe, conditions, select = c(column, column))
```

subset: subset dataframe by certain conditions and only return selected columns

```
merge(dataframe1, dataframe2, by.x = dataframe1$column,
by.y = dataframe2$column, all = T/F)
```

by.x and by.y: specify a merge on columns that do not share the same name

all: specifies an outer join versus an inner join: T includes all records and inserts NA for all missing information

`dataframe1$column[order(dataframe1$column)]`

Returns a vector of row numbers in sorted order of the specified column

`sorted <- dataframe1[order(dataframe1$column, dataframe1$column2),]`

Stores into “sorted” dataframe1 sorted by row order by column1 then by column2. Additional levels of sorting can be added in the order function.

*Be careful to add the “,” after the order function or you will get an error.

`melt(dataframe, id.vars="idcolumn", variable.name="varnames",value.name="values")`

Example output:

```
Input Matrix:
Name TreatmentA TreatmentB
John 4 1
Jane 5 2
Result:
Name Treatment Value
John TreatmentA 4
John TreatmentB 1
Jane TreatmentA 5
Jane TreatmentB 2
```

This takes a matrix style table and reshapes it to have one observation per row.

*Requires install.packages(“reshape”)

The problem begins as three pegs on which discs can be placed. The discs are placed in decreasing order of size to create a pyramid of n height on peg 1 and the task is to move the discs from peg 1 to peg 3. The constraint of the problem is that a larger sized disc cannot be placed on top of a smaller disc.

Creating an array for each of the three pegs and notating each disc as an integer from 1 – n depending on size gave me a representation of the problem. The beginning of the problem looked like so:

tower1 = [n,n-1,…,4,3,2,1]

tower2 = []

tower3 = []

I then began to map out the base cases for the problem:

**N = 1, 1n move from tower 1 to tower 3:**

1 disc from tower 1 to tower 3

**N = 2, 2n move from tower 1 to tower 3:**

1n move from tower 1 to tower 2

1 disc from tower 1 to tower 3

1n move from tower 2 to tower 3

**N = 3, 3n move from tower 1 to tower 3:**

2n move from tower 1 to tower 2

1 disc from tower 1 to tower 3

2n move from tower 2 to tower 3

If we notate the towers as origin, destination, and buffer, we begin to see the pattern:

origin to buffer

origin to destination

buffer to destination

Also we see that the problem lends itself to recursion easily because for a 3 height tower there are just two 2 height tower sub-problems, for a 4 there are two 3 height tower sub-problems, etc.

Given this fact, we can work together a recursive solution using normal arrays:

```
```

```
# Define a move
def move(origin, dest)
if dest.empty? || origin[-1] < dest[-1]
dest.push(origin.pop)
end
end
# The recursive solution
def move_tower(n, origin, dest, buffer)
if n == 1
move(origin, dest)
else
move_tower(n-1, origin, buffer, dest)
move_tower(1, origin, dest, buffer)
move_tower(n-1, buffer, dest, origin)
end
end
tower1 = [6,5,4,3,2,1]
tower2 = []
tower3 = []
move_tower(6, tower1, tower3, tower2)
# Result is tower3 = [6,5,4,3,2,1]
```

The move ensures we don't make an illegal move by making sure the target stack is either empty or has a larger disc than what we are moving. The solution takes in the height, and the three towers that we create and solves the problem recursively.

The solution is was quite fun to puzzle through by finding the pattern in the cases and modeling a solution to reflect that pattern. There is also an iterative solution that takes into account the fact that odd and even numbered towers have different solutions with set repeating moves. I will try to tackle that one in the future!

]]>You are a person trying to cross a river and reach the land on the other side, however you cannot swim.

There are rocks in the water which you can step on to allow you to cross the river, however you start at speed 0 and can only increase your speed by 1, decrease it by 1, or maintain the same speed for each step that you take. You cannot have a negative speed (go backwards).

Given a string representing rocks (“*”) and water (“~”) write a method to return True or False depending on whether any path exists across the river.

```
Examples
River: *****~~*~~~*~~~~*~*~~*** -> true
Steps: 011_2__3___4____5_____6 Success
River: ****~~~~~~~~~~~~** -> false
Steps: 01_2__3 Failure
```

~~~~~~~~~

This problem, courtesy of Sitong, was an interesting programming challenge we tried to tackle yesterday. The problem itself is a bit unique as it represents a sort of tree traversal problem where we have a series of possibilities (accelerate, decelerate, or maintain constant speed) at every step and search the tree for a valid solution. Given that it is a tree traversal problem, the most basic approach would be recursive.

I have always favored iteration over recursion when I program just because of the way my mind things, but recursive solutions tend to be easy to read and relatively short. However the catch with recursion I always am wary of is issues with run-time, but we’ll get to that problem later!

The solution that we both came up with (myself with a little prodding from Sitong) was one of tree traversal. My solution in ruby follows:

```
def can_cross_river?(river, loc, speed)
```` # base-case success if we reach land`

return true if loc + speed + 1 > river.length
# If we landed on another rock, try 3 steps recursively
# Accelerate first every time so we get the fastest path across
if river[loc + speed + 1] == "*"
return true if can_cross_river?(river, loc + speed + 1, speed + 1)
end
# If accelerating fails try maintaining speed for this step
if speed > 0 && river[loc + speed] == "*"
return true if can_cross_river?(river, loc + speed, speed)
end
# If maintaining speed fails try slowing down
if speed > 1 && river[loc + speed - 1] == "*"
return true if can_cross_river?(river, loc + speed - 1, speed - 1)
end
return false
end

The code checks every path possible starting with the fastest and looks for the first successful run. However, when calculating the run time of this path we ran into some interesting caveats. The initial bound of run time when I first looked at the problem was O(3^n) because the problem recursively results in 3 possible choices for each path. For n choices this results in 3^n paths. However that is merely an upper-bound, as the tree depth shortens as we accelerate in speed and reach the goal faster. Therefore we will have significantly less than 3^n possible paths. I don’t think I can mathematically calculate this bound of worst-case but my intuition can be captured in the figure below:

The triangle represents a normal 3^n tree of possibilities but the grey shaded area represents the area of this problem as we accelerate when we take a right branch and reach a true or false conclusion faster. We cannot go slower than 1 speed so we don’t have any branches that go left from the origin.

There are other solutions for this problem and other optimizations including memoization or caching, however those also have issues. This problem is unique as for each sub-problem while we may face the same remainder of the river, we might approach this at a different speed. Thus caching will involve a huge problem space and questions of efficiency arise. I think this problem was great fodder for the mind however and provided a lot of interesting food for thought. I also asked David the same problem and saw a completely different approach to solving it, however we haven’t figured out the details on how to efficiently tackle the problem from his angle yet. Perhaps if we do I will write another post comparing his solution and this solution.

]]>1. Start off by clarifying goals and results. The most important part for a case is understanding what the desired result is. Whether I’m designing a data system, formulating a solution, or estimating effects, I should understand what the overarching goal is before I even start to walk through my thinking. Who is my interviewer supposed to be? What information can they provide?

2. I tend to work in a iterative fashion as I walk through cases which lends itself to train of thought syndrome. Framework. I should lay out and notate a framework. Start from the beginning. List assumptions, available resources. List the goal and sub-goals that can be derived from that goal. Do not begin to talk about a solution before laying down this framework. My initial solution tends to change as I work through a problem and to mitigate that I can work through the problem on paper and the first solution I explain can at least be a fourth or fifth draft in my mind.

3. Immediately consider these questions as I’m working:

1. What are the advantages to this system? How can it be extended? How generalized is it?

2. What are the disadvantages to this system? What can it not measure?

3. How can this system work with extensions? Other systems?

4. If data-based, what do I track and measure and what do I not know? Can I get the information I don't know? Does it fit into the current scheme?

5. Think about the goal. Does this solve that problem? Does it scale to this problem? Are there any tangential goals that might be in question?

6. What company am I interviewing for and what does this question have to do with the company? What are the parallels and what should I focus on as I explain my solution?

4. Breathe. Take time and slow down. I’ve always had a habit on essay tests to just start writing. In class I never take notes and I work from memory. While this approach serves me well internally, I need to highlight my strengths in problem solving in a way that can be followed by the interviewer. Go back to the outline. Go back to the notes. Go back to the goal and review it again. I need to slow down and keep reviewing, move at a deliberate but more coherent and complete pace.

5. When working with data, lay out the specifics. Write down the tables, the relationships and the storage. This might not need to be conveyed during to the interviewer depending on position, but having this laid out provides the organization for explaining what data can be derived from the system.

6. Define the results and capabilities of the system that you’ve designed. Understand its flaws and think through its implementation. Important aspects: scaling, use case, expandability, applications, edge cases, any missing information? Always think about edge cases and the world view of the problem, don’t fall into the trap of assuming the topic, whether a store, business, or other, reflect the first example that comes to mind. Think of other environmental or confounding factors that may change the optimal design.

I may come back and add and edit this as I learn more. It’s important in all circumstance to always keep learning from any situation. Until the next one!

]]>The pseudo-code follows:

```
i = 2
n = N
while i < n # Start with worst-case
while n % i == 0 # Take out the same factors
n = n / i # Reduce iterating range
largest_prime_factor = i # The largest factor
i += 1
return largest_prime_factor
```

My code runs very well for numbers that have a large number of factors but gets worse as N has larger prime factors. The worst case for my algorithm is if N is prime.

Now the advantages of collaboration show up as I shared my solution with Sitong Peng. His solution approached the problem from an entirely different angle. He searched for pairs of numbers where x * y = N and checked the larger factor to see if it was prime. Because he iterated starting from 2 as well, the first prime factor would be the answer. Sitong’s R code translated into pseudo-code and a step through of the process follows:

```
# This block is credit to Sitong Peng
def is_prime?(n)
if n == 1
return false
if n < 4
return true
for i in range (2, sqrt(n)
if i % n == 0
return False
return True
def largest_prime_factor(n)
i = 2
if is_prime?(n)
return n
while (i < num)
if n % i == 0
if is_prime?(n/i)
return n/i
i += 1
# Step through of the process
Take for example 30:
The pairs are:
1 30 # check 30 if prime (no)
2 15 # check 15 if prime (no)
3 10 # check 10 if prime (no)
5 6 # check 6 if prime (no)
6 5 # check 5 if prime (yes) < Answer
10 3
15 2
30 1
```

An interesting property of this algorithm is that my worst-case scenario happens to be Sitong’s best-case as the first number he checks is N while mine would have to iterate from i to N. Sitong’s worst case scenario on the other hand is having a number that is 2^n or other numbers whose largest prime is small. Sitong’s algorithm would have to iterate until it reached the larger value to find the first prime. Another step through follows:

```
# Worst case for Sitong's solution
Take for example 16:
1 16 # check 16 if prime (no)
2 8 # check 8 if prime (no)
4 4 # check 4 if prime (no)
8 2 # check 2 if prime (yes)
16 1
```

With two algorithms that are almost inverses in terms of best and worst case, it only makes sense to combine to two to have a super algorithm! With a simple 2 line change in the code, a new algorithm was born. Combining my idea to reduce the searching space by removing factors with Sitong’s prime checking to look for very large prime factors yielded an algorithm that deals nicely with both of our worst cases. This post look much longer than our solution, but the process of problem solving that led to the combination was a lot of fun to share. The pseudo-code follows:

```
# Recycled is_prime function from Sitong
def is_prime?(n)
if n == 1
return false
if n < 4
return true
for i in range (2, sqrt(n)
if i % n == 0
return False
return True
# New solution
def largest_prime_factor(n)
i = 2
if is_prime?(n)
return n
while (i < n)
while n % i == 0 # 2 lines of changes, while loop here to grab all small primes
n = n/i # then reducing n to shrink the iteration
if is_prime?(n/i)
return n/i
i += 1
```

To check out our epiphany from Sitong’s perspective check out his post!

]]>Today, I was thinking of a problem that would appear in a possible project that Sitong suggested. The project involved creating a type of “heat” map which would represent the minutes to a destination given walking and public transportation options. In mulling over this in my head it presented one of the classic problems in algorithms… the shortest path problem. I have little experience with algorithms so I thought this would be a great problem to come up with a personal solution before checking for the optimal solution. Hopefully my solution will turn out to be the optimal.

In this problem, you are given a graph with nodes and connections with values, in this case the values for the connections represent minutes of travel between each node. The nodes will represent travel options between two points on a map and the goal is to find the shortest path from an initial starting node and a chosen end node.

In my haze of sleep I began to think of solutions for a quick optimization of finding the shortest path and began to envision the graph node network as a net that is malleable. If one picks the starting node and ending node in each hand and stretches, the shortest path will appear as it will be the first path that can no longer stretch between the two hands (ignoring of course all physical tangling of the net). The problem I face is to quickly calculate the distance from every node on a map to another node, and if this act of “lifting” up the node from the net instantly reveals the shortest node it may be somehow translatable in code.

My currently train of thought brings me to trees. Given a graph network, it may be possible to translate that graph into a tree and stretch it out until you find the shortest path between the two nodes…

~ Hacking in Progress ~

After doing some reading and implementing Dijkstra’s algorithm in Ruby, I see the parallels of my visual thoughts and the mathematical explanations. Creating a even set of sub-nodes and traveling down all parallel paths until you hit nodes is analogous to lifting a net and having all connections lift off a table until node by node the solution appears. I find this connection between my thoughts and the derivation fascinating and really do wonder if there are other things out there that I could find novel solutions to. Looking forward to more late-night epiphanies and other sparks of inspiration!

]]>Using Ubuntu 12.04

First step is to install the postgresql packages which comes in two pieces, the client and the dev package with the additional add-ons that the book requires:

```
sudo apt-get install postgresql-client
sudo apt-get install postgresql-contrib
```

Postgresql creates a postgres user which manages the server if you run it locally. Ideally you should create a user for yourself:

```
# Opens psql as the postgres superuser
sudo -u postgres psql
# Creates a user for yourself
createuser --superuser [
```*user*]
# Create a password for your user
\password

Exit the psql dashboard and return to terminal. Now I can create and log into my server by using the psql command and install the extensions:

```
# Create DB
createdb book
# Open the newly created DB
psql book
# Install the extensions needed for the exercises in 7 Databases in 7 Days
create extension tablefunc
create extension dict_xsyn
create extension fuzzystrmatch
create extension pg_trgm
create extension tablefunc
```

Another link which was quite helpful for getting psql setup and running:

https://help.ubuntu.com/community/PostgreSQL

In a classic example consider two problems:

You are given $1000 and asked to pick one of two gambles:

A) 50% chance of winning $1000

B) $500 for sure

You are given $2000 and asked again to pick one of two gambles:

A) 50% chance to lose nothing

B) Losing $500 for sure

For the majority of survey respondents, in the first event they chose to gain $500 for sure, but when the event is framed as a loss, they chose to gamble. This is because of loss-aversion that is exhibited once you have a set reference point. The chance to avoid a loss becomes more tempting because we are so averse to losing once we have the $2000 in our possession even though the expected values of the gambles are identical.

Some day-to-day observations regarding leadership decisions while working in a public company seem to be neatly explained with prospect theory. Of these observations is the cultish obsession with revenue and EBITDA at my company. While financials reflect some measure of success, upper management has come to worship them as harbingers of life and death. One of the challenges that I have faced is the battle against upper management on whether to prioritize long-term goals vs short-term gains. In the losing battle I wage, I have seen time and time again a shift in prioritization from long-term progress to short-term goals. The idea of reference points and loss-aversion within prospect theory provide a possible explanation for these sub-optimal decisions.

Prior to an IPO, a company is often in a mode where they may be pre-revenue or unprofitable. A monetary loss may not be important and can be offset by a more intangible resource gain within the company.Â Without public reporting, the reference point for the company is more fluid and changes with the needs of the company. The challenge of an IPO is that once a monetary reference point is set, these intangibles which may help the company in the future may become undervalued as they are not counted in the bottom line. The correct time for an IPO often comes when companies are growing the fastest and striving to become more profitable, exacerbating the possibility of missing projections and entering the realm of “loss”. In a public company, the focus is completely different with forecasting and quarterly announcements of targets. These promises become reference points set in stone for leadership and, when faced with the danger of missing the numbers, loss-aversion pushes decisions that will sacrifice anything to avoid losses. These initiatives tend to be ones that prioritize short-term revenue and padding while ignoring long-term impacts.

A second symptom of this creates an obsession with quantifying initiatives with a revenue/margin impact. This focus often de-prioritizes long-term initiatives because they often do not have immediate revenue boosts and are more resource intensive. Thus, by avoiding the losses based on a public reference point in a public company, leadership will trade the long-term gains for short-term value.

I believe that strong leadership is able to avoid these pitfalls, whether by understanding the economic concepts that drive their decisions, or by having a broader sense of what is good and bad for the company. As of yet I am still on the lowest rung of the corporate ladder; however, I do wonder if understanding these models can make us immune to their pitfalls. Perhaps that would be an interesting experiment to conduct!

]]>Tenet 1:

In a factor economy understand the entire supply chain.

Frequent and dedicated trading is almost always easiest at the lowest level of a supply chain. The liquidity of these goods and the amount of available supply and demand is almost always highest at the lowest levels of the supply chain because these goods must be combined to create higher level parts. When trading these commodities, the entire chain of production becomes a possible exit point because you have to ability to undertake the crafting to convert that good into a product. This ability to convert the good as well as trade it allows for greater reward because of the higher possibility of arbitrage between the purchase and the many exit goods, as well as reduced risk because the good can be converted to another.

A caveat to trading low level goods is that there is a hidden value of most commodities in games that have crafting systems that is priced into the market. When crafting requires repeated creation of products using raw materials, the experience gained from the commodity is often priced into the material. Often this will result in the products created by these materials to sell for less than the inputs because of the value of experience. Being aware of the usefulness of the outputs and extra careful about the dollar value of experience will help in avoiding costly trades.

Tenet 2:

The more liquid the good, the smaller the margin.

Just as demand and supply are the highest for the low level commodities, the trading activity within these goods creates the most efficient markets. Because of the constant activity, spreads between the bids and asks tend to be small and thus the margins for trading these goods tend to be thin. One must be very careful when trading with low margins because the often overlooked costs of commissions and trading fees can often turn a profitable trade into one in the red. On the other hand, slow markets tend to have high spreads and thus offer the most opportunity for profit. One tactic when keeping this tenet in mind is that when one has time to dedicate to trading, they can pick highly liquid goods to trade frequently and increase profits; however, when one is not actively in the market, one can often invest in a few highly illiquid goods to effectively maximize the profit off of a single trade during the time that you are away from the market. One caution with this technique though is to be sure to pick markets with which you have a strong understanding. With illiquid markets, often the prices in market are very volatile because of the nature of players posting outrageous prices. One must have a lot of caution in picking their investment because this volatility usually regresses to the mean and thus profit opportunity will be lower than expected.

Tenet 3:

Always understand the origin of supply and the reason for demand.

Understanding supply for goods can help you foresee and take advantage of supply shocks or booms. In games, all goods can be traced to a specific origin. Understanding the amount of the good available and the relative rarity of the good can help with judgement of price and maximize your profit. Equivalent is understanding the reason behind demand. If demand is artificially inflated by some circumstance, investing in a good can be dangerous when the market crashes. This particular note is very salient in real world markets as well.

An example of this in action is an early demand spike for a highly rare item On the supply side, take for example a rare dungeon drop that is found early on in the game release. It is almost always good to sell that good immediately because supply will be severely constrained. As the game progresses, the rarity of the item will drop and the value of the good will stabilize at a lower value. The best course of action in this circumstance would be to immediately sell the rare good at a very high price (take advantage of the anchoring effect ;)) and then quickly reinvest the money into the market to beat inflation. Inflation early on in a game economy is a very real issue because of low currency saturation. As money is collected by the game population, money supply will grow quickly, and inflation will be high and unstable. In an established market, the urgency of these trades is much lower because of the relatively lower risk of inflation.

~~

These are just a few basic aspects that I keep in mind when thinking about a trade. After a while, most of these considerations become second nature and you will automatically understand when each of these tenets apply and what they might tell you about the good. Most game economies are only mildly sophisticated because of the nature of the market with many participants that are not engaged in actual trading and use the market purely as a tool to sell and buy. This gives the traders of any game a great opportunity to accumulate returns and ultimately those who are the most aware come away with the best earnings!

]]>