Data Analysis and R, A (continuously updating) Personal Reference

APPLY:

SAPPLY:

sapply(dataframe$col, FUN = sum/mean/etc)


sapply: Applies a function to the column and returns a vector with the results.

LAPPLY:

lapply(dataframe$col, FUN = sum/mean/etc)


lapply: Applies a function to the column and returns a list with the results.

TAPPLY:

tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)


tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

AGGREGATE

aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 + 
     dataframe$col4, data = dataframe, FUN = sum/mean/etc)


aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

CUT:

cut2(dataframe$column, g=numberofgroups)


Cut a continuous variable into a factor with g groups.

SAMPLE:

sample(1:rows, size=number, replace=T/F)


sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

RANDOMLY SUBSETTING TRAIN AND TEST DATA:

set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test

PLOT:

plot(dataframe$col1, dataframe$col2, pch=bullettype, 
    col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other


col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.
plotting multiple columns creates a matrix of plots
pch has integer values to represent different types of bullets
cex determines the size and detail in the plots

OTHER SCATTER PLOTS:

smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)


Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

REGRESSION ON FACTORS:

lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor, 
     ref ="reference variable")) # sets a reference variable for the lm


Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: