Data Analysis and R, A (continuously updating) Personal Reference

APPLY:

SAPPLY:

``sapply(dataframe\$col, FUN = sum/mean/etc)``
``` ```

sapply: Applies a function to the column and returns a vector with the results.

LAPPLY:

``lapply(dataframe\$col, FUN = sum/mean/etc)``
``` ```

lapply: Applies a function to the column and returns a list with the results.

TAPPLY:

``tapply(dataframe\$col1, dataframe\$col2, FUN = sum/mean/etc)``
``` ```

tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

AGGREGATE

``````aggregate(cbind(dataframe\$col1, dataframe\$col2) ~ dataframe\$col3 +
dataframe\$col4, data = dataframe, FUN = sum/mean/etc)``````
``` ```

aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

CUT:

``cut2(dataframe\$column, g=numberofgroups)``
``` ```

Cut a continuous variable into a factor with g groups.

SAMPLE:

``sample(1:rows, size=number, replace=T/F)``
``` ```

sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

RANDOMLY SUBSETTING TRAIN AND TEST DATA:

``````set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test``````
``` ```

PLOT:

``````plot(dataframe\$col1, dataframe\$col2, pch=bullettype,
col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other``````
``` ```

col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.
plotting multiple columns creates a matrix of plots
pch has integer values to represent different types of bullets
cex determines the size and detail in the plots

OTHER SCATTER PLOTS:

``````smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)``````
``` ```

Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

REGRESSION ON FACTORS:

``````lm(dataframe\$quantitative ~ as.factor(dataframe\$factor))
lm(dataframe\$quantitative ~ relevel(dataframe\$factor,
ref ="reference variable")) # sets a reference variable for the lm``````
``` ```

Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.