Archive

R

APPLY:

SAPPLY:

sapply(dataframe$col, FUN = sum/mean/etc)


sapply: Applies a function to the column and returns a vector with the results.

LAPPLY:

lapply(dataframe$col, FUN = sum/mean/etc)


lapply: Applies a function to the column and returns a list with the results.

TAPPLY:

tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)


tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

AGGREGATE

aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 + 
     dataframe$col4, data = dataframe, FUN = sum/mean/etc)


aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

CUT:

cut2(dataframe$column, g=numberofgroups)


Cut a continuous variable into a factor with g groups.

SAMPLE:

sample(1:rows, size=number, replace=T/F)


sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

RANDOMLY SUBSETTING TRAIN AND TEST DATA:

set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test

PLOT:

plot(dataframe$col1, dataframe$col2, pch=bullettype, 
    col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other


col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.
plotting multiple columns creates a matrix of plots
pch has integer values to represent different types of bullets
cex determines the size and detail in the plots

OTHER SCATTER PLOTS:

smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)


Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

REGRESSION ON FACTORS:

lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor, 
     ref ="reference variable")) # sets a reference variable for the lm


Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.

Advertisements

I am quick to Google and find answers when it comes to doing tasks. I’ve read studies that have mentioned that the generation that has grown up with the internet and all this knowledge at our fingertips has become worse and memorizing instructions outright but significantly better at efficiently finding instructions and remembering reference areas. As I often to find myself knowing where to look for an answer but not able to recall off the top of my head, I figure this would be a great place to start writing out my personal comprehensive reference guide to data munging in R!

R has a good help function but I feel that this is a good exercise to remember all these commands as well as have a broad reference dictionary.

SETTING UP WD:

getwd()
setwd("directory/directory")

Shows current working directory and sets a working directory

NAMESPACES:

attach(dataframe)
detach(dataframe)

Attaches current dataframe’s objects to main namespace or removes current dataframe’s objects from current namespace
*Not sure how useful this is I think I might use sparsely to avoid namespace conflicts.

READING CSV OR TABLE:

read.csv(file = "filename", head=T/F)
read.table(file="filename", sep="separator", head=T/F, 
     strings.as.factors=T/F,col.names=T/F, row.names=T/F, 
     strip.white = T/F)

Reads in a CSV or table with different settings.

SUMMARY OPTIONS:

head(dataframe, numofrows)
dim(dataframe)
names(dataframe)
summary(dataframe)
quantile(dataframe$column)
class(dataframe)
sapply(dataframe[1,],class)
unique(dataframe$column)
length(dataframe$column)
table(dataframe$column, useNA="ifany")

head: see first numofrows of dataframe
dim: see dimensions of dataframe
names: see col names
summary: get counts for qualitative variables or numerical summary of a quantitative variable
quantile: get quantiles of a quantitative variable
class: get class of dataframe of column
sapply: apply class function to first row of dataframe to get classes of all variables
unique: get unique values of a column
length: get length of column
table: create a table for unique values and counts mainly for qualitative variables, useNA=”ifany” shows NA value counts

TESTING DATA:

any(dataframe$column [condition])
all(dataframe$column [condition])

example:
any(data$column > 40) # true/false
all(data$column > 0 & data$column < 40)

any: tests condition for any matches
all: tests condition for all values

SUBSETTING DATA:

subset(dataframe, conditions, select = c(column, column))

subset: subset dataframe by certain conditions and only return selected columns

MERGE:

merge(dataframe1, dataframe2, by.x = dataframe1$column, 
      by.y = dataframe2$column, all = T/F)

by.x and by.y: specify a merge on columns that do not share the same name
all: specifies an outer join versus an inner join: T includes all records and inserts NA for all missing information

ORDER:

dataframe1$column[order(dataframe1$column)]

Returns a vector of row numbers in sorted order of the specified column

SORTING USING ORDER:

sorted <- dataframe1[order(dataframe1$column, dataframe1$column2),]

Stores into “sorted” dataframe1 sorted by row order by column1 then by column2. Additional levels of sorting can be added in the order function.
*Be careful to add the “,” after the order function or you will get an error.

MELT:

melt(dataframe, id.vars="idcolumn", variable.name="varnames",value.name="values")

Example output:

Input Matrix:
Name    TreatmentA    TreatmentB
John    4             1
Jane    5             2 

Result:
Name    Treatment   Value
John    TreatmentA  4
John    TreatmentB  1
Jane    TreatmentA  5
Jane    TreatmentB  2

This takes a matrix style table and reshapes it to have one observation per row.
*Requires install.packages(“reshape”)