I am quick to Google and find answers when it comes to doing tasks. I’ve read studies that have mentioned that the generation that has grown up with the internet and all this knowledge at our fingertips has become worse and memorizing instructions outright but significantly better at efficiently finding instructions and remembering reference areas. As I often to find myself knowing where to look for an answer but not able to recall off the top of my head, I figure this would be a great place to start writing out my personal comprehensive reference guide to data munging in R!
R has a good help function but I feel that this is a good exercise to remember all these commands as well as have a broad reference dictionary.
SETTING UP WD:
Shows current working directory and sets a working directory
Attaches current dataframe’s objects to main namespace or removes current dataframe’s objects from current namespace
*Not sure how useful this is I think I might use sparsely to avoid namespace conflicts.
READING CSV OR TABLE:
read.csv(file = "filename", head=T/F) read.table(file="filename", sep="separator", head=T/F, strings.as.factors=T/F,col.names=T/F, row.names=T/F, strip.white = T/F)
Reads in a CSV or table with different settings.
head(dataframe, numofrows) dim(dataframe) names(dataframe) summary(dataframe) quantile(dataframe$column) class(dataframe) sapply(dataframe[1,],class) unique(dataframe$column) length(dataframe$column) table(dataframe$column, useNA="ifany")
head: see first numofrows of dataframe
dim: see dimensions of dataframe
names: see col names
summary: get counts for qualitative variables or numerical summary of a quantitative variable
quantile: get quantiles of a quantitative variable
class: get class of dataframe of column
sapply: apply class function to first row of dataframe to get classes of all variables
unique: get unique values of a column
length: get length of column
table: create a table for unique values and counts mainly for qualitative variables, useNA=”ifany” shows NA value counts
any(dataframe$column [condition]) all(dataframe$column [condition]) example: any(data$column > 40) # true/false all(data$column > 0 & data$column < 40)
any: tests condition for any matches
all: tests condition for all values
subset(dataframe, conditions, select = c(column, column))
subset: subset dataframe by certain conditions and only return selected columns
merge(dataframe1, dataframe2, by.x = dataframe1$column, by.y = dataframe2$column, all = T/F)
by.x and by.y: specify a merge on columns that do not share the same name
all: specifies an outer join versus an inner join: T includes all records and inserts NA for all missing information
Returns a vector of row numbers in sorted order of the specified column
SORTING USING ORDER:
sorted <- dataframe1[order(dataframe1$column, dataframe1$column2),]
Stores into “sorted” dataframe1 sorted by row order by column1 then by column2. Additional levels of sorting can be added in the order function.
*Be careful to add the “,” after the order function or you will get an error.
melt(dataframe, id.vars="idcolumn", variable.name="varnames",value.name="values")
Input Matrix: Name TreatmentA TreatmentB John 4 1 Jane 5 2 Result: Name Treatment Value John TreatmentA 4 John TreatmentB 1 Jane TreatmentA 5 Jane TreatmentB 2
This takes a matrix style table and reshapes it to have one observation per row.