A Basic Regression in R

 Here is a basic R program for doing a simple linear regression.  Below I'll show some common modifications that one might want to add that aren't intuitive to add.

 

First, we import the data, in this case from a comma separated variable file.

emp <- read.table("/Vols/duphenix/Docs/self_emp/employ.csv",header=TRUE,sep="," )

I'll explain each piece

  1. emp, is just a container name, I'll use it every time I want to refer to, or use the raw data
  2. <-, assigns whatever follows it to the container name
  3. read.table(), is the function that actually goes out and gets the data
  4. "/Vols/duphenix/Docs/self_emp/employ.csv", is the path to the file containing the data.  On windows machines, this would begin with C:\ (or whatever drive letter) and then the file path.
  5. header=TRUE, tells read.table that the first line in the csv file contains the variables names for each column.  The other option would be to use row.names to call a set of data containing the variable names, for now it is much easier to just use header=TRUE and have the variable names in the csv file.
  6. sep=",", lets read.table know that the separator between columns is a comma, "\t" would tell read.table that it was a tab delineated file (no matter what the file extension was).

You can use,

print(emp)

to print the data in the emp container we just made with read.table, if you wanted to verify the data imported correctly for instance.

summary(emp)

Will print summary statistics for the data, by default the mean, median, maximum, minimum and quintiles.  You can get individual summary statistics from other functions.

names(emp)

Will print all the variable names from the emp dataset, which can be handy when you need to use them later in the program.

Now that we have the data entered, and have a list of the variable names we can get to the actual regression.

The most basic linear regression in R is called by the lm() function.

lm(emp$dependent_var ~ emp$independent_var_1 + emp$independent_var_2) 

In this case emp is the dataset, the $ is the separator and dependent_var is the dependent variable (or explanatory variable, or regressor, etc.) .  The ~ tells the lm function that the independent variables (or observed variables, or regressands, etc.) follow.  The next two, emp$independent_var_1, and emp$independent_var_2, are the first two independent variables.  You could use as many as you wanted here, depending either on your experimental design, or theoretical background.  

Some variations should be mentioned here.  If you needed to force the intercept to 0, for theoretical or logical reasons, you could rewrite the line as follows,

lm(emp$dependent_var ~-1 + emp$independent_var_1 + emp$independent_var_2)

with the -1 forcing the intercept to zero.  You could also use the more flexible generalized linear model, or glm().  By default this will give results identical to the lm() function, but you can specify a different family of distributions in it.  The following is an example of using the glm() function to get identical results to the lm() function.

glm(emp$dependent_var ~-1 + emp$independent_var_1 + emp$independent_var_2, family =gaussian)