(options)

Summer06: Notes on Bayesian Data Analysis

Homework for Thursday June 22
Hunting the pot data!

Grab the spss file pot.sav with,

 library(foreign)
 pot ← read.spss(“pot.sav”)
 attach(pot)

Now you have 1135 variables (!) to play with… go huntting… try to understand what’s in there… suggest models and questions. Display summaries that shed light on the dataset….


Homework for Tuesday June 20
Consider the following Bioassay data. Dose, x_{ i} Number of Number of (log g/ml) animals, n_{ i} deaths, y_{ i}
  −0.863               5            0
  −0.296               5            1
  −0.053               5            3
   0.727               5            5

Assume the standard logistic regression model. i.e., the data is

(x_{ i},n_{ i},y_{ i});\mbox{\ \ for\ } i=1,\ldots,k,

where the x_{ i} represents the ith of the k dose levels (measured on a log scale) given to n_{ i} animals, of which y_{ i} respond with a positive outcome. Every subject is assumed to respond either with a positive or a negative outcome independently of the other subjects in the experiment. The chance that the ith subject will respond with a positive outcome is \theta_{ i} and it is assumed to depend on the dose x_{ i} according to the standard,

\log(\frac{\theta_{ i}}{1-\theta_{ i}}) = \alpha + \beta x_{ i}

known as the logistic regression model. Put the usual (gaussian) ignorant priors on \alpha and \beta and get their posterior distribution by simulation. Find an estimate for the so called LD50 parameter i.e., the dose level at which the probability of death is 50%. It is easy to see that in the logistic regression model,

\mbox{LD50} = - \frac{\alpha}{\beta}

Show nice pictures.

Solution:

Grab all the files that start with “b” (for bio) from http://acccn.net/cr569/Rstuff PLUS the file http://acccn.net/cr569/Rstuff/MCbio.R Then just:

 source(“bio.R”)

to get the picture:

The red line is the logistic curve corresponding to the mean alpha and beta. The green curve shows the posterior distribution for LD50.


Homework due Tuesday June 13
Consider the following (hypothetical) data:
NAME EXAM_1 EXAM_2 EXAM_3 EXAM_4 Best-3-Ave Alice 95 65 0 65 75 Bob 65 48 25 40 51 Fed 93 NA 45 63 NA Monica 31 84 2 45 53 Carlos 51 40 5 45 45 Mitch 58 50 14 73 60 Ozlem 7 NA 0 53 NA Nicole 55 50 35 65 57
  • Create a grades.txt file with the above data.
  • Notice that EXAM_2 scores for Fed and Ozlem are NA (Not Available).
  • You need to Use R + BRugs to impute the NAs.

There are many ways to fill up the NAs with estimates based on the available data. We’ll use the following simple method: Fit a straight line (y = Exam_2 and x = ave of the other exams) with the 6 people for which there is complete data. Then use this line to inpute the NAs. R has all the necessary commands to do that in the usual (non-bayesian) way. Figure how to do that first.

Then do it the bayesian way! i.e. assume:

y_{ i} = \alpha + \beta x_{ i} + \epsilon_{ i}

where the \epsilon_{ i} for i=1,\ldots,n are assumed to be independent and from the same normal distribution with mean zero and precision \tau. This simple linear regression (theory) has three parameters: the intercept \alpha the slope \beta and the precision \tau. Put ignorant priors on these parameters and use Brugs to compute the posterior distribution. Then think about how to use the posterior to impute the NAs. Show pictures.


Hints

Homework 1:

Commands to look at… try “help(command)”:

 setwd(“C:/where/you/want/to/change/”)
 read.table
 lm
 plot
 abline

Load BRugs with:

 library(BRugs)

Here is a model for the regression:

 model {
   for(i in 1:N) {
     mu[i] ← alpha + beta*(x[i] - x.bar)
     y[i] ~ dnorm(mu[i],tau)
   }
   x.bar ← mean(x[])
   alpha ~ dnorm(0.0,1.0E-4)
   beta ~ dnorm(0.0,1.0E-4)
   tau ~ dgamma(1.0E-3,1.0E-3)
   sigma ← 1.0/sqrt(tau)
 }

Notice that the \alpha in the original model is

 alpha - beta * x.bar

in the bugs model above. This centering of the x usually helps the Monte Carlo. Notice that the posterior distribution of the parameters (when using the above ignorant priors) recovers the usual least squares values of elementary statistics. The outcome of “lm” already has the values of the standard regression coefficients. Check the “help”.

Recall the typical sequence:

 modelCheck(“gradesmodel.txt”)
 modelData(“gradesdata.txt”)
 modelCompile(numChains=2)
 modelInits(“gradesinits1.txt”)
 modelInits(“gradesinits2.txt”)
 modelUpdate(1000)
 samplesSet(c(“alpha”,”beta”,”tau”,”mu”))
 samplesStats(“*”)
 samplesHistory(“*”,mfrow=c(4,2))
 samplesDensity(“beta”)

The commands in grades.R produce the following picture:

What are all those lines?

The yellow line is the *Wrong* regression line obtained if we allow the outlier “Monica” to be included. Standard least squares regression is very non-robust and a single oulier is able (like “Monica”) to swing the regression line wildly. Without the outlier the other five points for which we have Exam2 produce the red line that looks reasonable. Recall that “a” is the average of exams 1,3 and 4. The scores in exam2 are predicted for the NA data “Fed” and “Oz” and they are shown with error bars ( 68% confidence prediction intervals ). The dotted grey fences represent plus and minus the root mean square for prediction from the red line. Notice how the prediction becomes more uncertain as we move away from the safe center and away from the observed data. This is all standard simple regression. The bayesian picture with ignorant priors is identical but with a more reasonable interpretation. Can you show it with the Monte Carlo samples? Also, it is not difficult to “robustify” the regression by just changing the model for the error from gaussian “dnorm” to double exponential “ddexp”. Try it! Does that allow you to keep “Monica” in the picture?… see you… happy clicking…

I forgot… the command errbar.R is from the R-package “Hmisc”. I used it to plot the error bars for “Fed” and “Oz”. I also wrote the little functions p.err.R and afence.R that allow to plot the prediction fences.

Just grab:

grds.R Now you only have to set the working directory (edit the second line in that file). Then just,

 source(“grds.R”)

should do it!… Notice that the program stops after the (by now) classic regression picture. Click the right mouse button “stop”. After the 100 random lines from the posterior are produced then click on “Fed” and “Oz” to get their labels. You’ll get:


Page last modified on April 24, 2009, at 01:22 PM