(options)
Summer06: Notes on Bayesian Data Analysis
- Homework for Thursday June 22
- Hunting the pot data!
Grab the spss file pot.sav with,
library(foreign) pot ← read.spss(“pot.sav”) attach(pot)
Now you have 1135 variables (!) to play with… go huntting… try to understand what’s in there… suggest models and questions. Display summaries that shed light on the dataset….
- Homework for Tuesday June 20
- Consider the following Bioassay data.
Dose,
Number of Number of
(log g/ml) animals,
deaths,
−0.863 5 0 −0.296 5 1 −0.053 5 3 0.727 5 5
Assume the standard logistic regression model. i.e., the data is

where the
represents the ith of the k dose levels (measured on a log scale)
given to
animals, of which
respond with a positive outcome.
Every subject is assumed to respond either with a positive or a negative outcome independently of the
other subjects in the experiment. The chance that the ith subject will respond with a positive outcome
is
and it is assumed to depend on the dose
according to the standard,

known as the logistic regression model. Put the usual (gaussian) ignorant priors on
and
and get their posterior distribution by simulation. Find an estimate for the so called LD50
parameter i.e., the dose level at which the probability of death is 50%. It is easy to see that in the logistic regression model,

Show nice pictures.
Solution:
Grab all the files that start with “b” (for bio) from http://acccn.net/cr569/Rstuff PLUS the file http://acccn.net/cr569/Rstuff/MCbio.R Then just:
source(“bio.R”)
to get the picture:

The red line is the logistic curve corresponding to the mean alpha and beta. The green curve shows the posterior distribution for LD50.
- Homework due Tuesday June 13
- Consider the following (hypothetical) data: NAME EXAM_1 EXAM_2 EXAM_3 EXAM_4 Best-3-Ave Alice 95 65 0 65 75 Bob 65 48 25 40 51 Fed 93 NA 45 63 NA Monica 31 84 2 45 53 Carlos 51 40 5 45 45 Mitch 58 50 14 73 60 Ozlem 7 NA 0 53 NA Nicole 55 50 35 65 57
- Create a grades.txt file with the above data.
- Notice that EXAM_2 scores for Fed and Ozlem are NA (Not Available).
- You need to Use R + BRugs to impute the NAs.
There are many ways to fill up the NAs with estimates based on the available data. We’ll use the following simple method: Fit a straight line (y = Exam_2 and x = ave of the other exams) with the 6 people for which there is complete data. Then use this line to inpute the NAs. R has all the necessary commands to do that in the usual (non-bayesian) way. Figure how to do that first.
Then do it the bayesian way! i.e. assume:

where the
for
are assumed to be independent and from the same normal
distribution with mean zero and precision
. This simple linear regression (theory) has three parameters:
the intercept
the slope
and the precision
. Put ignorant priors on these parameters and use Brugs to compute the posterior
distribution. Then think about how to use the posterior to impute the NAs. Show pictures.
Hints
Homework 1:
Commands to look at… try “help(command)”:
setwd(“C:/where/you/want/to/change/”) read.table lm plot abline
Load BRugs with:
library(BRugs)
Here is a model for the regression:
model {
for(i in 1:N) {
mu[i] ← alpha + beta*(x[i] - x.bar)
y[i] ~ dnorm(mu[i],tau)
}
x.bar ← mean(x[])
alpha ~ dnorm(0.0,1.0E-4)
beta ~ dnorm(0.0,1.0E-4)
tau ~ dgamma(1.0E-3,1.0E-3)
sigma ← 1.0/sqrt(tau)
}
Notice that the
in the original model is
alpha - beta * x.bar
in the bugs model above. This centering of the x usually helps the Monte Carlo. Notice that the posterior distribution of the parameters (when using the above ignorant priors) recovers the usual least squares values of elementary statistics. The outcome of “lm” already has the values of the standard regression coefficients. Check the “help”.
Recall the typical sequence:
modelCheck(“gradesmodel.txt”) modelData(“gradesdata.txt”) modelCompile(numChains=2) modelInits(“gradesinits1.txt”) modelInits(“gradesinits2.txt”) modelUpdate(1000) samplesSet(c(“alpha”,”beta”,”tau”,”mu”)) samplesStats(“*”) samplesHistory(“*”,mfrow=c(4,2)) samplesDensity(“beta”)
The commands in grades.R produce the following picture:

What are all those lines?
The yellow line is the *Wrong* regression line obtained if we allow the outlier “Monica” to be included. Standard least squares regression is very non-robust and a single oulier is able (like “Monica”) to swing the regression line wildly. Without the outlier the other five points for which we have Exam2 produce the red line that looks reasonable. Recall that “a” is the average of exams 1,3 and 4. The scores in exam2 are predicted for the NA data “Fed” and “Oz” and they are shown with error bars ( 68% confidence prediction intervals ). The dotted grey fences represent plus and minus the root mean square for prediction from the red line. Notice how the prediction becomes more uncertain as we move away from the safe center and away from the observed data. This is all standard simple regression. The bayesian picture with ignorant priors is identical but with a more reasonable interpretation. Can you show it with the Monte Carlo samples? Also, it is not difficult to “robustify” the regression by just changing the model for the error from gaussian “dnorm” to double exponential “ddexp”. Try it! Does that allow you to keep “Monica” in the picture?… see you… happy clicking…
I forgot… the command errbar.R is from the R-package “Hmisc”. I used it to plot the error bars for “Fed” and “Oz”. I also wrote the little functions p.err.R and afence.R that allow to plot the prediction fences.
Just grab:
grds.R Now you only have to set the working directory (edit the second line in that file). Then just,
source(“grds.R”)
should do it!… Notice that the program stops after the (by now) classic regression picture. Click the right mouse button “stop”. After the 100 random lines from the posterior are produced then click on “Fed” and “Oz” to get their labels. You’ll get:

