(options)
Does Geometry underlie Inference?
A version of the general geometric prior has been around for almost 20 years if not more. In 2002 Rodriguez proposed a geometric theory of ignorance that is able to produce priors and models from objective data.
YES!
- Rodriguez 2002. It and bit from not!
- Ariel Caticha 2006. Yes, but not YOU, ME!
NO!
- Radford Neal 2002. Prior for mixtures does not make sense.
- John Skilling 2006. Geometry does NOT underlie inference.
To John et. al.:
Georg Cantor’s words, from the begining of the twentieth century, resonate in my mind:
‘’My theory stands firm as a rock; every arrow directed against it will return quicky to its archer. How do I know this? Because I have studied it from all sides for many years; because I have examined all objections which have ever been made against [it]…’‘
Skilling says:
‘’ I am a Bayesian. I am trying to infer parameter. What prior
should I assign?’‘
I say: WRONG Q!
Try this Q:
‘’ I have a brain. I observe. I want to explain
. I want to predict future
. How?’‘
Notice the difference between the two initial set-ups.
Skilling’s is all about
; mine is all about
.
My Q is considerably simpler.
In my set-up there are no English priests, no ghostly parameters,
no priors. Only some-thing,
rather than
no-thing,
(ignorance).
I claim that our ability to pose the Q is in itself a proof that it can, at least partially, be answered.
Whatever
is or means, brains have survived to Q.
Brains and Q are dialectic couples. We seem to be in a place
where explanations are possible and useful.
Let’s label a particular explanation for
and
with a
parameter
. I only need to bother about labels if
I want to distinguish among multiple explanations
.
Ok. But what do I mean by an explanation?
What’s
a label for?
For me an explanation for posible observations
is the
total collection of partial truth values of statements about
in a given domain of discourse in
.
By domain of discourse in
I mean a given Boolean Field in
.
Bring R. T. Cox and finally claim that an explanation is nothing but
a probability distribution for
. I often refer to
as
a theory. If we assume the theory to be
then we can
obtain the probabilities of different
,

For a fix
, any function of
proportional to the above
is commonly known as the Likelihood
.
If follows easily from Kraft’s inequality that there is a one-to-one
correspondance between probability distributions and (prefix) codes. Thus,
one could think of an explanation for
as a code or compression of
. So, what’s a prior
again?
By now it should be no surprise that a prior is ALWAYS a probability distribution over probability distributions!
The parameters
are only labels. They label probability
distributions. They are only a choice of coordinates. A language. The
probabilities for the different explanations (theories, codes, prob
distributions… all the same) MUST not change under reparameterizations.
It is a pure mathematics fact that collections of probability distributions
that admit smooth parameterizations are geometric objects
known as Riemannian manifolds with the metric given by Fisher information.
For these regular cases, priors (and posteriors)
become
scalar density fields on the manifold.
When the proposed collection of explanations (statistical model, hypothesis
space) is not a Riemannian manifold with Fisher information as the metric,
we say the model is NON-REGULAR. In these non-regular hypothesis spaces
the standard geometry breaks down and consequently the inference also shows
peculiarities and singularities and often requires the invention of new
methods of analysis. For example, it is well-known (for those of us that
know it!) that in finite dimensional regular models assymptotic estimation
follows the
-law. In most (but not all!) regular infinite
dimensional models the rates of convergence are slower than
and in most (but not all!) non-regular finite dimensional models the
rates are faster than
.
The two examples chosen by Skilling are non-regular. Fisher info either does not exist, it is singular (no inverse), or it is infinite.
The fact that there are valid non-regular inference problems does not invalidate the ignorant prior any more than the existence of distributions without expectation invalidates the concept of expected value or of entropy.
It works in theory and it works in practice.
sure,
Maximum Entropy laudly and fast!
.. but wait there is a bit more…
It and bit from not!
To Johnny Cash:
The appendix shows that you didn’t loose. Now please, read the theory.
John’s Answer: 081506
CR says:
Hi John,
Thank you very much for continuing with this very enjoyable
badminton-over-internet kind of game.
I can’t resist running to hit the (.. how do you call the “ball” in
badminton?.. I’ll google it later…) little flying thing with feathers to
the other side….
(More follows within your text…)
On 8/15/06, John Skilling <skilling@eircom.net> wrote:
> Hello again Carlos,
>
> Thank you for your implicit acknowledgment that I have a brain. I am
----The ref to my (and implicitly your) brain is not just for decoration.
> trying to use it. I want to know about theta.
>
> Asking for the prior pi(theta) is not a “WRONG Question”. It is MY
> question, and I am entitled to ask it. Yes, I observe data x. Yes,
> I want to predict future x’. My way of doing that is to estimate the
> parameter theta that underlies both x and x’. The future x’ could be
> simply theta, or could be some subsidiary property. Either way,
> theta is sufficient. I want it. I have the likelihood Pr(x|theta).
> I need the prior pi(theta). You tell me I can use the likelihood to
> get the prior. I am doubting that.
>
--- this is like saying: “I want the cylindrical coordinates of this point
p relative to this particular choice of origin, x,theta, and z axis”.
--- sure you can ask for that, but *DO NOT* forget that it is the
location of p that you really care about!
> I don’t care whether theta is called a parameter, a label, a theory,
> a model, an explanation, or a statement having partial truth value,
> and neither do I care whether or not its distribution is classified
> as a prefix code. The erudition of these varied interpretations may
> look impressive, but it is beside the point. I want theta. Simple.
>
>
… hmmm I think you forgot it.
Please stop here for a moment.
Please listen carefully and don’t rush and fall into an old loop.
The point is trivial but subtle:
“Parameters are ghosts”
my mentioning of label,theory, model, explanation, or prefix code was not
an attempt to display erudition (as an spanish toreador request for Ole…)
but to explain the above statement.
Once you get that, you’ll see that your (excellent) concerns about gij
should not really bother you.
Think about the hypothesis space as a surface (model M). You are in fact
free to deform a very small region of M a LOT and create a new M’. Like the
creation of a black hole in GR but let’s not get dramatic (yet). Everywhere
except in that small region, M and M’ coincide. The uniform prior (vol
element) will be the same everywhere except in that small deformed
region. Now let’s play GOD…. Let’s say you are god and you manufacture a
big! x by choosing from a distribution inside that very deformed region of
M’. Then it is clear that the different uniform priors for M and M’ will
become important and will justifiably change your inferences….
Another source of confusion: “Likelihood”
Let’s agree to call “Likelihood”
any function L(theta) proportional to p(x|theta) PROVIDED THAT x IS THE
ACTUAL OBSERVED DATA. With this definition my ig priors DO NOT DEPEND ON
THE LIKELIHOOD. However, I am still happy to rape (most versions) the so
called “likelihood principle”. Priors depend on the hypothesis space in
the trivial way that they are prob dist over it. Again think of the
surface M above. This is so trivial it hardly needs to be mentioned BUT the
pop cannon blinds the believers…
Another point. We could still handle what you call “discrete” problems by
embedding them in continuous spaces but I do agree (sort of viscerally)
with your expectation that ultimately the continuous riemannian geometry
should give way to something else… (non-commutative geometry, loops,
strings or Wolfram science or even just 110…)
SHUTTLE!!!!
I love it.
Cheers!
- Here is The complete thread of the conversation
- Ariel Caticha’s response: ME 10 Commandments.
Reply to A. Caticha 10 Commandments
A. Caticha’s Reply
… and Carlos says:
I was eagerly waiting for the reply. A couple of things:
- w.r.t.
notice my “Ent Priors for discrete…” 2002, page4.
- w.r.t. Godel’s theorem: It applies to ALL axiomatic systems for induction, deduction or whatever.
- w.r.t. compound decisions check Zhang’s recent paper for overview and LOTS of references. In particular check the second paragraph… “[compound decision problems] demostrate against naive intuition, that stochastically independent experiments are not necessarily noninformative to each other!!!”. Sorry, axiom3 must go.
- w.r.t. your argument for ruling out all the Renyi entropies except one: See above. But, it doesn’t work even if I accept, for the sake of argument, the axiom. Here is why:
Your axioms are not able to choose a single entropy S. Just the whole Renyi family of entropies. So, what I don’t understand (or rather I understand it to be incorrect) is why are you then assuming that for a given problem, there MUST be only one member of the family that works?. Nothing in your axioms tells you HOW to pick one single S among the Renyi family. What happens is that there is more than one way of measuring preference. It depends on what you, the user, wants. The correct value of the parameter is not necessarily in the problem, is in the user.
There is nothing weird about that. Think of distances, say on the real line. If I ask you: How far away is 3 from 7.5? the naive standard answer is |3–7.5| but that’s just one choice among many. Nothing in the structure of real numbers demands the use of one metric over another. In some cases the better choice is the diadic metric for example. Or to put it simpler: How far is London from NY? It is all, $800 away and it is 7 hours away and it is etc…
Axiom 3 must go. Another way of seeing that it is faulty is to realize that the notion of independent subsystems is not coordinate invariant and that combined with your explicit demand of coordinate invariance in axiom 2 is a recipe for disaster. One has to be very careful to claim anything about total probabilistic systems obtained from simpler ones. Our intuitions are often wrong. One of the most dramatic demostrations of this are the so called (Juan) Parrondo’s paradoxes. I think we’ve talked about them. Remember? You have two separate games. In each game you expect to loose but if you are allowed to play both then you can WIN!
What I’d do is to replace Axiom3 with the missing other, and in my opinion, most important symmetry of statistical
then you’ll recover the only thing that you are actually using from your faulty Axiom3. You may end up with not just Renyi entropies but mixtures of them as well.
- John Skilling comes back: 082606
- Carlos shows off his DG: 082606
- Skilling replies to AC: EGP 082706
- AC replies to CR: 082706
- JS to CR: 082806
- JS to AC: “Against prior(L)!” 082906
- AC to JS, II: 083006
Raping the Likelihood Principle
- JS “Against Fisher Mass”: 083006
