(options)

Does Geometry underlie Inference?

A version of the general geometric prior has been around for almost 20 years if not more. In 2002 Rodriguez proposed a geometric theory of ignorance that is able to produce priors and models from objective data.

YES!

NO!


To John et. al.:

Georg Cantor’s words, from the begining of the twentieth century, resonate in my mind:

 ‘’My theory stands firm as a rock;
 every arrow directed against it
 will return quicky to its archer.
 How do I know this?
 Because I have studied it from all sides
 for many years; because I have examined
 all objections which have ever been made
 against [it]…’‘

Skilling says:

 ‘’ I am a Bayesian.
 I am trying to infer parameter \theta \in \Theta.
 What prior \pi(\theta) should I assign?’‘

I say: WRONG Q!

Try this Q:

 ‘’ I have a brain.
 I observe x.
 I want to explain x.
 I want to predict future x'.
 How?’‘

Notice the difference between the two initial set-ups. Skilling’s is all about \theta; mine is all about x.

My Q is considerably simpler. In my set-up there are no English priests, no ghostly parameters, no priors. Only some-thing, x rather than no-thing, O (ignorance).

I claim that our ability to pose the Q is in itself a proof that it can, at least partially, be answered.

Whatever x is or means, brains have survived to Q. Brains and Q are dialectic couples. We seem to be in a place where explanations are possible and useful.

Let’s label a particular explanation for x and x' with a parameter \theta. I only need to bother about labels if I want to distinguish among multiple explanations \theta \in \Theta.

Ok. But what do I mean by an explanation?

What’s \theta a label for?

For me an explanation for posible observations x\in X is the total collection of partial truth values of statements about x in a given domain of discourse in X.

By domain of discourse in X I mean a given Boolean Field in X.

Bring R. T. Cox and finally claim that an explanation is nothing but a probability distribution for x. I often refer to \theta as a theory. If we assume the theory to be \theta then we can obtain the probabilities of different x\in X,

Pr(x | \theta)

For a fix x, any function of \theta proportional to the above is commonly known as the Likelihood L.

If follows easily from Kraft’s inequality that there is a one-to-one correspondance between probability distributions and (prefix) codes. Thus, one could think of an explanation for x as a code or compression of

x.

So, what’s a prior \pi(\theta) again?

By now it should be no surprise that a prior is ALWAYS a probability distribution over probability distributions!

The parameters \theta are only labels. They label probability distributions. They are only a choice of coordinates. A language. The probabilities for the different explanations (theories, codes, prob distributions… all the same) MUST not change under reparameterizations.

It is a pure mathematics fact that collections of probability distributions that admit smooth parameterizations are geometric objects known as Riemannian manifolds with the metric given by Fisher information. For these regular cases, priors (and posteriors) \pi(\theta) become scalar density fields on the manifold.

When the proposed collection of explanations (statistical model, hypothesis space) is not a Riemannian manifold with Fisher information as the metric, we say the model is NON-REGULAR. In these non-regular hypothesis spaces the standard geometry breaks down and consequently the inference also shows peculiarities and singularities and often requires the invention of new methods of analysis. For example, it is well-known (for those of us that know it!) that in finite dimensional regular models assymptotic estimation follows the \sqrt{n}-law. In most (but not all!) regular infinite dimensional models the rates of convergence are slower than \sqrt{n} and in most (but not all!) non-regular finite dimensional models the rates are faster than \sqrt{n}.

The two examples chosen by Skilling are non-regular. Fisher info either does not exist, it is singular (no inverse), or it is infinite.

The fact that there are valid non-regular inference problems does not invalidate the ignorant prior any more than the existence of distributions without expectation invalidates the concept of expected value or of entropy.

It works in theory and it works in practice.

sure,

Maximum Entropy laudly and fast!

.. but wait there is a bit more…

It and bit from not!

To Johnny Cash: The appendix shows that you didn’t loose. Now please, read the theory.


John’s Answer: 081506

CR says:

 Hi John,

 Thank you very much for continuing with this very enjoyable 
 badminton-over-internet  kind of game. 

 I can’t resist running to hit the (.. how do you call the “ball” in
 badminton?.. I’ll google it later…) little flying thing with feathers to
 the other side….

 (More follows within your text…)

 On 8/15/06, John Skilling <skilling@eircom.net> wrote:
 > Hello again Carlos,
 > 
 > Thank you for your implicit acknowledgment that I have a brain.  I am

 ----The ref to my (and implicitly your) brain is not just for decoration.

 > trying to use it.  I want to know about theta.
 > 



 > Asking for the prior pi(theta) is not a “WRONG Question”.  It is MY
 > question, and I am entitled to ask it.  Yes, I observe data x.  Yes,
 > I want to predict future x’.  My way of doing that is to estimate the
 > parameter theta that underlies both x and x’.  The future x’ could be
 > simply theta, or could be some subsidiary property.  Either way,
 > theta is sufficient.  I want it.  I have the likelihood Pr(x|theta).
 > I need the prior pi(theta).  You tell me I can use the likelihood to
 > get the prior.  I am doubting that.
 > 

 --- this is like saying: “I want the cylindrical coordinates of this point
      p relative to this particular choice of origin, x,theta, and z axis”.

 --- sure you can ask for that, but *DO NOT* forget that it is the
      location of p that you really care about!

 > I don’t care whether theta is called a parameter, a label, a theory,
 > a model, an explanation, or a statement having partial truth value,
 > and neither do I care whether or not its distribution is classified
 > as a prefix code.  The erudition of these varied interpretations may
 > look impressive, but it is beside the point.  I want theta.  Simple.
 > 
 > 
 … hmmm I think you forgot it.

 Please stop here for a moment. 
 Please listen carefully and don’t rush and fall into an old loop.

 The point is trivial but subtle:

                                        “Parameters are ghosts”

 my mentioning of label,theory, model, explanation, or prefix code was not
 an attempt to display erudition (as an spanish toreador request for Ole…)
 but to explain the above statement.

 Once you get that, you’ll see that your (excellent) concerns about gij
 should not really bother you.

 Think about the hypothesis space as a surface (model M). You are in fact
 free to deform a very small region of M a LOT and create a new M’. Like the
 creation of a black hole in GR but let’s not get dramatic (yet). Everywhere
 except in that small region, M and M’ coincide. The uniform prior (vol
 element) will be the same everywhere except in that small deformed
 region. Now let’s play GOD…. Let’s say you are god and you manufacture a
 big! x by choosing from a distribution inside that very deformed region of
 M’. Then it is clear that the different uniform priors for M and M’ will
 become important and will justifiably change your inferences….

 Another source of confusion: “Likelihood” 

 Let’s agree to call “Likelihood”
 any function L(theta) proportional to p(x|theta) PROVIDED THAT x IS THE
 ACTUAL OBSERVED DATA.  With this definition my ig priors DO NOT DEPEND ON
 THE LIKELIHOOD.  However, I am still happy to rape (most versions) the so
 called “likelihood principle”.  Priors depend on the hypothesis space in
 the trivial way that they are prob dist over it.  Again think of the
 surface M above. This is so trivial it hardly needs to be mentioned BUT the
 pop cannon blinds the believers…

 Another point.  We could still handle what you call “discrete” problems by
 embedding them in continuous spaces but I do agree (sort of viscerally)
 with your expectation that ultimately the continuous riemannian geometry
 should give way to something else… (non-commutative geometry, loops,
 strings or Wolfram science or even just 110…)

 SHUTTLE!!!!
 I love it.

 Cheers!

Reply to A. Caticha 10 Commandments

Why ME, ME, ME?

A. Caticha’s Reply

Reply to whymes 082506

… and Carlos says:

I was eagerly waiting for the reply. A couple of things:

  1. w.r.t. \alpha notice my “Ent Priors for discrete…” 2002, page4.
  2. w.r.t. Godel’s theorem: It applies to ALL axiomatic systems for induction, deduction or whatever.
  3. w.r.t. compound decisions check Zhang’s recent paper for overview and LOTS of references. In particular check the second paragraph… “[compound decision problems] demostrate against naive intuition, that stochastically independent experiments are not necessarily noninformative to each other!!!”. Sorry, axiom3 must go.
  4. w.r.t. your argument for ruling out all the Renyi entropies except one: See above. But, it doesn’t work even if I accept, for the sake of argument, the axiom. Here is why:

Your axioms are not able to choose a single entropy S. Just the whole Renyi family of entropies. So, what I don’t understand (or rather I understand it to be incorrect) is why are you then assuming that for a given problem, there MUST be only one member of the family that works?. Nothing in your axioms tells you HOW to pick one single S among the Renyi family. What happens is that there is more than one way of measuring preference. It depends on what you, the user, wants. The correct value of the parameter is not necessarily in the problem, is in the user.

There is nothing weird about that. Think of distances, say on the real line. If I ask you: How far away is 3 from 7.5? the naive standard answer is |3–7.5| but that’s just one choice among many. Nothing in the structure of real numbers demands the use of one metric over another. In some cases the better choice is the diadic metric for example. Or to put it simpler: How far is London from NY? It is all, $800 away and it is 7 hours away and it is etc…

Axiom 3 must go. Another way of seeing that it is faulty is to realize that the notion of independent subsystems is not coordinate invariant and that combined with your explicit demand of coordinate invariance in axiom 2 is a recipe for disaster. One has to be very careful to claim anything about total probabilistic systems obtained from simpler ones. Our intuitions are often wrong. One of the most dramatic demostrations of this are the so called (Juan) Parrondo’s paradoxes. I think we’ve talked about them. Remember? You have two separate games. In each game you expect to loose but if you are allowed to play both then you can WIN!

What I’d do is to replace Axiom3 with the missing other, and in my opinion, most important symmetry of statistical

inference: Sufficiency. I think that if you demand your preferences to be invariant under sufficient reductions of data

then you’ll recover the only thing that you are actually using from your faulty Axiom3. You may end up with not just Renyi entropies but mixtures of them as well.



Raping the Likelihood Principle

083006

  • JS “Against Fisher Mass”: 083006


Page last modified on April 24, 2009, at 01:22 PM