Sum-of-squares: proofs, beliefs, and algorithms — Boaz Barak and David Steurer

Index PDF

\[ \newcommand{\undefined}{} \newcommand{\hfill}{} \newcommand{\qedhere}{\square} \newcommand{\qed}{\square} \newcommand{\ensuremath}[1]{#1} \newcommand{\bbA}{\mathbb A} \newcommand{\bbB}{\mathbb B} \newcommand{\bbC}{\mathbb C} \newcommand{\bbD}{\mathbb D} \newcommand{\bbE}{\mathbb E} \newcommand{\bbF}{\mathbb F} \newcommand{\bbG}{\mathbb G} \newcommand{\bbH}{\mathbb H} \newcommand{\bbI}{\mathbb I} \newcommand{\bbJ}{\mathbb J} \newcommand{\bbK}{\mathbb K} \newcommand{\bbL}{\mathbb L} \newcommand{\bbM}{\mathbb M} \newcommand{\bbN}{\mathbb N} \newcommand{\bbO}{\mathbb O} \newcommand{\bbP}{\mathbb P} \newcommand{\bbQ}{\mathbb Q} \newcommand{\bbR}{\mathbb R} \newcommand{\bbS}{\mathbb S} \newcommand{\bbT}{\mathbb T} \newcommand{\bbU}{\mathbb U} \newcommand{\bbV}{\mathbb V} \newcommand{\bbW}{\mathbb W} \newcommand{\bbX}{\mathbb X} \newcommand{\bbY}{\mathbb Y} \newcommand{\bbZ}{\mathbb Z} \newcommand{\sA}{\mathscr A} \newcommand{\sB}{\mathscr B} \newcommand{\sC}{\mathscr C} \newcommand{\sD}{\mathscr D} \newcommand{\sE}{\mathscr E} \newcommand{\sF}{\mathscr F} \newcommand{\sG}{\mathscr G} \newcommand{\sH}{\mathscr H} \newcommand{\sI}{\mathscr I} \newcommand{\sJ}{\mathscr J} \newcommand{\sK}{\mathscr K} \newcommand{\sL}{\mathscr L} \newcommand{\sM}{\mathscr M} \newcommand{\sN}{\mathscr N} \newcommand{\sO}{\mathscr O} \newcommand{\sP}{\mathscr P} \newcommand{\sQ}{\mathscr Q} \newcommand{\sR}{\mathscr R} \newcommand{\sS}{\mathscr S} \newcommand{\sT}{\mathscr T} \newcommand{\sU}{\mathscr U} \newcommand{\sV}{\mathscr V} \newcommand{\sW}{\mathscr W} \newcommand{\sX}{\mathscr X} \newcommand{\sY}{\mathscr Y} \newcommand{\sZ}{\mathscr Z} \newcommand{\sfA}{\mathsf A} \newcommand{\sfB}{\mathsf B} \newcommand{\sfC}{\mathsf C} \newcommand{\sfD}{\mathsf D} \newcommand{\sfE}{\mathsf E} \newcommand{\sfF}{\mathsf F} \newcommand{\sfG}{\mathsf G} \newcommand{\sfH}{\mathsf H} \newcommand{\sfI}{\mathsf I} \newcommand{\sfJ}{\mathsf J} \newcommand{\sfK}{\mathsf K} \newcommand{\sfL}{\mathsf L} \newcommand{\sfM}{\mathsf M} \newcommand{\sfN}{\mathsf N} \newcommand{\sfO}{\mathsf O} \newcommand{\sfP}{\mathsf P} \newcommand{\sfQ}{\mathsf Q} \newcommand{\sfR}{\mathsf R} \newcommand{\sfS}{\mathsf S} \newcommand{\sfT}{\mathsf T} \newcommand{\sfU}{\mathsf U} \newcommand{\sfV}{\mathsf V} \newcommand{\sfW}{\mathsf W} \newcommand{\sfX}{\mathsf X} \newcommand{\sfY}{\mathsf Y} \newcommand{\sfZ}{\mathsf Z} \newcommand{\cA}{\mathcal A} \newcommand{\cB}{\mathcal B} \newcommand{\cC}{\mathcal C} \newcommand{\cD}{\mathcal D} \newcommand{\cE}{\mathcal E} \newcommand{\cF}{\mathcal F} \newcommand{\cG}{\mathcal G} \newcommand{\cH}{\mathcal H} \newcommand{\cI}{\mathcal I} \newcommand{\cJ}{\mathcal J} \newcommand{\cK}{\mathcal K} \newcommand{\cL}{\mathcal L} \newcommand{\cM}{\mathcal M} \newcommand{\cN}{\mathcal N} \newcommand{\cO}{\mathcal O} \newcommand{\cP}{\mathcal P} \newcommand{\cQ}{\mathcal Q} \newcommand{\cR}{\mathcal R} \newcommand{\cS}{\mathcal S} \newcommand{\cT}{\mathcal T} \newcommand{\cU}{\mathcal U} \newcommand{\cV}{\mathcal V} \newcommand{\cW}{\mathcal W} \newcommand{\cX}{\mathcal X} \newcommand{\cY}{\mathcal Y} \newcommand{\cZ}{\mathcal Z} \newcommand{\bfA}{\mathbf A} \newcommand{\bfB}{\mathbf B} \newcommand{\bfC}{\mathbf C} \newcommand{\bfD}{\mathbf D} \newcommand{\bfE}{\mathbf E} \newcommand{\bfF}{\mathbf F} \newcommand{\bfG}{\mathbf G} \newcommand{\bfH}{\mathbf H} \newcommand{\bfI}{\mathbf I} \newcommand{\bfJ}{\mathbf J} \newcommand{\bfK}{\mathbf K} \newcommand{\bfL}{\mathbf L} \newcommand{\bfM}{\mathbf M} \newcommand{\bfN}{\mathbf N} \newcommand{\bfO}{\mathbf O} \newcommand{\bfP}{\mathbf P} \newcommand{\bfQ}{\mathbf Q} \newcommand{\bfR}{\mathbf R} \newcommand{\bfS}{\mathbf S} \newcommand{\bfT}{\mathbf T} \newcommand{\bfU}{\mathbf U} \newcommand{\bfV}{\mathbf V} \newcommand{\bfW}{\mathbf W} \newcommand{\bfX}{\mathbf X} \newcommand{\bfY}{\mathbf Y} \newcommand{\bfZ}{\mathbf Z} \newcommand{\rmA}{\mathrm A} \newcommand{\rmB}{\mathrm B} \newcommand{\rmC}{\mathrm C} \newcommand{\rmD}{\mathrm D} \newcommand{\rmE}{\mathrm E} \newcommand{\rmF}{\mathrm F} \newcommand{\rmG}{\mathrm G} \newcommand{\rmH}{\mathrm H} \newcommand{\rmI}{\mathrm I} \newcommand{\rmJ}{\mathrm J} \newcommand{\rmK}{\mathrm K} \newcommand{\rmL}{\mathrm L} \newcommand{\rmM}{\mathrm M} \newcommand{\rmN}{\mathrm N} \newcommand{\rmO}{\mathrm O} \newcommand{\rmP}{\mathrm P} \newcommand{\rmQ}{\mathrm Q} \newcommand{\rmR}{\mathrm R} \newcommand{\rmS}{\mathrm S} \newcommand{\rmT}{\mathrm T} \newcommand{\rmU}{\mathrm U} \newcommand{\rmV}{\mathrm V} \newcommand{\rmW}{\mathrm W} \newcommand{\rmX}{\mathrm X} \newcommand{\rmY}{\mathrm Y} \newcommand{\rmZ}{\mathrm Z} \newcommand{\paren}[1]{( #1 )} \newcommand{\Paren}[1]{\left( #1 \right)} \newcommand{\bigparen}[1]{\bigl( #1 \bigr)} \newcommand{\Bigparen}[1]{\Bigl( #1 \Bigr)} \newcommand{\biggparen}[1]{\biggl( #1 \biggr)} \newcommand{\Biggparen}[1]{\Biggl( #1 \Biggr)} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\Abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\bigabs}[1]{\bigl\lvert #1 \bigr\rvert} \newcommand{\Bigabs}[1]{\Bigl\lvert #1 \Bigr\rvert} \newcommand{\biggabs}[1]{\biggl\lvert #1 \biggr\rvert} \newcommand{\Biggabs}[1]{\Biggl\lvert #1 \Biggr\rvert} \newcommand{\card}[1]{\lvert #1 \rvert} \newcommand{\Card}[1]{\left\lvert #1 \right\rvert} \newcommand{\bigcard}[1]{\bigl\lvert #1 \bigr\rvert} \newcommand{\Bigcard}[1]{\Bigl\lvert #1 \Bigr\rvert} \newcommand{\biggcard}[1]{\biggl\lvert #1 \biggr\rvert} \newcommand{\Biggcard}[1]{\Biggl\lvert #1 \Biggr\rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\Norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\bignorm}[1]{\bigl\lVert #1 \bigr\rVert} \newcommand{\Bignorm}[1]{\Bigl\lVert #1 \Bigr\rVert} \newcommand{\biggnorm}[1]{\biggl\lVert #1 \biggr\rVert} \newcommand{\Biggnorm}[1]{\Biggl\lVert #1 \Biggr\rVert} \newcommand{\iprod}[1]{\langle #1 \rangle} \newcommand{\Iprod}[1]{\left\langle #1 \right\rangle} \newcommand{\bigiprod}[1]{\bigl\langle #1 \bigr\rangle} \newcommand{\Bigiprod}[1]{\Bigl\langle #1 \Bigr\rangle} \newcommand{\biggiprod}[1]{\biggl\langle #1 \biggr\rangle} \newcommand{\Biggiprod}[1]{\Biggl\langle #1 \Biggr\rangle} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\Set}[1]{\left\lbrace #1 \right\rbrace} \newcommand{\bigset}[1]{\bigl\lbrace #1 \bigr\rbrace} \newcommand{\Bigset}[1]{\Bigl\lbrace #1 \Bigr\rbrace} \newcommand{\biggset}[1]{\biggl\lbrace #1 \biggr\rbrace} \newcommand{\Biggset}[1]{\Biggl\lbrace #1 \Biggr\rbrace} \newcommand{\bracket}[1]{\lbrack #1 \rbrack} \newcommand{\Bracket}[1]{\left\lbrack #1 \right\rbrack} \newcommand{\bigbracket}[1]{\bigl\lbrack #1 \bigr\rbrack} \newcommand{\Bigbracket}[1]{\Bigl\lbrack #1 \Bigr\rbrack} \newcommand{\biggbracket}[1]{\biggl\lbrack #1 \biggr\rbrack} \newcommand{\Biggbracket}[1]{\Biggl\lbrack #1 \Biggr\rbrack} \newcommand{\ucorner}[1]{\ulcorner #1 \urcorner} \newcommand{\Ucorner}[1]{\left\ulcorner #1 \right\urcorner} \newcommand{\bigucorner}[1]{\bigl\ulcorner #1 \bigr\urcorner} \newcommand{\Bigucorner}[1]{\Bigl\ulcorner #1 \Bigr\urcorner} \newcommand{\biggucorner}[1]{\biggl\ulcorner #1 \biggr\urcorner} \newcommand{\Biggucorner}[1]{\Biggl\ulcorner #1 \Biggr\urcorner} \newcommand{\ceil}[1]{\lceil #1 \rceil} \newcommand{\Ceil}[1]{\left\lceil #1 \right\rceil} \newcommand{\bigceil}[1]{\bigl\lceil #1 \bigr\rceil} \newcommand{\Bigceil}[1]{\Bigl\lceil #1 \Bigr\rceil} \newcommand{\biggceil}[1]{\biggl\lceil #1 \biggr\rceil} \newcommand{\Biggceil}[1]{\Biggl\lceil #1 \Biggr\rceil} \newcommand{\floor}[1]{\lfloor #1 \rfloor} \newcommand{\Floor}[1]{\left\lfloor #1 \right\rfloor} \newcommand{\bigfloor}[1]{\bigl\lfloor #1 \bigr\rfloor} \newcommand{\Bigfloor}[1]{\Bigl\lfloor #1 \Bigr\rfloor} \newcommand{\biggfloor}[1]{\biggl\lfloor #1 \biggr\rfloor} \newcommand{\Biggfloor}[1]{\Biggl\lfloor #1 \Biggr\rfloor} \newcommand{\lcorner}[1]{\llcorner #1 \lrcorner} \newcommand{\Lcorner}[1]{\left\llcorner #1 \right\lrcorner} \newcommand{\biglcorner}[1]{\bigl\llcorner #1 \bigr\lrcorner} \newcommand{\Biglcorner}[1]{\Bigl\llcorner #1 \Bigr\lrcorner} \newcommand{\bigglcorner}[1]{\biggl\llcorner #1 \biggr\lrcorner} \newcommand{\Bigglcorner}[1]{\Biggl\llcorner #1 \Biggr\lrcorner} \newcommand{\e}{\varepsilon} \newcommand{\eps}{\varepsilon} \newcommand{\from}{\colon} \newcommand{\super}[2]{#1^{(#2)}} \newcommand{\varsuper}[2]{#1^{\scriptscriptstyle (#2)}} \newcommand{\tensor}{\otimes} \newcommand{\eset}{\emptyset} \newcommand{\sse}{\subseteq} \newcommand{\sst}{\substack} \newcommand{\ot}{\otimes} \newcommand{\Esst}[1]{\bbE_{\substack{#1}}} \newcommand{\vbig}{\vphantom{\bigoplus}} \newcommand{\seteq}{\mathrel{\mathop:}=} \newcommand{\defeq}{\stackrel{\mathrm{def}}=} \newcommand{\Mid}{\mathrel{}\middle|\mathrel{}} \newcommand{\Ind}{\mathbf 1} \newcommand{\bits}{\{0,1\}} \newcommand{\sbits}{\{\pm 1\}} \newcommand{\R}{\mathbb R} \newcommand{\Rnn}{\R_{\ge 0}} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\Q}{\mathbb Q} \newcommand{\mper}{\,.} \newcommand{\mcom}{\,,} \DeclareMathOperator{\Id}{Id} \DeclareMathOperator{\cone}{cone} \DeclareMathOperator{\vol}{vol} \DeclareMathOperator{\val}{val} \DeclareMathOperator{\opt}{opt} \DeclareMathOperator{\Opt}{Opt} \DeclareMathOperator{\Val}{Val} \DeclareMathOperator{\LP}{LP} \DeclareMathOperator{\SDP}{SDP} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\Inf}{Inf} \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\argmax}{arg\,max} \DeclareMathOperator{\argmin}{arg\,min} \DeclareMathOperator{\qpoly}{qpoly} \DeclareMathOperator{\qqpoly}{qqpoly} \DeclareMathOperator{\conv}{conv} \DeclareMathOperator{\Conv}{Conv} \DeclareMathOperator{\supp}{supp} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\mspan}{span} \DeclareMathOperator{\mrank}{rank} \DeclareMathOperator{\E}{\mathbb E} \DeclareMathOperator{\pE}{\tilde{\mathbb E}} \DeclareMathOperator{\Pr}{\mathbb P} \DeclareMathOperator{\Span}{Span} \DeclareMathOperator{\Cone}{Cone} \DeclareMathOperator{\junta}{junta} \DeclareMathOperator{\NSS}{NSS} \DeclareMathOperator{\SA}{SA} \DeclareMathOperator{\SOS}{SOS} \newcommand{\iprod}[1]{\langle #1 \rangle} \newcommand{\R}{\mathbb{R}} \newcommand{\cE}{\mathcal{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\pE}{\tilde{\mathbb{E}}} \newcommand{\N}{\mathbb{N}} \renewcommand{\P}{\mathcal{P}} \notag \]
\[ \newcommand{\sleq}{\ensuremath{\preceq}} \newcommand{\sgeq}{\ensuremath{\succeq}} \newcommand{\diag}{\ensuremath{\mathrm{diag}}} \newcommand{\support}{\ensuremath{\mathrm{support}}} \newcommand{\zo}{\ensuremath{\{0,1\}}} \newcommand{\pmo}{\ensuremath{\{\pm 1\}}} \newcommand{\uppersos}{\ensuremath{\overline{\mathrm{sos}}}} \newcommand{\lambdamax}{\ensuremath{\lambda_{\mathrm{max}}}} \newcommand{\rank}{\ensuremath{\mathrm{rank}}} \newcommand{\Mslow}{\ensuremath{M_{\mathrm{slow}}}} \newcommand{\Mfast}{\ensuremath{M_{\mathrm{fast}}}} \newcommand{\Mdiag}{\ensuremath{M_{\mathrm{diag}}}} \newcommand{\Mcross}{\ensuremath{M_{\mathrm{cross}}}} \newcommand{\eqdef}{\ensuremath{ =^{def}}} \newcommand{\threshold}{\ensuremath{\mathrm{threshold}}} \newcommand{\vbls}{\ensuremath{\mathrm{vbls}}} \newcommand{\cons}{\ensuremath{\mathrm{cons}}} \newcommand{\edges}{\ensuremath{\mathrm{edges}}} \newcommand{\cl}{\ensuremath{\mathrm{cl}}} \newcommand{\xor}{\ensuremath{\oplus}} \newcommand{\1}{\ensuremath{\mathrm{1}}} \notag \]
\[ \newcommand{\transpose}[1]{\ensuremath{#1{}^{\mkern-2mu\intercal}}} \newcommand{\dyad}[1]{\ensuremath{#1#1{}^{\mkern-2mu\intercal}}} \newcommand{\nchoose}[1]{\ensuremath{{n \choose #1}}} \newcommand{\generated}[1]{\ensuremath{\langle #1 \rangle}} \notag \]

The Bayesian interpretation of pseudo-distributions.

Consider a an optimization problem of the form \(\min_{x\in\bits^n} f(x)\). When we run the degree \(d\) sos algorithm on this problem, we obtain the minimum value \(\alpha\in\R\) for which there is a degree \(d\) pseudo-distribution \(\mu\) such that \(\pE_\mu f = \alpha\). Our notation strongly encourages us to pretend that \(\mu\) is an actual distribution over \(x\in\bits^n\) such that \(f(x)=\alpha\).Satisfying \(\pE f = \alpha\) and being supported on \(\{ x : f(x)=\alpha \}\) is not the same thing in general, but these notions do coincide when \(\alpha\) is the global minimum of \(f\). Indeed, many properties of pseudo-distributions, such as satisfying the Cauchy-Schwarz and Holder inequalities, are most naturally explained by the phenomena of “inheriting” traits of actual distributions. However, often we have strong reasons to believe that no such distribution exists, or that if it does, its moments look nothing like \(\mu\)’s. How are we to interpret \(\mu\) in such cases?

Let us elaborate on this a bit more. In many natural settings, the global minimum of \(f\) would be achieved at a single point. For example, if we think of \(f\) as counting the number of violations of some constraints or equations, then often these equations are over constrained in the sense that the number of constraints is large enough to completely determine the optimum point. In optimization problems arising from machine learning, \(x^*\in\bits^n\) could denote the unknown parameters of the model, and \(f\) would be a function derived from the observed data.E.g., \(x^*\) can denote the parameters of an unknown neural network and \(f\) is the function that minimizes the total deviation of the predictions of the candidate network \(x\) from the correct labels on a large number of samples. As we accumulate enough data, eventually it would completely determine the unknown parameters (with arbitrarily high accuracy) in a statistical sense. So, if \(\mu\) was truly supported on points achieving the global minimum then \(\mu\) would simply be the distribution putting all the weight on the single point \(x^*\) which in particular means that \(\pE_\mu x_i = x^*_i\) for every \(i\). But if deriving the parameters from the observations is computationally hard then we shouldn’t be able to “read off” \(x^*\) from the expectations and so it will not hold that \(\pE_\mu x_i\) is equal to either \(0\) or \(1\) for all \(i\).

As another example, consider the case where \(f(x)\) counts the number of violations of the constraints of some particular 3SAT formula \(\varphi\) by the assignment \(x\in\bits^n\). The forumla \(\varphi\) is satisfiable if and only \(\min_x f(x)=0\) but since this problem is NP hard there should exist some \(f\)’s of this form such that \(\min_x f(x)>0\) but for every constant \(d\) there is a \(d\) pseudo-distribution \(\mu\) such that \(\pE_\mu f = 0\). Thus this pseudo-distribution pretends to distribution over elements (i.e., satisfying assignments) that don’t exist. This is why we sometimes refer to such pseudo-distributions as akin to being supported over “unicorns”.

Bayesian probability theory

“The probability of winning a battle” … has no place in our theory … because we cannot think of a collective to which it belongs. The theory of probability cannot be applied to this problem any more than the physical concept of work can be applied to … the “work” done by an actor reciting his part. Richard Von Mises, 1928

The theory of inverse probability is founded upon an error, and must be wholly rejected, Fisher, 1925

If anyone wishes to study the properties of frequencies in random experiements he is, of course, perfectly free to do so; and we wish him every success. But … why does he insist on appropriating the word “probability” which had already a long-established and very different technical meaning? E.T. Jaynes, 1978

I am unable to see why “objectivity” requires us to interpret every probability as a frequency in some random experiment; particulary when … probabilities appearing in most problems are … frequencies only in an … imaginary universe invented just for the purpose of allowing a frequency interpretation., E.T. Jaynes, 1976

“The terms “certain” and “probable” describe the various degrees of rational beliefs about a proposition … All propositions are true or false but the knowledge we have of them depends on our circumstances.” John Maynard Keyenes, 1921

What kind of a probability theory would allow you to assign a value different than \(0\) or \(1\) to an event that clearly either happened or not? It turns out that this is related to a longstanding question in the philosophy of probability and in particular to the debate between the Bayesian and Frequentist interpretation of probability.Obviously we cannot do justice to this deep issue in these lecture notes. Two classic books on this topic from the frequentist and Bayesian viewpoint respectively are Feller (1968) and Jaynes (2003) . This talk by Michael Jordan is also a good starting point for this discussion. Probability theory initially arose in the 17th century from Pascal’s study of gambling. The concern there was how to make the best possible bets. Making bets (like investing in stocks) is less about the inherent uncertainty of the outcome and more about our information about this outcome: different bettors have different knowledge about the horse, team, company, or whatever we want to bet on. Indeed probability theory, as was studied before the 19th century, was mostly based on the viewpoint of the observer, which generally can be subjective but in some cases, such as tossing a die, essentially all observers have the same information. That is, for all observers all six sides are symmetric which suggests assigning a probability of \(1/6\) for each particular outcome. This idea, which is known as the Principle of Indifference is at the cornerstone of classical probability theory as originated by Bernoulli and Laplace and allowed them to argue even about the probabilities of events that do not come from a well defined sample space such “the probability that the sun will rise tomorrow”.

However in the late 19th and 20th century, researchers such as Fisher, Neyman and Pearson found the symmetry/subjective based approach to probability lacking in rigor, and have moved to the sample space or frequentist approach which is how we typically learn probability in mathematics today.In particular the application of probability theory to events that are not repeatable experiments can sometimes lead to somewhat unsettling results such as the so called “doomsday argument” by which one argues that humanity is likely not to survive in similar form for an extended period of time. This is because about 100 billion people have ever born so far and if you assume a uniform prior on the person you happened to be among all the people that ever existed or will exist, then this latter total should not be much more than 200 billion.

In this approach, a probability distribution is defined by a function \(\mu\) that assigns to any element \(x\) in some sample space \(S\) some probability \(\mu(x)\in[0,1]\) (where these numbers sum, or integrate, to \(1\)), and the probability of some event \(A\) is obtained by summing up \(\mu(x)\) for all \(x\in A\).

The frequentist viewpoint is particularly well suited for hypothesis testing. This is the setting where we have some hypothesis \(H_0\) (known as the “null hypothesis”) and can set up an experiment that corresponds to a sample space in which if \(H_0\) is true then the probability that some event \(A\) occurs is at most \(1/2\). If we then repeat the experiment \(k\) times, and in all the cases the event \(A\) occurs, then we can decide that \(H_0\) has been “refuted” and we would be wrong at most a \(2^{-k}\) fraction of the times. The value \(2^{-k}\) is known as the “\(p\)-Value” and in many scientific settings it is set to be \(0.05\). For example, if we are given a coin and our hypothesis is that it is a fair coin, then if we toss it \(5\) times and check if we got all heads then, if so, we can conclude that it is not a fair coin with a \(p\)-value of \(1/2^5 = 1/32 \sim 0.03\).This example also shows the importance of deciding on the test to perform in advance: clearly any particular outcome of the five coin tosses has probability \(2^{-5}\) and so the mere fact that the outcome was “unlikely” is no evidence that the coin wasn’t fair. This means that if we use this as our standard test for fairness of coins then the frequency in which we would make an error calling a coin unfair when it is fair would be at most \(1/32\).

Indeed, for frequentist statisticians, probability does not make sense if we can’t set up a potentially repeatable experiment with well defined sample space. Note that the frequentist outlook accepts the fact that sometimes (indeed up to a \(p\) fraction of the times, even if other sources errors are accounted for) we will come to the wrong conclusion. The frequentist outlook is well suited for experiment design in science, in the sense that it is not about using all available information to deduce whether or not the hypothesis is true, but rather about designing an experiment that can be rigorously analyzed to give evidence about its truth. If we had to bet on the hypothesis’ truth, we might want to try to use more of the information to get the best possible estimate for its likelihood, but the frequentist approach is not designed to achieve a betting strategy to make the best possible guess using the given data. Rather it is about designing an experiment that would gather more data until we can make a high confidence prediction.One analogy is to consider the problem of trying to decide on the merits of a legal case. The Bayesian approach would be to take all the possible information and beliefs we have access to into account and then make the best possible prediction. The Frequentist approach is to define rigorous “rules of evidence” that may sometimes not allow us to derive the truth, but can be shown to be correct most of the times. One can see that the Bayesian approach is more appropriate if we are interested in making a subjective judgement in a “one off” case that is as accurate as can be given our prior knowledge and the data we have, while the frequentist approach is better suited for deciding on procedures that will allow us to reach some mutually agreed upon objective conclusions in many cases.

Not all situations in which we want to apply probability, whether it is betting, investing, or making decisions, fall cleanly into the frequentist framework. Sometimes we do need to make some predictions outside a controlled experimental environment, and we want to do the best job we can with the data we have. These kind of “messy” situations are occurring more often in modern applications and have led to a sort of partial resurgence of the “subjective”/“betting” view of probability theory, which is now commonly known as Bayesian probabilities. The reason for the name Bayesian is that such calculations almost always rely on Bayes’ theorem which states that if you made some observation \(A\), then the probability you should assign to an unknown \(B\) should conditioned on this observation should be \[ \Pr[A \cap B]/\Pr[A] \label{eq:bayes} \]

In the context of subjective or betting probabilities, one initially has a prior distribution \(\mu\) over the set of possible outcomes, and then when we learn that some event \(A\) occurred and we need to make a bet on \(B\) then we use \eqref{eq:bayes} to derive the odds.

Note that in the context of betting, it may make perfect sense to give odds that do not correspond to zero or one probabilities even to events that have been fully determined. For example, if I had to bet on whether my great great grandfather had blue eyes, I would probably try to estimate this probability based on some prior estimate for the prevalence of blue eyes, adjusting it based on observations of the eye color of his descendants. However, clearly he either had blue eyes or didn’t, and so this probability is really due to my ignorance rather than to any inherent unpredictability of this event. More commonly, whenever we assign probabilities to events such that “X will win the upcoming election” or “company Y will perform well in the stock market”, there isn’t any well defined sample space out of which this event is drawn out of repeated experiments. Rather the way to interpret these probabilities is that these represent the best odds we can give to those events given the information at our disposal.

A Bayesian and a frequentist go to the race tracks

The key philosophical difference between Bayesians and Frequentists is whether the probability of an event \(A\) correspond to the beliefs of an observer on the likelihood of \(A\) or to the frequency that \(A\) would occur in repeated identical but independent experiments. What is more interesting to us is how this philosophical is manifested mathematically.

Suppose that Betty the Bayesian and Frida the frequentist go to the horse race tracks where they can bet on various outcomes of a horse race (who will come first, what time it will take, the gap between first and second place etc. etc.). We can model the outcome of such a race as a random variable \(X\) that is sampled from some distribution \(p(X|\theta)\) where \(\theta\) is some set of (unknown) parameters on the inherent abilities of the horses, jockeys, etc.. Betty and Frida observe various partial information (e.g., results of past races, which we can think of as some random variable \(Y\) correlated with \(X\)) and then need to make bets on it. We can think of a “bet” as some function \(h\) that maps \(X\) to \(\R\), where \(h(x)\) is the payout of the bet when the outcome of \(X\) is \(x\). Let’s think of the bettors goal is to come up with prices \(\Phi_{Betty}(h)\) and \(\Phi_{Frida}(h)\) for every potential bet \(h\) such that Betty (resp. Frida) would be willing to either buy or sell the bet \(h\) at the price \(\Phi_{Betty}(h)\) (resp. \(\Phi_{Frida}(h)\)). One can see that the best price (that is guaranteed not to lose in expectation regardless if you buy or sell) for the bet \(h\) would be \(\E_{x \sim p(X|\theta)} h(x)\) but the problem is that we don’t know \(\theta\).

Betty would handle this as follows: she will try to encode all her prior knowledge and assumptions on the ability of the horses into a prior distribution \(\cD\) on the possible values of \(\theta\).The reliance on a subjective prior might make us uncomfortable, but note that in most cases the choice of the model for \(X\) (i.e., the distribution \(p(X|\theta)\)) is already subjective as well. In practice people often do the equivalent of “looking for a coin under the streetlight” and change the model so it becomes more computationally tractable. Then she updates this prior based on the observations \(y\) to obtain the posterior distribution \(\cD'=\cD|Y=y\). Betty’s price \(\Phi_{Betty}(h)\) is the expectation of \(h(x)\) when we sample \(\theta\) conditioned on \(y\) and then \(x\) conditioned on \(\theta\). Betty’s price is only as good as the prior she uses, and we have no guarantee on what her performance would be for arbitrary \(\theta\). That is, there is no no guarantee of calibration betweeh \(\Phi_{Betty}(h)\) and the empirical average of \(h(x)\) if we were to sample \(x\) many times from \(p(X|\theta)\).Betty’s prices do satisfy average case calibration in the sense that if \(\theta\) is sampled from Betty’s prior \(\cD\) and then \(x,y\) are sampled conditioned on \(\theta\) then the expectation of \(\Phi_{Betty}(h)\) (which is a function of \(y\)) would be the same as the expectation of \(h(x)\). But it least Betty’s prices are coherent in the sense that they correspond to some distribution over the \(x\)’s. In particular if for every \(x\), \(h(x) \geq h'(x)\) then we are guarantees that Betty’s price will for \(h\) will always be at least as large as her price for \(h'\), and so there is no “dutch book” or “arbitrage” strategy against Betty that is guaranteed to make money off her regardless of what the value of \(\theta\) is.

For Frida, the parameters \(\theta\) and the random variable \(X\) play very different roles. The choice of \(\theta\) already happened and so from a frequentist viewpoint, there is no sense in assigning it a probability distribution. Rather Frida will try to come up with some estimate \(\Phi_{Frida}(h)\) that is guaranteed to have some calibration in the sense of a bound on the difference between \(\Phi_{Frida}(h)\) and \(\E_{x\sim p(X|\theta)} h(x)\) that holds for every \(\theta\) from a set \(\Theta\) of potentially allowable \(\theta\)’s and such that if we can think of the data \(Y\) as coming from independent measurements \(Y_1,\ldots,Y_n\) then as \(n\) goes to infinity this difference tends to zero. Unlike in the Bayes case, there is no universal prescription of how to come up with the frequentist estimate, and so Frida’s focus is on getting some estimation procedure that she can analyze to rigorously guarantee the calibration property. Since Frida’s estimation procedure might not correspond to taking the expectation of \(h\) under any distribution, it might be incoherent and in praticular there might be two functions \(h,h'\) such that \(h(x) \geq h'(x)\) for every \(x\) but \(\Phi_{Frida}(h) < \Phi_{Frida}(h')\). So there could be a “dutch book”/“arbitrarge strategy” against Frida where we would buy \(h\) bets from her and sell her \(h'\) bets and be guaranteed to make money regardless of the value of \(\theta\).

The discussion so far ignored the question of efficiency. For Frida requiring the map \(h \mapsto \Phi_{Frida}(h)\) to be efficient restricts the set of estimation procedures that she can use. For Betty, this is more complicated since this restriction can (and often does) rule out the unique estimation procedure. In such a case, typically Frida would compute her estimate using an efficiently sampleable distribution \(\cD'\) that is not the true posterior \(\cD|Y=y\) but is hopefully somewhat related to it. She could obtain \(\cD'\) by simplifying the model or her prior (i.e., in effect “forgetting” some information, such as higher order correlations, that is hard to incorporate in a computationally efficient manner), or run a random walk type algorithm that would sample from the true posterior \(\cD\) in the limit, but stop it before it does so. This may make her estimate even less calibrated than it was before, but, since it comes from a probability distribution, it would still be coherent. While for inefficient algorithms, restricting attention to estimation procedures that are based on actual distributions is without loss of generality, as we’ll see, in the efficient case this can come at a significant cost.This is similar, and related to, the fact that in mechanism design in economics by the revelation principle we can always have a truthful mechanism if we don’t care about efficiency, but for efficient mechanisms this is not necessarily the case Conitzer and Sandholm (2004) . In particular, as we will see, it is possible that no sampleable distributions will match the observations \(y\), and hence there, by observing \(y\), we could have a “dutch book”/“arbitrage” strategy that makes money off Betty without any risk.

The sum of squares algorithm can be thought of as some “hybrid” between the frequentist and Bayesian approach. Like in the Bayesian case, the sos algorithm is a single algorithm, and we can write an sos program such that the pseudo-distributions solving it would correspond to the BAyesian procedure if the degree \(d\) goes to infinity.The key notion here is that since the set of pseudo-distributions over pairs \((\theta,x)\) that satisfy the observations \(y\) is convex, we can minimize over it the convex function corresponding to a distance from our prior. However, when we want to have an efficient algorithm, then sos turns into a frequentist procedure, since rather than coming up with an actual probability distribution \(\cD'\) that “approximately” matches the observations, we come up with a pseudo distribution that exactly matches them. Since it is a pseudo-distribution, it does not (and generally will not) satisfy the coherence property, but it would satisfy degree \(d\) pseudo coherence in the sense that if there is a degree \(d\) proof that \(h \geq h'\) then our price for \(h\) would always be at least as large as the price for \(h'\). Moreover, it will also satisfy a degree \(d\) pseudo calibration condition in the sense that if we have a degree \(d\) SOS proof that conditioned on \(y\), for every \(\theta\), \(h(x) \in [\alpha,\beta]\) then the estimate will also fall in this interval. We now elaborate on this more, using another example: that of the clique problem.

Contrasting sos and Bayesian analysis: an example

Consider the following: We are given a graph \(G\) on \(n\) vertices that represents some data we have gathered. For example, the vertices of \(G\) might be certain proteins and edges correspond to interactions between them that have been observed in patients with disease X. Suppose that we posit that the underlying mechanism of the disease is captured by a set of \(k\) proteins that all interact with one another. That is, a clique in the graph \(G\), which we represent as a vector \(x\in\bits^n\) such that \(\sum x_i = k\) and \(x_ix_j =0\) if \(i\) is not connected to \(j\).

Now suppose that we have \(N\) different candidate drugs, each of which will either cure the disease or not depending on the underlying mechanism. So we can think of these drugs as \(N\) functions \(f_1,\ldots,f_N\from \bits^n\to\bits\) where \(f_\ell(x)=1\) if and only if drug \(\ell\) will cure the disease if its mechanism is \(x\). We now want to understand which drug is most worthwhile performing an experiment on.

In the classical Bayesian approach we assume that there is some underlying prior distribution \(p\) on pairs \((G,x)\) where \(x\) is a \(k\)-clique in \(G\).In the notation above the clique \(x\) would correspond to the hidden parameters \(\theta\), while the graph would correspond to the observations \(y\). For example, we can take the “maximum entropy” prior that \(x\) is a uniformly random vector in \(\bits^n\) of weight \(k\), and \(G\) is a random Erdos-Renyi graph where \(x\) is “planted” as a clique. Once we observe \(G\), we then define the posterior distribution \(p(x|G)=p(x,G)/p(G)\) using Bayes’ law. Now we can define the expected utility of drug \(\ell\) as \(\E_{x\sim p(x|G)} f_\ell(x)\).

The problem is of course that we can’t efficiently sample from this posterior distribution on cliques. In fact, there is not much unique about this problem- it is very often the case that posterior distributions are computationally hard to sample from. For this reason, Bayesian analysts often use efficient approximations for the posterior. For example, one might sample \(x\) from a random walk that converges to \(p(x|G)\) in an exponential number of steps, but stop it much before convergence happens, or one might use some other type of algorithm to obtain some approximate distributions. Regardless, we can abstract this as sampling \(x\) by running \(P(G)\) where \(P(\cdot)\) is some efficient probabilistic algorithm, and then estimating the utility of drug \(\ell\) as \(\E_{x\leftarrow P(G)} f_\ell(x)\). The problem is that since the clique problem is NP hard, with very high probability the output of \(P(G)\) will not be a \(k\) clique and in fact, we can assume (using standard hardness of approximation results) that the set corresponding to \(x\) contains at least \(k/2\) non edges. Using this observation, we can find a set \(S\) of about \(2n^2/k\) non edges of \(G\) such that with probability close to \(1\) the following event \(E\) holds if we sample \(x\) from \(P(G)\): \[ E \;=\; \{ \exists (i,j)\in S \text{ s.t. } x_ix_j = 1 \} \] Now suppose that there is a drug that succeeds if and only if \(E\) occurs, then while we can plainly see that \(E\) will never happen for the actual posterior, the approximate Bayesian algorithm will think that the drug succeeds with probability close to \(1\)! Note that this cannot be fixed by changing the notion of approximate distribution: every efficient probabilistic model will have the same issues!

The sos approach is different. Rather than producing an actual distribution that attempts to approximate the true posterior, we produce a pseudo distribution that exactly matches the low degree statistics of the posterior that we can easily derive from the data. In particular the sos pseudo-distribution will satisfy that \(\pE x_ix_j = 0\) for every non edge \((i,j)\) over the graph as well as \(\pE (\sum x_i-k)^2 = 0\), even though the only actual distributions over \(\bits^n\) that satisfy this condition are supported over \(k\) cliques. If the graph has a unique \(k\) clique \(x^*\), then the only true distribution supported on \(k\) cliques satisfies \(\E x_i = x^*_i\) for every \(i\) and so in particular \(\E x_i\) equals either zero or one. Generally speaking, initially the sos pseudo-distribution will not satisfy this property, as it will reflect the uncertainty that a computationally bounded observer has about the clique, but as we increase the degree parameter \(d\), for every \(i\) the value of \(\E x_i\) will become closer and closer to either zero or one, until eventually it will converge to \(x^*_i\), see the figure below:

A random graph with a hidden clique. The sum-of-squares algorithm maintains a set of beliefs about which vertices belong to the hidden clique. Despite learning no new information, as we invest more computation time, the algorithm reduces uncertainty in the beliefs by making them consistent with increasingly powerful proof systems. Initially the beliefs have maximum uncertainty and correspond to the uniform distribution but they eventually converge on the correct hidden clique (red edges).

For every function \(f\), if our observations imply using a low degree sos argument that \(f(x) \leq \alpha\), then our estimate \(\pE_x f(x)\) will be at most \(\alpha\). This yields a strategy that, in analogous way to the optimal Bayesian posterior, can’t be “easily dutch booked” in the sense that one could not find \(f,g\) such that there is a simple proof that \(f \leq g\) but \(\pE f > \pE g\) where “simple” is defined as low degree sos. (See also our discussion on economics below.) As a computationally bounded strategy, sos can and will be wrong in its predictions, but at least its consistently wrong within a well defined system of making inferences, and provides some non trivial rigorous guarantees about its quality.

The economic view of Bayesian probabilities

There is a deep relation between Bayesian beliefs and rational strategy. The standard way to model an economic agent is as a computationally unbounded entity whose goal is to maximize its expected utility. This expectation is taken over the beliefs of the agent, which are based on combining a prior with the observations. Computational issues arise in economic settings as well.

One natural such setting is in the context of prediction markets (or stock markets more generally). Suppose there is a marketplace to place bets on some events \(A_1,\ldots,A_n\) (for example, event \(A_i\) could be that candidate X wins in the \(i\)-th state in an election, or that company \(i\) outperforms its expectations) where each bet will give you one dollar if the corresponding event materializes. The question of setting the right price for each of these events is highly non trivial when they are correlated. For example, if we know that \(A_2\) is implied by \(A_1\) then we must set a price for \(A_1\) higher than that for \(A_2\) as otherwise we create an arbitrage opportunity (also known as dutch book) to make guaranteed profit without a risk. Indeed, the non-existence of such opportunities is a cornerstone of economics known as the Efficient Market Hypothesis; this can be proven if all participants are unbiased rational agents but the extent to which it applies in the real world is a subject of intense debate. However, what if an implication such as \(A_1 \Rightarrow A_2\) is computationally hard to verify? For example, it may be that both \(A_1\) and \(A_2\) have some complex dependency structure on some underlying assets, in a way that the statement \(A_1 \Rightarrow A_2\) would end up being equivalent to some SAT formula being unsatisfiable. In such a case we cannot expect the prices set by the market to always respect this implication.

SOS pseudo-distributions can be thought of as ensuring a “no computationally easy dutch book strategy”. For example, think of a stock market that is based on some underlying set of events captured by \(x \in \bits^n\), and where you can buy for every function \(f:\bits^n\rightarrow\R\), a stock that would pay you \(f(x)\) dollars when \(x\) is revealed. An arbitrage or dutch book strategy could arise when there are two functions such that \(f(x) \geq g(x)\) for every \(x\) but the price of the stock corresponding to \(f\) is cheaper than the stock correpsonding to \(g\). While in general we cannot be guaranteed that there is no such arbitrage, if we assign the price for the “\(f\)-stock” to be \(\pE f\) then at least we know that if the price of the \(f\)-stock is cheaper than the price of the \(g\)-stock then there is no short sos proof that \(f\geq g\).

The existence of arbitrages in markets is an actual issue and there are documented instances of arbitrage opportunities in prediction markets and stock markets (and there are companies that are spending immense computational resources in finding them and quickly exploiting them). In fact, finding and removing arbitrage opportunities is the main algorithmic challenge in creating combinatorial prediction markets that allow participants to place bets on arbitrarily complex combinations of underlying basic events, see also Kroer et al. (2016).

A recent paper Garrabrant et al. (2016) proposes a more general, Turing machine based, way of defining probabilities/prices for a “prediction market for mathematical statements” in a way to ensure that there is no computationally easy arbitrage/dutch book strategy regardless of the algorithm used to obtain it. Trying to compare the two approaches, as well as using other algorithmic frameworks as a source for defining computationally efficient Bayesian probabilities, is a very interesting research direction.

Other algorithms as a source for Bayesian probabilities.

It is an interesting and yet largely unexplored area to understand to what extent algorithms other than the sos algorithm give rise to internally consistent “probabilities” that can be interpreted as describing the beliefs of computationally bounded agents. When we run an algorithm for the task of recovering an unknown set of parameters \(x\) from observations \(y\), even if the algorithm is not successful, it might be possible to interpret its internal state as codifying some partial information about \(x\). Some algorithms, such as belief propagation come with fairly explicit mapping between the internal state and Bayesian probabilities. Similarly algorithms based on Monte Carlo Markov Chains or statistical-physics inspired algorithms that converge as a certain “temperture parameter” decreases can be interpreted as getting closer and closer to the ground truth. Recently, people have been able to interpret the internal state (i.e. intermediate neurons) of deep neural networks as encoding certain “beliefs” about the observed data (e.g., see Yosinski et al. (2015)).

Computational Bayesian probabilities and cryptography.

People sometimes describe a cryptographic scheme as offering, say, “\(128\) bits of security”, which roughly speaking means that one could not break the scheme using much fewer than \(2^{128}\) computational operations or with probability much better than \(2^{-128}\). This does not necessarily mean that the secret key is \(128\) bits long, though it definitely needs to be at least as long as that; for example, the security level of \(1024\) bit RSA encryption system is typically estimated at about \(80\) bits (E. Barker et al. 2007).

Cryptography is typically most interested in “zero one scenarios” where one basically learns nothing about the underlying secrets using much less than \(2^{128}\) operations and everything after spending more than that. The notion of pseudodistributions allows us to talk about “softer” scenarios where one can learn non-trivial information about the unknown data.

Beyond philosophy: the Bayes-Marley approach for analyzing SOS

How do we actually use this Bayesian perspective when we design algorithms or lower bounds? The intuition behind this is the following: while SOS is not coherent (i.e., does not correspond to actual distributions), it should be hard for any bounded observer to “catch” it being non coherent (at least up to some small error). So, when we design SOS based algorithm, we can pretend that these pseudo-distributions are actual distributions, as long as we limit ourselves to “bounded reasoning”. Technically “bounded reasoning” corresponds to SOS proofs of low degree, but in practice it seems that most mathematical arguments we use in these contexts (except for the important exception of non constructive techniques such as the probabilistic method) can be “SOS’ed” with low degree. So, the paradigm for designing algorithms is:

Pretend you are working with actual distributions, avoid using the probabilistic method, and every little thing is going to be alright

When designing lower bounds we need to construct pseudo distributions. These are not actual distributions, but it should not be easy for bounded observers to “catch” us in a discrepancy. So we need to respect any correlations that would be possible to bounded observers. That is, our approach is

If you can prove that the moments of a true posterior would have to satisfy property \(P\), you’d better ensure that your pseudo distribution satisfies \(P\) as well.

The nice thing about this that often the properties that a bounded observers can ascertain essentially fix the choice of the pseudo-distribution, and hence all that is left is the (often highly non trivial) task of proving that this choice satisfies the pseudo-distribution constraints.

At the moment this might seem somewhat vague and abstract, but we will see actual examples of both upper and lower bounds using this approach.

References

Barker, Elaine, William Barker, William Burr, William Polk, and Miles Smid. 2007. “NIST Special Publication 800-57.” NIST Special Publication 800 (57): 1–142.

Conitzer, Vincent, and Tuomas Sandholm. 2004. “Computational Criticisms of the Revelation Principle.” In Proceedings 5th ACM Conference on Electronic Commerce (Ec-2004), New York, Ny, Usa, May 17-20, 2004, 262–63. doi:10.1145/988772.988824.

Feller, William. 1968. An Introduction to Probability Theory and Its Applications: Volume I. Vol. 3. John Wiley & Sons London-New York-Sydney-Toronto.

Garrabrant, Scott, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor. 2016. “Logical Induction.” ArXiv Preprint ArXiv:1609.03543.

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.

Kroer, Christian, Miroslav Dudík, Sébastien Lahaie, and Sivaraman Balakrishnan. 2016. “Arbitrage-Free Combinatorial Market Making via Integer Programming.” ArXiv Preprint ArXiv:1606.02825.

Yosinski, Jason, Jeff Clune, Thomas Fuchs, and Hod Lipson. 2015. “Understanding Neural Networks Through Deep Visualization.” In In Icml Workshop on Deep Learning.