Limitations of the sum of squares algorithm
In the last lecture we have seen how the sum of squares algorithm can achieve non-trivial performance guarantees for several interesting problems. But it is not all powerful. In this lecture we will see some negative results for the sum of squares algorithm, showing lower bounds on the degree needed to certify certain problems. In many cases, we do not know of any algorithms that do better, but in some cases we do, and we will see examples of both types. In the standard parlance, these negative results are known as integrality gaps, since these are instances in which there is a gap between the value that the pseudo-distribution “pretends” and the true objective value.The name “integrality gap”, as well as the related notion of a “rounding algorithm” arise from the setting of using a linear program as a relaxation of an integer linear program, where the optimal value of the linear programming relaxation is known as the fractional value, and the opitmal value of the integer linear program is known as the integral value.
The cycle as an integrality gap for Cheeger’s Inequality and Max-Cut
Recall that the discrete Cheeger’s inequality states that every \(d\)-regular graph \(G\) with adjacency matrix \(A\) satisfies \(\lambda \geq \Omega(\varphi(G)^2)\), where \(\lambda\) is the second smallest eigenvalue of the normalized Laplacian \(L_G = I - \tfrac{1}{d}A\). It turns out the humble cycle shows that this bound is tight.
Let \(C_n\) be the cycle on \(n\) vertices and let \(L_{C_n}\) be its normalized Laplacian. Then the second smallest eigenvalue of \(L_{C_n}\) is at most \(O(1/n^2)\).
The lemma shows that Cheeger’s inequality is tight for the \(C_n\) because every subset \(S\subseteq V(C_n)\) with \(1\le \card{S}\le n-1\) has at least one neighbor outside of \(S\). Therefore, the expansion of \(C_n\) is at least \(\varphi(C_n) \geq \Omega(1/n)\).
Let \(\omega=e^{2\pi {\mathrm{i}}/n}\in {\mathbb C}\) be the \(n\)-th root of unity. The vector \(v = (\omega^0,\omega^1,\dots,\omega^{n-1})\) is orthogonal to the all-ones vector \(\Ind\), \[ \iprod{v,\Ind} = \sum_{\ell=0}^{n-1} \omega^\ell = \frac {\omega^0 - \omega^n}{1- \omega} = 0 \] using the formula for the sum of a geometric progression. At the same time, we can upper bound the quadratic form of \(L_{C_n}\) at \(v\), \[ \begin{aligned} \iprod{v,L_{C_n} v} & = \sum_{\ell=0}^{n-1} \abs{\omega^\ell - \omega^{\ell+1}}^2 \\ & = \abs{1-\omega}^2 \cdot n\\ & \le O(1/n^2)\cdot \norm{v}^2\,. \quad \text{} \end{aligned} \] The last step uses \(|1-\omega|^2= O(1/n^2)\). It follows that the second smallest eigenvalue of \(L_{C_n}\) is \(O(1/n^2)\)In fact, it turns out that the second eigenvalue of \(L_{C_n}\) is exactly \(\iprod{v,L_{C_n} v} / \norm{v}^2\) and \(v\) is an eigenvector corresponding to this eigenvalue.
In one of the exercises in the chapter about Max Cut you showed that if a graph \(G\) satisfies \({\mathrm{maxcut}}(G) \leq 1 -\e\) then every degree-\(2\) pseudo-distribution \(\mu\) satisfies that \(\pE_\mu f_G(x) \leq 1 - \Omega(\e^2)\). The following lemma shows that this tradeoff is tight for cycles with an odd number of vertices. Concretely, if \(n\) is odd then \(C_n\) is not bipartite. Therefore, \({\mathrm{maxcut}}(C_n) \leq 1 - 1/n\). In contrast, there are degree-2 pseudo-distributions \(\mu\) such that \(\pE_\mu f_{C_n}(x)\ge 1-O(1/n^2)\).
Let \(n\in \N\) be odd. Then, there exists a degree two pseudo-distribution \(\mu\) such that \(\pE_\mu f_{C_n}(x) \geq 1 - O(1/n^2)\).
Let \(\omega\in {\mathbb C}\) be the \(n\)-th complex root of unity as before. Suppose \(n=2k+1\). Let \(u\in {\mathbb C}^n\) with \[ (0,\omega^k, \omega^{2k},\dots,\omega^{(n-1) k}) \] Let \(v,w\in \R^n\) be the real and imaginary part of \(u\) so that \(u=v+{\mathrm{i}}\cdot w\). Let \(X\) be the \(n\)-by-\(n\) positive-semidefinite matrix, \[ X = \dyad v + \dyad w\,. \] Since \(X_{ii}=v_i^2 + w_i^2=\abs{u_i}=1\), the diagonal of \(X\) is the all-ones vector \(\Ind\).
We have seen that we can specify a pseudo-distribution \(\mu\) by specifying its pseudo-expectation operator. Specifically, we will fix the expectation operator such that \[ \begin{aligned} \pE_{\mu(x)} x &= \tfrac12 \Ind\,, \\ \pE_{\mu(x)} \dyad x &= \tfrac 14 \cdot \dyad{\Ind} + \tfrac14 \cdot X\,. \end{aligned} \]
We leave it as an exercise to the reader to verify that the above is indeed a valid degree two pseudo-expectation.Hint: This holds because the diagonal of the second moment \(\tfrac 14 \cdot \dyad{\Ind} + \tfrac14 \cdot X\) agrees with mean \(\tfrac12 \Ind\) and the formal covariance \(\pE \dyad{\Paren{x-\tfrac12 \Ind}}=\tfrac14 \cdot X\) is positive semidefinite. See exercise below.
It remains to estimate the pseudo-expectation of \(f_G\) with respect to \(\mu\), \[ \begin{aligned} \pE_{\mu(x)} f_G(x) & = \tfrac 14 \sum_{\set{i,j}\in E(G)} (v_i - v_j)^2 + (w_i-w_j)^2\\ & = \tfrac 14 \sum_{\set{i,j}\in E(G)} \abs{u_i - u_j}^2\\ & = \tfrac 14 n \cdot \abs{1-\omega^{k}}^2\ge n \cdot \Paren{1-O\Paren{1/n^2}} \\ \end{aligned} \] In the last step, we use that \[ \begin{split} \abs{1-\omega^{k}}^2 & = (1-\omega^{-k})(1-\omega^k) =2-\omega^{k}-\omega^{-k}\\ & = 2 + 2 \cos\tfrac\pi n \ge 4-O\Paren{\tfrac 1 {n}}^2\,. \end{split} \]
Show that for every vector \(v\in \R^n\) and every matrix \(M\in \R^{n\times n}\) with \(\diag(M)=v\) and \(M-\dyad v\succeq 0\), there exists a degree-2 pseudo-distribution \(\mu\) over the hypercube \(\bits^n\) with \(\pE_{\mu(x)} x=v\) and \(\pE_{\mu(x)}\dyad x = M\).
Eigenvalues of Cayley graphs
The above ad-hoc computations of eigenvalues are actually special cases of a more general theory. Let \((\Gamma,\cdot)\) be a group and \(S\subseteq \Gamma\). The Cayley graph corresponding to \(\Gamma\) and \(S\), which we denote as \(G(\Gamma,S)\), has vertices corresponding to \(\Gamma\) and edges \(a,b\) for every \(a,b \in \Gamma\) such that \(ab^{-1} \in S\cup S^{-1}\). The \(n\)-cycle is simply the Cayley graph corresponding to the group \(\Z_n\) (integers in \(\set{0,\ldots,n-1}\) with addition modulo \(n\)) and the set \(S = \set{1}\).
It turns out that for every Abelian group \(\Gamma\) and Cayley graph \(G(\Gamma,S)\), we can explicitly calculate the eigenvectors and eigenvalues of \(G(\Gamma,S)\). The following series of exercises works this out, first for cyclic groups and then for every Abelian group:
Let \(n\in\N\) and \(\omega = e^{2\pi {\mathrm{i}}/n}\). For every \(\alpha \in\Z_n\), we define \(\chi^\alpha\in {\mathbb C}^n\) to be the vector \(\chi^\alpha_j = \omega^{\alpha \cdot j}\). Prove that if \(S\subseteq \Z_n\), and \(A\) is the adjacency matrix of \(G=G(\Z_n,S)\), then \(A\chi^\alpha = \lambda_\alpha \cdot \chi^\alpha\) where \(\lambda_\alpha = \sum_{j\in S}\omega^{\alpha \cdot j}\). (In particular, \(\chi^\alpha\) is an eigenvector of \(A\) with eigenvalue \(\alpha\).)
Adjacency matrices of (potentially weighted) Cayley graphs over the group \(\Z_n\) are known as circulant matrices.
Let \(\Gamma\) be an Abelian group of the form \(\Gamma = \Z_{n_1}\times \cdots \times Z_{n_\ell}\), and let \(\omega_t = e^{2\pi{\mathrm{i}}/n_t}\). For every \(\alpha=(\alpha_1,\ldots,\alpha_\ell) \in \Gamma\), we define a vector \(\chi^{\alpha}\in {\mathbb C}^\Gamma\) (called character) such that for every \(j=(j_1,\ldots,j_\ell)\in \Gamma\) \[ \chi^{\alpha}_j = \prod_{t=1}^\ell \omega_t^{\alpha_t\cdot j_t}\,. \] Prove that if \(S\subseteq \Gamma\) and \(A\) is the adjacency matrix of \(G(\Gamma,S)\), then every vector \(\chi^\alpha\) is an eigenvector of \(A\) with eigenvalue \(\lambda_\alpha = \sum_{j\in S} \prod_{t=1}^\ell \omega_t^{\alpha_t \cdot j_t}\) so that \(A\chi^{\alpha} = \lambda_\alpha\cdot \chi^{\alpha}\). .
One of the most common instantiations of this result in computer science is for the case that \(\Gamma\) is the Boolean cube \(\bits^\ell\) with the XOR operation. In this case we can think of \(\Gamma\) as \(\Z_2^\ell\), and then, since \(e^{2\pi {\mathrm{i}}/2}=-1\), we get that the eigenvectors of the adjacency matrix of a graph \(G(\bits^n,S)\) have the form \(\chi^\alpha\in\R^{\bits^\ell}\) where \(\chi^\alpha_\beta = -1^{\iprod{\alpha,\beta}}\) with \(\alpha,\beta\in\bits^n\). The corresponding eigenvalue is \(\sum_{\beta\in S} -1^{\iprod{\alpha,\beta}}\).
The map that transforms a vector \(v\in{\mathbb C}^\Gamma\) from its representation in the standard basis into its representation in the basis of characters of \(\Gamma\) (which are the eigenvectors of \(G(\Gamma,S)\)) is known as the Fourier transform). For more on this topic, let us point again to the textbook (O’Donnell 2014).
A sharper integrality gap for Max Cut
The odd cycle shows that, at least for degree \(2\) pseudo-distributions, our analysis was tight up to a constant factor, but it does not yield the optimal constant. However, there is a more sophisticated example, due to Feige and Schechtman (2002), that yields a tight bound. Recall that the approximation ratio was \(\alpha_{{\mathrm{GW}}} = \min_{0\leq x \leq 1} \tfrac{\arccos(1-2x)}{\pi x} \approx 0.878\) and was achieved at \(x_{{\mathrm{GW}}} \approx 0.845\).
For every \(\e>0\), there exists a graph \(G=(V,E)\) such that \({\mathrm{maxcut}}(G) \leq \alpha_{{\mathrm{GW}}}\cdot x_{{\mathrm{GW}}}+\e\) and there is a degree \(2\) pseudo-distribution \(\mu\) with \(\pE_\mu f_G(x) \geq x_{{\mathrm{GW}}}\).
Let’s think of \(\e\) as some small \(o(1)\) value that we will fix later. Looking at the analysis of the GW rounding algorithm, we see that to prove the theorem we need to come up with a graph \(G\) on \(n\) vertices and a degree \(2\) pseudo-distribution \(\mu\) on \(\bits^n\) such that \({\mathrm{maxcut}}(G) \leq \alpha_{{\mathrm{GW}}}x_{{\mathrm{GW}}}+o(1)\) but for almost all edges \(\{i,j\}\) of \(G\), \(\pE_{\mu} (x_i-x_j)^2 \geq x_{{\mathrm{GW}}} -o(1)\). By the same calculations we did before, if we assume \(\pE_\mu x_i = 1/2\), then this corresponds to the normalized covariance of \(x_i\) and \(x_j\) satisfying \(\pE_{\mu} (x_i-1/2)(x_j-1/2)\leq (\rho_{{\mathrm{GW}}}+o(1))1/4\) where \(\rho_{{\mathrm{GW}}} = 1 -2x_{{\mathrm{GW}}}\) (note that \(\rho_{{\mathrm{GW}}}\) is roughly \(-0.69\)). Typically, we think of the graph as fixed and then we come up with the pseudo-distribution, but for this proof we will do this the other way around. We will first come up with \(\mu\) and then define the graph \(G\) to correspond to those pairs \(\{i,j\}\) in which \(\pE_\mu x_ix_j\) is roughly equal to \(\rho_{{\mathrm{GW}}}\). Moreover, borrowing an idea from the rounding algorithm, we will let \(\mu\) be an actual distribution, but one over \(\R^n\) instead of \(\bits^n\). In fact, \(\mu\) will correspond to an actual multivariate Gaussian distribution over \(\R^n\), with \(\E_\mu x_i = 1/2\) for all \(i\) (and hence \(\E_\mu (x_i-1/2)^2 = 1/4\) for all \(i\)). How do we come up with such a distribution? First note that we can make the number of vertices \(n\) as large as we like as a function of our desired accuracy \(\e\), and hence we will think of \(n\) as very large. In fact, we will think of \(n\) as very very large: so large that it is practically infinite or even continuous! Concretely, we will identify the vertices of the graph \(G\) with the \(d-1\) dimensional unit sphere \(\cS^{d-1}\) in \(\R^d\) (for some dimension parameter \(d\) depending on the desired accuracy) - that is the vertex set is the set of all \(v\in\R^d\) with \(\norm{v}=1\). (The one dimensional sphere is a circle, the two dimensional sphere is the boundary of a 3 dimensional ball, and so on..)
The set \(E\) of edges will be the set of pairs of vectors \((u,v) \in \cS^{d-1}\times\cS^{d-1}\) such that \(\iprod{u,v} \leq \rho_{{\mathrm{GW}}}+\epsilon\). We can think of the max cut value of \(G\) as the maximum over all measurable subsets \(S\) of \(\cS^{d-1}\), of the measure of \(E \cap S \times (\cS^{d-1} \setminus S)\). Ultimately, we will obtain a finite graph by sampling \(n\) such vectors, but as long as \(n\) is large enough \((n\gg 2^d\) will do) then this finite graph will inherit both the max-cut value, as well as the pseudo-distribution value. However, the heart of the argument happens in the continuous setting, so you can ignore for the moment that final sampling stage.
We now need to come up with a collection of correlated random variables
\(\{ X_v \}_{v\in\R^d}\) such that for every \(v\in\R^d\), \(\pE X_v = 1/2\)
and every edge \((u,v)\), the covariance of \(X_u\) and \(X_v\) is at most
\(-\rho_{{\mathrm{GW}}}+O(\e)\). This collection will be very simple: we
choose a random standard Gaussian \(g\in\R^d\) (i.e., \(g\in\R^n\) is chosen
with \(g_i \in N(0,1)\) independently for all \(i\)), and for every
\(v\in\R^d\), we define \(X_v = 1/2 + \iprod{v,g}/2\). Note that
\(\E X_v = 1/2\) and that
\(\E (X_v-1/2)^2 = \E \iprod{v,g}^2/4 = 1/4\).Both equations follow from the rotational symmetry of a standard
Gaussian, which means that without loss of generality \(v=e_1\), in
which case \(\iprod{g,v}=g_1\) is simply a one dimensional standard
Gaussian.
Now the normalized covariance (subtracting the expectation and dividing
by the standard deviation) of \(u\) and \(v\) corresponds to
\(\E \iprod{u,g}\iprod{v,g}\) which by standard manipulations is the same
as \(\E \Tr(uv^\top gg^\top)\) (thinking of \(u,v,g\) as column vectors and
hence \(uv^\top\) and \(gg^\top\) are \(n\times n\) matrices. But by linearity
of trace and expectation this is the same as \(Tr(uv^\top \E gg^\top)\)
and since for a standard Gaussian \(g\), \(\E gg^\top = \Id\), we get that
this normalized covariance is equal to \(\Tr(uv^\top)= \iprod{u,v}\) which
equals at most \(\rho_{{\mathrm{GW}}}+\e\) by our definition of the edge
set.
The above shows that we have a degree two pseudo-distribution \(\mu\) satisfying that with high probability over \(\{u,v\} \in E\), \(\pE_\mu (X_u-X_v)^2 \geq x_{{\mathrm{GW}}}-\e\). But we still need to show that the true maximum cut value is at most \(\alpha_{{\mathrm{GW}}}x_{{\mathrm{GW}}} + o(1)\). Luckily, here we can “stand on the shoulders of giants” and use previously known results. Specifically, by the geometric nature of this graph, intuitively the maximal cuts would be obtained via a gemoetric partition. Indeed, Borell (1975) (and, independently Sudakov and Cirel\('\)son (1974) ), proved that over the unit sphere, when we define the edge sets in such geometric terms, then the maximum cuts that optimize this will always be spherical caps. That is, the set \(S\) would be of the form \(\{ v\in \R^d : \iprod{v,a_0} \geq b_0 \}\) for some \(a_0 \in \R^d\) and \(b_0 \in \R\). Specifically, in this case, one can show that the bipartition that would maximize the number of cut edges would be a balanced one (where \(S\) and \(\cS^{d-1}\setminus S\) have the same measure) and hence \(b_0=0\). By the rotational symmetry of the sphere (and appropriately scaling \(b_0\)), we can assume without loss of generality that \(a_0\) is simply the first standard basis vector \((1,0,\ldots,0)\). Now, since a random edge \((u,v)\in E\) is chosen by letting \(v\) be a random vector with correlation roughly \(\rho_{{\mathrm{GW}}}\) with \(u\), and since the first coordinate of a random unit vector has (essentially) the Gaussian distribution with mean zero and variance \(1/n\), computing the value of the cut reduces to computing the probability that two \(\rho_{{\mathrm{GW}}}\) correlated Gaussians disagree in their sign. This latter quantity is exactly what we computed in the last lecture as \(1-\arccos(\rho_{{\mathrm{GW}}})/\pi = \alpha_{{\mathrm{GW}}}\).
Based on our construction, we can see that if the final graph has \(n\) vertices, then there is a unit vector \(v_i\) associated with each vertex \(i\), and the pseudo-expectation operator is defined as \(\pE x_ix_j = 1/2+1/2\iprod{v_i,v_j}\). The resulting matrix is the sum of the psd all-\(1/2\) matrix plus the Gram matrix (i.e., matrix of dot products) of the vectors \(\{v_1,\ldots,v_m\}\). It is not hard to verify that such a matrix is psd- see also the exercises below:
Prove that an \(n\times n\) matrix \(M\) is a psd matrix of rank \(d\) if and only if there exist \(v_1,\ldots,v_n \in \R^d\) such that \(M_{i,j}=\iprod{v_i,v_j}\).
Suppose that \(M,N\) are two \(n\times n\) psd matrices such that \(M_{i,j}=\iprod{v_i,v_j}\) and \(N_{i,j}=\iprod{u_i,u_j}\). Show explicitly a tuple of vectors \((w_1,\ldots,w_n)\) such that the psd matrix \(L=M+N\) satisfies \(L_{i,j}=\iprod{w_i,w_j}\).
Isoperimetry, extremal questions, and sum of squares
The result of (Borell 1975,Sudakov and Cirel\('\)son (1974)) we used above is one in a long line of work on isoperimetric inequalities and their many generalizations. The classical isoperimetric problem is to prove that among all shapes in the plane, the circle is the one that minimizes the ratio of its boundary to its volume. This question has been generalized to many other geometric spaces and notions of volume and boundaries. Indeed, it corresponds to the question we have already seen of finding the least expanding set in the graph: if we think of a very fine grid graph that discretizes the plane, then the isoperimetric problem corresponds to proving that the circle is the set of vertices that minimizes the expansion.
Isoperimetric questions themselves are just a special case of more general questions of finding and characterizing extremal objects. The general setting can be thought of as follows. We have:
- Some “universe” \(\cU\) of possible objects (e.g., all subsets of a graph, all closed curves in the plane, all functions or vectors in some space).
- Some “objective” function \(F:\cU\rightarrow\R\) (e.g., the ratio of boundary to volume, the sum of violations of some constraints)
- Some “nice” family \(\cS \subseteq \cU\) (e.g., shifts of circles, spherical caps, codewords)
The type of theorems we might want to prove (in increasing order of difficulty, and, often, usefulness) would be:
- An optimality theorem: If \(S\in\cU\) is a global minimum of \(F(\cdot)\) then \(S\in\cS\). Like in the isoperimetric case, such optimality theorem are often phrased as inequalities of the form \(F(S) \geq \alpha^*\) for every \(S\in\cU\), where \(\alpha^*\) is the (typically easily computable) minimum of \(F(S)\) among all the “nice” \(S\in\cS\).
- A stability theorem: If \(S\in\cU\) is a “near global minimum” (i.e., \(F(S)\) is close to \(\min_{S'\in\cU} f(S')\)) then \(S\) is “close” to some “nice” \(S^*\in\cS\). The notion of “close” here of course needs to be defined and depends on the context.
- An inverse theorem: If \(S\in\cU\) has “non trivial \(F(\cdot)\) value” (i.e., \(F(S)\) is significantly smaller than the expected value of \(F(S')\) for a random \(S'\)) then \(S\) is “somewhat correlated” with some “nice” \(S^*\in \cS\). Again, the notion of “somewhat correlated” is context-dependent. Sometimes inverse theorems come with a “list decoding” variant in which the condition is that the non-trivial value of \(F(S)\) can be explained by expressing \(S\) as some combination of a small number of “nice” objects in \(\cS\) where again what is “small” and what combinations is one allowed to take.
A related question is the notion of structure vs. randomness as discussed by Tao (see for example (Tao 2007b; Tao 2007a; Tao 2008)). Given some object \(S\in\cU\) and some family of tests/objectives \(\cF\), we want to decompose \(S\) into the “structured part” that is some combination of objects from the “nice” family \(\cS\) and the “random part” which we can think of as some “noise” object \(N\) such that \(F(N)\) is close to the expectation of \(F(S)\) over a random \(S\in\cU\) for all \(F\in\cF\).
There are often algorithmic questions associated with such theorems. One such question is the “decoding” task of finding the “nice” \(S^*\) that is close to an \(S\) with small \(F(\cdot)\) value. Another is the task of, given some description of \(\cU\) and \(F(\cdot)\), certifying an optimality theorem. Indeed, we will see that a very interesting question is often whether such an inequality has a low degree sum of squares proof. Often, an algorithmic proof of an optimality theorem will imply at least a stability theorem if not stronger results. Indeed such algorithmic proofs often give an explicit process for optimizing \(F\) that given any starting point \(S\) ends up in an optimum point. Typically, if the starting point \(S\) already had a pretty good \(F(\cdot)\) value then the algorithm would presumably not take too many steps and hence its final output will be “close” to the initial point.
At this point, when we’ve seen only one concrete example, this discussion might feel somewhat abstract, but many important results, including hypercontractivity, the “invariance principle”, Brascamb Lieb inequalities, results on list decoding, the Gowers norm, and others can be thought of as falling into this general framework. We will see several other examples of such results in this course.
Beyond degree \(2\)
The examples above show that the rounding algorithms of the previous lecture are tight with respect to degree two pseudodistributions. But of course, we can run the sum of squares algorithm for larger degrees. We do pay a price in the running time, but it remains polynomial time for every constant degree, and as long as the degree is significantly smaller than \(n\) it would still be significantly faster than brute force.
Could it be that using larger, but still small degree, we can beat the guarantees for max cut, graph expansion, or boolean quadratic forms that are achieved respectively by Goemans-Willamson, Cheeger, or Grothendieck? The short answer that we do not know. It is known that if Khot’s Unique Games Conjecture is true (or the closely related Small Set Expansion Hypothesis) then no polynomial (or even \(2^{n^{o(1)}}\) time) algorithms can beat those guarantees. In particular, for every \(d = n^{o(1)}\) these conjectures predict that we can obtain instances with the same gaps as we showed in this lecture but with respect not to merely degree \(2\) but to the value achieved by degree \(d\) pseudo-distributions. However, even for \(d=O(1)\) (even \(d=4\)) this is still wide open. What we do know is that the same examples that we saw in this leture do not yield such gaps. Here is one example:
Let \(n\) be odd and \(C_n\) be the \(n\)-length cycle. Prove that for every degree \(6\) pseudo-distribution \(\mu\) over \(\bits^n\), \(\pE_\mu f_{C_n} \leq (1-1/n)|E|\).Hint: Start by showing the square triangle inequality for degree \(6\) pseudo-distributions. That is prove that for every degree \(6\) pseudo-distribution \(\mu\) over \(\bits^n\) and every \(i,j,k\in [n]\), \(\pE_\mu (x_i-x_k)^2 \leq \pE_\mu (x_i-x_j)^2+(x_j-x_k)^2\).
References
Borell, Christer. 1975. “The Brunn-Minkowski Inequality in Gauss Space.” Invent. Math. 30 (2): 207–16.
Feige, Uriel, and Gideon Schechtman. 2002. “On the Optimality of the Random Hyperplane Rounding Technique for MAX CUT.” Random Struct. Algorithms 20 (3): 403–40.
O’Donnell, Ryan. 2014. Analysis of Boolean Functions. Cambridge University Press, New York. doi:10.1017/CBO9781139814782.
Sudakov, V. N., and B. S. Cirel\('\)son. 1974. “Extremal Properties of Half-Spaces for Spherically Invariant Measures.” Zap. Naučn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 41: 14–24, 165.
Tao, Terence. 2007a. “Structure and Randomness in Combinatorics.” In FOCS, 3–15. IEEE Computer Society.
———. 2007b. “The Dichotomy Between Structure and Randomness, Arithmetic Progressions, and the Primes.” In International Congress of Mathematicians. Vol. I, 581–608. Eur. Math. Soc., Zürich. doi:10.4171/022-1/22.
———. 2008. Structure and Randomness. American Mathematical Society, Providence, RI. doi:10.1090/mbk/059.