Bretagnolle–Huber inequality

From The Right Wiki
Jump to navigationJump to search

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions P and Q by a concave and bounded function of the Kullback–Leibler divergence DKL(PQ). The bound can be viewed as an alternative to the well-known Pinsker's inequality: when DKL(PQ) is large (larger than 2 for instance.[1]), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing[2]  (Bretagnolle–Huber–Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)

Formal statement

Preliminary definitions

Let P and Q be two probability distributions on a measurable space (𝒳,). Recall that the total variation between P and Q is defined by

dTV(P,Q)=supA{|P(A)Q(A)|}.

The Kullback-Leibler divergence is defined as follows:

DKL(PQ)={𝒳log(dPdQ)dPif PQ,+otherwise.

In the above, the notation PQ stands for absolute continuity of P with respect to Q, and dPdQ stands for the Radon–Nikodym derivative of P with respect to Q.

General statement

The Bretagnolle–Huber inequality says:

dTV(P,Q)1exp(DKL(PQ))112exp(DKL(PQ))

Alternative version

The following version is directly implied by the bound above but some authors[2] prefer stating it this way. Let A be any event. Then

P(A)+Q(A¯)12exp(DKL(PQ))

where A¯=ΩA is the complement of A. Indeed, by definition of the total variation, for any A,

Q(A)P(A)dTV(P,Q)112exp(DKL(PQ))=Q(A)+Q(A¯)12exp(DKL(PQ))

Rearranging, we obtain the claimed lower bound on P(A)+Q(A¯).

Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),[3] which differ from the original proof[4] (see C.Canonne's note [1] for a modernized retranscription of their argument). The proof is in two steps: 1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):

1dTV(P,Q)2(PQ)2

2. Prove by a clever application of Jensen’s inequality that

(PQ)2exp(DKL(PQ))
  • Step 1:
First notice that
dTV(P,Q)=1min(P,Q)=max(P,Q)1
To see this, denote A*=argmaxAΩ|P(A)Q(A)| and without loss of generality, assume that P(A*)>Q(A*) such that dTV(P,Q)=P(A*)Q(A*). Then we can rewrite
dTV(P,Q)=A*max(P,Q)A*min(P,Q)
And then adding and removing A*¯max(P,Q) or A*¯min(P,Q) we obtain both identities.
Then
1dTV(P,Q)2=(1dTV(P,Q))(1+dTV(P,Q))=min(P,Q)max(P,Q)(min(P,Q)max(P,Q))2=(PQ)2
because PQ=min(P,Q)max(P,Q).
  • Step 2:
We write ()2=exp(2log()) and apply Jensen's inequality:
(PQ)2=exp(2log(PQ))=exp(2log(PQP))=exp(2log(EP[(PQ)1]))exp(EP[log(PQ)])=exp(DKL(P,Q))
Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.

Examples of applications

Sample complexity of biased coin tosses

Source:[1] The question is How many coin tosses do I need to distinguish a fair coin from a biased one? Assume you have 2 coins, a fair coin (Bernoulli distributed with mean p1=1/2) and an ε-biased coin (p2=1/2+ε). Then, in order to identify the biased coin with probability at least 1δ (for some δ>0), at least

n12ε2log(12δ).

In order to obtain this lower bound we impose that the total variation distance between two sequences of n samples is at least 12δ. This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote P1n and P2n the respective joint distributions of the n coin tosses for each coin, then We have

(12δ)2dTV(P1n,P2n)21eDKL(P1nP2n)=1enDKL(P1P2)=1enlog(1/(14ε2))2

The result is obtained by rearranging the terms.

Information-theoretic lower bound for k-armed bandit games

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms[2]).

History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar.[4] Alexandre Tsybakov's book[3] features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.[5]

See also

References

  1. 1.0 1.1 1.2 Canonne, Clément (2022). "A short note on an inequality between KL and TV". arXiv:2202.07198 [math.PR].
  2. 2.0 2.1 2.2 Lattimore, Tor; Szepesvari, Csaba (2020). Bandit Algorithms (PDF). Cambridge University Press. Retrieved 18 August 2022.
  3. 3.0 3.1 Tsybakov, Alexandre B. (2010). Introduction to nonparametric estimation. Springer Series in Statistics. Springer. doi:10.1007/b13794. ISBN 978-1-4419-2709-5. OCLC 757859245. S2CID 42933599.
  4. 4.0 4.1 Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax", Séminaire de Probabilités XII, Lecture notes in Mathematics, vol. 649, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 342–363, doi:10.1007/bfb0064610, ISBN 978-3-540-08761-8, S2CID 122597694, retrieved 2022-08-20
  5. Gerchinovitz, Sébastien; Ménard, Pierre; Stoltz, Gilles (2020-05-01). "Fano's Inequality for Random Variables". Statistical Science. 35 (2). arXiv:1702.05985. doi:10.1214/19-sts716. ISSN 0883-4237. S2CID 15808752.