inftheory.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 \documentclass[10pt,a4paper,twoside]{article}
   4 \usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry}
   5 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
   6 \usepackage[utf8]{inputenc}
   7 \usepackage{amsmath,amssymb,dsfont}
   8 \usepackage[pdftex]{graphicx}
   9 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  10 \usepackage{tikz}
  11 \usetikzlibrary{arrows,arrows.meta,calc}
  12 \usetikzlibrary{patterns,backgrounds}
  13 \usetikzlibrary{positioning,fit}
  14 \usetikzlibrary{shapes.geometric,shapes.multipart}
  15 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  16 \usetikzlibrary{tikzmark}
  17 \usetikzlibrary{decorations.pathmorphing}
  18 \usepackage[round]{natbib}
  19 %\usepackage{cmbright}
  20 %\usepackage{showframe}
  21
  22 \usepackage{mleftright}
  23
  24 \newcommand{\setmuskip}[2]{#1=#2\relax}
  25 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  26 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  27 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  28
  29 \setlength{\parindent}{0cm}
  30 \setlength{\parskip}{12pt}
  31 %\renewcommand{\baselinestretch}{1.3}
  32 %\setlength{\tabcolsep}{0pt}
  33 %\renewcommand{\arraystretch}{1.0}
  34
  35 \def\argmax{\operatornamewithlimits{argmax}}
  36 \def\argmin{\operatornamewithlimits{argmin}}
  37 \def\expect{\mathds{E}}
  38
  39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  40 %% The \todo command
  41 \newcounter{nbdrafts}
  42 \setcounter{nbdrafts}{0}
  43 \makeatletter
  44 \newcommand{\checknbdrafts}{
  45 \ifnum \thenbdrafts > 0
  46 \@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)}
  47 \fi}
  48 \newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}}
  49 \makeatother
  50 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  51
  52 \begin{document}
  53
  54 Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  55
  56 This field is about quantifying the amount of ``information'' contained
  57 in a signal and how much can be transmitted under certain conditions.
  58
  59 What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not.
  60
  61 \section{Shannon's Entropy}
  62
  63 This is the key concept from which everything is defined.
  64
  65 Imagine that you have a distribution of probabilities p on a finite
  66 set of symbols and that you generate a stream of symbols by sampling
  67 them one after another independently with that distribution.
  68
  69 To transmit that stream, for instance with bits over a communication
  70 line, you can design a coding that takes into account that the symbols
  71 are not all as probable, and decode on the other side.
  72
  73 For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and
  74 $P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
  75 ``B'' and ``11'' for a ``C'', 1.5 bits on average.
  76
  77 If the symbol is always the same, you transmit nothing, if they are
  78 equiprobable you need $\log_2$(nb symbols) etc.
  79
  80 Shannon's Entropy (in base 2) is the minimum number of bits you have
  81 to emit on average to transmit that stream.
  82
  83 It has a simple formula:
  84 %
  85 \[
  86  H(p) = - \sum_k p(k) \log_2 p(k)
  87 \]
  88 %
  89 where by convention $o \log_2 0 = 0$.
  90
  91 It is often seen as a measure of randomness since the more
  92 deterministic the distribution is, the less you have to emit.
  93
  94 The codings above are "Huffman coding", which reaches the Entropy
  95 bound only for some distributions. The "Arithmetic coding" does it
  96 always.
  97
  98 From this perspective, many quantities have an intuitive
  99 value. Consider for instance sending pairs of symbols (X, Y).
 100
 101 If these two symbols are independent, you cannot do better than sending one and the other separately, hence
 102 %
 103 \[
 104 H(X, H) = H(X) + H(Y).
 105 \]
 106
 107 However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side.
 108
 109 Hence in that case
 110 %
 111 \[
 112 H(X, Y) = H(X).
 113 \]
 114
 115 An associated quantity is the mutual information between two random
 116 variables, defined with
 117 %
 118 \[
 119 I(X;Y) = H(X) + H(Y) - H(X,Y),
 120 \]
 121 %
 122 that quantifies the amount of information shared by the two variables.
 123
 124 \section{Conditional Entropy}
 125
 126 Conditional entropy is the average of the entropy of the conditional distribution:
 127 %
 128 \begin{align*}
 129 &H(X \mid Y)\\
 130  &= \sum_y p(Y=y) H(X \mid Y=y)\\
 131        &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y)
 132 \end{align*}
 133
 134 Intuitively it is the [minimum average] number of bits required to describe X given that Y is known.
 135
 136 So in particular, if X and Y are independent, getting the value of $Y$
 137 does not help at all, so you still have to send all the bits for $X$,
 138 hence
 139 %
 140 \[
 141   H(X \mid Y)=H(X)
 142 \]
 143
 144 if X is a deterministic function of Y then
 145 %
 146 \[
 147   H(X \mid Y)=0.
 148 \]
 149
 150 And if you send the bits for Y and then the bits to describe X given
 151 that Y, you have sent (X, Y). Hence we have the chain rule:
 152 %
 153 \[
 154 H(X, Y) = H(Y) + H(X \mid Y).
 155 \]
 156
 157 And then we get
 158 %
 159 \begin{align*}
 160 I(X;Y) &= H(X) + H(Y) - H(X,Y)\\
 161        &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\
 162        &= H(X) - H(X \mid Y).
 163 \end{align*}
 164
 165 \section{Kullback-Leibler divergence}
 166
 167 Imagine that you encode your stream thinking it comes from
 168 distribution $q$ while it comes from $p$. You would emit more bits than
 169 the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the
 170 Kullback-Leibler divergence between $p$ and $q$.
 171
 172 In particular if $p=q$
 173 %
 174 \[
 175  D_{KL}(p\|q)=0,
 176 \]
 177 %
 178 and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
 179 %
 180 \[
 181  D_{KL}(p\|q)=+\infty.
 182 \]
 183
 184 Its formal expression is
 185 %
 186 \[
 187 D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
 188 \]
 189 %
 190 that can be understood as a value called the cross-entropy between $p$ and $q$
 191 %
 192 \[
 193 H(p,q) = -\sum_x p(x) \log q(x)
 194 \]
 195 %
 196 minus the entropy of p
 197 \[
 198 H(p) = -\sum_x p(x) \log p(x).
 199 \]
 200
 201 Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them.
 202 \end{document}