inftheory.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 \documentclass[10pt,a4paper,twoside]{article}
   4 \usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry}
   5 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
   6 \usepackage[utf8]{inputenc}
   7 \usepackage{amsmath,amssymb,dsfont}
   8 \usepackage[pdftex]{graphicx}
   9 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  10 \usepackage{tikz}
  11 \usetikzlibrary{arrows,arrows.meta,calc}
  12 \usetikzlibrary{patterns,backgrounds}
  13 \usetikzlibrary{positioning,fit}
  14 \usetikzlibrary{shapes.geometric,shapes.multipart}
  15 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  16 \usetikzlibrary{tikzmark}
  17 \usetikzlibrary{decorations.pathmorphing}
  18 \usepackage[round]{natbib}
  19 %\usepackage{cmbright}
  20 %\usepackage{showframe}
  21
  22 \usepackage{mleftright}
  23
  24 \newcommand{\setmuskip}[2]{#1=#2\relax}
  25 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  26 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  27 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  28
  29 \setlength{\parindent}{0cm}
  30 \setlength{\parskip}{12pt}
  31 %\renewcommand{\baselinestretch}{1.3}
  32 %\setlength{\tabcolsep}{0pt}
  33 %\renewcommand{\arraystretch}{1.0}
  34
  35 \def\argmax{\operatornamewithlimits{argmax}}
  36 \def\argmin{\operatornamewithlimits{argmin}}
  37 \def\expect{\mathds{E}}
  38
  39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  40 %% The \todo command
  41 \newcounter{nbdrafts}
  42 \setcounter{nbdrafts}{0}
  43 \makeatletter
  44 \newcommand{\checknbdrafts}{
  45 \ifnum \thenbdrafts > 0
  46 \@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)}
  47 \fi}
  48 \newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}}
  49 \makeatother
  50 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  51
  52 \begin{document}
  53
  54 Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  55
  56 This field is about quantifying the amount ``of information'' contained
  57 in a signal and how much can be transmitted under certain conditions.
  58
  59 What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not.
  60
  61 \section{Shannon's Entropy}
  62
  63 This is the key concept from which everything is defined.
  64
  65 Imagine that you have a distribution of probabilities p on a finite set of symbols and that you generate a stream of symbols by sampling them one after another independently with that distribution.
  66
  67 To transmit that stream, for instance with bits over a communication line, you can design a coding that takes into account that the symbols are not all as probable, and decode on the other side.
  68
  69 For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and
  70 $P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
  71 ``B'' and ``11'' for a ``C'', 1.5 bits on average.
  72
  73 If the symbol is always the same, you transmit nothing, if they are equiprobable you need $\log_2$(nb symbols) etc.
  74
  75 Shannon's Entropy (in base 2) is the minimum number of bits you have to emit on average to transmit that stream.
  76
  77 It has a simple formula:
  78 %
  79 \[
  80  H(p) = - \sum_k p(k) \log_2 p(k)
  81 \]
  82 %
  83 where by convention $o \log_2 0 = 0$.
  84
  85 It is often seen as a measure of randomness since the more deterministic the distribution is, the less you have to emit.
  86
  87 The codings above are "Huffman coding", which reaches the Entropy
  88 bound only for some distributions. The "Arithmetic coding" does it
  89 always.
  90
  91 From this perspective, many quantities have an intuitive
  92 value. Consider for instance sending pairs of symbols (X, Y).
  93
  94 If these two symbols are independent, you cannot do better than sending one and the other separately, hence
  95 %
  96 \[
  97 H(X, H) = H(X) + H(Y).
  98 \]
  99
 100 However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side.
 101
 102 Hence in that case
 103 %
 104 \[
 105 H(X, Y) = H(X).
 106 \]
 107
 108 An associated quantity is the mutual information between two random
 109 variables, defined with
 110 %
 111 \[
 112 I(X;Y) = H(X) + H(Y) - H(X,Y),
 113 \]
 114 %
 115 that quantifies the amount of information shared by the two variables.
 116
 117 \section{Conditional Entropy}
 118
 119 Okay given the visible interest for the topic, an addendum: Conditional entropy is the average of the entropy of the conditional distribution:
 120 %
 121 \begin{align*}
 122 &H(X \mid Y)\\
 123  &= \sum_y p(Y=y) H(X \mid Y=y)\\
 124        &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y)
 125 \end{align*}
 126
 127 Intuitively it is the [minimum average] number of bits required to describe X given that Y is known.
 128
 129 So in particular, if X and Y are independent
 130 %
 131 \[
 132   H(X \mid Y)=H(X)
 133 \]
 134
 135 if X is a deterministic function of Y then
 136 %
 137 \[
 138   H(X \mid Y)=0
 139 \]
 140
 141 And since if you send the bits for Y and then the bits to describe X given that X is known you have sent (X, Y), we have the chain rule:
 142 %
 143 \[
 144 H(X, Y) = H(Y) + H(X \mid Y).
 145 \]
 146
 147 And then we get
 148 %
 149 \begin{align*}
 150 I(X;Y) &= H(X) + H(Y) - H(X,Y)\\
 151        &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\
 152        &= H(X) - H(X \mid Y).
 153 \end{align*}
 154
 155 \section{Kullback-Leibler divergence}
 156
 157 Imagine that you encode your stream thinking it comes from
 158 distribution $q$ while it comes from $p$. You would emit more bits than
 159 the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the
 160 Kullback-Leibler divergence between $p$ and $q$.
 161
 162 In particular if $p=q$
 163 %
 164 \[
 165  D_{KL}(p\|q)=0,
 166 \]
 167 %
 168 and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
 169 %
 170 \[
 171  D_{KL}(p\|q)=+\infty.
 172 \]
 173
 174 Its formal expression is
 175 %
 176 \[
 177 D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
 178 \]
 179 %
 180 that can be understood as a value called the cross-entropy between $p$ and $q$
 181 %
 182 \[
 183 H(p,q) = -\sum_x p(x) \log q(x)
 184 \]
 185 %
 186 minus the entropy of p
 187 \[
 188 H(p) = -\sum_x p(x) \log p(x).
 189 \]
 190
 191 Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them.
 192 \end{document}