inftheory.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 %% Any copyright is dedicated to the Public Domain.
   4 %% https://creativecommons.org/publicdomain/zero/1.0/
   5 %% Written by Francois Fleuret <francois@fleuret.org>
   6
   7 \documentclass[10pt,a4paper,twoside]{article}
   8 \usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry}
   9 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
  10 \usepackage[utf8]{inputenc}
  11 \usepackage{amsmath,amssymb,dsfont}
  12 \usepackage[pdftex]{graphicx}
  13 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  14 \usepackage{tikz}
  15 \usetikzlibrary{arrows,arrows.meta,calc}
  16 \usetikzlibrary{patterns,backgrounds}
  17 \usetikzlibrary{positioning,fit}
  18 \usetikzlibrary{shapes.geometric,shapes.multipart}
  19 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  20 \usetikzlibrary{tikzmark}
  21 \usetikzlibrary{decorations.pathmorphing}
  22 \usepackage[round]{natbib}
  23
  24 \usepackage{mleftright}
  25
  26 \newcommand{\setmuskip}[2]{#1=#2\relax}
  27 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  28 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  29 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  30
  31 \setlength{\parindent}{0cm}
  32 \setlength{\parskip}{12pt}
  33 %\renewcommand{\baselinestretch}{1.3}
  34 %\setlength{\tabcolsep}{0pt}
  35 %\renewcommand{\arraystretch}{1.0}
  36
  37 \def\argmax{\operatornamewithlimits{argmax}}
  38 \def\argmin{\operatornamewithlimits{argmin}}
  39
  40 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  41
  42 \def\given{\,\middle\vert\,}
  43 \def\proba{\operatorname{P}}
  44 \newcommand{\seq}{{S}}
  45 \newcommand{\expect}{\mathds{E}}
  46 \newcommand{\variance}{\mathds{V}}
  47 \newcommand{\empexpect}{\hat{\mathds{E}}}
  48 \newcommand{\mutinf}{\mathds{I}}
  49 \newcommand{\empmutinf}{\hat{\mathds{I}}}
  50 \newcommand{\entropy}{\mathds{H}}
  51 \newcommand{\empentropy}{\hat{\mathds{H}}}
  52 \newcommand{\ganG}{\mathbf{G}}
  53 \newcommand{\ganD}{\mathbf{D}}
  54 \newcommand{\ganF}{\mathbf{F}}
  55
  56 \newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
  57 \newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
  58
  59 \newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
  60 \newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
  61
  62 \def\positionalencoding{\operatorname{pos-enc}}
  63 \def\concat{\operatorname{concat}}
  64 \def\crossentropy{\LL_{\operatorname{ce}}}
  65
  66 \begin{document}
  67
  68 \vspace*{1ex}
  69
  70 \begin{center}
  71 {\Large Some bits of Information Theory}
  72
  73 \today
  74 \end{center}
  75
  76 Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  77
  78 The field is originally about quantifying the amount of
  79 ``information'' contained in a signal and how much can be transmitted
  80 under certain conditions.
  81
  82 What makes it awesome IMO is that it is very intuitive, and like
  83 thermodynamics in Physics, it gives exact bounds about what is possible
  84 or not.
  85
  86 \section{Shannon's Entropy}
  87
  88 Shannon's entropy is the key concept from which everything is defined.
  89
  90 Imagine that you have a distribution of probabilities $p$ on a finite
  91 set of symbols, and that you generate a stream of symbols by sampling
  92 them one after another independently with that distribution.
  93
  94 To transmit that stream, for instance with bits over a communication
  95 line, you can design a coding that takes into account that the symbols
  96 are not all as probable, and decode on the other side.
  97
  98 For instance if $\proba('\!\!A')=1/2$, $\proba('\!\!B')=1/4$, and
  99 $\proba('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
 100 ``B'' and ``11'' for a ``C'', 1.5 bits on average.
 101
 102 If the symbol is always the same, you transmit nothing, if they are
 103 equiprobable you need $\log_2$(nb symbols) etc.
 104
 105 Shannon's Entropy (in base 2) is the minimum number of bits you have
 106 to emit on average per symbol to transmit that stream.
 107
 108 It has a simple analytical form:
 109 %
 110 \[
 111  \entropy(p) = - \sum_k p(k) \log_2 p(k)
 112 \]
 113 %
 114 where by convention $0 \log_2 0 = 0$.
 115
 116 It is often seen as a measure of randomness since the more
 117 deterministic the distribution is, the less you have to emit.
 118
 119 The examples above correspond to "Huffman coding", which reaches the
 120 Entropy bound only for some distributions. A more sophisticated scheme
 121 called "Arithmetic coding" does it always.
 122
 123 From this perspective, many quantities have an intuitive
 124 value. Consider for instance sending pairs of symbols $(X, Y)$.
 125
 126 If these two symbols are independent, you cannot do better than
 127 sending one and the other separately, hence
 128 %
 129 \[
 130 \entropy(X, Y) = \entropy(X) + \entropy(Y).
 131 \]
 132
 133 However, imagine that the second symbol is a function of the first
 134 Y=f(X). You just have to send $X$ since $Y$ can be computed from it on the
 135 other side.
 136
 137 Hence in that case
 138 %
 139 \[
 140 \entropy(X, Y) = \entropy(X).
 141 \]
 142
 143 An associated quantity is the mutual information between two random
 144 variables, defined with
 145 %
 146 \[
 147 \mutinf(X;Y) = \entropy(X) + \entropy(Y) - \entropy(X,Y),
 148 \]
 149 %
 150 that quantifies the amount of information shared by the two variables.
 151
 152 \section{Conditional Entropy}
 153
 154 Conditional entropy is the average of the entropy of the conditional distribution:
 155 %
 156 \begin{align*}
 157  & \entropy(X \mid Y)                        \\
 158  & = \sum_y \proba(Y=y) \entropy(X \mid Y=y) \\
 159  & = \sum_y \proba(Y=y) \sum_x \proba(X=x \mid Y=y) \log \proba(X=x \mid Y=y)
 160 \end{align*}
 161
 162 Intuitively it is the [minimum average] number of bits required to describe $X$ given that $Y$ is known.
 163
 164 So in particular, if $X$ and $Y$ are independent, getting the value of $Y$
 165 does not help at all, so you still have to send all the bits for $X$,
 166 hence
 167 %
 168 \[
 169   \entropy(X \mid Y)=\entropy(X),
 170 \]
 171 %
 172 and if $X$ is a deterministic function of $Y$ then
 173 %
 174 \[
 175   \entropy(X \mid Y)=0.
 176 \]
 177
 178 And if you send the bits for $Y$ and then the bits to describe $X$ given
 179 that $Y$, you have sent $(X, Y)$. Hence we have the chain rule:
 180 %
 181 \[
 182 \entropy(X, Y) = \entropy(Y) + \entropy(X \mid Y).
 183 \]
 184
 185 And then we get
 186 %
 187 \begin{align*}
 188 I(X;Y) &= \entropy(X) + \entropy(Y) - \entropy(X,Y)\\
 189        &= \entropy(X) + \entropy(Y) - (\entropy(Y) + \entropy(X \mid Y))\\
 190        &= \entropy(X) - \entropy(X \mid Y).
 191 \end{align*}
 192
 193 \section{Kullback-Leibler divergence}
 194
 195 Imagine that you encode your stream thinking it comes from
 196 distribution $q$ while it comes from $p$. You would emit more bits
 197 than the optimal $\entropy(p)$, and that excess of bits is
 198 $\dkl(p||q)$ the Kullback-Leibler divergence between $p$ and $q$.
 199
 200 In particular if $p=q$
 201 %
 202 \[
 203  \dkl(p\|q)=0,
 204 \]
 205 %
 206 and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
 207 %
 208 \[
 209  \dkl(p\|q)=+\infty.
 210 \]
 211
 212 Its formal expression is
 213 %
 214 \[
 215 \dkl(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
 216 \]
 217 %
 218 that can be understood as a value called the cross-entropy between $p$ and $q$
 219 %
 220 \[
 221 \entropy(p,q) = -\sum_x p(x) \log q(x)
 222 \]
 223 %
 224 minus the entropy of p
 225 \[
 226 \entropy(p) = -\sum_x p(x) \log p(x).
 227 \]
 228
 229 Notation horror: if $X$ and $Y$ are random variables $\entropy(X, Y)$ is the
 230 entropy of their joint law, and if $p$ and $q$ are distributions,
 231 $\entropy(p,q)$ is the cross-entropy between them.
 232
 233 \end{document}