Update.

author François Fleuret <francois@fleuret.org>

Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)

committer François Fleuret <francois@fleuret.org>

Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)
author François Fleuret <francois@fleuret.org>
Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)
committer François Fleuret <francois@fleuret.org>
Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)
diff --git a/inftheory.tex b/inftheory.tex

index 1933ff4..7c34fce 100644 (file)
--- a/inftheory.tex
+++ b/inftheory.tex
@@ -36,13 +36,42 @@
  
  \def\argmax{\operatornamewithlimits{argmax}}
  \def\argmin{\operatornamewithlimits{argmin}}
  
  \def\argmax{\operatornamewithlimits{argmax}}
  \def\argmin{\operatornamewithlimits{argmin}}
-\def\expect{\mathds{E}}
  
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  
  
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  
+\def\given{\,\middle\vert\,}
+\def\proba{\operatorname{P}}
+\newcommand{\seq}{{S}}
+\newcommand{\expect}{\mathds{E}}
+\newcommand{\variance}{\mathds{V}}
+\newcommand{\empexpect}{\hat{\mathds{E}}}
+\newcommand{\mutinf}{\mathds{I}}
+\newcommand{\empmutinf}{\hat{\mathds{I}}}
+\newcommand{\entropy}{\mathds{H}}
+\newcommand{\empentropy}{\hat{\mathds{H}}}
+\newcommand{\ganG}{\mathbf{G}}
+\newcommand{\ganD}{\mathbf{D}}
+\newcommand{\ganF}{\mathbf{F}}
+
+\newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
+\newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
+
+\newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
+\newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
+
+\def\positionalencoding{\operatorname{pos-enc}}
+\def\concat{\operatorname{concat}}
+\def\crossentropy{\LL_{\operatorname{ce}}}
+
  \begin{document}
  
  \begin{document}
  
+\vspace*{1ex}
+
+\begin{center}
+{\Large Some bits of Information Theory}
+
  \today
  \today
+\end{center}
  
  Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  
  
  Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  
@@ -66,8 +95,8 @@ To transmit that stream, for instance with bits over a communication
  line, you can design a coding that takes into account that the symbols
  are not all as probable, and decode on the other side.
  
  line, you can design a coding that takes into account that the symbols
  are not all as probable, and decode on the other side.
  
-For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and
-$P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
+For instance if $\proba('\!\!A')=1/2$, $\proba('\!\!B')=1/4$, and
+$\proba('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
  ``B'' and ``11'' for a ``C'', 1.5 bits on average.
  
  If the symbol is always the same, you transmit nothing, if they are
  ``B'' and ``11'' for a ``C'', 1.5 bits on average.
  
  If the symbol is always the same, you transmit nothing, if they are
@@ -79,7 +108,7 @@ to emit on average per symbol to transmit that stream.
  It has a simple analytical form:
  %
  \[
  It has a simple analytical form:
  %
  \[
- H(p) = - \sum_k p(k) \log_2 p(k)
+ \entropy(p) = - \sum_k p(k) \log_2 p(k)
  \]
  %
  where by convention $0 \log_2 0 = 0$.
  \]
  %
  where by convention $0 \log_2 0 = 0$.
@@ -92,30 +121,30 @@ Entropy bound only for some distributions. A more sophisticated scheme
  called "Arithmetic coding" does it always.
  
  From this perspective, many quantities have an intuitive
  called "Arithmetic coding" does it always.
  
  From this perspective, many quantities have an intuitive
-value. Consider for instance sending pairs of symbols (X, Y).
+value. Consider for instance sending pairs of symbols $(X, Y)$.
  
  If these two symbols are independent, you cannot do better than
  sending one and the other separately, hence
  %
  \[
  
  If these two symbols are independent, you cannot do better than
  sending one and the other separately, hence
  %
  \[
-H(X, H) = H(X) + H(Y).
+\entropy(X, Y) = \entropy(X) + \entropy(Y).
  \]
  
  However, imagine that the second symbol is a function of the first
  \]
  
  However, imagine that the second symbol is a function of the first
-Y=f(X). You just have to send X since Y can be computed from it on the
+Y=f(X). You just have to send $X$ since $Y$ can be computed from it on the
  other side.
  
  Hence in that case
  %
  \[
  other side.
  
  Hence in that case
  %
  \[
-H(X, Y) = H(X).
+\entropy(X, Y) = \entropy(X).
  \]
  
  An associated quantity is the mutual information between two random
  variables, defined with
  %
  \[
  \]
  
  An associated quantity is the mutual information between two random
  variables, defined with
  %
  \[
-I(X;Y) = H(X) + H(Y) - H(X,Y),
+\mutinf(X;Y) = \entropy(X) + \entropy(Y) - \entropy(X,Y),
  \]
  %
  that quantifies the amount of information shared by the two variables.
  \]
  %
  that quantifies the amount of information shared by the two variables.
@@ -125,80 +154,80 @@ that quantifies the amount of information shared by the two variables.
  Conditional entropy is the average of the entropy of the conditional distribution:
  %
  \begin{align*}
  Conditional entropy is the average of the entropy of the conditional distribution:
  %
  \begin{align*}
-&H(X \mid Y)\\
- &= \sum_y p(Y=y) H(X \mid Y=y)\\
-       &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y)
+ & \entropy(X \mid Y)                        \\
+ & = \sum_y \proba(Y=y) \entropy(X \mid Y=y) \\
+ & = \sum_y \proba(Y=y) \sum_x \proba(X=x \mid Y=y) \log \proba(X=x \mid Y=y)
  \end{align*}
  
  \end{align*}
  
-Intuitively it is the [minimum average] number of bits required to describe X given that Y is known.
+Intuitively it is the [minimum average] number of bits required to describe $X$ given that $Y$ is known.
  
  
-So in particular, if X and Y are independent, getting the value of $Y$
+So in particular, if $X$ and $Y$ are independent, getting the value of $Y$
  does not help at all, so you still have to send all the bits for $X$,
  hence
  %
  \[
  does not help at all, so you still have to send all the bits for $X$,
  hence
  %
  \[
-  H(X \mid Y)=H(X)
+  \entropy(X \mid Y)=\entropy(X),
  \]
  \]
-
-if X is a deterministic function of Y then
+%
+and if $X$ is a deterministic function of $Y$ then
  %
  \[
  %
  \[
-  H(X \mid Y)=0.
+  \entropy(X \mid Y)=0.
  \]
  
  \]
  
-And if you send the bits for Y and then the bits to describe X given
-that Y, you have sent (X, Y). Hence we have the chain rule:
+And if you send the bits for $Y$ and then the bits to describe $X$ given
+that $Y$, you have sent $(X, Y)$. Hence we have the chain rule:
  %
  \[
  %
  \[
-H(X, Y) = H(Y) + H(X \mid Y).
+\entropy(X, Y) = \entropy(Y) + \entropy(X \mid Y).
  \]
  
  And then we get
  %
  \begin{align*}
  \]
  
  And then we get
  %
  \begin{align*}
-I(X;Y) &= H(X) + H(Y) - H(X,Y)\\
-       &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\
-       &= H(X) - H(X \mid Y).
+I(X;Y) &= \entropy(X) + \entropy(Y) - \entropy(X,Y)\\
+       &= \entropy(X) + \entropy(Y) - (\entropy(Y) + \entropy(X \mid Y))\\
+       &= \entropy(X) - \entropy(X \mid Y).
  \end{align*}
  
  \section{Kullback-Leibler divergence}
  
  Imagine that you encode your stream thinking it comes from
  \end{align*}
  
  \section{Kullback-Leibler divergence}
  
  Imagine that you encode your stream thinking it comes from
-distribution $q$ while it comes from $p$. You would emit more bits than
-the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the
-Kullback-Leibler divergence between $p$ and $q$.
+distribution $q$ while it comes from $p$. You would emit more bits
+than the optimal $\entropy(p)$, and that excess of bits is
+$\dkl(p||q)$ the Kullback-Leibler divergence between $p$ and $q$.
  
  In particular if $p=q$
  %
  \[
  
  In particular if $p=q$
  %
  \[
- D_{KL}(p\|q)=0,
+ \dkl(p\|q)=0,
  \]
  %
  and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
  %
  \[
  \]
  %
  and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
  %
  \[
- D_{KL}(p\|q)=+\infty.
+ \dkl(p\|q)=+\infty.
  \]
  
  Its formal expression is
  %
  \[
  \]
  
  Its formal expression is
  %
  \[
-D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
+\dkl(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
  \]
  %
  that can be understood as a value called the cross-entropy between $p$ and $q$
  %
  \[
  \]
  %
  that can be understood as a value called the cross-entropy between $p$ and $q$
  %
  \[
-H(p,q) = -\sum_x p(x) \log q(x)
+\entropy(p,q) = -\sum_x p(x) \log q(x)
  \]
  %
  minus the entropy of p
  \[
  \]
  %
  minus the entropy of p
  \[
-H(p) = -\sum_x p(x) \log p(x).
+\entropy(p) = -\sum_x p(x) \log p(x).
  \]
  
  \]
  
-Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the
+Notation horror: if $X$ and $Y$ are random variables $\entropy(X, Y)$ is the
  entropy of their joint law, and if $p$ and $q$ are distributions,
  entropy of their joint law, and if $p$ and $q$ are distributions,
-$H(p,q)$ is the cross-entropy between them.
+$\entropy(p,q)$ is the cross-entropy between them.
  
  \end{document}
  
  \end{document}
author	François Fleuret <francois@fleuret.org>
	Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)
committer	François Fleuret <francois@fleuret.org>
	Fri, 19 Jan 2024 19:36:38 +0000 (20:36 +0100)