The Bartlett’s identities are related to expectations of \(n\)-th derivatives of the log-likelihood function.
1. First order Bartlett’s identity
Let us assume that a random variable \(X\) has a non-zero density \(f\) with a parameter \(\theta\) , i.e., \(y=f(x;\theta)\), where \(y\in\mathbb{R}\).
Then, we have a condition \(\int_{\mathbb{R}}f(x;\theta)=1\).
It means
\[ \frac{\partial}{\partial\theta}\int f(x;\theta)dx= \int \frac{\partial}{\partial\theta}f(x;\theta)dx =\int f'(x;\theta)dx =0. \] From the above equality, we have the following relationship:
\[ \int f'(x;\theta)dx = \int\frac{f'(x;\theta)}{f(x;\theta)} \cdot f(x;\theta)dx = \mathbb{E}_{f(x;\theta)}\left\{ \frac{\partial}{\partial\theta}\left( \log(f(x;\theta))\right) \right\} =0 \]
Let us define \(\ell(x;\theta) = \log(f(x;\theta))\), we have
\[ \mathbb{E}_{y\sim f(x;\theta)}\left[\frac{\partial}{\partial\theta}\ell(x;\theta)\right] = 0 \]
2. Second order Bartlett’s identity
Similar to the first-order identity, we start with the following:
\[ \int \frac{\partial^2}{\partial \theta^2}f(x;\theta) = 0. \]
Also, we have
\[ \frac{\partial^2}{\partial\theta^2} \ell(x;\theta) = \frac{f''(x;\theta)}{f(x;\theta)} - \left(\frac{f'(x;\theta)}{f(x;\theta)}\right)^2 \]
By adding expectation over both side of the equality, we have
\[ \begin{aligned} \mathbb{E}\left[ \frac{\partial^2}{\partial\theta^2}\ell(x;\theta)\right] = \int f''(x;\theta)dx - \mathbb{E}\left[\left(\frac{\partial}{\partial\theta}\log f(x;\theta)\right)^2\right] \end{aligned} \]
Finally, we can rewrite the above equation into a vector form representation:
\[\mathbb{E}\left[\nabla_\theta^2\ell(x;\theta)\right] + \mathbb{E}\left[\nabla \ell(x;\theta)\nabla\ell(x;\theta)^T\right]=0.\]
3. Advanced to NLL Loss
Paper: https://arxiv.org/abs/1612.00796
The article discusses the relationship between Fisher Information and the Hessian of Negative Log-Likelihood in neural networks. It explains how the Hessian matrix, which is difficult to compute due to its dependence on the second derivative of a large parameter vector, can be approximated using the Fisher Information Matrix. The article delves into the definitions and mathematical expressions of both concepts, highlighting their similarities and differences. It also explores a specific case where the activation function is part of the Canonical Exponential Family, leading to further insights into the relationship between Fisher Information and the Hessian matrix.
Notation: Let us consider a NN model, which estimates \(p(y|x;\theta)\), where the ground truth distribution is represented by \(q_x\) , \(q_{y|x}\), …
Fisher Information Matrix (FIM): The fisher information matrix is defined by
\[ F = \mathbb{E}_{x\sim q_x}\left[\mathbb{E}_{y\sim p_{y|x}} \nabla_\theta\log p(y|x;\theta)\nabla_\theta\log p(y|x;\theta)^T\right]. \] From the Bartlett’s identity, we have
\[ F = - \mathbb{E}_{x\sim q_x}\left[\mathbb{E}_{y\sim p_{y|x}} \nabla^2_\theta\log p(y|x;\theta)\right]. \]
Hessian Matrix of NLL loss: Let us consider an NLL loss \(\ell = -\mathbb{E} \left[\log p(y|x;\theta) \right]\). Then, the Hessian matrix is represented by
\[ H = -\mathbb{E}_q\left[\nabla^2_\theta \log p(y|x;\theta)\right]. \]
The key difference between FIM and NLL loss is probability distribution used in expectation. - \(p(y|x)\) vs \(q(y|x)\).
Thus, if \(p(y|x)\approx q(y|x)\), we can ap proximate \(H\) by \(F\).