1. Introduction
Graph MNIST/CIFAR10 - Accumulation - Accuracy - small/med/large variance - various epsilon and delta - various hyperparameters
Concept 1: Propose Sign Sampling mechanism based on Logistic Mechanism –> Extension to DPSGD Concept 2: Propose Logistic-mechanism-based DPSGD
Toy-example figure: The below figure demonstrates the error rate of the DP-signSGD algorithm with various noises (\(\epsilon=2.0\) and \(\delta=1e-4\))
1. Motivation of DP-SignLoSGD
Here, our focus is to find some additive noise in sense of \(\epsilon\)-DP. In the exponential mechanism, it is well-shown that the mechanism can preserves best utility function, if the utility function is given and computationally cheap.
Query space: For each element of the gradient, we confirm that the output of the query would be -1 or 1. That is, the query output space is \(\{-1,1,\}\).
Here, since our focus is to minimize the loss function \(L\), the desired problem is formulated as \[ \begin{align} \arg\min_{\mathbf{r}} ~~~~ & \mathbf{r}^T\cdot \mathbf{g}\\ \text{s.t.} ~~~~ & r_i\in\{-1, 1\} \end{align} \]
Here, the optimal solution is SignSGD; however, by applying this problem into exponential mechanism, we can find that
\[ \Pr (r_i |g_i) \propto \exp(-r_ig_i/s). \] Thus, we have \[ \Pr(r_i=1|g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] Since we want to design an additive noise with before sign operator, the above formulation can be rewritten by an event \(r_i =1\) if \(n_i+g_i >0\) , and \(r_i=0\) otherwise. \[ \Pr(n_i+g_i>0|g_i) =1-\Pr(n_i<-g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] Then, we have \[ 1 -F(-g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] By applying \(\frac{\partial }{\partial g_i}\) on both side, we have \[ f_{n_i}(g_i) = \frac{2\sigma\cdot\exp(-2\sigma g_i)}{(1+\exp(-2\sigma g_i)^2}, \] Contribution 1: We confirm that \(n_i\sim\mathrm{Logistic}(\mu=0, s=\frac{1}{2\sigma})\), which is no longer dependent to the value of the gradient.
In order to ensure the differential privacy, we limit the gradient by \(\Vert\mathbf{g}\Vert_\infty\le G_\infty\). Thus, the sensitivity of the input would be \(G_\infty\), the mechanism ensures \((4\sigma G_\infty)\)-DP.
2. Privacy Budget
Let us consider a question: What is the worst case? Here, we have the DPLoSGD as
\[ \hat{\mathbf{g}} = \mathtt{sign}\left( \sum_{i=1}^{k} \mathbf{g}^{(i)} + \mathrm{Logistic}(0,s=\frac{1}{2\sigma}) \right), \]
where \(\Vert\mathbf{g}^{(i)}\Vert_\infty \le G_\infty\).
From the above equation, for absense of one data (\(k\)-th data) is that: \(\sum_{i=1}^{k-1}\mathbf{g}^{(i)} = -\mathbf{1} * q *\frac{G_\infty}{2}\) and \(\sum_{i=1}^{k} \mathbf{g}^{(i)} = \mathbf{1} *q * \frac{G_\infty}{2}\). And the update procedure is defined by
\[ \theta \leftarrow \theta + \alpha \cdot \hat{\mathbf{g}}. \]
Here, the sign-SGD is implemented with the 1) clipping procedure and 2) Additive noise. In the worst-case example, we can obtain the moments accountant as follows:
Since additive noise on each of the gradient elements are independent, we can define
2.1. Manual Integration
- \(G=G_\infty * q\) \[ \alpha_\mathcal{M}(\lambda) = \log\mathbb{E}_{o\sim\mathcal{M}(d)}\left[\exp\left(\lambda\log\frac{\Pr(\mathcal{M}(d)=o)}{\Pr(\mathcal{M}(d')=o)}\right)\right] \]
\[ = n\cdot\log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(\lambda G\sigma)+ \frac{\exp(-G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(-\lambda G\sigma)\right). \]
Let us assume that \(\sigma > 0\). Then we have, \[ \begin{align} \alpha_\mathcal{M}(\lambda) &= n\cdot \log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\right) + n\lambda G\sigma \\ & + n\log\left( 1 + \exp(-(1+2\lambda)G\sigma)\right) \\ & \le n\lambda G\sigma \end{align} \]
Since \(\log(1+\exp(-x))\le \approx \ln(2)-x/2 + x^2/2\).
nThus, the proposed algorithm is no longer a (theoretically). Also, \(kn\lambda G \sigma \le \epsilon\).
Variance of logistic noise = \(s^2 \pi^2 /3\). –> Linear decay is a problem.
2.2.1. Another bound
$$ \[\begin{align} &\log\left(\frac{\exp(\lambda G\sigma)\exp(G\sigma/2)+\exp(-\lambda G\sigma)\exp(-G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\right)\\ &< \log(\exp(\lambda G\sigma) + \exp(-\lambda G \sigma))\\ & \le \log (2\exp(\frac{\lambda^2G^2\sigma^2}{2})) \\ &\le \log(\exp(\frac{\lambda(\lambda+1)G^2\sigma^2}{2})).\\ & = \frac{\lambda(\lambda+1)G^2\sigma^2}{2} \end{align}\] $$
Condition: The inequality holds if \(2<\exp(G^2\sigma^2/2)\). That is, \(\log(2)*2 \le G^2\sigma^2\). Thus, the original scale \(s^2 \propto G^2\).
- (o)First inequality: \(\exp(x) + \exp(-x)\le 2e^{x^2/2}\)
- (x)Second inequality: \(e< (1+x)^{(2/x)}< e^2\) if \(x\in[0,1]\). –> \(e^{x/2}-1\le x\) ### 2.2. Follow the original paper.
We note that - \(\nu_0 =\mathcal{M}(d)\sim \mathtt{sign}(\mathrm{Logistic}(\frac{G}{2}, s=\frac{1}{2\sigma}))\) - \(\nu_1 = \mathcal{M}(d')\sim \mathtt{sign}(\mathrm{Logistic}(-\frac{G}{2}, s=\frac{1}{2\sigma}))\) \[ \Pr(\mathcal{M}(d)=1) = \frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)} \]
We have
\[ \alpha_\mathcal{M}(\lambda) = n\cdot\log\alpha_{\mathcal{M}_i}(\lambda) \]
\[ \begin{align} \alpha_{\mathcal{M}_i}(\lambda) &= \mathbb{E}_{z\sim\nu_0}\left(\frac{\nu_0(z)}{\nu_1(z)}\right)^{\lambda}\\ &= \mathbb{E}_{z\sim\nu_1}\left[ \frac{\nu_0(z)}{\nu_1(z)}^{\lambda+1} \right] \\ &= \mathbb{E}_{z\sim\nu_1}\left[ (1+ \frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)})^{\lambda+1}\right]\\ &=\sum_{t=0}^{\lambda+1} \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\t\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^t\right] \end{align} \]
By the expansion of the above equation we have
2.2.1. Case 1. \(t=0\)
\[ \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\0\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^0\right] = 1 \]
2.2.2. Case 2. \(t=1\).
\[ \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\1\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^1\right] = \sum_{z\in\{0,1\}} \nu_0(z) - \nu_1(z) = 0. \] #### 2.2.3. Case 3. \(t=2\).
\[ \begin{align} &\mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\2\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^2\right] \\ =& \frac{\lambda(\lambda+1)}{2}\left(\exp(G\sigma/2) - \exp(-G\sigma/2)\right)^2\\ =& \frac{\lambda(\lambda+1)}{2}\cdot \left( (\exp(G\sigma)-1) + (\exp(-G\sigma) -1)\right) \\ \le& \frac{\lambda(\lambda+1)}{2}\cdot( 2\exp(\frac{G^2\sigma^2}{2})-2)\\ \le& \lambda(\lambda+1)\left(G^2\sigma^2\right) \end{align} \]
- First inequality: \(\exp(x) + \exp(-x)\le 2e^{x^2/2}\) if \(x\in[0,1]\).
- Second inequality: \(e< (1+x)^{(2/x)}< e^2\) if \(x\in[0,1]\). –> \(e^{x/2}-1\le x\)
2.2.4. Remainder terms
N/A
3. Implementation
Before manual integration, we focus on the definition of Renyi differential privacy. The renyi differential privacy is defined, for \((\alpha,\epsilon)\)-rDP,
\[ D_a(f(D)\Vert f(D')) \le \epsilon, \] where \(D_a(P\Vert Q) = \frac{1}{\alpha-1}\log\mathbb{E}_{x\sim Q}\left(\frac{P(x)}{Q(x)}\right)^\alpha\).
The Renyi divergence can be rewritten by
\[ D_a(P\Vert Q) = \frac{1}{\alpha-1} \log\mathbb{E}_{x\sim Q}\exp\left(\alpha\log\left(\frac{P(x)}{Q(x)}\right)\right). \]
As written above, the Formulation is same except for the \(\alpha=\lambda+1\) and the scale.
A simple update rule is also works: \[ \begin{align} &\epsilon(\delta) = \min_\lambda \frac{\log(1/\delta)+\alpha_\mathcal{M}(\lambda)}{\lambda},\\ &\delta (\epsilon) = \min_\lambda \exp(\alpha_\mathcal{M}(\lambda)-\lambda\epsilon). \end{align} \]
3.1. Manually measure the privacy loss
Let us define the obtained gradient vector as \(\mathbf{v}=[v_1, ..., v_m]\), in which \(v_m = \frac{1}{k}\sum_{i=1}^{k}v_m^{(i)}\). The maximum amount of the gradient of each element is defined by \(\bar{\mathbf{v}}=[\bar{v}_1,...,\bar{v}_m]\), where \(\bar{v}_m = \max v_{m}^{(i)}\).
\[ \alpha_\mathcal{M}(\lambda) = \log\mathbb{E}_{o\sim\mathcal{M}(d)}\left[\exp\left(\lambda\log\frac{\Pr(\mathcal{M}(d)=o)}{\Pr(\mathcal{M}(d')=o)}\right)\right] \]
\[ = n\cdot\log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(\lambda G\sigma)+ \frac{-\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(-\lambda G\sigma)\right). \]
A closed-form solution can be obtained!!, thus we don’t need to numerically compute the privacy loss.
https://github.com/jxbz/signSGD/blob/master/signSGD_zeros.ipynb https://github.com/dayu11/Differentially-Private-Deep-Learning/tree/main/vision/DP-SGD
Logistic distribution can be sampled from unifrom distribution \(X\sim\mathrm{Uniform}(0,1)\).
\[ \mu+s(\log(X) - \log(1-X)) \]
4. Performance comparison
5. Limitations
Compared to the GDP algorithm, our method has been limited…