DPSGD

1. Introduction

Graph MNIST/CIFAR10 - Accumulation - Accuracy - small/med/large variance - various epsilon and delta - various hyperparameters

Concept 1: Propose Sign Sampling mechanism based on Logistic Mechanism –> Extension to DPSGD Concept 2: Propose Logistic-mechanism-based DPSGD

Toy-example figure: The below figure demonstrates the error rate of the DP-signSGD algorithm with various noises ($\epsilon=2.0$ and $\delta=1e-4$)

1. Motivation of DP-SignLoSGD

Here, our focus is to find some additive noise in sense of $\epsilon$-DP. In the exponential mechanism, it is well-shown that the mechanism can preserves best utility function, if the utility function is given and computationally cheap.

Query space: For each element of the gradient, we confirm that the output of the query would be -1 or 1. That is, the query output space is $\{-1,1,\}$.

Here, since our focus is to minimize the loss function $L$, the desired problem is formulated as \[ \begin{align} \arg\min_{\mathbf{r}} ~~~~ & \mathbf{r}^T\cdot \mathbf{g}\\ \text{s.t.} ~~~~ & r_i\in\{-1, 1\} \end{align} \]

Here, the optimal solution is SignSGD; however, by applying this problem into exponential mechanism, we can find that

\[ \Pr (r_i |g_i) \propto \exp(-r_ig_i/s). \] Thus, we have \[ \Pr(r_i=1|g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] Since we want to design an additive noise with before sign operator, the above formulation can be rewritten by an event $r_i =1$ if $n_i+g_i >0$ , and $r_i=0$ otherwise. \[ \Pr(n_i+g_i>0|g_i) =1-\Pr(n_i<-g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] Then, we have \[ 1 -F(-g_i) = \frac{\exp(-\sigma g_i)}{\exp(\sigma g_i)+\exp(-\sigma g_i)} \] By applying $\frac{\partial }{\partial g_i}$ on both side, we have \[ f_{n_i}(g_i) = \frac{2\sigma\cdot\exp(-2\sigma g_i)}{(1+\exp(-2\sigma g_i)^2}, \] Contribution 1: We confirm that $n_i\sim\mathrm{Logistic}(\mu=0, s=\frac{1}{2\sigma})$, which is no longer dependent to the value of the gradient.

In order to ensure the differential privacy, we limit the gradient by $\Vert\mathbf{g}\Vert_\infty\le G_\infty$. Thus, the sensitivity of the input would be $G_\infty$, the mechanism ensures $(4\sigma G_\infty)$-DP.

2. Privacy Budget

Let us consider a question: What is the worst case? Here, we have the DPLoSGD as

\[ \hat{\mathbf{g}} = \mathtt{sign}\left( \sum_{i=1}^{k} \mathbf{g}^{(i)} + \mathrm{Logistic}(0,s=\frac{1}{2\sigma}) \right), \]

where $\Vert\mathbf{g}^{(i)}\Vert_\infty \le G_\infty$.

From the above equation, for absense of one data ($k$-th data) is that: $\sum_{i=1}^{k-1}\mathbf{g}^{(i)} = -\mathbf{1} * q *\frac{G_\infty}{2}$ and $\sum_{i=1}^{k} \mathbf{g}^{(i)} = \mathbf{1} *q * \frac{G_\infty}{2}$. And the update procedure is defined by

\[ \theta \leftarrow \theta + \alpha \cdot \hat{\mathbf{g}}. \]

Here, the sign-SGD is implemented with the 1) clipping procedure and 2) Additive noise. In the worst-case example, we can obtain the moments accountant as follows:

Since additive noise on each of the gradient elements are independent, we can define

2.1. Manual Integration

$G=G_\infty * q$ \[ \alpha_\mathcal{M}(\lambda) = \log\mathbb{E}_{o\sim\mathcal{M}(d)}\left[\exp\left(\lambda\log\frac{\Pr(\mathcal{M}(d)=o)}{\Pr(\mathcal{M}(d')=o)}\right)\right] \]

\[ = n\cdot\log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(\lambda G\sigma)+ \frac{\exp(-G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(-\lambda G\sigma)\right). \]

Let us assume that $\sigma > 0$. Then we have, \[ \begin{align} \alpha_\mathcal{M}(\lambda) &= n\cdot \log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\right) + n\lambda G\sigma \\ & + n\log\left( 1 + \exp(-(1+2\lambda)G\sigma)\right) \\ & \le n\lambda G\sigma \end{align} \]

Since $\log(1+\exp(-x))\le \approx \ln(2)-x/2 + x^2/2$.

nThus, the proposed algorithm is no longer a (theoretically). Also, $kn\lambda G \sigma \le \epsilon$.

Variance of logistic noise = $s^2 \pi^2 /3$. –> Linear decay is a problem.

2.2.1. Another bound

$$ \[\begin{align} &\log\left(\frac{\exp(\lambda G\sigma)\exp(G\sigma/2)+\exp(-\lambda G\sigma)\exp(-G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\right)\\ &< \log(\exp(\lambda G\sigma) + \exp(-\lambda G \sigma))\\ & \le \log (2\exp(\frac{\lambda^2G^2\sigma^2}{2})) \\ &\le \log(\exp(\frac{\lambda(\lambda+1)G^2\sigma^2}{2})).\\ & = \frac{\lambda(\lambda+1)G^2\sigma^2}{2} \end{align}\] $$

Condition: The inequality holds if $2<\exp(G^2\sigma^2/2)$. That is, $\log(2)*2 \le G^2\sigma^2$. Thus, the original scale $s^2 \propto G^2$.

(o)First inequality: $\exp(x) + \exp(-x)\le 2e^{x^2/2}$
(x)Second inequality: $e< (1+x)^{(2/x)}< e^2$ if $x\in[0,1]$. –> $e^{x/2}-1\le x$ ### 2.2. Follow the original paper.

We note that - $\nu_0 =\mathcal{M}(d)\sim \mathtt{sign}(\mathrm{Logistic}(\frac{G}{2}, s=\frac{1}{2\sigma}))$ - $\nu_1 = \mathcal{M}(d')\sim \mathtt{sign}(\mathrm{Logistic}(-\frac{G}{2}, s=\frac{1}{2\sigma}))$ \[ \Pr(\mathcal{M}(d)=1) = \frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)} \]

We have

\[ \alpha_\mathcal{M}(\lambda) = n\cdot\log\alpha_{\mathcal{M}_i}(\lambda) \]

\[ \begin{align} \alpha_{\mathcal{M}_i}(\lambda) &= \mathbb{E}_{z\sim\nu_0}\left(\frac{\nu_0(z)}{\nu_1(z)}\right)^{\lambda}\\ &= \mathbb{E}_{z\sim\nu_1}\left[ \frac{\nu_0(z)}{\nu_1(z)}^{\lambda+1} \right] \\ &= \mathbb{E}_{z\sim\nu_1}\left[ (1+ \frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)})^{\lambda+1}\right]\\ &=\sum_{t=0}^{\lambda+1} \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\t\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^t\right] \end{align} \]

By the expansion of the above equation we have

2.2.1. Case 1. $t=0$

\[ \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\0\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^0\right] = 1 \]

2.2.2. Case 2. $t=1$.

\[ \mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\1\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^1\right] = \sum_{z\in\{0,1\}} \nu_0(z) - \nu_1(z) = 0. \] #### 2.2.3. Case 3. $t=2$.

\[ \begin{align} &\mathbb{E}_{z\sim\nu_1}\begin{pmatrix}\lambda+1\\2\end{pmatrix}\left[\left(\frac{\nu_0(z)-\nu_1(z)}{\nu_1(z)}\right)^2\right] \\ =& \frac{\lambda(\lambda+1)}{2}\left(\exp(G\sigma/2) - \exp(-G\sigma/2)\right)^2\\ =& \frac{\lambda(\lambda+1)}{2}\cdot \left( (\exp(G\sigma)-1) + (\exp(-G\sigma) -1)\right) \\ \le& \frac{\lambda(\lambda+1)}{2}\cdot( 2\exp(\frac{G^2\sigma^2}{2})-2)\\ \le& \lambda(\lambda+1)\left(G^2\sigma^2\right) \end{align} \]

First inequality: $\exp(x) + \exp(-x)\le 2e^{x^2/2}$ if $x\in[0,1]$.
Second inequality: $e< (1+x)^{(2/x)}< e^2$ if $x\in[0,1]$. –> $e^{x/2}-1\le x$

2.2.4. Remainder terms

N/A

3. Implementation

Before manual integration, we focus on the definition of Renyi differential privacy. The renyi differential privacy is defined, for $(\alpha,\epsilon)$-rDP,

\[ D_a(f(D)\Vert f(D')) \le \epsilon, \] where $D_a(P\Vert Q) = \frac{1}{\alpha-1}\log\mathbb{E}_{x\sim Q}\left(\frac{P(x)}{Q(x)}\right)^\alpha$.

The Renyi divergence can be rewritten by

\[ D_a(P\Vert Q) = \frac{1}{\alpha-1} \log\mathbb{E}_{x\sim Q}\exp\left(\alpha\log\left(\frac{P(x)}{Q(x)}\right)\right). \]

As written above, the Formulation is same except for the $\alpha=\lambda+1$ and the scale.

A simple update rule is also works: \[ \begin{align} &\epsilon(\delta) = \min_\lambda \frac{\log(1/\delta)+\alpha_\mathcal{M}(\lambda)}{\lambda},\\ &\delta (\epsilon) = \min_\lambda \exp(\alpha_\mathcal{M}(\lambda)-\lambda\epsilon). \end{align} \]

3.1. Manually measure the privacy loss

Let us define the obtained gradient vector as $\mathbf{v}=[v_1, ..., v_m]$, in which $v_m = \frac{1}{k}\sum_{i=1}^{k}v_m^{(i)}$. The maximum amount of the gradient of each element is defined by $\bar{\mathbf{v}}=[\bar{v}_1,...,\bar{v}_m]$, where $\bar{v}_m = \max v_{m}^{(i)}$.

\[ \alpha_\mathcal{M}(\lambda) = \log\mathbb{E}_{o\sim\mathcal{M}(d)}\left[\exp\left(\lambda\log\frac{\Pr(\mathcal{M}(d)=o)}{\Pr(\mathcal{M}(d')=o)}\right)\right] \]

\[ = n\cdot\log\left(\frac{\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(\lambda G\sigma)+ \frac{-\exp(G\sigma/2)}{\exp(G\sigma/2)+\exp(-G\sigma/2)}\cdot\exp(-\lambda G\sigma)\right). \]

A closed-form solution can be obtained!!, thus we don’t need to numerically compute the privacy loss.

https://github.com/jxbz/signSGD/blob/master/signSGD_zeros.ipynb https://github.com/dayu11/Differentially-Private-Deep-Learning/tree/main/vision/DP-SGD

Logistic distribution can be sampled from unifrom distribution $X\sim\mathrm{Uniform}(0,1)$.

\[ \mu+s(\log(X) - \log(1-X)) \]

4. Performance comparison

5. Limitations

Compared to the GDP algorithm, our method has been limited…

1. Introduction

1. Motivation of DP-SignLoSGD

2. Privacy Budget

2.1. Manual Integration

2.2.1. Another bound

2.2.1. Case 1. \(t=0\)

2.2.2. Case 2. \(t=1\).

2.2.4. Remainder terms

3. Implementation

3.1. Manually measure the privacy loss

4. Performance comparison

5. Limitations

1. Introduction

2. Related Studies

1. Motivation of DP-SignLoSGD

2. Privacy Budget

2.1. Manual Integration

2.2.1. Another bound

2.2.1. Case 1. \(t=0\)

2.2.2. Case 2. \(t=1\).

2.2.4. Remainder terms

3. Implementation

3.1. Manually measure the privacy loss

4. Performance comparison

5. Limitations