The so-called "ones trick" for JAGS and BUGS models allows the user to sample from distributions that are not
in the standard list. Here I show another application for "unbalanced" missing data.
More examples using the ones (or zeros) trick can be found in The BUGS Book, and in particular in Chapter 9.

Suppose that we want to find an association between a trait \(x\) (possibly continuous) and another categorical trait \(k \in \{1, 2,\dots, K\}\). We have \(N\) observations \(x_i\), however, some of the \(k_i\) are missing. Whenever there is no information on the missing \(k_i\) whatsoever, we can simply write the following JAGS model:

Suppose now that the missing \(k_i\)s are not completely unknown. In stead, suppose that we know that some of the \(k_i\) must belong to a group \(g_i\). The group \(g_i\) is encoded as a binary vector, where a \(g_{i,j} = 1\) indicates that the trait value \(j\) is in the group, and \(0\) that it is not. In particular, when \(k_i\) is known, then \[ g_{i,j} = \left\{ \begin{array}{ll} 1 & {\rm if\ } j = k_i \\ 0 & {\rm otherwise} \end{array} \right. \] If \(k_i\) is completely unknown, then we just let \(g_{i,j} = 1\) for each \(j\).

Similarly, it might have been tempting to write

As an example, I generated some random \(k_i\)s sampled from \({\rm Categorical}(p)\) (I did not bother to sample any \(x_i\)s). I have taken \(K = 15\), \(N = 2000\), and \(3\) randomly chosen groups. For a \(1000\) observations, I "forget" the actual \(k_i\), and only know the group \(g_i\). The goal is to recover \(p\) and the missing \(k_i\)s.

Using a chain of length \(20000\) (using the first \(10000\) as a burn-in and \(1/10\)-th thinning) we get the following result:

Suppose that we want to find an association between a trait \(x\) (possibly continuous) and another categorical trait \(k \in \{1, 2,\dots, K\}\). We have \(N\) observations \(x_i\), however, some of the \(k_i\) are missing. Whenever there is no information on the missing \(k_i\) whatsoever, we can simply write the following JAGS model:

data { /* alpha is the parameter for the Dirichlet prior of p, * with p the estimated frequency of k */ for ( j in 1:K ) { alpha[j] <- 1 } } model { for ( i in 1:N ) { /* Some model for x, given k */ x[i] ~ dnorm(xhat[k[i]], tau_x) /* Sample missing values of k, and inform * the distribution p with the known k's */ k[i] ~ dcat(p) } /* Model the distribution p_k of the trait k */ p ~ ddirch(alpha) /* The precision of the likelihood for x */ tau_x ~ dgamma(0.01, 0.01) /* Give the xhat's a shared prior to regularize them */ for ( j in 1:K ) { xhat[j] ~ dnorm(mu_xhat, tau_xhat) } mu_xhat ~ dnorm(0, 0.01) tau_xhat ~ dgamma(0.01, 0.01) }The data file must contain the vector \(k\), with NA in the place of the missing values.

Suppose now that the missing \(k_i\)s are not completely unknown. In stead, suppose that we know that some of the \(k_i\) must belong to a group \(g_i\). The group \(g_i\) is encoded as a binary vector, where a \(g_{i,j} = 1\) indicates that the trait value \(j\) is in the group, and \(0\) that it is not. In particular, when \(k_i\) is known, then \[ g_{i,j} = \left\{ \begin{array}{ll} 1 & {\rm if\ } j = k_i \\ 0 & {\rm otherwise} \end{array} \right. \] If \(k_i\) is completely unknown, then we just let \(g_{i,j} = 1\) for each \(j\).

data { for ( j in 1:K ) { alpha[j] <- 1 } /* for the "ones trick", * we need a 1 for every observation */ for ( i in 1:N ) { ones[i] <- 1 } } model { for ( i in 1:N ) { x[i] ~ dnorm(xhat[k[i]], tau_x) /* sample missing k's using the group-vector g */ k[i] ~ dcat(g[i,]) /* in order to inform p correctly, * we need to multiply the posterior with p[k[i]], * which can be done by observing a 1, * assumed to be Bernoulli(p[k[i]]) distributed */ ones[i] ~ dbern(p[k[i]]) } p ~ ddirch(alpha) tau_x ~ dgamma(0.01, 0.01) for ( j in 1:K ) { xhat[j] ~ dnorm(mu_xhat, tau_xhat) } mu_xhat ~ dnorm(0, 0.01) tau_xhat ~ dgamma(0.01, 0.01) }In stead of using the "ones trick", it would have been more clear-cut if we were able to write

k[i] ~ dcat(g[i,]) /* sampling statement */ k[i] ~ dcat(p) /* statement to inform p */but in this way JAGS can not tell that it is to sample from \({\rm Categorical}(g_i)\), and not from \({\rm Categorical}(p)\).

Similarly, it might have been tempting to write

k[i] ~ dcat(g[i,] * p) /* point-wise vector multiplication */This would correctly prefer the common \(k\,\)s in group \(g_i\) over the rare \(k\,\)s, but the fact that \(k_i\) must come from the group \(g_i\) is not used to inform \(p\).

As an example, I generated some random \(k_i\)s sampled from \({\rm Categorical}(p)\) (I did not bother to sample any \(x_i\)s). I have taken \(K = 15\), \(N = 2000\), and \(3\) randomly chosen groups. For a \(1000\) observations, I "forget" the actual \(k_i\), and only know the group \(g_i\). The goal is to recover \(p\) and the missing \(k_i\)s.

Using a chain of length \(20000\) (using the first \(10000\) as a burn-in and \(1/10\)-th thinning) we get the following result: