We propose a method for identifying models with good predictive performance in the family of Bayesian log-linear mixed models with Dirichlet process random effects for count data. Their wide applicability makes the assessment of model performance crucial in many fields, including disclosure risk estimation, which is the focus of the present work.
Rather than assessing models on the whole contingency table, we target the specific objective of the analysis and propose a two-stage model selection procedure aimed at limiting a form of bias arising in the process of model selection. Our proposal combines two different criteria: at the first stage, a path in the model search space is identified through a strongly penalized log-likelihood; at the second, a small number of semi-parametric models is evaluated through a context-dependent score-based information criterion. Tested on a variety of contingency tables, our method proves to be able to identify models with good predictive performance in a few steps, even in the presence of large tables with many sampling and structural zeros. We carefully discuss the proposed method in the context of the literature on model assessment and contextualize the illustrative application in the recent debate on statistical disclosure limitation. Finally, we provide examples of further applications in different research areas.