The model provides an embedding or feature representation of the data of all taxpayers. The features are then used to train a separate classifier. The information acquired allows for the clustering of related features in a hidden space.

\newline\newline

A deep generative model of both audited and not audited taxpayers data provides a more robust set of hidden(latent) features. The generative model used is:

\newline

\newline

\begin{math}

p(\textbf{z}) = \mathcal{N}(\textbf{z}|\textbf{0,I});\ \ \ \ \ \ p _\theta(\textbf{x|z}) = f(\textbf{x;z},\boldsymbol\theta), \quad(1)

\end{math}

\newline\newline

where\begin{math} f (\textbf{x; z},\boldsymbol\theta) \end{math}is a Gaussian distribution whose probabilities are formed by a non-linear functions (deep neural networks), with parameters \begin{math} \boldsymbol\theta \end{math}, of a set of hidden (latent) variables \textbf{z}.

\newline\newline

Approximate samples from the posterior distribution (the probability distribution that represents the updated beliefs about the parameters after the model has seen the data) over the hidden (latent) variables p(z|x) are used as features to train a classifier that predicts whether a material audit yield will result if a taxpayer is audited (y) such as Support Vector Machine (SVM).This approach enables the classification to be performed in a lower dimensional space since we typically use hidden (latent) variables whose dimensionality is much less than that of the observations.

These low dimensional embeddings should now also be more easily separable since we make use of independent hidden (latent) Gaussian posteriors whose parameters are formed by a sequence of non-linear transformations of the data.

\newline\newline

\textbf{Generative semi-supervised model (Model 2): }

A probabilistic model describes the data as being generated by a hidden(latent) class variable y in addition to a continuous hidden(latent) variable z. The model used is:

\newline

\newline

\begin{math}

p(y) = Cat(y| \boldsymbol\pi);\quad p(\textbf{z}) = \mathcal{N} (\textbf{z|0, I});\quad p \theta (\textbf{X}|y, \textbf{Z}) = f (\textbf{x}; y, \textbf{z}, \boldsymbol\theta), \quad(2)

\end{math}

\newline

\newline

where \ensuremath{Cat(y| \boldsymbol\pi)} is the multinomial distribution, the class labels y are treated as hidden (latent) variables if no class label is available and z are additional hidden (latent) variables. These hidden (latent) variables are marginally independent.

As in Model 1, \begin{math} f(\textbf{X};y,\textbf{z},\boldsymbol\theta) \end{math} is a Gaussian distribution, parameterized by a non-linear function (deep neural networks) of the hidden(latent) variables.

\newline\newline

Since most labels y are unobserved, we integrate over the class of any unlabeled data during the inference process, thus performing classification as inference (deriving logical conclusions from premises known or assumed to be true.). The inferred posterior distribution is used to obtain labels for any missing labels.

\\

\\

\textbf{Stacked generative semi-supervised model: }

The two models can be stacked together; the \textbf{Model 1} learns the new hidden (latent) representation \textbf{z$_1$} using the generative model, and afterwards the generative semi-supervised \textbf{Model 2} using \textbf{z$_1$} instead of raw data (\textbf{x}).

The outcome is a deep generative model with two layers:

\newline\newline

\begin{math} p \theta(\textbf{x}, y, \end{math}\textbf{z$_1$, z$_2$})\begin{math} = p(y)p\end{math}(\textbf{z$_2$})\begin{math}p _\theta \end{math}(\textbf{z}$_1$|\ensuremath{y},\textbf{z$_2$})\begin{math}p _\theta \end{math}(x|\textbf{z$_1$})

\newline\newline

where the priors \ensuremath{p(y)} and \ensuremath{p}(\textbf{z$_2$}) equal those of y and \textbf{z} above, and both \ensuremath{ p _\theta(\textbf{z}}$_1$|\ensuremath{y}, \ensuremath{\textbf{z}}$_2$) and \ensuremath{p _\theta(\textbf{x}|\textbf{z}}$_1$) are parameterized as deep neural networks.

The computation of the exact posterior distribution is not easily managed because of the nonlinear, non-conjugate dependencies between the random variables. To allow for easier management and scalable inference and parameter learning, the recent advances in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) are utilized. A fixed form distribution \ensuremath{q _\phi(\textbf{z}|\textbf{x}) }with parameters \ensuremath{\phi} that approximates the true posterior distribution \begin{math} p(\textbf{z}|\textbf{x}) \end{math}.

\newline\newline

The variational principle is used to derive a lower bound on the maximum likelihood of the model. This consists in maximizing function of the variational bound and the approximate posterior has the minimum difference with the true posterior. The approximate posterior distribution \begin{math} q _\phi(\cdot) \end{math} is constructed as an inference or recognition model (Dayan, 2000; Kingma and Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013).

\newline\newline

With the use of an inference network, a set of global variational parameters \begin{math} \phi \end{math}, allowing for fast inference at both training and testing because the delay of inference is for all the posterior estimates for all hidden (latent) variables through the parameters of the inference network. An inference network is introduced for all hidden (latent) variables, and are parameterized as deep neural networks. Their outputs construct the parameters of the distribution \ensuremath{ q _\phi(\cdot) }.

\newline\newline

For the latent-feature discriminative model (Model 1), we use a Gaussian inference network \begin{math} q _\phi(\textbf{z}|\textbf{x}) \end{math}for the hidden(latent) variable \textbf{z}. For the generative semi-supervised model (Model 2),an inference model the hidden(latent) variables \textbf{z} and \textbf{y}, which its assumed have a factorized form

\begin{math} q _\phi(\textbf{z}, y|\textbf{x}) = q _\phi(\textbf{z}|\textbf{x})q_\phi(y|\textbf{x}) \end{math}, specified as Gaussian and multinomial distributions.

\newline\newline

\textbf{Model 1:} \ensuremath{q _\phi(\textbf{z}|\textbf{x}) = \mathcal{N} (\textbf{z}| \boldsymbol\mu _\phi(\textbf{x}), diag( \boldsymbol\sigma^2 _\phi(\textbf{x})))}, \quad(3)

\newline

\textbf{Model 2:} \ensuremath{q _\phi(\textbf{z}|y,\textbf{x})= \mathcal{N}(\textbf{z}| \boldsymbol\mu _\phi(y,\textbf{x}),diag( \boldsymbol\sigma^2 _\phi(\textbf{x}))); q _\phi(y|\textbf{x})=\textit{C}at(y| \boldsymbol\pi _\phi(\textbf{x}))}, (4)

\newline\newline

\newline\newline

where:

\ensuremath{\boldsymbol\sigma _\phi(\textbf{x})} is a vector of standard deviations,

\ensuremath{\boldsymbol\pi _\phi(\textbf{x})} is a probability vector,

functions \ensuremath{\boldsymbol\mu _\phi(x), \boldsymbol\sigma _\phi(\textbf{x}) and \boldsymbol\pi _\phi(\textbf{x})} are represented as \textbf{MLPs}.

\newline\newline

\textbf{Generative Semi-supervised Model Objective}

The label corresponding to a data point is observed and the variational bound is:

\newline\newline

\begin{math}

logp _\theta(\textbf{X},y) \leq \mathbb{E}_q {_\phi} _{(z|x,y)} log p _\theta(\textbf{x}|y,\textbf{z})+ log p _\theta(y)+ log p (\textbf{z})-log q _\phi(\textbf{z}|\textbf{x},y)=-\mathcal{L}(\textbf{x},y), \quad(5)

\end{math}

\newline\newline

The objective function is minimized by resorting to AdaGrad, which is a gradient-descent based optimization algorithm. It automatically tunes the learning rate based on its observations of the data’s geometry. AdaGrad is designed to perform well with datasets that have infrequently-occurring features.

\chapter{Evaluation}

The model was used to analyze taxpayers data from the Cyprus Tax Department database in order to identify taxpayers yielding material additional tax in case of performing a VAT audit.

The Deep Generative Models for Semi-supervised Learning is a solution that enables increased efficiency in the audit selection process. Its input includes both audited (supervised) and not audited (unsupervised) taxpayer data. Its output is a collection of labels, each of which corresponds to a taxpayer with one of two possible values (binary) good (1) or bad (0). If the taxpayer is expected to yield a material tax after audit, would be classified as good (1).

\newline\newline

Nearly all the VAT returns of the last few years were processed in order to generate the features to be used by the model. These were selected based on the advice of experienced field auditors, data analysts and rules from rule based models. Some of the selected fields where further processed to generate extra fields. The features selected broadly relate to business characteristics like location of the business, type of business and features from its tax returns.

For data preparation , the data was cleaned, for example we removed taxpayers with little or no tax history, mainly new businesses.

\newline\newline

The details of the criteria used to select the features, the features processing, the new generated features, feature number and cleansing process, cannot be disclosed due to the confidentiality nature of the audit selection. Also publication of the features could result in compromise of future audit selection as well as being unlawful.

For, modelling taxpayer data from the tax department registry like economic activity and from the tax returns \ensuremath{(X)} and actual audit results \ensuremath{(Y)} appear as pairs.

\newline\newline

\ensuremath{(X, Y) = in }\{(x$_1$, y$_1$), . . . , \ensuremath{(x\mathcal{N} , y\mathcal{N} )\}}

\newline\newline

with the ith observation x$_i$ and the corresponding class label \ensuremath{y}$_i$ \ensuremath{\{1, . . . , L\}} for the taxpayers audited.

\newline\newline

For each observation we infer corresponding hidden (latent) variables denoted by z. In the semi-supervised classification, where both audited taxpayers and not audited taxpayers are utilized, only a subset of all the taxpayers have corresponding class labels (audit result). The empirical distribution over the labelled (audited) and unlabeled (not audited) subsets as \ensuremath{p (x, y)} and \ensuremath{p (x)}.

\newline\newline

For building the model, Tensorflow was used, an open source software library for high performance numerical computation, running on top of python programming language. The hardware used is a custom build machine of the Cyprus Tax Department with an NVIDIA 10 series Graphic Processing Unit. The performance was measured using a k-fold cross validation on training data.

\newline\newline

The model was trained on actual tax audit data collected from the prior years (supervised) and on actual data of not audited taxpayers(unsupervised). The amount over which an audit yield is classified as material was set following internal guidelines. The same model was used for both large medium and small taxpayers irrespective of the economic activity classification (NACE code). The predictions made by the model were compared to the actual audit findings with an accuracy of 78,4\%. The results compared favorably to peer results, using Data Mining Based Tax Audit Selection with a reported accuracy of 51\% (Kuo-Wei Hsu et al., 2015).

\section{The confusion matrix}

The confusion matrix in Table 1 represents the classification of the model on the training data set. Columns and

rows are for predictions. The top-left element indicates correctly classified cases, the top-right element indicates the tax audits

lost (i.e. cases predicted as bad turning out to be good). The bottom-left element indicates tax audits incorrectly predicted as good, and the bottom-right element indicates correctly predicted bad tax audits.

The confusion matrix indicates that the model is balanced. The actual numbers are not disclosed for confidentiality reasons, instead they are presented as percentages.