CMS logoCMS event Hgg
Compact Muon Solenoid
LHC, CERN

CMS-PAS-MLG-24-002
Wasserstein normalized autoencoder
Abstract: A novel approach to unsupervised jet tagging is presented for the CMS experiment at the CERN LHC. The Wasserstein normalized autoencoder (WNAE) is a normalized probabilistic model that minimizes the Wasserstein distance between the probability distribution of the training data and the Boltzmann distribution of the reconstruction error of the autoencoder. Trained on jets of particles from simulated standard model processes, the WNAE is shown to learn the probability distribution of the input data, in a fully unsupervised fashion, in order to effectively identify new physics jets as anomalies. This algorithm has been developed and applied in the context of a recent search for semivisible jets. The model consistently demonstrates stable, convergent training and achieves strong classification performance across a wide range of signals, improving upon standard normalized autoencoders, while remaining agnostic to the signal. The WNAE directly tackles the problem of outlier reconstruction, a common failure mode of autoencoders in anomaly detection tasks.
Figures Summary References CMS Publications
Figures

png pdf
Figure 1:
Schematic visualization of the outlier reconstruction failure mode. Signal events drawn from the hatched area are reconstructed well by the AE, despite not being part of the training set, and thus will not be separated from the background. The AE training is assumed to have converged such that the background is reconstructed well.

png pdf
Figure 2:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.

png pdf
Figure 2-a:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.

png pdf
Figure 2-b:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.

png pdf
Figure 3:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.

png pdf
Figure 3-a:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.

png pdf
Figure 3-b:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.

png pdf
Figure 4:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-a:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-b:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-c:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-d:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-e:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 4-f:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.

png pdf
Figure 5:
Positive and negative energies (upper panel) and AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel) during training with the loss function in Eq. xxxxx.

png pdf
Figure 6:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.

png pdf
Figure 6-a:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.

png pdf
Figure 6-b:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.

png pdf
Figure 7:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).

png pdf
Figure 7-a:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).

png pdf
Figure 7-b:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).

png pdf
Figure 7-c:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).

png pdf
Figure 8:
Flowchart of the Wasserstein normalized autoencoder training. The positive examples are passed through the autoencoder, and the negative examples are generated via MCMC. The Wasserstein distance is calculated between the positive and negative examples, and the gradients are backpropagated through the entire MCMC chain.

png pdf
Figure 9:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $.

png pdf
Figure 9-a:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $.

png pdf
Figure 9-b:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $.

png pdf
Figure 10:
The AUC scores for a WNAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested on a grid of possible SVJ signal models.

png pdf
Figure 11:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-a:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-b:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-c:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-d:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-e:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-f:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-g:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 11-h:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-a:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-b:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-c:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-d:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-e:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-f:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-g:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 12-h:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.

png pdf
Figure 13:
The Wasserstein distance between the positive and negative samples and the AUC score during the training of a WNAE on an SVJ signal ($ m_{\Phi} = $ 2000 GeV, $ r_{\text{inv}} = $ 0.3), with the $ \mathrm{t} \overline{\mathrm{t}} $ background used for testing.
Summary
Autoencoder-based anomaly detection relies on learning a reconstruction error such that phase space regions with low probability density have high reconstruction error and can be identified as anomalous. However, standard autoencoders are prone to learn to reconstruct outliers because they are free to minimize the reconstruction error outside the training phase space. The normalized autoencoder paradigm promotes the autoencoder reconstruction error to an energy function in the framework of energy-based models, in order to define a normalized probabilistic model. This is achieved by minimizing the negative log-likelihood of the training data, given the energy-based model probability. In practice, this construction presents a number of failure modes, such as divergence of the loss function and phase space degeneracy, leading phase space regions distinct from the training data to have low reconstruction error. The Wasserstein normalized autoencoder, an improvement over normalized autoencoders, is introduced to solve the aforementioned failure modes. This is achieved by using the Wasserstein distance to quantify the difference between the probability distribution of the training data and the Boltzmann distribution of the energy function of the model. Using simulated samples from the CMS experiment, the classification of out-of-distribution examples by the Wasserstein normalized autoencoder is shown to be on par with or better than that of the normalized autoencoder. Furthermore, the Wasserstein distance is found to be a robust metric to define a stopping condition for the training in a fully signal-agnostic fashion.
References
1 T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson QCD or what? SciPost Phys. 6 (2019) 030 1808.08979
2 M. Farina, Y. Nakai, and D. Shih Searching for new physics with deep autoencoders PRD 101 (2020) 075021 1808.08992
3 T. Finke et al. Autoencoders for unsupervised anomaly detection in high energy physics JHEP 06 (2021) 161 2104.09051
4 S. Yoon, Y.-K. Noh, and F. Park Autoencoding under normalization constraints Proc. 38th Int. Conf. Mach. Learn. 1, 2021
link
2105.05735
5 T. Cohen, M. Lisanti, and H. K. Lou Semivisible jets: Dark matter undercover at the LHC PRL 115 (2015) 171804 1503.00009
6 CMS Collaboration The CMS experiment at the CERN LHC JINST 3 (2008) S08004
7 M. J. Strassler and K. M. Zurek Echoes of a hidden valley at hadron colliders PLB 651 (2007) 374 hep-ph/0604261
8 CMS Collaboration Search for resonant production of strongly coupled dark matter in proton-proton collisions at 13 TeV JHEP 06 (2022) 156 CMS-EXO-19-020
2112.11125
9 T. Cohen, M. Lisanti, H. K. Lou, and S. Mishra-Sharma LHC searches for dark sector showers JHEP 11 (2017) 196 1707.05326
10 M. A. Kramer Autoassociative neural networks Comput. Chem. Eng. 16 (1992) 313
11 J. Alwall et al. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations JHEP 07 (2014) 079 1405.0301
12 T. Sjöstrand et al. An introduction to PYTHIA 8.2 Comput. Phys. Commun. 191 (2015) 159 1410.3012
13 GEANT4 Collaboration GEANT4---a simulation toolkit NIM A 506 (2003) 250
14 M. Cacciari, G. P. Salam, and G. Soyez The anti-$ k_{\mathrm{T}} $ jet clustering algorithm JHEP 04 (2008) 063 0802.1189
15 M. Cacciari, G. P. Salam, and G. Soyez FastJet user manual EPJC 72 (2012) 1896 1111.6097
16 CMS Collaboration Performance of quark/gluon discrimination in 8 TeV pp data CMS Physics Analysis Summary, 2013
CMS-PAS-JME-13-002
CMS-PAS-JME-13-002
17 P. T. Komiske, E. M. Metodiev, and J. Thaler Energy flow polynomials: A complete linear basis for jet substructure JHEP 04 (2018) 013 1712.07124
18 A. J. Larkoski, G. P. Salam, and J. Thaler Energy correlation functions for jet substructure JHEP 06 (2013) 108 1305.0007
19 A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler Soft drop JHEP 05 (2014) 146 1402.2657
20 J. Thaler and K. Van Tilburg Identifying boosted objects with N-subjettiness JHEP 03 (2011) 015 1011.2268
21 F. Pedregosa et al. Scikit-learn: Machine learning in Python J. Mach. Learn. Res. 12 (2011) 2825 1201.0490
22 F. Canelli et al. Autoencoders for semivisible jet detection JHEP 02 (2022) 074 2112.02864
23 B. M. Dillon et al. A normalized autoencoder for LHC triggers SciPost Phys. Core 6 (2023) 074 2206.14225
24 E. T. Jaynes Information Theory and Statistical Mechanics PR 106 (1957) 620
25 T. Tieleman Training restricted Boltzmann machines using approximations to the likelihood gradient Proc. 25th Int. Conf. Mach. Learn, 2008
link
26 Y. Du and I. Mordatch Implicit generation and modeling with energy-based models Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019
link
1903.08689
27 M. Arjovsky, S. Chintala, and L. Bottou Wasserstein GAN Proc. 34th Int. Conf. Mach. Learn., 2017
link
1701.07875
28 R. Flamary et al. POT: Python optimal transport J. Mach. Learn. Res. 22 (2021) 1
29 A. Paszke et al. PyTorch: An imperative style, high-performance deep learning library Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019
link
1912.01703
Compact Muon Solenoid
LHC, CERN