CMSPASMLG24002  
Wasserstein normalized autoencoder  
CMS Collaboration  
26 September 2024  
Abstract: A novel approach to unsupervised jet tagging is presented for the CMS experiment at the CERN LHC. The Wasserstein normalized autoencoder (WNAE) is a normalized probabilistic model that minimizes the Wasserstein distance between the probability distribution of the training data and the Boltzmann distribution of the reconstruction error of the autoencoder. Trained on jets of particles from simulated standard model processes, the WNAE is shown to learn the probability distribution of the input data, in a fully unsupervised fashion, in order to effectively identify new physics jets as anomalies. This algorithm has been developed and applied in the context of a recent search for semivisible jets. The model consistently demonstrates stable, convergent training and achieves strong classification performance across a wide range of signals, improving upon standard normalized autoencoders, while remaining agnostic to the signal. The WNAE directly tackles the problem of outlier reconstruction, a common failure mode of autoencoders in anomaly detection tasks.  
Links: CDS record (PDF) ; CADI line (restricted) ; 
Figures  
png pdf 
Figure 1:
Schematic visualization of the outlier reconstruction failure mode. Signal events drawn from the hatched area are reconstructed well by the AE, despite not being part of the training set, and thus will not be separated from the background. The AE training is assumed to have converged such that the background is reconstructed well. 
png pdf 
Figure 2:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. 
png pdf 
Figure 2a:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. 
png pdf 
Figure 2b:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. 
png pdf 
Figure 3:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. 
png pdf 
Figure 3a:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. 
png pdf 
Figure 3b:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. 
png pdf 
Figure 4:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4a:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4b:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4c:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4d:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4e:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 4f:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275279) the start of the divergence of the NAE loss. 
png pdf 
Figure 5:
Positive and negative energies (upper panel) and AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel) during training with the loss function in Eq. xxxxx. 
png pdf 
Figure 6:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_ $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. 
png pdf 
Figure 6a:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_ $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. 
png pdf 
Figure 6b:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_ $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. 
png pdf 
Figure 7:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). 
png pdf 
Figure 7a:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). 
png pdf 
Figure 7b:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). 
png pdf 
Figure 7c:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). 
png pdf 
Figure 8:
Flowchart of the Wasserstein normalized autoencoder training. The positive examples are passed through the autoencoder, and the negative examples are generated via MCMC. The Wasserstein distance is calculated between the positive and negative examples, and the gradients are backpropagated through the entire MCMC chain. 
png pdf 
Figure 9:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. 
png pdf 
Figure 9a:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. 
png pdf 
Figure 9b:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. 
png pdf 
Figure 10:
The AUC scores for a WNAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested on a grid of possible SVJ signal models. 
png pdf 
Figure 11:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11a:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11b:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11c:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11d:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11e:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11f:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11g:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 11h:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12a:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12b:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12c:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12d:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12e:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12f:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12g:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 12h:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. 
png pdf 
Figure 13:
The Wasserstein distance between the positive and negative samples and the AUC score during the training of a WNAE on an SVJ signal ($ m_{\Phi} = $ 2000 GeV, $ r_{\text{inv}} = $ 0.3), with the $ \mathrm{t} \overline{\mathrm{t}} $ background used for testing. 
Summary 
Autoencoderbased anomaly detection relies on learning a reconstruction error such that phase space regions with low probability density have high reconstruction error and can be identified as anomalous. However, standard autoencoders are prone to learn to reconstruct outliers because they are free to minimize the reconstruction error outside the training phase space. The normalized autoencoder paradigm promotes the autoencoder reconstruction error to an energy function in the framework of energybased models, in order to define a normalized probabilistic model. This is achieved by minimizing the negative loglikelihood of the training data, given the energybased model probability. In practice, this construction presents a number of failure modes, such as divergence of the loss function and phase space degeneracy, leading phase space regions distinct from the training data to have low reconstruction error. The Wasserstein normalized autoencoder, an improvement over normalized autoencoders, is introduced to solve the aforementioned failure modes. This is achieved by using the Wasserstein distance to quantify the difference between the probability distribution of the training data and the Boltzmann distribution of the energy function of the model. Using simulated samples from the CMS experiment, the classification of outofdistribution examples by the Wasserstein normalized autoencoder is shown to be on par with or better than that of the normalized autoencoder. Furthermore, the Wasserstein distance is found to be a robust metric to define a stopping condition for the training in a fully signalagnostic fashion. 
References  
1  T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson  QCD or what?  SciPost Phys. 6 (2019) 030  1808.08979 
2  M. Farina, Y. Nakai, and D. Shih  Searching for new physics with deep autoencoders  PRD 101 (2020) 075021  1808.08992 
3  T. Finke et al.  Autoencoders for unsupervised anomaly detection in high energy physics  JHEP 06 (2021) 161  2104.09051 
4  S. Yoon, Y.K. Noh, and F. Park  Autoencoding under normalization constraints  Proc. 38th Int. Conf. Mach. Learn. 1, 2021 link 
2105.05735 
5  T. Cohen, M. Lisanti, and H. K. Lou  Semivisible jets: Dark matter undercover at the LHC  PRL 115 (2015) 171804  1503.00009 
6  CMS Collaboration  The CMS experiment at the CERN LHC  JINST 3 (2008) S08004  
7  M. J. Strassler and K. M. Zurek  Echoes of a hidden valley at hadron colliders  PLB 651 (2007) 374  hepph/0604261 
8  CMS Collaboration  Search for resonant production of strongly coupled dark matter in protonproton collisions at 13 TeV  JHEP 06 (2022) 156  CMSEXO19020 2112.11125 
9  T. Cohen, M. Lisanti, H. K. Lou, and S. MishraSharma  LHC searches for dark sector showers  JHEP 11 (2017) 196  1707.05326 
10  M. A. Kramer  Autoassociative neural networks  Comput. Chem. Eng. 16 (1992) 313  
11  J. Alwall et al.  The automated computation of treelevel and nexttoleading order differential cross sections, and their matching to parton shower simulations  JHEP 07 (2014) 079  1405.0301 
12  T. Sjöstrand et al.  An introduction to PYTHIA 8.2  Comput. Phys. Commun. 191 (2015) 159  1410.3012 
13  GEANT4 Collaboration  GEANT4a simulation toolkit  NIM A 506 (2003) 250  
14  M. Cacciari, G. P. Salam, and G. Soyez  The anti$ k_{\mathrm{T}} $ jet clustering algorithm  JHEP 04 (2008) 063  0802.1189 
15  M. Cacciari, G. P. Salam, and G. Soyez  FastJet user manual  EPJC 72 (2012) 1896  1111.6097 
16  CMS Collaboration  Performance of quark/gluon discrimination in 8 TeV pp data  CMS Physics Analysis Summary, 2013 CMSPASJME13002 
CMSPASJME13002 
17  P. T. Komiske, E. M. Metodiev, and J. Thaler  Energy flow polynomials: A complete linear basis for jet substructure  JHEP 04 (2018) 013  1712.07124 
18  A. J. Larkoski, G. P. Salam, and J. Thaler  Energy correlation functions for jet substructure  JHEP 06 (2013) 108  1305.0007 
19  A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler  Soft drop  JHEP 05 (2014) 146  1402.2657 
20  J. Thaler and K. Van Tilburg  Identifying boosted objects with Nsubjettiness  JHEP 03 (2011) 015  1011.2268 
21  F. Pedregosa et al.  Scikitlearn: Machine learning in Python  J. Mach. Learn. Res. 12 (2011) 2825  1201.0490 
22  F. Canelli et al.  Autoencoders for semivisible jet detection  JHEP 02 (2022) 074  2112.02864 
23  B. M. Dillon et al.  A normalized autoencoder for LHC triggers  SciPost Phys. Core 6 (2023) 074  2206.14225 
24  E. T. Jaynes  Information Theory and Statistical Mechanics  PR 106 (1957) 620  
25  T. Tieleman  Training restricted Boltzmann machines using approximations to the likelihood gradient  Proc. 25th Int. Conf. Mach. Learn, 2008 link 

26  Y. Du and I. Mordatch  Implicit generation and modeling with energybased models  Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link 
1903.08689 
27  M. Arjovsky, S. Chintala, and L. Bottou  Wasserstein GAN  Proc. 34th Int. Conf. Mach. Learn., 2017 link 
1701.07875 
28  R. Flamary et al.  POT: Python optimal transport  J. Mach. Learn. Res. 22 (2021) 1  
29  A. Paszke et al.  PyTorch: An imperative style, highperformance deep learning library  Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link 
1912.01703 
Compact Muon Solenoid LHC, CERN 