CMS-PAS-MLG-24-002 | ||
Wasserstein normalized autoencoder | ||
CMS Collaboration | ||
26 September 2024 | ||
Abstract: A novel approach to unsupervised jet tagging is presented for the CMS experiment at the CERN LHC. The Wasserstein normalized autoencoder (WNAE) is a normalized probabilistic model that minimizes the Wasserstein distance between the probability distribution of the training data and the Boltzmann distribution of the reconstruction error of the autoencoder. Trained on jets of particles from simulated standard model processes, the WNAE is shown to learn the probability distribution of the input data, in a fully unsupervised fashion, in order to effectively identify new physics jets as anomalies. This algorithm has been developed and applied in the context of a recent search for semivisible jets. The model consistently demonstrates stable, convergent training and achieves strong classification performance across a wide range of signals, improving upon standard normalized autoencoders, while remaining agnostic to the signal. The WNAE directly tackles the problem of outlier reconstruction, a common failure mode of autoencoders in anomaly detection tasks. | ||
Links: CDS record (PDF) ; CADI line (restricted) ; |
Figures | |
png pdf |
Figure 1:
Schematic visualization of the outlier reconstruction failure mode. Signal events drawn from the hatched area are reconstructed well by the AE, despite not being part of the training set, and thus will not be separated from the background. The AE training is assumed to have converged such that the background is reconstructed well. |
png pdf |
Figure 2:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. |
png pdf |
Figure 2-a:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. |
png pdf |
Figure 2-b:
Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background. |
png pdf |
Figure 3:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. |
png pdf |
Figure 3-a:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. |
png pdf |
Figure 3-b:
Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences. |
png pdf |
Figure 4:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-a:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-b:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-c:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-d:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-e:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 4-f:
Histograms of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. |
png pdf |
Figure 5:
Positive and negative energies (upper panel) and AUC for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel) during training with the loss function in Eq. xxxxx. |
png pdf |
Figure 6:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. |
png pdf |
Figure 6-a:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. |
png pdf |
Figure 6-b:
Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $ \mathcal{E} $ and $ \mathcal{B} $ overlap, while $ \mathcal{E} $ and $ \mathcal{S} $ do not. Right: after the mode collapse, $ \mathcal{E} $ expands and can partially include $ \mathcal{S} $, reducing the difference in AE reconstruction error and thus lowering the AUC score. $ E_+ $ and $ E_- $ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero. |
png pdf |
Figure 7:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). |
png pdf |
Figure 7-a:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). |
png pdf |
Figure 7-b:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). |
png pdf |
Figure 7-c:
Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000). |
png pdf |
Figure 8:
Flowchart of the Wasserstein normalized autoencoder training. The positive examples are passed through the autoencoder, and the negative examples are generated via MCMC. The Wasserstein distance is calculated between the positive and negative examples, and the gradients are backpropagated through the entire MCMC chain. |
png pdf |
Figure 9:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. |
png pdf |
Figure 9-a:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. |
png pdf |
Figure 9-b:
Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $ m_{\Phi} = $ 2000 GeV, but varying invisible fraction $ r_{\text{inv}} $. |
png pdf |
Figure 10:
The AUC scores for a WNAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested on a grid of possible SVJ signal models. |
png pdf |
Figure 11:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-a:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-b:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-c:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-d:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-e:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-f:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-g:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 11-h:
The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-a:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-b:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-c:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-d:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-e:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-f:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-g:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 12-h:
The distributions of the other half of the input variables, axis major, axis minor, $ p_{\mathrm{T}}^{\mathrm{D}} $, and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. |
png pdf |
Figure 13:
The Wasserstein distance between the positive and negative samples and the AUC score during the training of a WNAE on an SVJ signal ($ m_{\Phi} = $ 2000 GeV, $ r_{\text{inv}} = $ 0.3), with the $ \mathrm{t} \overline{\mathrm{t}} $ background used for testing. |
Summary |
Autoencoder-based anomaly detection relies on learning a reconstruction error such that phase space regions with low probability density have high reconstruction error and can be identified as anomalous. However, standard autoencoders are prone to learn to reconstruct outliers because they are free to minimize the reconstruction error outside the training phase space. The normalized autoencoder paradigm promotes the autoencoder reconstruction error to an energy function in the framework of energy-based models, in order to define a normalized probabilistic model. This is achieved by minimizing the negative log-likelihood of the training data, given the energy-based model probability. In practice, this construction presents a number of failure modes, such as divergence of the loss function and phase space degeneracy, leading phase space regions distinct from the training data to have low reconstruction error. The Wasserstein normalized autoencoder, an improvement over normalized autoencoders, is introduced to solve the aforementioned failure modes. This is achieved by using the Wasserstein distance to quantify the difference between the probability distribution of the training data and the Boltzmann distribution of the energy function of the model. Using simulated samples from the CMS experiment, the classification of out-of-distribution examples by the Wasserstein normalized autoencoder is shown to be on par with or better than that of the normalized autoencoder. Furthermore, the Wasserstein distance is found to be a robust metric to define a stopping condition for the training in a fully signal-agnostic fashion. |
References | ||||
1 | T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson | QCD or what? | SciPost Phys. 6 (2019) 030 | 1808.08979 |
2 | M. Farina, Y. Nakai, and D. Shih | Searching for new physics with deep autoencoders | PRD 101 (2020) 075021 | 1808.08992 |
3 | T. Finke et al. | Autoencoders for unsupervised anomaly detection in high energy physics | JHEP 06 (2021) 161 | 2104.09051 |
4 | S. Yoon, Y.-K. Noh, and F. Park | Autoencoding under normalization constraints | Proc. 38th Int. Conf. Mach. Learn. 1, 2021 link |
2105.05735 |
5 | T. Cohen, M. Lisanti, and H. K. Lou | Semivisible jets: Dark matter undercover at the LHC | PRL 115 (2015) 171804 | 1503.00009 |
6 | CMS Collaboration | The CMS experiment at the CERN LHC | JINST 3 (2008) S08004 | |
7 | M. J. Strassler and K. M. Zurek | Echoes of a hidden valley at hadron colliders | PLB 651 (2007) 374 | hep-ph/0604261 |
8 | CMS Collaboration | Search for resonant production of strongly coupled dark matter in proton-proton collisions at 13 TeV | JHEP 06 (2022) 156 | CMS-EXO-19-020 2112.11125 |
9 | T. Cohen, M. Lisanti, H. K. Lou, and S. Mishra-Sharma | LHC searches for dark sector showers | JHEP 11 (2017) 196 | 1707.05326 |
10 | M. A. Kramer | Autoassociative neural networks | Comput. Chem. Eng. 16 (1992) 313 | |
11 | J. Alwall et al. | The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations | JHEP 07 (2014) 079 | 1405.0301 |
12 | T. Sjöstrand et al. | An introduction to PYTHIA 8.2 | Comput. Phys. Commun. 191 (2015) 159 | 1410.3012 |
13 | GEANT4 Collaboration | GEANT4---a simulation toolkit | NIM A 506 (2003) 250 | |
14 | M. Cacciari, G. P. Salam, and G. Soyez | The anti-$ k_{\mathrm{T}} $ jet clustering algorithm | JHEP 04 (2008) 063 | 0802.1189 |
15 | M. Cacciari, G. P. Salam, and G. Soyez | FastJet user manual | EPJC 72 (2012) 1896 | 1111.6097 |
16 | CMS Collaboration | Performance of quark/gluon discrimination in 8 TeV pp data | CMS Physics Analysis Summary, 2013 CMS-PAS-JME-13-002 |
CMS-PAS-JME-13-002 |
17 | P. T. Komiske, E. M. Metodiev, and J. Thaler | Energy flow polynomials: A complete linear basis for jet substructure | JHEP 04 (2018) 013 | 1712.07124 |
18 | A. J. Larkoski, G. P. Salam, and J. Thaler | Energy correlation functions for jet substructure | JHEP 06 (2013) 108 | 1305.0007 |
19 | A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler | Soft drop | JHEP 05 (2014) 146 | 1402.2657 |
20 | J. Thaler and K. Van Tilburg | Identifying boosted objects with N-subjettiness | JHEP 03 (2011) 015 | 1011.2268 |
21 | F. Pedregosa et al. | Scikit-learn: Machine learning in Python | J. Mach. Learn. Res. 12 (2011) 2825 | 1201.0490 |
22 | F. Canelli et al. | Autoencoders for semivisible jet detection | JHEP 02 (2022) 074 | 2112.02864 |
23 | B. M. Dillon et al. | A normalized autoencoder for LHC triggers | SciPost Phys. Core 6 (2023) 074 | 2206.14225 |
24 | E. T. Jaynes | Information Theory and Statistical Mechanics | PR 106 (1957) 620 | |
25 | T. Tieleman | Training restricted Boltzmann machines using approximations to the likelihood gradient | Proc. 25th Int. Conf. Mach. Learn, 2008 link |
|
26 | Y. Du and I. Mordatch | Implicit generation and modeling with energy-based models | Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link |
1903.08689 |
27 | M. Arjovsky, S. Chintala, and L. Bottou | Wasserstein GAN | Proc. 34th Int. Conf. Mach. Learn., 2017 link |
1701.07875 |
28 | R. Flamary et al. | POT: Python optimal transport | J. Mach. Learn. Res. 22 (2021) 1 | |
29 | A. Paszke et al. | PyTorch: An imperative style, high-performance deep learning library | Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link |
1912.01703 |
Compact Muon Solenoid LHC, CERN |