CMS-PAS-MLG-24-002

CMS-PAS-MLG-24-002
Wasserstein normalized autoencoder
CMS Collaboration
26 September 2024

Abstract: A novel approach to unsupervised jet tagging is presented for the CMS experiment at the CERN LHC. The Wasserstein normalized autoencoder (WNAE) is a normalized probabilistic model that minimizes the Wasserstein distance between the probability distribution of the training data and the Boltzmann distribution of the reconstruction error of the autoencoder. Trained on jets of particles from simulated standard model processes, the WNAE is shown to learn the probability distribution of the input data, in a fully unsupervised fashion, in order to effectively identify new physics jets as anomalies. This algorithm has been developed and applied in the context of a recent search for semivisible jets. The model consistently demonstrates stable, convergent training and achieves strong classification performance across a wide range of signals, improving upon standard normalized autoencoders, while remaining agnostic to the signal. The WNAE directly tackles the problem of outlier reconstruction, a common failure mode of autoencoders in anomaly detection tasks.
Links: CDS record (PDF) ; CADI line (restricted) ;

Figures	Summary	References	CMS Publications

Figures
png pdf	Figure 1: Schematic visualization of the outlier reconstruction failure mode. Signal events drawn from the hatched area are reconstructed well by the AE, despite not being part of the training set, and thus will not be separated from the background. The AE training is assumed to have converged such that the background is reconstructed well.
png pdf	Figure 2: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background, evaluated during each training epoch on $\mathrm{t} \overline{\mathrm{t}}$ background jets and signal models with $m_{\Phi} =$ 2000 GeV and $r_{\text{inv}} =$ 0.3 (upper) or $r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $\mathrm{t} \overline{\mathrm{t}}$ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $\mathrm{t} \overline{\mathrm{t}}$ background.
png pdf	Figure 2-a: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background, evaluated during each training epoch on $\mathrm{t} \overline{\mathrm{t}}$ background jets and signal models with $m_{\Phi} =$ 2000 GeV and $r_{\text{inv}} =$ 0.3 (upper) or $r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $\mathrm{t} \overline{\mathrm{t}}$ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $\mathrm{t} \overline{\mathrm{t}}$ background.
png pdf	Figure 2-b: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background, evaluated during each training epoch on $\mathrm{t} \overline{\mathrm{t}}$ background jets and signal models with $m_{\Phi} =$ 2000 GeV and $r_{\text{inv}} =$ 0.3 (upper) or $r_{\text{inv}} =$ 0.1, 0.3, 0.5, 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $\mathrm{t} \overline{\mathrm{t}}$ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $\mathrm{t} \overline{\mathrm{t}}$ background.
png pdf	Figure 3: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.
png pdf	Figure 3-a: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.
png pdf	Figure 3-b: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel). Right: from the same training, the positive and negative energies are shown before the divergence, illustrating their differences.
png pdf	Figure 4: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-a: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-b: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-c: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-d: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-e: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 4-f: Histograms of the input feature $\tau_{3}$ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss.
png pdf	Figure 5: Positive and negative energies (upper panel) and AUC for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel) during training with the loss function in Eq. xxxxx.
png pdf	Figure 6: Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $\mathcal{E}$ and $\mathcal{B}$ overlap, while $\mathcal{E}$ and $\mathcal{S}$ do not. Right: after the mode collapse, $\mathcal{E}$ expands and can partially include $\mathcal{S}$ , reducing the difference in AE reconstruction error and thus lowering the AUC score. $E_+$ and $E_-$ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.
png pdf	Figure 6-a: Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $\mathcal{E}$ and $\mathcal{B}$ overlap, while $\mathcal{E}$ and $\mathcal{S}$ do not. Right: after the mode collapse, $\mathcal{E}$ expands and can partially include $\mathcal{S}$ , reducing the difference in AE reconstruction error and thus lowering the AUC score. $E_+$ and $E_-$ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.
png pdf	Figure 6-b: Schematic representation of the mode collapse when using the loss function described in Eq. xxxxx. Left: before the mode collapse, $\mathcal{E}$ and $\mathcal{B}$ overlap, while $\mathcal{E}$ and $\mathcal{S}$ do not. Right: after the mode collapse, $\mathcal{E}$ expands and can partially include $\mathcal{S}$ , reducing the difference in AE reconstruction error and thus lowering the AUC score. $E_+$ and $E_-$ respectively denote the positive and negative energies. In both cases, the difference between the positive and the negative energies is zero.
png pdf	Figure 7: Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 7-a: Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 7-b: Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 7-c: Upper: the positive and negative energy (upper panel), and the Wasserstein distance between the positive and negative samples with AUC scores for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ (lower panel), during the training of an NAE with the loss function in Eq. xxxxx. Lower left: the AUC scores for an NAE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Lower right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 8: Flowchart of the Wasserstein normalized autoencoder training. The positive examples are passed through the autoencoder, and the negative examples are generated via MCMC. The Wasserstein distance is calculated between the positive and negative examples, and the gradients are backpropagated through the entire MCMC chain.
png pdf	Figure 9: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ .
png pdf	Figure 9-a: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ .
png pdf	Figure 9-b: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the AUC scores from the same WNAE for several signal hypotheses with fixed mediator mass, $m_{\Phi} =$ 2000 GeV, but varying invisible fraction $r_{\text{inv}}$ .
png pdf	Figure 10: The AUC scores for a WNAE trained on the $\mathrm{t} \overline{\mathrm{t}}$ background and tested on a grid of possible SVJ signal models.
png pdf	Figure 11: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-a: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-b: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-c: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-d: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-e: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-f: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-g: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 11-h: The distributions of half of the input variables, $\tau_{2}$ , $\tau_{3}$ , EFP1, and $C_2^{(0.5)}$ , for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-a: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-b: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-c: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-d: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-e: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-f: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-g: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 12-h: The distributions of the other half of the input variables, axis major, axis minor, $p_{\mathrm{T}}^{\mathrm{D}}$ , and softdrop mass, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training.
png pdf	Figure 13: The Wasserstein distance between the positive and negative samples and the AUC score during the training of a WNAE on an SVJ signal ( $m_{\Phi} =$ 2000 GeV, $r_{\text{inv}} =$ 0.3), with the $\mathrm{t} \overline{\mathrm{t}}$ background used for testing.

Summary

Autoencoder-based anomaly detection relies on learning a reconstruction error such that phase space regions with low probability density have high reconstruction error and can be identified as anomalous. However, standard autoencoders are prone to learn to reconstruct outliers because they are free to minimize the reconstruction error outside the training phase space. The normalized autoencoder paradigm promotes the autoencoder reconstruction error to an energy function in the framework of energy-based models, in order to define a normalized probabilistic model. This is achieved by minimizing the negative log-likelihood of the training data, given the energy-based model probability. In practice, this construction presents a number of failure modes, such as divergence of the loss function and phase space degeneracy, leading phase space regions distinct from the training data to have low reconstruction error. The Wasserstein normalized autoencoder, an improvement over normalized autoencoders, is introduced to solve the aforementioned failure modes. This is achieved by using the Wasserstein distance to quantify the difference between the probability distribution of the training data and the Boltzmann distribution of the energy function of the model. Using simulated samples from the CMS experiment, the classification of out-of-distribution examples by the Wasserstein normalized autoencoder is shown to be on par with or better than that of the normalized autoencoder. Furthermore, the Wasserstein distance is found to be a robust metric to define a stopping condition for the training in a fully signal-agnostic fashion.

References
1	T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson	QCD or what?	SciPost Phys. 6 (2019) 030	1808.08979
2	M. Farina, Y. Nakai, and D. Shih	Searching for new physics with deep autoencoders	PRD 101 (2020) 075021	1808.08992
3	T. Finke et al.	Autoencoders for unsupervised anomaly detection in high energy physics	JHEP 06 (2021) 161	2104.09051
4	S. Yoon, Y.-K. Noh, and F. Park	Autoencoding under normalization constraints	Proc. 38th Int. Conf. Mach. Learn. 1, 2021 link	2105.05735
5	T. Cohen, M. Lisanti, and H. K. Lou	Semivisible jets: Dark matter undercover at the LHC	PRL 115 (2015) 171804	1503.00009
6	CMS Collaboration	The CMS experiment at the CERN LHC	JINST 3 (2008) S08004
7	M. J. Strassler and K. M. Zurek	Echoes of a hidden valley at hadron colliders	PLB 651 (2007) 374	hep-ph/0604261
8	CMS Collaboration	Search for resonant production of strongly coupled dark matter in proton-proton collisions at 13 TeV	JHEP 06 (2022) 156	CMS-EXO-19-020 2112.11125
9	T. Cohen, M. Lisanti, H. K. Lou, and S. Mishra-Sharma	LHC searches for dark sector showers	JHEP 11 (2017) 196	1707.05326
10	M. A. Kramer	Autoassociative neural networks	Comput. Chem. Eng. 16 (1992) 313
11	J. Alwall et al.	The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations	JHEP 07 (2014) 079	1405.0301
12	T. Sjöstrand et al.	An introduction to PYTHIA 8.2	Comput. Phys. Commun. 191 (2015) 159	1410.3012
13	GEANT4 Collaboration	GEANT4---a simulation toolkit	NIM A 506 (2003) 250
14	M. Cacciari, G. P. Salam, and G. Soyez	The anti- $k_{\mathrm{T}}$ jet clustering algorithm	JHEP 04 (2008) 063	0802.1189
15	M. Cacciari, G. P. Salam, and G. Soyez	FastJet user manual	EPJC 72 (2012) 1896	1111.6097
16	CMS Collaboration	Performance of quark/gluon discrimination in 8 TeV pp data	CMS Physics Analysis Summary, 2013 CMS-PAS-JME-13-002	CMS-PAS-JME-13-002
17	P. T. Komiske, E. M. Metodiev, and J. Thaler	Energy flow polynomials: A complete linear basis for jet substructure	JHEP 04 (2018) 013	1712.07124
18	A. J. Larkoski, G. P. Salam, and J. Thaler	Energy correlation functions for jet substructure	JHEP 06 (2013) 108	1305.0007
19	A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler	Soft drop	JHEP 05 (2014) 146	1402.2657
20	J. Thaler and K. Van Tilburg	Identifying boosted objects with N-subjettiness	JHEP 03 (2011) 015	1011.2268
21	F. Pedregosa et al.	Scikit-learn: Machine learning in Python	J. Mach. Learn. Res. 12 (2011) 2825	1201.0490
22	F. Canelli et al.	Autoencoders for semivisible jet detection	JHEP 02 (2022) 074	2112.02864
23	B. M. Dillon et al.	A normalized autoencoder for LHC triggers	SciPost Phys. Core 6 (2023) 074	2206.14225
24	E. T. Jaynes	Information Theory and Statistical Mechanics	PR 106 (1957) 620
25	T. Tieleman	Training restricted Boltzmann machines using approximations to the likelihood gradient	Proc. 25th Int. Conf. Mach. Learn, 2008 link
26	Y. Du and I. Mordatch	Implicit generation and modeling with energy-based models	Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link	1903.08689
27	M. Arjovsky, S. Chintala, and L. Bottou	Wasserstein GAN	Proc. 34th Int. Conf. Mach. Learn., 2017 link	1701.07875
28	R. Flamary et al.	POT: Python optimal transport	J. Mach. Learn. Res. 22 (2021) 1
29	A. Paszke et al.	PyTorch: An imperative style, high-performance deep learning library	Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019 link	1912.01703