CMS-MLG-24-002

CMS-MLG-24-002 ; CERN-EP-2025-209
Wasserstein normalized autoencoder for anomaly detection
CMS Collaboration
3 October 2025
Submitted to Machine Learning: Science and Technology
Abstract: A novel anomaly detection algorithm is presented. The Wasserstein normalized autoencoder (WNAE) is a normalized probabilistic model that minimizes the Wasserstein distance between the learned probability distribution---a Boltzmann distribution where the energy is the reconstruction error of the autoencoder---and the distribution of the training data. This algorithm has been developed and applied to the identification of semivisible jets---conical sprays of visible standard model particles and invisible dark matter states---with the CMS experiment at the CERN LHC. Trained on jets of particles from simulated standard model processes, the WNAE is shown to learn the probability distribution of the input data in a fully unsupervised fashion, such that it effectively identifies new physics jets as anomalies. The model consistently demonstrates stable, convergent training and achieves strong classification performance across a wide range of signals, improving upon standard normalized autoencoders, while remaining agnostic to the signal. The WNAE directly tackles the problem of outlier reconstruction, a common failure mode of autoencoders in anomaly detection tasks.
Links: e-print arXiv:2510.02168 [hep-ex] (PDF) ; CDS record ; inSPIRE record ; CADI line (restricted) ;

Figures & Tables	Summary	References	CMS Publications

Figures
png pdf	Figure 1: Schematic visualization of the outlier reconstruction failure mode. Signal samples drawn from the hatched area are reconstructed well by the AE, despite not being part of the training set, and thus are not separated from the background. The AE training is assumed to have converged such that the background is reconstructed well.
png pdf	Figure 2: An illustration of collider SVJ production. The dashed black arrows indicate stable, undetectable DM candidate particles. Figure adapted from Ref. [51].
png pdf	Figure 3: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} = 0.1, 0.3, 0.5, $ 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.
png pdf	Figure 3-a: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} = 0.1, 0.3, 0.5, $ 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.
png pdf	Figure 3-b: Left: the reconstruction error (upper panel) and the AUC scores (lower panel) for the AE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background, evaluated during each training epoch on $ \mathrm{t} \overline{\mathrm{t}} $ background jets and signal models with $ m_{\Phi} = $ 2000 GeV and $ r_{\text{inv}} = $ 0.3 (upper) or $ r_{\text{inv}} = 0.1, 0.3, 0.5, $ 0.7 (lower). Right: The AUC scores for the same AE, evaluated for the epoch with the minimal background reconstruction error, for the classification of several SVJ signal hypotheses against the $ \mathrm{t} \overline{\mathrm{t}} $ background. The AUC scores are close to 0.5, indicating that the AE is unable to discriminate between the SVJ signal and the $ \mathrm{t} \overline{\mathrm{t}} $ background.
png pdf	Figure 4: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $ values (lower panel). Right: the positive and negative energies from the upper panel of the left plot, shown for $ \text{epoch} < $ 250---before the divergence---and 0.18 $ < \text{energy} < $ 1.4 on a linear scale, to illustrate their differences.
png pdf	Figure 4-a: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $ values (lower panel). Right: the positive and negative energies from the upper panel of the left plot, shown for $ \text{epoch} < $ 250---before the divergence---and 0.18 $ < \text{energy} < $ 1.4 on a linear scale, to illustrate their differences.
png pdf	Figure 4-b: Left: NAE training showing the divergence of the loss function, in terms of positive and negative energy (upper panel), and the AUC for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $ values (lower panel). Right: the positive and negative energies from the upper panel of the left plot, shown for $ \text{epoch} < $ 250---before the divergence---and 0.18 $ < \text{energy} < $ 1.4 on a linear scale, to illustrate their differences.
png pdf	Figure 5: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-a: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-b: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-c: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-d: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-e: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 5-f: Distributions of the input feature $ \tau_{3} $ for positive, negative, and signal samples, before (epochs 274) and after (epochs 275--279) the start of the divergence of the NAE loss. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 6: As a function of epoch during the training of an NAE with the loss function from Eq. \eqrefeq:nae_loss_logcosh: the positive and negative energies and the value of the loss function (upper panel); the Wasserstein distance between negative and positive samples and the AUC for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $ (lower panel).
png pdf	Figure 7: Schematic representation of the mode collapse when using the loss function described in Eq. \eqrefeq:nae_loss_logcosh. The energy landscape is shown before (left) and after (right) the mode collapse. On the right, the reconstruction errors for the signal and background supports are completely overlapping on the vertical axis. The symbols $ E_+ $ and $ E_- $ denote the positive and negative energies, respectively.
png pdf	Figure 7-a: Schematic representation of the mode collapse when using the loss function described in Eq. \eqrefeq:nae_loss_logcosh. The energy landscape is shown before (left) and after (right) the mode collapse. On the right, the reconstruction errors for the signal and background supports are completely overlapping on the vertical axis. The symbols $ E_+ $ and $ E_- $ denote the positive and negative energies, respectively.
png pdf	Figure 7-b: Schematic representation of the mode collapse when using the loss function described in Eq. \eqrefeq:nae_loss_logcosh. The energy landscape is shown before (left) and after (right) the mode collapse. On the right, the reconstruction errors for the signal and background supports are completely overlapping on the vertical axis. The symbols $ E_+ $ and $ E_- $ denote the positive and negative energies, respectively.
png pdf	Figure 8: Left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 8-a: Left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 8-b: Left: the AUC scores for an NAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested against a grid of possible SVJ signals, before the increase of the Wasserstein distance (at epoch 3000). Right: the AUC scores for the same NAE after the increase in Wasserstein distance (at epoch 10000).
png pdf	Figure 9: Flowchart of the Wasserstein normalized autoencoder training. The negative examples are generated via MCMC using the energy function of the model, which is the Boltzmann-distributed reconstruction error. The energy function is computed from random input feature values $ X_n $ and the corresponding reconstructed feature values $ \widetilde{X}_n $, obtained by passing the inputs through the autoencoder. The positive examples are compared to the negative examples through the Wasserstein distance. The gradients are backpropagated through the entire MCMC chain.
png pdf	Figure 10: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the Wasserstein distance between negative and positive samples and the AUC scores from the same WNAE for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $.
png pdf	Figure 10-a: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the Wasserstein distance between negative and positive samples and the AUC scores from the same WNAE for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $.
png pdf	Figure 10-b: Left: the Wasserstein distance between pairs of the positive, negative, and signal samples during the WNAE training. Right: the Wasserstein distance between negative and positive samples and the AUC scores from the same WNAE for several signal hypotheses with fixed $ m_{\Phi} = $ 2000 GeV and varying $ r_{\text{inv}} $.
png pdf	Figure 11: The AUC scores for a WNAE trained on the $ \mathrm{t} \overline{\mathrm{t}} $ background and tested on a grid of possible SVJ signal models.
png pdf	Figure 12: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-a: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-b: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-c: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-d: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-e: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-f: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-g: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-h: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-i: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-j: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-k: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-l: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-m: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-n: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-o: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 12-p: The distributions of half of the input variables, $ \tau_{2} $, $ \tau_{3} $, EFP1, and $ C_2^{(0.5)} $, for the positive, negative, and signal samples, at the start (upper) and at the end (lower) of the WNAE training. The signal distributions are overlaid for illustration; signal samples are not used during the training. All distributions are normalized such that their integral is 100.
png pdf	Figure 13: The Wasserstein distance between the positive and negative samples and the AUC score during the training of a WNAE on an SVJ signal ($ m_{\Phi} = $ 2000 GeV, $ r_{\text{inv}} = $ 0.3), with the $ \mathrm{t} \overline{\mathrm{t}} $ background used for testing.
png pdf	Figure B1: The learning rate during the training of the WNAE from Section 4.3.

Tables
png pdf	Table B1: The hyperparameters of the learning rate scheduler.
png pdf	Table C1: The hyperparameters of the MCMC.

Summary

Anomaly detection using autoencoders (AEs) relies on learning a reconstruction function that gives high reconstruction error to phase space regions with low probability density, such that they can be identified as anomalous. However, standard AEs are prone to learn to reconstruct outliers because they are free to minimize the reconstruction error outside the training phase space. In addition, they may exhibit complexity bias, learning to identify examples as anomalous only if their feature distribution is more complex than the training data. The normalized autoencoder (NAE) paradigm promotes the AE reconstruction error to an energy function in the framework of energy-based models, in order to define a normalized probabilistic model. This is achieved by minimizing the negative log-likelihood of the training data from the energy-based model probability. In practice, this method presents a number of failure modes, such as divergence of the loss function and phase space degeneracy, leading to low reconstruction error for phase space regions distinct from the training data. The Wasserstein normalized autoencoder (WNAE ), an improvement over the NAE, is introduced to solve these failure modes. This is achieved by directly minimizing the Wasserstein distance between the probability distribution of the training data and the Boltzmann distribution of the energy function of the model. This Wasserstein distance is found to be highly correlated with signal identification performance while still being fully signal-agnostic, preserving the unsupervised nature of the approach. The performance is studied in the context of a search for new physics with the CMS experiment, using top-antitop quark production as the standard model background and nonresonant semivisible jet production from a strongly coupled dark sector as the proposed signal model. The classification of the signal events as outliers by the WNAE is shown to be on par with or better than that of the NAE. Further, the WNAE approach is found to mitigate complexity bias, as it can effectively identify top quark jets as anomalous when trained on semivisible jet signal events. Though simulated samples were used to develop the WNAE, in practice it may be preferable to use observed data directly for training, in order to limit biases arising from differences between simulation and observation. In this case, the training data may contain anomalies, which would reduce the anomaly detection performance of the WNAE. The WNAE can be straightforwardly trained using observed data from a control region with no anomalous examples, if such a region can be defined and follows the same probability distribution as the observed data where the WNAE will be applied. When no assumption at all can be made about the nature of the anomalies, such as the case of triggering at a high-energy physics experiment, alternative solutions may exist. The WNAE associates low probability density regions with high reconstruction error; because anomalies necessarily have low probability density, they still tend to have relatively high reconstruction error even when included in the training data set. Therefore, the training data set can be iteratively refined by selecting a given fraction of examples with the lowest reconstruction error, in order to reduce the proportion of anomalous data. This would result in a self-supervised training for the WNAE. We leave the development of such a procedure for future work.

References
1	T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson	QCD or what?	SciPost Phys. 6 (2019) 030	1808.08979
2	M. Farina, Y. Nakai, and D. Shih	Searching for new physics with deep autoencoders	PRD 101 (2020) 075021	1808.08992
3	T. Finke et al.	Autoencoders for unsupervised anomaly detection in high energy physics	JHEP 06 (2021) 161	2104.09051
4	S. Yoon, Y.-K. Noh, and F. Park	Autoencoding under normalization constraints	in Proceedings of the 38th International Conference on Machine Learning, p. 12087. 2021 link	2105.05735
5	T. Cohen, M. Lisanti, and H. K. Lou	Semivisible jets: Dark matter undercover at the LHC	PRL 115 (2015) 171804	1503.00009
6	CMS Collaboration	The CMS experiment at the CERN LHC	JINST 3 (2008) S08004
7	CMS Collaboration	Development of the CMS detector for the CERN LHC Run 3	JINST 19 (2024) P05064	CMS-PRF-21-001 2309.05466
8	CMS Collaboration	Performance of the CMS Level-1 trigger in proton-proton collisions at $ \sqrt{s} = $ 13 TeV	JINST 15 (2020) P10017	CMS-TRG-17-001 2006.10165
9	CMS Collaboration	The CMS trigger system	JINST 12 (2017) P01020	CMS-TRG-12-001 1609.02366
10	CMS Collaboration	Performance of the CMS high-level trigger during LHC run 2	JINST 19 (2024) P11021	CMS-TRG-19-001 2410.17038
11	CMS Collaboration	Electron and photon reconstruction and identification with the CMS experiment at the CERN LHC	JINST 16 (2021) P05014	CMS-EGM-17-001 2012.06888
12	CMS Collaboration	Performance of the CMS muon detector and muon reconstruction with proton-proton collisions at $ \sqrt{s}= $ 13 TeV	JINST 13 (2018) P06015	CMS-MUO-16-001 1804.04528
13	CMS Collaboration	Description and performance of track and primary-vertex reconstruction with the CMS tracker	JINST 9 (2014) P10009	CMS-TRK-11-001 1405.6569
14	CMS Collaboration	Particle-flow reconstruction and global event description with the CMS detector	JINST 12 (2017) P10003	CMS-PRF-14-001 1706.04965
15	CMS Collaboration	Jet energy scale and resolution in the CMS experiment in pp collisions at 8 TeV	JINST 12 (2017) P02014	CMS-JME-13-004 1607.03663
16	CMS Collaboration	Performance of missing transverse momentum reconstruction in proton-proton collisions at $ \sqrt{s} = $ 13 TeV using the CMS detector	JINST 14 (2019) P07004	CMS-JME-17-001 1903.06078
17	L. V. Kantorovich	Mathematical methods of organizing and planning production	Management Science 6 (1939) 366
18	L. N. Vaserstein	Markov processes over denumerable products of spaces describing large systems of automata	Problems of Information Transmission 5 (1969) 47
19	M. A. Kramer	Autoassociative neural networks	Comput. Chem. Eng. 16 (1992) 313
20	P. Smolensky	Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations	ch. Information processing in dynamical systems: Foundations of harmony theory. The MIT Press, 1986 link
21	G. Hinton	Training products of experts by minimizing contrastive divergence	Neural Comput. 14 (2002) 1771
22	Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton	Energy-based models for sparse overcomplete representations	J. Mach. Learn. Res. 4 (2003) 1235
23	E. T. Jaynes	Information theory and statistical mechanics	PR 106 (1957) 620
24	D. P. Kingma and M. Welling	Auto-encoding variational Bayes	in 2nd International Conference on Learning Representations. 2014 link	1312.6114
25	A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow	Adversarial autoencoders	in International Conference on Learning Representations. 2016 link	1511.05644
26	I. J. Goodfellow et al.	Generative adversarial nets	in Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc, 2014 link	1406.2661
27	I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf	Wasserstein auto-encoders	in International Conference on Learning Representations. 2018	1711.01558
28	L. V. Kantorovich and S. Rubinshtein	On a space of totally additive functions	Vestnik of the St. Petersburg University:, 1958 Mathematics 13 (1958) 52
29	M. Arjovsky, S. Chintala, and L. Bottou	Wasserstein GAN	in Proceedings of the 34th International Conference on Machine Learning, volume 70, p. 214. 2017 link	1701.07875
30	R. Flamary et al.	POT: Python optimal transport	J. Mach. Learn. Res. 22 (2021) 1
31	K. Fatras et al.	Learning with minibatch Wasserstein: asymptotic and gradient properties	in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108, 2019 link	1910.04091
32	E. Nalisnick et al.	Do deep generative models know what they don't know?	in International Conference on Learning Representations. 2019	1810.09136
33	Z. Xiao, Q. Yan, and Y. Amit	Likelihood regret: An out-of-distribution detection score for variational auto-encoder	in Advances in Neural Information Processing Systems, volume 33, p. 20685. 2020	2003.02977
34	S. Pidhorskyi, R. Almohsen, and G. Doretto	Generative probabilistic novelty detection with adversarial autoencoders	in Proceedings of the 32nd International Conference on Neural Information Processing Systems, p. 6823. 2018 link	1807.02588
35	S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon	GANomaly: Semi-supervised anomaly detection via adversarial training	in Asian Conference on Computer Vision, p. 622, Springer. 2018 link	1805.06725
36	T. Schlegl et al.	f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks	Medical Image Analysis 54 (2019) 30
37	Y. Du and I. Mordatch	Implicit generation and modeling with energy-based models	in Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 324. 2019 link	1903.08689
38	A. Gandrakota	Realtime anomaly detection at the L1 trigger of CMS experiment	PoS ICHEP, 2025 link	2411.19506
39	CMS Collaboration	Model-agnostic search for dijet resonances with anomalous jet substructure in proton-proton collisions at $ \sqrt{s} = $ 13 TeV	Rept. Prog. Phys. 88 (2025) 067802	CMS-EXO-22-026 2412.03747
40	ATLAS Collaboration	Search for new phenomena in two-body invariant mass distributions using unsupervised machine learning for anomaly detection at $ \sqrt{s}= $ 13 TeV with the ATLAS detector	PRL 132 (2024) 081801	2307.01612
41	B. M. Dillon et al.	A normalized autoencoder for LHC triggers	SciPost Phys. Core 6 (2023) 074	2206.14225
42	V. C. Rubin, N. Thonnard, and W. K. Ford, Jr.	Rotational properties of 21 SC galaxies with a large range of luminosities and radii, from NGC 4605 (R = 4 kpc) to UGC 2885 (R = 122 kpc)	Astrophys. J. 238 (1980) 471
43	M. Persic, P. Salucci, and F. Stel	The universal rotation curve of spiral galaxies: I. The dark matter connection	Mon. Not. Roy. Astron. Soc. 281 (1996) 27	astro-ph/9506004
44	D. Clowe et al.	A direct empirical proof of the existence of dark matter	Astrophys. J. 648 (2006) L109	astro-ph/0608407
45	DES Collaboration	Dark Energy Survey year 1 results: curved-sky weak lensing mass map	Mon. Not. Roy. Astron. Soc. 475 (2018) 3165	1708.01535
46	Planck Collaboration	Planck 2018 results. VI. Cosmological parameters	Astron. Astrophys. 641 (2020) A6	1807.06209
47	M. J. Strassler and K. M. Zurek	Echoes of a hidden valley at hadron colliders	PLB 651 (2007) 374	hep-ph/0604261
48	CMS Collaboration	Search for resonant production of strongly coupled dark matter in proton-proton collisions at 13 TeV	JHEP 06 (2022) 156	CMS-EXO-19-020 2112.11125
49	ATLAS Collaboration	Search for new physics in final states with semivisible jets or anomalous signatures using the ATLAS detector	PRD 112 (2025) 012021	2505.01634
50	T. Cohen, M. Lisanti, H. K. Lou, and S. Mishra-Sharma	LHC searches for dark sector showers	JHEP 11 (2017) 196	1707.05326
51	E. Bernreuther, F. Kahlhoefer, M. Krämer, and P. Tunney	Strongly interacting dark sectors in the early universe and at the LHC through a simplified portal	JHEP 01 (2020) 162	1907.04346
52	J. Alwall et al.	The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations	JHEP 07 (2014) 079	1405.0301
53	T. Sjostrand et al.	An introduction to PYTHIA 8.2	Comput. Phys. Commun. 191 (2015) 159	1410.3012
54	GEANT4 Collaboration	GEANT4---a simulation toolkit	NIM A 506 (2003) 250
55	NNPDF Collaboration	Parton distributions from high-precision collider data	EPJC 77 (2017) 663	1706.00428
56	M. Cacciari, G. P. Salam, and G. Soyez	The anti-$ k_{\mathrm{T}} $ jet clustering algorithm	JHEP 04 (2008) 063	0802.1189
57	M. Cacciari, G. P. Salam, and G. Soyez	FastJet user manual	EPJC 72 (2012) 1896	1111.6097
58	CMS Collaboration	Performance of quark/gluon discrimination in 8 TeV pp data	CMS Physics Analysis Summary, 2013 CMS-PAS-JME-13-002	CMS-PAS-JME-13-002
59	P. T. Komiske, E. M. Metodiev, and J. Thaler	Energy flow polynomials: A complete linear basis for jet substructure	JHEP 04 (2018) 013	1712.07124
60	A. J. Larkoski, G. P. Salam, and J. Thaler	Energy correlation functions for jet substructure	JHEP 06 (2013) 108	1305.0007
61	A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler	Soft drop	JHEP 05 (2014) 146	1402.2657
62	J. Thaler and K. Van Tilburg	Identifying boosted objects with N-subjettiness	JHEP 03 (2011) 015	1011.2268
63	F. Pedregosa et al.	Scikit-learn: Machine learning in Python	J. Mach. Learn. Res. 12 (2011) 2825	1201.0490
64	F. Canelli et al.	Autoencoders for semivisible jet detection	JHEP 02 (2022) 074	2112.02864
65	CMS Collaboration	Source code repository	gitlab
66	A. Paszke et al.	PyTorch: An imperative style, high-performance deep learning library	in Proceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32, p. 721. 2019 link	1912.01703
67	T. Tieleman	Training restricted Boltzmann machines using approximations to the likelihood gradient	in Proceedings of the 25th International Conference on Machine Learning, p. 1064. 2008 link