CMS-PAS-MLG-23-003

CMS-PAS-MLG-23-003
Extension and first application of the ABCDisCo method with LHC data: ABCDisCoTEC
CMS Collaboration
27 March 2025

Abstract: The ``ABCD method'' provides a reliable framework to estimate backgrounds using observed data by partitioning events into one signal-enhanced region (A) and three background-enhanced control regions (B, C, and D) via two statistically independent variables. In practice, even slight correlations between the two variables can significantly undermine the method's performance. Thus, choosing appropriate variables by hand can present a formidable challenge, especially when background and signal differ only subtly. To address this issue, the ABCDisCo method (ABCD with distance correlation) was developed to construct two artificial variables from the output scores of a neural network trained to maximize signal-background discrimination while minimizing correlations using the distance correlation measure. However, relying solely on minimizing the distance correlation can introduce undesirable characteristics in the resulting distributions, which may compromise the validity of the background prediction obtained using this method. The ABCDisCo training enhanced with closure (ABCDisCoTEC) method is introduced to provide a novel solution to this issue by directly minimizing the nonclosure, expressed as a dedicated differentiable loss term. This extended method is applied to a data set of proton-proton collisions at a center-of-mass energy of 13 TeV recorded by the CMS detector at the CERN LHC. Additionally, given the complexity of the minimization problem with constraints on multiple loss terms, the modified differential method of multipliers is applied and shown to greatly improve the stability and robustness of the ABCDisCoTEC method, compared to grid search hyperparameter optimization procedures.
Links: CDS record (PDF) ; CADI line (restricted) ;

Figures	Summary	References	CMS Publications

Figures
png pdf	Figure 1: Schematic illustration of idealized signal (red) and background (grey) distributions in the ABCD plane, represented as Gaussian kernel density estimators (KDEs).
png pdf	Figure 2: Diagrammatic layout of the ABCDisCoTEC NN model. Features $x_i$ (blue) are initially fed into a hidden layer containing $M$ nodes. The resulting output of this hidden layer is fed into two binary classifiers, each with one hidden layer $f$ and output layer $S^{\mathrm{NN}}_{1}$ or $S^{\mathrm{NN}}_{2}$ . Both outputs are used to compute the loss function $L_{\text{total}}$ . Colored arrows indicate the direction of propagation from inputs to final network output. Black arrows represent the back propagation carried out during training.
png pdf	Figure 3: An example of the sigmoid function in Eq. (7) with choices for the scale parameters $a$ of 10 (upper left), 100 (upper right), and $\lim a\rightarrow\infty$ (lower). The boundaries are $b_{1,2} =$ 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.
png pdf	Figure 3-a: An example of the sigmoid function in Eq. (7) with choices for the scale parameters $a$ of 10 (upper left), 100 (upper right), and $\lim a\rightarrow\infty$ (lower). The boundaries are $b_{1,2} =$ 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.
png pdf	Figure 3-b: An example of the sigmoid function in Eq. (7) with choices for the scale parameters $a$ of 10 (upper left), 100 (upper right), and $\lim a\rightarrow\infty$ (lower). The boundaries are $b_{1,2} =$ 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.
png pdf	Figure 3-c: An example of the sigmoid function in Eq. (7) with choices for the scale parameters $a$ of 10 (upper left), 100 (upper right), and $\lim a\rightarrow\infty$ (lower). The boundaries are $b_{1,2} =$ 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.
png pdf	Figure 4: Schematic illustration of the training path of the NN in the space of the learned parameters for the $\lambda$ method (blue) and MDMM (orange). The case of an equality constraint on a generic loss component $L_2$ ( $L_2 = \epsilon_2$ ) is shown. Both convex (left) and nonconvex (right) Pareto fronts, represented by dashed black lines, are shown. The dashed red line represents the subspace on which the condition $L_2 = \epsilon_2$ is met.
png pdf	Figure 5: The values of the individual loss components scaled by the corresponding $\lambda$ value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.
png pdf	Figure 5-a: The values of the individual loss components scaled by the corresponding $\lambda$ value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.
png pdf	Figure 5-b: The values of the individual loss components scaled by the corresponding $\lambda$ value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.
png pdf	Figure 6: The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.
png pdf	Figure 6-a: The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.
png pdf	Figure 6-b: The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.
png pdf	Figure 7: The ROC curves for $S^{\mathrm{NN}}_{1}$ (left) and $S^{\mathrm{NN}}_{2}$ (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.
png pdf	Figure 7-a: The ROC curves for $S^{\mathrm{NN}}_{1}$ (left) and $S^{\mathrm{NN}}_{2}$ (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.
png pdf	Figure 7-b: The ROC curves for $S^{\mathrm{NN}}_{1}$ (left) and $S^{\mathrm{NN}}_{2}$ (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.
png pdf	Figure 8: Comparison of ROC curves from a single binary classifier DNN, each discriminant from the ABCDisCoTEC model, and the two ABCDisCoTEC discriminants combined for all top squark masses combined. The last entry is computed via the distance of the two discriminants from $(0,0)$ . The black dashed line represents the ROC curve for random classification.
png pdf	Figure 9: Comparison of the nonclosure and normalized significance from the three loss function variations (DisCo only, nonclosure only, and DisCo and nonclosure) for the stealth SUSY training set using a scan of boundaries over multiple trainings. The significance values from Eq. (13) are normalized by the maximum significance value of the three training configurations.
png pdf	Figure 10: Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large $\lambda_{\text{DisCo}}$ value may result in convergence to this failure mode. White bins contain no events.
png pdf	Figure 10-a: Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large $\lambda_{\text{DisCo}}$ value may result in convergence to this failure mode. White bins contain no events.
png pdf	Figure 10-b: Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large $\lambda_{\text{DisCo}}$ value may result in convergence to this failure mode. White bins contain no events.
png pdf	Figure 11: Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.
png pdf	Figure 11-a: Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.
png pdf	Figure 11-b: Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.
png pdf	Figure 12: The nonclosure (left) and raw statistical significance ( $N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}}$ ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.
png pdf	Figure 12-a: The nonclosure (left) and raw statistical significance ( $N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}}$ ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.
png pdf	Figure 12-b: The nonclosure (left) and raw statistical significance ( $N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}}$ ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.
png pdf	Figure 13: Scans of the Pareto front in the $L_{\text{nonclosure}}$ vs. $L_{\text{BCE}}$ plane with the $\lambda$ (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the $\lambda$ or $\epsilon$ parameters, respectively.
png pdf	Figure 13-a: Scans of the Pareto front in the $L_{\text{nonclosure}}$ vs. $L_{\text{BCE}}$ plane with the $\lambda$ (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the $\lambda$ or $\epsilon$ parameters, respectively.
png pdf	Figure 13-b: Scans of the Pareto front in the $L_{\text{nonclosure}}$ vs. $L_{\text{BCE}}$ plane with the $\lambda$ (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the $\lambda$ or $\epsilon$ parameters, respectively.
png pdf	Figure 14: The significance and nonclosure values from scanning the ABCD boundaries, with training using the $\lambda$ method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.
png pdf	Figure 14-a: The significance and nonclosure values from scanning the ABCD boundaries, with training using the $\lambda$ method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.
png pdf	Figure 14-b: The significance and nonclosure values from scanning the ABCD boundaries, with training using the $\lambda$ method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.
png pdf	Figure 15: Visual representations of the three VRs (VR I, VR II, and VR III). The solid blue lines indicate the bin boundaries from the full ABCD plane. The dashed lines represent the starting points of the ABCD subregion edges that are moved during the validation procedure, as described in the text. These lines divide the VRs into regions dA, dB, dC, and dD; the latter three are used to predict the number of events in dA, following the ABCD prescription.
png pdf	Figure 16: The nonclosure in simulation and observed data when iteratively increasing the sub-ABCD region boundaries as described in the text, for VR I, VR II, and VR III. The scans of VR I and VR II start at the outer boundary value of 0.4, while the scan for VR III starts at the boundary value of 0.6. Observed data are shown only for boundary values where the signal contamination is below 5% in the analogue of the A region.

Summary

The ABCD background estimation technique uses three control regions, defined by boundaries in the space of two statistically independent variables, to predict the background in the signal region. There has been ongoing interest in automating this technique with machine learning, such as the ABCDisCo method [3], which uses the distance correlation (DisCo) metric to train a neural network to produce two independent discriminants. In this note, we introduce the ABCDisCo training enhanced with closure (ABCDisCoTEC ) method, which enhances this automation by adding a novel differentiable loss term. This term quantifies the deviation in the background prediction (``nonclosure''), more directly addressing the primary objective of the ABCD technique. The result, from a case study using a high-energy physics data set with a simulated stealth supersymmetry signal, is a pair of decorrelated discriminants with strong signal-background separation, reducing the systematic uncertainty from deviations in the background estimation and improving the significance for a potential observation of physics beyond the standard model. Strategies for validating the method in proton-proton collision data and deriving applicable systematic uncertainties were presented. These achievements represent promising steps towards a generalized strategy to automate background estimation in high-energy physics. Additionally, an alternative approach to train the ABCDisCoTEC neural network was evaluated, using the modified differential method of multipliers (MDMM) [8]. The MDMM approach promotes the relative weights of the additional loss terms to learnable parameters, specifies constraints on each additional loss term directly, and guarantees the convergence of the training. This method provides further benefits when applied to ABCDisCoTEC by more quickly and directly finding optimal regions in hyperparameter space, without requiring exhaustive manual searches. Several potential modifications or extensions to the loss function could further improve the performance of the ABCDisCoTEC method. For example, incorporating a normalized measure of the signal yield in the control regions as an additional loss term could directly constrain the allowed contamination [3]. This modification could improve the network's sensitivity to the signal while simplifying validation by reducing high signal contamination in certain regions. Another promising extension involves augmenting the nonclosure loss term based on the extended ABCD method [31], which uses additional control regions to mitigate nonclosure effects from minor correlations between the discriminants. By following the approach employed here, the extended nonclosure loss could be made differentiable by relaxing hard boundaries using sigmoid functions. In conclusion, the combination of the ABCDisCoTEC method and MDMM represents another step forward in the optimization and automation of the traditional ABCD background estimation method. This technique has promising applications to a wide variety of high-energy physics analyses and the potential for further refinement in the future.

References
1	W. Buttinger	Background estimation with the ABCD method featuring the TRooFit toolkit	link
2	O. Behnke, K. Kröninger, G. Schott, and T. Schörner-Sadenius, eds.	Data analysis in high energy physics: a practical guide to statistical methods	Wiley-VCH, Weinheim, Germany, 2013 link
3	G. Kasieczka, B. Nachman, M. D. Schwartz, and D. Shih	Automating the ABCD method with machine learning	PRD 103 (2021) 035021	2007.14400
4	G. J. Székely, M. L. Rizzo, and N. K. Bakirov	Measuring and testing dependence by correlation of distances	Ann. Stat. 35 (2007) 2769	0803.4101
5	CMS Collaboration	The CMS experiment at the CERN LHC	JINST 3 (2008) S08004
6	Lyndon Evans and Philip Bryant	LHC Machine	JINST 3 (2008) S08001
7	CMS Collaboration	Search for top squarks in final states with many light-flavor jets and 0, 1, or 2 charged leptons in proton-proton collisions at $\sqrt{s}=$ 13 TeV	CMS Physics Analysis Summary, 2024 CMS-PAS-SUS-23-001	CMS-PAS-SUS-23-001
8	J. Platt and A. Barr	Constrained differential optimization	in Proceedings of the 1st International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 1987 link
9	K. Pearson	Note on regression and inheritance in the case of two parents	Proceedings of the Royal Society of London, 1895 link
10	A.V. Lotov and K. Miettinen	Visualizing the Pareto frontier	Springer Berlin Heidelberg, 2008 link
11	J. Fan, M. Reece, and J. T. Ruderman	Stealth supersymmetry	JHEP 11 (2011) 012	1105.5135
12	J. Fan, M. Reece, and J. T. Ruderman	A stealth supersymmetry sampler	JHEP 07 (2012) 196	1201.4875
13	J. Fan et al.	Stealth supersymmetry simplified	JHEP 07 (2016) 016	1512.05781
14	CMS Collaboration	Search for top squarks in final states with two top quarks and several light-flavor jets in proton-proton collisions at $\sqrt{s} =$ 13 TeV	PRD 104 (2021) 032006	CMS-SUS-19-004 2102.06976
15	F. Chollet et al.	Keras	link
16	M. Abadi et al.	TensorFlow: A system for large-scale machine learning	in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, USENIX Association, Savannah, GA, USA, 2016 link	1605.08695
17	A. Paszke et al.	PyTorch: an imperative style, high-performance deep learning library	in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc, 2019 link
18	D. P. Kingma and J. Ba	Adam: A method for stochastic optimization		1412.6980
19	P. Nason	A new method for combining NLO QCD with shower Monte Carlo algorithms	JHEP 11 (2004) 040	hep-ph/0409146
20	S. Frixione, P. Nason, and C. Oleari	Matching NLO QCD computations with parton shower simulations: the POWHEG method	JHEP 11 (2007) 070	0709.2092
21	S. Alioli, P. Nason, C. Oleari, and E. Re	A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX	JHEP 06 (2010) 043	1002.2581
22	S. Frixione, P. Nason, and G. Ridolfi	A positive-weight next-to-leading-order Monte Carlo for heavy flavour hadroproduction	JHEP 09 (2007) 126	0707.3088
23	M. Czakon and A. Mitov	Top++: A program for the calculation of the top-pair cross-section at hadron colliders	Comput. Phys. Commun. 185 (2014) 2930	1112.5675
24	J. Alwall et al.	The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations	JHEP 07 (2014) 079	1405.0301
25	C. Borschensky et al.	Squark and gluino production cross sections in pp collisions at $\sqrt{s}=$ 13, 14, 33 and 100 TeV	EPJC 74 (2014) 3174	1407.5066
26	W. Beenakker et al.	NNLL-fast: predictions for coloured supersymmetric particle production at the LHC with threshold and Coulomb resummation	JHEP 12 (2016) 133	1607.07741
27	T. Sjöstrand et al.	An introduction to PYTHIA 8.2	Comput. Phys. Commun. 191 (2015) 159	1410.3012
28	NNPDF Collaboration	Parton distributions from high-precision collider data	EPJC 77 (2017) 663	1706.00428
29	CMS Collaboration	Extraction and validation of a new set of CMS PYTHIA8 tunes from underlying-event measurements	EPJC 80 (2020) 4	CMS-GEN-17-001 1903.12179
30	GEANT4 Collaboration	GEANT 4---a simulation toolkit	NIM A 506 (2003) 250
31	S. Choi and H. Oh	Improved extrapolation methods of data-driven background estimations in high energy physics	EPJC 81 (2021) 643	1906.10831