CMS-PAS-MLG-23-003 | ||
Extension and first application of the ABCDisCo method with LHC data: ABCDisCoTEC | ||
CMS Collaboration | ||
27 March 2025 | ||
Abstract: The ``ABCD method'' provides a reliable framework to estimate backgrounds using observed data by partitioning events into one signal-enhanced region (A) and three background-enhanced control regions (B, C, and D) via two statistically independent variables. In practice, even slight correlations between the two variables can significantly undermine the method's performance. Thus, choosing appropriate variables by hand can present a formidable challenge, especially when background and signal differ only subtly. To address this issue, the ABCDisCo method (ABCD with distance correlation) was developed to construct two artificial variables from the output scores of a neural network trained to maximize signal-background discrimination while minimizing correlations using the distance correlation measure. However, relying solely on minimizing the distance correlation can introduce undesirable characteristics in the resulting distributions, which may compromise the validity of the background prediction obtained using this method. The ABCDisCo training enhanced with closure (ABCDisCoTEC) method is introduced to provide a novel solution to this issue by directly minimizing the nonclosure, expressed as a dedicated differentiable loss term. This extended method is applied to a data set of proton-proton collisions at a center-of-mass energy of 13 TeV recorded by the CMS detector at the CERN LHC. Additionally, given the complexity of the minimization problem with constraints on multiple loss terms, the modified differential method of multipliers is applied and shown to greatly improve the stability and robustness of the ABCDisCoTEC method, compared to grid search hyperparameter optimization procedures. | ||
Links: CDS record (PDF) ; CADI line (restricted) ; |
Figures | |
![]() png pdf |
Figure 1:
Schematic illustration of idealized signal (red) and background (grey) distributions in the ABCD plane, represented as Gaussian kernel density estimators (KDEs). |
![]() png pdf |
Figure 2:
Diagrammatic layout of the ABCDisCoTEC NN model. Features xi (blue) are initially fed into a hidden layer containing M nodes. The resulting output of this hidden layer is fed into two binary classifiers, each with one hidden layer f and output layer SNN1 or SNN2. Both outputs are used to compute the loss function Ltotal. Colored arrows indicate the direction of propagation from inputs to final network output. Black arrows represent the back propagation carried out during training. |
![]() png pdf |
Figure 3:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and lim (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies. |
![]() png pdf |
Figure 3-a:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies. |
![]() png pdf |
Figure 3-b:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies. |
![]() png pdf |
Figure 3-c:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies. |
![]() png pdf |
Figure 4:
Schematic illustration of the training path of the NN in the space of the learned parameters for the \lambda method (blue) and MDMM (orange). The case of an equality constraint on a generic loss component L_2 ( L_2 = \epsilon_2 ) is shown. Both convex (left) and nonconvex (right) Pareto fronts, represented by dashed black lines, are shown. The dashed red line represents the subspace on which the condition L_2 = \epsilon_2 is met. |
![]() png pdf |
Figure 5:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases. |
![]() png pdf |
Figure 5-a:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases. |
![]() png pdf |
Figure 5-b:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases. |
![]() png pdf |
Figure 6:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events. |
![]() png pdf |
Figure 6-a:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events. |
![]() png pdf |
Figure 6-b:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events. |
![]() png pdf |
Figure 7:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves. |
![]() png pdf |
Figure 7-a:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves. |
![]() png pdf |
Figure 7-b:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves. |
![]() png pdf |
Figure 8:
Comparison of ROC curves from a single binary classifier DNN, each discriminant from the ABCDisCoTEC model, and the two ABCDisCoTEC discriminants combined for all top squark masses combined. The last entry is computed via the distance of the two discriminants from (0,0) . The black dashed line represents the ROC curve for random classification. |
![]() png pdf |
Figure 9:
Comparison of the nonclosure and normalized significance from the three loss function variations (DisCo only, nonclosure only, and DisCo and nonclosure) for the stealth SUSY training set using a scan of boundaries over multiple trainings. The significance values from Eq. (13) are normalized by the maximum significance value of the three training configurations. |
![]() png pdf |
Figure 10:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events. |
![]() png pdf |
Figure 10-a:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events. |
![]() png pdf |
Figure 10-b:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events. |
![]() png pdf |
Figure 11:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events. |
![]() png pdf |
Figure 11-a:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events. |
![]() png pdf |
Figure 11-b:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events. |
![]() png pdf |
Figure 12:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance. |
![]() png pdf |
Figure 12-a:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance. |
![]() png pdf |
Figure 12-b:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance. |
![]() png pdf |
Figure 13:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively. |
![]() png pdf |
Figure 13-a:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively. |
![]() png pdf |
Figure 13-b:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively. |
![]() png pdf |
Figure 14:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point. |
![]() png pdf |
Figure 14-a:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point. |
![]() png pdf |
Figure 14-b:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point. |
![]() png pdf |
Figure 15:
Visual representations of the three VRs (VR I, VR II, and VR III). The solid blue lines indicate the bin boundaries from the full ABCD plane. The dashed lines represent the starting points of the ABCD subregion edges that are moved during the validation procedure, as described in the text. These lines divide the VRs into regions dA, dB, dC, and dD; the latter three are used to predict the number of events in dA, following the ABCD prescription. |
![]() png pdf |
Figure 16:
The nonclosure in simulation and observed data when iteratively increasing the sub-ABCD region boundaries as described in the text, for VR I, VR II, and VR III. The scans of VR I and VR II start at the outer boundary value of 0.4, while the scan for VR III starts at the boundary value of 0.6. Observed data are shown only for boundary values where the signal contamination is below 5% in the analogue of the A region. |
Summary |
The ABCD background estimation technique uses three control regions, defined by boundaries in the space of two statistically independent variables, to predict the background in the signal region. There has been ongoing interest in automating this technique with machine learning, such as the ABCDisCo method [3], which uses the distance correlation (DisCo) metric to train a neural network to produce two independent discriminants. In this note, we introduce the ABCDisCo training enhanced with closure (ABCDisCoTEC ) method, which enhances this automation by adding a novel differentiable loss term. This term quantifies the deviation in the background prediction (``nonclosure''), more directly addressing the primary objective of the ABCD technique. The result, from a case study using a high-energy physics data set with a simulated stealth supersymmetry signal, is a pair of decorrelated discriminants with strong signal-background separation, reducing the systematic uncertainty from deviations in the background estimation and improving the significance for a potential observation of physics beyond the standard model. Strategies for validating the method in proton-proton collision data and deriving applicable systematic uncertainties were presented. These achievements represent promising steps towards a generalized strategy to automate background estimation in high-energy physics. Additionally, an alternative approach to train the ABCDisCoTEC neural network was evaluated, using the modified differential method of multipliers (MDMM) [8]. The MDMM approach promotes the relative weights of the additional loss terms to learnable parameters, specifies constraints on each additional loss term directly, and guarantees the convergence of the training. This method provides further benefits when applied to ABCDisCoTEC by more quickly and directly finding optimal regions in hyperparameter space, without requiring exhaustive manual searches. Several potential modifications or extensions to the loss function could further improve the performance of the ABCDisCoTEC method. For example, incorporating a normalized measure of the signal yield in the control regions as an additional loss term could directly constrain the allowed contamination [3]. This modification could improve the network's sensitivity to the signal while simplifying validation by reducing high signal contamination in certain regions. Another promising extension involves augmenting the nonclosure loss term based on the extended ABCD method [31], which uses additional control regions to mitigate nonclosure effects from minor correlations between the discriminants. By following the approach employed here, the extended nonclosure loss could be made differentiable by relaxing hard boundaries using sigmoid functions. In conclusion, the combination of the ABCDisCoTEC method and MDMM represents another step forward in the optimization and automation of the traditional ABCD background estimation method. This technique has promising applications to a wide variety of high-energy physics analyses and the potential for further refinement in the future. |
References | ||||
1 | W. Buttinger | Background estimation with the ABCD method featuring the TRooFit toolkit | link | |
2 | O. Behnke, K. Kröninger, G. Schott, and T. Schörner-Sadenius, eds. | Data analysis in high energy physics: a practical guide to statistical methods | Wiley-VCH, Weinheim, Germany, 2013 link |
|
3 | G. Kasieczka, B. Nachman, M. D. Schwartz, and D. Shih | Automating the ABCD method with machine learning | PRD 103 (2021) 035021 | 2007.14400 |
4 | G. J. Székely, M. L. Rizzo, and N. K. Bakirov | Measuring and testing dependence by correlation of distances | Ann. Stat. 35 (2007) 2769 | 0803.4101 |
5 | CMS Collaboration | The CMS experiment at the CERN LHC | JINST 3 (2008) S08004 | |
6 | Lyndon Evans and Philip Bryant | LHC Machine | JINST 3 (2008) S08001 | |
7 | CMS Collaboration | Search for top squarks in final states with many light-flavor jets and 0, 1, or 2 charged leptons in proton-proton collisions at \sqrt{s}= 13 TeV | CMS Physics Analysis Summary, 2024 CMS-PAS-SUS-23-001 |
CMS-PAS-SUS-23-001 |
8 | J. Platt and A. Barr | Constrained differential optimization | in Proceedings of the 1st International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 1987 link |
|
9 | K. Pearson | Note on regression and inheritance in the case of two parents | Proceedings of the Royal Society of London, 1895 link |
|
10 | A.V. Lotov and K. Miettinen | Visualizing the Pareto frontier | Springer Berlin Heidelberg, 2008 link |
|
11 | J. Fan, M. Reece, and J. T. Ruderman | Stealth supersymmetry | JHEP 11 (2011) 012 | 1105.5135 |
12 | J. Fan, M. Reece, and J. T. Ruderman | A stealth supersymmetry sampler | JHEP 07 (2012) 196 | 1201.4875 |
13 | J. Fan et al. | Stealth supersymmetry simplified | JHEP 07 (2016) 016 | 1512.05781 |
14 | CMS Collaboration | Search for top squarks in final states with two top quarks and several light-flavor jets in proton-proton collisions at \sqrt{s} = 13 TeV | PRD 104 (2021) 032006 | CMS-SUS-19-004 2102.06976 |
15 | F. Chollet et al. | Keras | link | |
16 | M. Abadi et al. | TensorFlow: A system for large-scale machine learning | in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, USENIX Association, Savannah, GA, USA, 2016 link |
1605.08695 |
17 | A. Paszke et al. | PyTorch: an imperative style, high-performance deep learning library | in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc, 2019 link |
|
18 | D. P. Kingma and J. Ba | Adam: A method for stochastic optimization | 1412.6980 | |
19 | P. Nason | A new method for combining NLO QCD with shower Monte Carlo algorithms | JHEP 11 (2004) 040 | hep-ph/0409146 |
20 | S. Frixione, P. Nason, and C. Oleari | Matching NLO QCD computations with parton shower simulations: the POWHEG method | JHEP 11 (2007) 070 | 0709.2092 |
21 | S. Alioli, P. Nason, C. Oleari, and E. Re | A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX | JHEP 06 (2010) 043 | 1002.2581 |
22 | S. Frixione, P. Nason, and G. Ridolfi | A positive-weight next-to-leading-order Monte Carlo for heavy flavour hadroproduction | JHEP 09 (2007) 126 | 0707.3088 |
23 | M. Czakon and A. Mitov | Top++: A program for the calculation of the top-pair cross-section at hadron colliders | Comput. Phys. Commun. 185 (2014) 2930 | 1112.5675 |
24 | J. Alwall et al. | The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations | JHEP 07 (2014) 079 | 1405.0301 |
25 | C. Borschensky et al. | Squark and gluino production cross sections in pp collisions at \sqrt{s}= 13, 14, 33 and 100 TeV | EPJC 74 (2014) 3174 | 1407.5066 |
26 | W. Beenakker et al. | NNLL-fast: predictions for coloured supersymmetric particle production at the LHC with threshold and Coulomb resummation | JHEP 12 (2016) 133 | 1607.07741 |
27 | T. Sjöstrand et al. | An introduction to PYTHIA 8.2 | Comput. Phys. Commun. 191 (2015) 159 | 1410.3012 |
28 | NNPDF Collaboration | Parton distributions from high-precision collider data | EPJC 77 (2017) 663 | 1706.00428 |
29 | CMS Collaboration | Extraction and validation of a new set of CMS PYTHIA8 tunes from underlying-event measurements | EPJC 80 (2020) 4 | CMS-GEN-17-001 1903.12179 |
30 | GEANT4 Collaboration | GEANT 4---a simulation toolkit | NIM A 506 (2003) 250 | |
31 | S. Choi and H. Oh | Improved extrapolation methods of data-driven background estimations in high energy physics | EPJC 81 (2021) 643 | 1906.10831 |
![]() |
Compact Muon Solenoid LHC, CERN |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |