Processing math: 11%
CMS logoCMS event Hgg
Compact Muon Solenoid
LHC, CERN

CMS-PAS-MLG-23-003
Extension and first application of the ABCDisCo method with LHC data: ABCDisCoTEC
Abstract: The ``ABCD method'' provides a reliable framework to estimate backgrounds using observed data by partitioning events into one signal-enhanced region (A) and three background-enhanced control regions (B, C, and D) via two statistically independent variables. In practice, even slight correlations between the two variables can significantly undermine the method's performance. Thus, choosing appropriate variables by hand can present a formidable challenge, especially when background and signal differ only subtly. To address this issue, the ABCDisCo method (ABCD with distance correlation) was developed to construct two artificial variables from the output scores of a neural network trained to maximize signal-background discrimination while minimizing correlations using the distance correlation measure. However, relying solely on minimizing the distance correlation can introduce undesirable characteristics in the resulting distributions, which may compromise the validity of the background prediction obtained using this method. The ABCDisCo training enhanced with closure (ABCDisCoTEC) method is introduced to provide a novel solution to this issue by directly minimizing the nonclosure, expressed as a dedicated differentiable loss term. This extended method is applied to a data set of proton-proton collisions at a center-of-mass energy of 13 TeV recorded by the CMS detector at the CERN LHC. Additionally, given the complexity of the minimization problem with constraints on multiple loss terms, the modified differential method of multipliers is applied and shown to greatly improve the stability and robustness of the ABCDisCoTEC method, compared to grid search hyperparameter optimization procedures.
Figures Summary References CMS Publications
Figures

png pdf
Figure 1:
Schematic illustration of idealized signal (red) and background (grey) distributions in the ABCD plane, represented as Gaussian kernel density estimators (KDEs).

png pdf
Figure 2:
Diagrammatic layout of the ABCDisCoTEC NN model. Features xi (blue) are initially fed into a hidden layer containing M nodes. The resulting output of this hidden layer is fed into two binary classifiers, each with one hidden layer f and output layer SNN1 or SNN2. Both outputs are used to compute the loss function Ltotal. Colored arrows indicate the direction of propagation from inputs to final network output. Black arrows represent the back propagation carried out during training.

png pdf
Figure 3:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and lim (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.

png pdf
Figure 3-a:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.

png pdf
Figure 3-b:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.

png pdf
Figure 3-c:
An example of the sigmoid function in Eq. (7) with choices for the scale parameters a of 10 (upper left), 100 (upper right), and \lim a\rightarrow\infty (lower). The boundaries are b_{1,2} = 0.5. The sigmoid distribution shown in the upper right is used for the ABCD region event counts in the following studies.

png pdf
Figure 4:
Schematic illustration of the training path of the NN in the space of the learned parameters for the \lambda method (blue) and MDMM (orange). The case of an equality constraint on a generic loss component L_2 ( L_2 = \epsilon_2 ) is shown. Both convex (left) and nonconvex (right) Pareto fronts, represented by dashed black lines, are shown. The dashed red line represents the subspace on which the condition L_2 = \epsilon_2 is met.

png pdf
Figure 5:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.

png pdf
Figure 5-a:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.

png pdf
Figure 5-b:
The values of the individual loss components scaled by the corresponding \lambda value computed on the training set (left) and the total loss computed on both the training and test sets (right) as a function of epoch for the stealth SUSY case. The network training is stopped after 100 epochs, after which neither the training nor test sample losses decrease. The differences in magnitude between the different loss components result, in part, from the magnitude of their respective hyperparameters; the total loss smoothly decreases.

png pdf
Figure 6:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.

png pdf
Figure 6-a:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.

png pdf
Figure 6-b:
The distributions of background (left) and signal (right) events in the two-dimensional discriminant plane for the final training of the ABCDisCoTEC model with the stealth SUSY training set. White bins contain no events.

png pdf
Figure 7:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.

png pdf
Figure 7-a:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.

png pdf
Figure 7-b:
The ROC curves for S^{\mathrm{NN}}_{1} (left) and S^{\mathrm{NN}}_{2} (right) from the ABCDisCoTEC model using the stealth SUSY training set, for different values of the top squark mass in units of GeVns. The performance is measured using an eight-fold cross-validation; the shaded regions represent one standard deviation of the average performance of the true positive rate for each false positive value in the test. The black dashed line represents the ROC curve for random classification. The values in brackets are the areas under the ROC curves.

png pdf
Figure 8:
Comparison of ROC curves from a single binary classifier DNN, each discriminant from the ABCDisCoTEC model, and the two ABCDisCoTEC discriminants combined for all top squark masses combined. The last entry is computed via the distance of the two discriminants from (0,0) . The black dashed line represents the ROC curve for random classification.

png pdf
Figure 9:
Comparison of the nonclosure and normalized significance from the three loss function variations (DisCo only, nonclosure only, and DisCo and nonclosure) for the stealth SUSY training set using a scan of boundaries over multiple trainings. The significance values from Eq. (13) are normalized by the maximum significance value of the three training configurations.

png pdf
Figure 10:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events.

png pdf
Figure 10-a:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events.

png pdf
Figure 10-b:
Examples of the ``four corners'' failure mode: background (left) and signal (right) distributions in which the distance correlation is approximately zero. A relatively large \lambda_{\text{DisCo}} value may result in convergence to this failure mode. White bins contain no events.

png pdf
Figure 11:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.

png pdf
Figure 11-a:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.

png pdf
Figure 11-b:
Example of the ``hard edge'' failure mode for background events (left) and signal events (right). If the distance correlation loss is evaluated only with background-labeled events, then the interplay between the DisCo and BCE losses for the classifiers can yield a solution where decorrelation is only achieved in a subregion of the two-dimensional plane. White bins contain no events.

png pdf
Figure 12:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.

png pdf
Figure 12-a:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.

png pdf
Figure 12-b:
The nonclosure (left) and raw statistical significance ( N_{\text{sig}}/\sqrt{\smash[b]{N_{\text{bkg}}}} ) (right) as a function of the placement of the ABCD region boundaries for the stealth SUSY training set. The dashed black lines represent the optimal boundaries for both nonclosure and statistical significance.

png pdf
Figure 13:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively.

png pdf
Figure 13-a:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively.

png pdf
Figure 13-b:
Scans of the Pareto front in the L_{\text{nonclosure}} vs. L_{\text{BCE}} plane with the \lambda (left) and MDMM (right) methods. All trainings are conducted with PYTORCH. Each line shows the training of a model with different values of the \lambda or \epsilon parameters, respectively.

png pdf
Figure 14:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.

png pdf
Figure 14-a:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.

png pdf
Figure 14-b:
The significance and nonclosure values from scanning the ABCD boundaries, with training using the \lambda method (top) and MDMM (bottom). Multiple trainings with different initial weights and biases are shown for each of the two training approaches. Each point represents a given training and a given choice of boundaries; the color of the point represents the nonclosure mean (left) and variance (right) from all boundary choices for the training used for that point.

png pdf
Figure 15:
Visual representations of the three VRs (VR I, VR II, and VR III). The solid blue lines indicate the bin boundaries from the full ABCD plane. The dashed lines represent the starting points of the ABCD subregion edges that are moved during the validation procedure, as described in the text. These lines divide the VRs into regions dA, dB, dC, and dD; the latter three are used to predict the number of events in dA, following the ABCD prescription.

png pdf
Figure 16:
The nonclosure in simulation and observed data when iteratively increasing the sub-ABCD region boundaries as described in the text, for VR I, VR II, and VR III. The scans of VR I and VR II start at the outer boundary value of 0.4, while the scan for VR III starts at the boundary value of 0.6. Observed data are shown only for boundary values where the signal contamination is below 5% in the analogue of the A region.
Summary
The ABCD background estimation technique uses three control regions, defined by boundaries in the space of two statistically independent variables, to predict the background in the signal region. There has been ongoing interest in automating this technique with machine learning, such as the ABCDisCo method [3], which uses the distance correlation (DisCo) metric to train a neural network to produce two independent discriminants. In this note, we introduce the ABCDisCo training enhanced with closure (ABCDisCoTEC ) method, which enhances this automation by adding a novel differentiable loss term. This term quantifies the deviation in the background prediction (``nonclosure''), more directly addressing the primary objective of the ABCD technique. The result, from a case study using a high-energy physics data set with a simulated stealth supersymmetry signal, is a pair of decorrelated discriminants with strong signal-background separation, reducing the systematic uncertainty from deviations in the background estimation and improving the significance for a potential observation of physics beyond the standard model. Strategies for validating the method in proton-proton collision data and deriving applicable systematic uncertainties were presented. These achievements represent promising steps towards a generalized strategy to automate background estimation in high-energy physics. Additionally, an alternative approach to train the ABCDisCoTEC neural network was evaluated, using the modified differential method of multipliers (MDMM) [8]. The MDMM approach promotes the relative weights of the additional loss terms to learnable parameters, specifies constraints on each additional loss term directly, and guarantees the convergence of the training. This method provides further benefits when applied to ABCDisCoTEC by more quickly and directly finding optimal regions in hyperparameter space, without requiring exhaustive manual searches. Several potential modifications or extensions to the loss function could further improve the performance of the ABCDisCoTEC method. For example, incorporating a normalized measure of the signal yield in the control regions as an additional loss term could directly constrain the allowed contamination [3]. This modification could improve the network's sensitivity to the signal while simplifying validation by reducing high signal contamination in certain regions. Another promising extension involves augmenting the nonclosure loss term based on the extended ABCD method [31], which uses additional control regions to mitigate nonclosure effects from minor correlations between the discriminants. By following the approach employed here, the extended nonclosure loss could be made differentiable by relaxing hard boundaries using sigmoid functions. In conclusion, the combination of the ABCDisCoTEC method and MDMM represents another step forward in the optimization and automation of the traditional ABCD background estimation method. This technique has promising applications to a wide variety of high-energy physics analyses and the potential for further refinement in the future.
References
1 W. Buttinger Background estimation with the ABCD method featuring the TRooFit toolkit link
2 O. Behnke, K. Kröninger, G. Schott, and T. Schörner-Sadenius, eds. Data analysis in high energy physics: a practical guide to statistical methods Wiley-VCH, Weinheim, Germany, 2013
link
3 G. Kasieczka, B. Nachman, M. D. Schwartz, and D. Shih Automating the ABCD method with machine learning PRD 103 (2021) 035021 2007.14400
4 G. J. Székely, M. L. Rizzo, and N. K. Bakirov Measuring and testing dependence by correlation of distances Ann. Stat. 35 (2007) 2769 0803.4101
5 CMS Collaboration The CMS experiment at the CERN LHC JINST 3 (2008) S08004
6 Lyndon Evans and Philip Bryant LHC Machine JINST 3 (2008) S08001
7 CMS Collaboration Search for top squarks in final states with many light-flavor jets and 0, 1, or 2 charged leptons in proton-proton collisions at \sqrt{s}= 13 TeV CMS Physics Analysis Summary, 2024
CMS-PAS-SUS-23-001
CMS-PAS-SUS-23-001
8 J. Platt and A. Barr Constrained differential optimization in Proceedings of the 1st International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 1987
link
9 K. Pearson Note on regression and inheritance in the case of two parents Proceedings of the Royal Society of London, 1895
link
10 A.V. Lotov and K. Miettinen Visualizing the Pareto frontier Springer Berlin Heidelberg, 2008
link
11 J. Fan, M. Reece, and J. T. Ruderman Stealth supersymmetry JHEP 11 (2011) 012 1105.5135
12 J. Fan, M. Reece, and J. T. Ruderman A stealth supersymmetry sampler JHEP 07 (2012) 196 1201.4875
13 J. Fan et al. Stealth supersymmetry simplified JHEP 07 (2016) 016 1512.05781
14 CMS Collaboration Search for top squarks in final states with two top quarks and several light-flavor jets in proton-proton collisions at \sqrt{s} = 13 TeV PRD 104 (2021) 032006 CMS-SUS-19-004
2102.06976
15 F. Chollet et al. Keras link
16 M. Abadi et al. TensorFlow: A system for large-scale machine learning in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, USENIX Association, Savannah, GA, USA, 2016
link
1605.08695
17 A. Paszke et al. PyTorch: an imperative style, high-performance deep learning library in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc, 2019
link
18 D. P. Kingma and J. Ba Adam: A method for stochastic optimization 1412.6980
19 P. Nason A new method for combining NLO QCD with shower Monte Carlo algorithms JHEP 11 (2004) 040 hep-ph/0409146
20 S. Frixione, P. Nason, and C. Oleari Matching NLO QCD computations with parton shower simulations: the POWHEG method JHEP 11 (2007) 070 0709.2092
21 S. Alioli, P. Nason, C. Oleari, and E. Re A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX JHEP 06 (2010) 043 1002.2581
22 S. Frixione, P. Nason, and G. Ridolfi A positive-weight next-to-leading-order Monte Carlo for heavy flavour hadroproduction JHEP 09 (2007) 126 0707.3088
23 M. Czakon and A. Mitov Top++: A program for the calculation of the top-pair cross-section at hadron colliders Comput. Phys. Commun. 185 (2014) 2930 1112.5675
24 J. Alwall et al. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations JHEP 07 (2014) 079 1405.0301
25 C. Borschensky et al. Squark and gluino production cross sections in pp collisions at \sqrt{s}= 13, 14, 33 and 100 TeV EPJC 74 (2014) 3174 1407.5066
26 W. Beenakker et al. NNLL-fast: predictions for coloured supersymmetric particle production at the LHC with threshold and Coulomb resummation JHEP 12 (2016) 133 1607.07741
27 T. Sjöstrand et al. An introduction to PYTHIA 8.2 Comput. Phys. Commun. 191 (2015) 159 1410.3012
28 NNPDF Collaboration Parton distributions from high-precision collider data EPJC 77 (2017) 663 1706.00428
29 CMS Collaboration Extraction and validation of a new set of CMS PYTHIA8 tunes from underlying-event measurements EPJC 80 (2020) 4 CMS-GEN-17-001
1903.12179
30 GEANT4 Collaboration GEANT 4---a simulation toolkit NIM A 506 (2003) 250
31 S. Choi and H. Oh Improved extrapolation methods of data-driven background estimations in high energy physics EPJC 81 (2021) 643 1906.10831
Compact Muon Solenoid
LHC, CERN