CMS logoCMS event Hgg
Compact Muon Solenoid
LHC, CERN

CMS-PAS-BTV-16-001
Identification of c-quark jets at the CMS experiment
Abstract: An accurate identification of jets originating from b quarks is of primary importance in many measurements and searches at the LHC. The development of a charm tagger, identifying jets initiated by charm jets, would be of similar importance. In this note a technique where an MVA-based discriminator is used in order to select charm jets is presented, together with its expected performance on simulations, and its calibration on W+c and top quark pairs. The datasets used for the calibration of the algorithm are from proton-proton collisions at 13 TeV, recorded by the CMS experiment during the first year of the LHC Run II.
Figures & Tables Summary References CMS Publications
Figures

png pdf
Figure 1-a:
Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).

png pdf
Figure 1-b:
Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).

png pdf
Figure 1-c:
Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).

png pdf
Figure 1-d:
Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).

png pdf
Figure 2:
Performance of the CvsL tagger compared between the default BDT options (in red) and the optimized BDT options (in green). The dashed grey line (the diagonal in this plane) represents a tagger with a random choice.

png pdf
Figure 3-a:
a: Overlay of the BDT discriminator output for the different flavours for the CvsL (top) and for the CvsB (bottom) discriminators (normalized for each flavour). b: ROC curves showing the final performance of the CvsL (blue full line and axis) and CvsB (red full line and axis) trainings, validated on the ${\mathrm {t}\overline {\mathrm {t}}}$ sample. For comparison the performance of the CSVv2 and cMVAv2 [1] algorithms (used for b tagging) is also shown by the dotted lines.

png pdf
Figure 3-b:
a: Overlay of the BDT discriminator output for the different flavours for the CvsL (top) and for the CvsB (bottom) discriminators (normalized for each flavour). b: ROC curves showing the final performance of the CvsL (blue full line and axis) and CvsB (red full line and axis) trainings, validated on the ${\mathrm {t}\overline {\mathrm {t}}}$ sample. For comparison the performance of the CSVv2 and cMVAv2 [1] algorithms (used for b tagging) is also shown by the dotted lines.

png pdf
Figure 4-a:
Agreement between data and simulated samples for the CvsL and CvsB discriminator distributions.

png pdf
Figure 4-b:
Agreement between data and simulated samples for the CvsL and CvsB discriminator distributions.

png pdf
Figure 5-a:
a: two dimensional scatter overlay of the BDT discriminators for b (red), c (green), and light jets (blue). The CvsL discriminator is shown on the x-axis and the CvsB discriminator is shown on the y-axis. b: relation between bottom and light mistag efficiency for different values of a constant charm efficiency.

png pdf
Figure 5-b:
a: two dimensional scatter overlay of the BDT discriminators for b (red), c (green), and light jets (blue). The CvsL discriminator is shown on the x-axis and the CvsB discriminator is shown on the y-axis. b: relation between bottom and light mistag efficiency for different values of a constant charm efficiency.

png pdf
Figure 6-a:
Data (dots) to Monte Carlo (stacked solid histogram, divided by jet flavour) comparison of the discriminator distributions for negative (negative side of the distribution) and positive taggers (positive side) for the CvsL and CvsB discriminators (a and b plot, respectively). A jet-trigger ${p_{\mathrm {T}}}$ threshold of 40 GeV is required for both data and simulation. The simulation is normalised to the number of entries in the data.

png pdf
Figure 6-b:
Data (dots) to Monte Carlo (stacked solid histogram, divided by jet flavour) comparison of the discriminator distributions for negative (negative side of the distribution) and positive taggers (positive side) for the CvsL and CvsB discriminators (a and b plot, respectively). A jet-trigger ${p_{\mathrm {T}}}$ threshold of 40 GeV is required for both data and simulation. The simulation is normalised to the number of entries in the data.

png pdf
Figure 7-a:
Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.

png pdf
Figure 7-b:
Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.

png pdf
Figure 7-c:
Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.

png pdf
Figure 8:
Left and middle: leading order production of W+c signal with opposite sign charges (OS). Right: production of W+charm final state through gluon splitting process. In gluon splitting there is an additional charm quark with the same sign as the W boson (SS).

png pdf
Figure 9-a:
Distribution of the c tagger discriminators (CvsB and CvsL) after applying W+charm selection and OS-SS subtraction. The plots corresponds to the sum of both electron and muon decays of the W.

png pdf
Figure 9-b:
Distribution of the c tagger discriminators (CvsB and CvsL) after applying W+charm selection and OS-SS subtraction. The plots corresponds to the sum of both electron and muon decays of the W.

png pdf
Figure 10-a:
c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.

png pdf
Figure 10-b:
c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.

png pdf
Figure 10-c:
c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.

png pdf
Figure 11-a:
Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.

png pdf
Figure 11-b:
Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.

png pdf
Figure 11-c:
Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.

png pdf
Figure 12:
Data and MC distributions of ${\lambda _{M}}$ after full selection, including the selection of the best permutation and the b tagging identification of two of the jet of the permutation. The different simulated processes contributing are shown with different colours. The major contributions are semi-leptonic top quark pair decays with the hadronic W properly matched to the generator particles (violet), wrongly matched semi-leptonic top quark decays (red), and non semi-leptonic top quark decays (azure). Minor contributions are also present due to single top events (green), vector-boson plus jets (labeled as ``V+jets'', in yellow), and multi-jet production (``QCD'', in blue).

png pdf
Figure 13-a:
(upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.

png pdf
Figure 13-b:
(upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.

png pdf
Figure 13-c:
(upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.
Tables

png pdf
Table 1:
Explanation of the BDT options that are explicitly set during the c tagger training and their used values. To ease the reproducibility, the setting names used by TMVA are listed under the BDT Option column.

png pdf
Table 2:
Definitions of the three working points with the corresponding cuts on the discriminator values and the global efficiencies, obtained from simulated ${\mathrm {t}\overline {\mathrm {t}}}$ samples, for each flavour.

png pdf
Table 3:
Efficiencies and $SF_{\mathrm{c}}$ for the c-tagger algorithm using W+c events. $SF_{\mathrm{c}}$ and the sum in quadrature of statistical and systematical uncertainties are shown in the third column.

png pdf
Table 4:
Measured scale factors for the efficiency of tagging charm jets in semi-leptonic $\mathrm{t\bar{t}}$ events. The quoted uncertainty includes both statistical and systematic uncertainty. The last column shows the statistical-only uncertainty.

png pdf
Table 5:
Combined $SF_{\mathrm{c}}$ values
Summary
A new tool for identifying charm jets is presented. The tagger has been developed using two boosted decision trees, trained and tested on simulated input datasets. The light-to-charm mistagging rate has been measured in multi-jet events. The performance of the tagger has been measured on W+c and semi-leptonic top quark pairs events, deriving scale factors for the charm identification. The precision achieved with the first method ranges between 6% and 21%, while the second provides a more precise evaluation of the scale factor, in the order of 5%, but without jet $p_{\mathrm{T}}$ dependence. The W+c jet population is strongly skewed towards low-$p_{\mathrm{T}}$ jets. The datasets for the efficiency calibrations have been extracted from proton-proton collisions collected by the CMS experiment in 2015.
References
1 CMS Collaboration Collaboration Identification of b quark jets at the CMS Experiment in the LHC Run 2 Technical Report CMS-PAS-BTV-15-001, CERN, Geneva
2 L. Breiman, J. Friedman, R. Olshen, and C. Stone Classification and Regression Trees Wadsworth and Brooks, Monterey, CA
3 CMS Collaboration The CMS experiment at the CERN LHC JINST 3 (2008) S08004 CMS-00-001
4 P. Nason A New method for combining NLO QCD with shower Monte Carlo algorithms JHEP 11 (2004) 040 hep-ph/0409146
5 S. Frixione, P. Nason, and C. Oleari Matching NLO QCD computations with Parton Shower simulations: the POWHEG method JHEP 11 (2007) 070 0709.2092
6 S. Alioli, P. Nason, C. Oleari, and E. Re A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX JHEP 06 (2010) 043 1002.2581
7 J. M. Campbell, R. K. Ellis, P. Nason, and E. Re Top-pair production and decay at NLO matched with parton showers JHEP 04 (2015) 114 1412.1828
8 E. Re Single-top Wt-channel production matched with parton showers using the POWHEG method EPJC71 (2011) 1547 1009.2450
9 J. Alwall et al. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations JHEP 07 (2014) 079 1405.0301
10 T. Sjostrand, S. Mrenna, and P. Z. Skands A Brief Introduction to PYTHIA 8.1 CPC 178 (2008) 852--867 0710.3820
11 CMS Collaboration Event generator tunes obtained from underlying event and multiparton scattering measurements CMS-GEN-14-001
1512.00815
12 CMS Collaboration Collaboration Particle-Flow Event Reconstruction in CMS and Performance for Jets, Taus, and MET Technical Report CMS-PAS-PFT-09-001, CERN, 2009. Geneva, Apr
13 CMS Collaboration Collaboration Commissioning of the Particle-Flow reconstruction in Minimum-Bias and Jet Events from pp Collisions at 7 TeV Technical Report CMS-PAS-PFT-10-002, CERN, Geneva
14 CMS Collaboration, F. Beaudette The CMS Particle Flow Algorithm in Proceedings, International Conference on Calorimetry for the High Energy Frontier (CHEF 2013), pp. 295--304 2013 1401.8155
15 M. Cacciari, G. P. Salam, and G. Soyez The anti-$ k_t $ jet clustering algorithm JHEP 04 (2008) 063 0802.1189
16 The CMS collaboration Determination of jet energy calibration and transverse momentum resolution in CMS Journal of Instrumentation 6 (2011), no. 11, P11002
17 CMS Collaboration Measurement of $ B\bar{B} $ Angular Correlations based on Secondary Vertex Reconstruction at $ \sqrt{s}=7 $ TeV JHEP 03 (2011) 136 CMS-BPH-10-010
1102.3194
18 R. Fruhwirth, W. Waltenberger, and P. Vanlaer Adaptive Vertex Fitting CMS-NOTE-2007-008
19 CMS Collaboration Performance of b tagging at sqrt(s)=8 TeV in multijet, ttbar and boosted topology events CMS-PAS-BTV-13-001 CMS-PAS-BTV-13-001
20 A. Hoecker et al. TMVA: Toolkit for Multivariate Data Analysis PoS ACAT (2007) 040 physics/0703039
21 R. Brun and F. Rademakers ROOT: An object oriented data analysis framework NIMA389 (1997) 81--86
22 CMS Collaboration Measurement of associated W + charm production in pp collisions at $ \sqrt{s} $ = 7 TeV JHEP 02 (2014) 013 CMS-SMP-12-002
1310.1138
23 CMS Collaboration Collaboration Measurement of the inclusive and differential tt production cross sections in lepton + jets final states at 13 TeV Technical Report CMS-PAS-TOP-16-008, CERN, Geneva
24 L. Lyons, D. Gibaut, and P. Clifford How to combine correlated estimates of a single physical quantity NIMA 270 (1988) 110
Compact Muon Solenoid
LHC, CERN