CMS-PAS-BTV-16-001

CMS-PAS-BTV-16-001
Identification of c-quark jets at the CMS experiment
CMS Collaboration
August 2016

Abstract: An accurate identification of jets originating from b quarks is of primary importance in many measurements and searches at the LHC. The development of a charm tagger, identifying jets initiated by charm jets, would be of similar importance. In this note a technique where an MVA-based discriminator is used in order to select charm jets is presented, together with its expected performance on simulations, and its calibration on W+c and top quark pairs. The datasets used for the calibration of the algorithm are from proton-proton collisions at 13 TeV, recorded by the CMS experiment during the first year of the LHC Run II.
Links: CDS record (PDF) ; inSPIRE record ; CADI line (restricted) ;

Figures & Tables	Summary	References	CMS Publications

Figures
png pdf	Figure 1-a: Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).
png pdf	Figure 1-b: Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).
png pdf	Figure 1-c: Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).
png pdf	Figure 1-d: Agreement between data and simulated samples for some of the input variables: 3D flight distance significance of the secondary vertex (a), the number of secondary vertices associated to the jet (b), the secondary vertex mass (c) and a variable called ``massVertexEnergyFraction'' which is defined as the product of the vertex mass and the fraction of the vertex energy with respect to the jet energy (d).
png pdf	Figure 2: Performance of the CvsL tagger compared between the default BDT options (in red) and the optimized BDT options (in green). The dashed grey line (the diagonal in this plane) represents a tagger with a random choice.
png pdf	Figure 3-a: a: Overlay of the BDT discriminator output for the different flavours for the CvsL (top) and for the CvsB (bottom) discriminators (normalized for each flavour). b: ROC curves showing the final performance of the CvsL (blue full line and axis) and CvsB (red full line and axis) trainings, validated on the ${\mathrm {t}\overline {\mathrm {t}}}$ sample. For comparison the performance of the CSVv2 and cMVAv2 [1] algorithms (used for b tagging) is also shown by the dotted lines.
png pdf	Figure 3-b: a: Overlay of the BDT discriminator output for the different flavours for the CvsL (top) and for the CvsB (bottom) discriminators (normalized for each flavour). b: ROC curves showing the final performance of the CvsL (blue full line and axis) and CvsB (red full line and axis) trainings, validated on the ${\mathrm {t}\overline {\mathrm {t}}}$ sample. For comparison the performance of the CSVv2 and cMVAv2 [1] algorithms (used for b tagging) is also shown by the dotted lines.
png pdf	Figure 4-a: Agreement between data and simulated samples for the CvsL and CvsB discriminator distributions.
png pdf	Figure 4-b: Agreement between data and simulated samples for the CvsL and CvsB discriminator distributions.
png pdf	Figure 5-a: a: two dimensional scatter overlay of the BDT discriminators for b (red), c (green), and light jets (blue). The CvsL discriminator is shown on the x-axis and the CvsB discriminator is shown on the y-axis. b: relation between bottom and light mistag efficiency for different values of a constant charm efficiency.
png pdf	Figure 5-b: a: two dimensional scatter overlay of the BDT discriminators for b (red), c (green), and light jets (blue). The CvsL discriminator is shown on the x-axis and the CvsB discriminator is shown on the y-axis. b: relation between bottom and light mistag efficiency for different values of a constant charm efficiency.
png pdf	Figure 6-a: Data (dots) to Monte Carlo (stacked solid histogram, divided by jet flavour) comparison of the discriminator distributions for negative (negative side of the distribution) and positive taggers (positive side) for the CvsL and CvsB discriminators (a and b plot, respectively). A jet-trigger ${p_{\mathrm {T}}}$ threshold of 40 GeV is required for both data and simulation. The simulation is normalised to the number of entries in the data.
png pdf	Figure 6-b: Data (dots) to Monte Carlo (stacked solid histogram, divided by jet flavour) comparison of the discriminator distributions for negative (negative side of the distribution) and positive taggers (positive side) for the CvsL and CvsB discriminators (a and b plot, respectively). A jet-trigger ${p_{\mathrm {T}}}$ threshold of 40 GeV is required for both data and simulation. The simulation is normalised to the number of entries in the data.
png pdf	Figure 7-a: Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.
png pdf	Figure 7-b: Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.
png pdf	Figure 7-c: Misidentification probability in data and simulation (top), breakdown of systematic uncertainties contributions (middle), data-to-simulation scale factor of the light mis-idenfication probability (bottom) for the loose (a), medium (b) and tight (c) working points. The solid curve in the bottom plots represents a fit to the observed data, while the dashed curves show the combined statistical and systematic uncertainties on the measurements.
png pdf	Figure 8: Left and middle: leading order production of W+c signal with opposite sign charges (OS). Right: production of W+charm final state through gluon splitting process. In gluon splitting there is an additional charm quark with the same sign as the W boson (SS).
png pdf	Figure 9-a: Distribution of the c tagger discriminators (CvsB and CvsL) after applying W+charm selection and OS-SS subtraction. The plots corresponds to the sum of both electron and muon decays of the W.
png pdf	Figure 9-b: Distribution of the c tagger discriminators (CvsB and CvsL) after applying W+charm selection and OS-SS subtraction. The plots corresponds to the sum of both electron and muon decays of the W.
png pdf	Figure 10-a: c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.
png pdf	Figure 10-b: c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.
png pdf	Figure 10-c: c tagging efficiencies ($\epsilon _{ {\mathrm {\mathrm{c}}}}$) and scale factors ($SF_{ {\mathrm {\mathrm{c}}}}$) as a function of $ {p_{\mathrm {T}}} ^\text {jet}$ for the loose, medium, and tight c tagger working points (from a to c, respectively) using W+c events.
png pdf	Figure 11-a: Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.
png pdf	Figure 11-b: Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.
png pdf	Figure 11-c: Data and MC distributions of the invariant mass of the hadronically-decaying W (a), top quark (b), and of combined discriminant ${\lambda _{M}}$ (c) for every permutation in pre-selected events (i.e. those passing trigger and objects selection). The lower panels show the distribution of the MC over data ratio. These distributions were obtained processing 2.3 fb$^{-1}$ of 2015 data.
png pdf	Figure 12: Data and MC distributions of ${\lambda _{M}}$ after full selection, including the selection of the best permutation and the b tagging identification of two of the jet of the permutation. The different simulated processes contributing are shown with different colours. The major contributions are semi-leptonic top quark pair decays with the hadronic W properly matched to the generator particles (violet), wrongly matched semi-leptonic top quark decays (red), and non semi-leptonic top quark decays (azure). Minor contributions are also present due to single top events (green), vector-boson plus jets (labeled as ``V+jets'', in yellow), and multi-jet production (``QCD'', in blue).
png pdf	Figure 13-a: (upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.
png pdf	Figure 13-b: (upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.
png pdf	Figure 13-c: (upper panels) Data-to-simulation scale factor of the charm tagging efficiency for the c-tagging WP (loose on a, medium on b, tight on c) as measured with the two methods, with (thick error bar) statistical error and (narrow error bar) combined statistical and systematic uncertainties. The combined SF value with its overall uncertainty is displayed as a hatched area. (lower panels) Same combined SF value with the result of a linear fit function superimposed (solid curve). The combined statistical and systematic uncertainty is centred around the fit result (points with error bars). The last bin includes the overflow.

Tables
png pdf	Table 1: Explanation of the BDT options that are explicitly set during the c tagger training and their used values. To ease the reproducibility, the setting names used by TMVA are listed under the BDT Option column.
png pdf	Table 2: Definitions of the three working points with the corresponding cuts on the discriminator values and the global efficiencies, obtained from simulated ${\mathrm {t}\overline {\mathrm {t}}}$ samples, for each flavour.
png pdf	Table 3: Efficiencies and $SF_{\mathrm{c}}$ for the c-tagger algorithm using W+c events. $SF_{\mathrm{c}}$ and the sum in quadrature of statistical and systematical uncertainties are shown in the third column.
png pdf	Table 4: Measured scale factors for the efficiency of tagging charm jets in semi-leptonic $\mathrm{t\bar{t}}$ events. The quoted uncertainty includes both statistical and systematic uncertainty. The last column shows the statistical-only uncertainty.
png pdf	Table 5: Combined $SF_{\mathrm{c}}$ values

Summary

A new tool for identifying charm jets is presented. The tagger has been developed using two boosted decision trees, trained and tested on simulated input datasets. The light-to-charm mistagging rate has been measured in multi-jet events. The performance of the tagger has been measured on W+c and semi-leptonic top quark pairs events, deriving scale factors for the charm identification. The precision achieved with the first method ranges between 6% and 21%, while the second provides a more precise evaluation of the scale factor, in the order of 5%, but without jet $p_{\mathrm{T}}$ dependence. The W+c jet population is strongly skewed towards low-$p_{\mathrm{T}}$ jets. The datasets for the efficiency calibrations have been extracted from proton-proton collisions collected by the CMS experiment in 2015.

References
1	CMS Collaboration Collaboration	Identification of b quark jets at the CMS Experiment in the LHC Run 2	Technical Report CMS-PAS-BTV-15-001, CERN, Geneva
2	L. Breiman, J. Friedman, R. Olshen, and C. Stone	Classification and Regression Trees	Wadsworth and Brooks, Monterey, CA
3	CMS Collaboration	The CMS experiment at the CERN LHC	JINST 3 (2008) S08004	CMS-00-001
4	P. Nason	A New method for combining NLO QCD with shower Monte Carlo algorithms	JHEP 11 (2004) 040	hep-ph/0409146
5	S. Frixione, P. Nason, and C. Oleari	Matching NLO QCD computations with Parton Shower simulations: the POWHEG method	JHEP 11 (2007) 070	0709.2092
6	S. Alioli, P. Nason, C. Oleari, and E. Re	A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX	JHEP 06 (2010) 043	1002.2581
7	J. M. Campbell, R. K. Ellis, P. Nason, and E. Re	Top-pair production and decay at NLO matched with parton showers	JHEP 04 (2015) 114	1412.1828
8	E. Re	Single-top Wt-channel production matched with parton showers using the POWHEG method	EPJC71 (2011) 1547	1009.2450
9	J. Alwall et al.	The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations	JHEP 07 (2014) 079	1405.0301
10	T. Sjostrand, S. Mrenna, and P. Z. Skands	A Brief Introduction to PYTHIA 8.1	CPC 178 (2008) 852--867	0710.3820
11	CMS Collaboration	Event generator tunes obtained from underlying event and multiparton scattering measurements		CMS-GEN-14-001 1512.00815
12	CMS Collaboration Collaboration	Particle-Flow Event Reconstruction in CMS and Performance for Jets, Taus, and MET	Technical Report CMS-PAS-PFT-09-001, CERN, 2009. Geneva, Apr
13	CMS Collaboration Collaboration	Commissioning of the Particle-Flow reconstruction in Minimum-Bias and Jet Events from pp Collisions at 7 TeV	Technical Report CMS-PAS-PFT-10-002, CERN, Geneva
14	CMS Collaboration, F. Beaudette	The CMS Particle Flow Algorithm	in Proceedings, International Conference on Calorimetry for the High Energy Frontier (CHEF 2013), pp. 295--304 2013	1401.8155
15	M. Cacciari, G. P. Salam, and G. Soyez	The anti-$ k_t $ jet clustering algorithm	JHEP 04 (2008) 063	0802.1189
16	The CMS collaboration	Determination of jet energy calibration and transverse momentum resolution in CMS	Journal of Instrumentation 6 (2011), no. 11, P11002
17	CMS Collaboration	Measurement of $ B\bar{B} $ Angular Correlations based on Secondary Vertex Reconstruction at $ \sqrt{s}=7 $ TeV	JHEP 03 (2011) 136	CMS-BPH-10-010 1102.3194
18	R. Fruhwirth, W. Waltenberger, and P. Vanlaer	Adaptive Vertex Fitting	CMS-NOTE-2007-008
19	CMS Collaboration	Performance of b tagging at sqrt(s)=8 TeV in multijet, ttbar and boosted topology events	CMS-PAS-BTV-13-001	CMS-PAS-BTV-13-001
20	A. Hoecker et al.	TMVA: Toolkit for Multivariate Data Analysis	PoS ACAT (2007) 040	physics/0703039
21	R. Brun and F. Rademakers	ROOT: An object oriented data analysis framework	NIMA389 (1997) 81--86
22	CMS Collaboration	Measurement of associated W + charm production in pp collisions at $ \sqrt{s} $ = 7 TeV	JHEP 02 (2014) 013	CMS-SMP-12-002 1310.1138
23	CMS Collaboration Collaboration	Measurement of the inclusive and differential tt production cross sections in lepton + jets final states at 13 TeV	Technical Report CMS-PAS-TOP-16-008, CERN, Geneva
24	L. Lyons, D. Gibaut, and P. Clifford	How to combine correlated estimates of a single physical quantity	NIMA 270 (1988) 110