Is tokenization needed for masked particle modeling?

  • Matthew Leigh
  • , Samuel Klein
  • , François Charton
  • , Tobias Golling
  • , Lukas Heinrich
  • , Michael Kagan
  • , Inês Ochoa
  • , Margarita Osadchy

Research output: Contribution to journalArticlepeer-review

Abstract

In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.

Original languageEnglish
Article number025075
JournalMachine Learning: Science and Technology
Volume6
Issue number2
DOIs
StatePublished - 30 Jun 2025

Bibliographical note

Publisher Copyright:
© 2025 The Author(s). Published by IOP Publishing Ltd.

Keywords

  • conditional generative models
  • high-energy physics
  • jet
  • jet physics
  • self-supervised learning

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Is tokenization needed for masked particle modeling?'. Together they form a unique fingerprint.

Cite this