Representation learning for detector data

Detectors produce large amounts of feature-dense data. Deep learning can powerfully extract features from these data during fully supervised training, i.e., learning to predict a specific ground truth. But what if generic features could be extracted that are relevant not only for one task, but for any number of downstream tasks? Self-supervised learning (SSL) aims to distill a representation of the data that reduces its dimensionality while preserving its most salient features. Unlike supervised models, an SSL model can be trained using data augmentation strategies in place of traditional truth labels, opening the possibility to train on huge unlabeled datasets. In [1], we applied SSL to hadronic jets in a simulated detector like those at the LHC. Our main takeaway is that a generic representation comprising only 8 numbers can provide a strong pre-trained foundation for tasks such as jet classification and anomaly detection. In future work, we plan to extend this approach by exploring different augmentation methods, downstream particle reconstruction tasks, and the potential for training on raw data instead of simulated data.

[1] https://arxiv.org/abs/2503.11632