A Matter of Time: Revealing the Structure of Time in Vision-Language Models | TU Wien

Information

Publication Type: Conference Paper
Workgroup(s)/Project(s):
- Visual Analytics and Computer Vision meet Cultural Heritage
- Visualization Group
Date: October 2025
ISBN: 979-8-4007-2035-2
Open Access: yes
Location: Dublin
Lecturer: Nidham Tekaya
Event: ACM International Conference on Multimedia 2025
DOI: 10.1145/3746027.3758163
Booktitle: MM '25: Proceedings of the 33rd ACM International Conference on Multimedia
Pages: 10
Conference date: 27. October 2025 – 31. October 2025
Pages: 12371 – 12380
Keywords: Multimodal representations, Vision-language models, Time modeling, Time estimation, Benchmark dataset

Abstract

Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

Additional Files and Images

Additional images and videos

teaser

Additional files

Weblinks

BibTeX

@inproceedings{tekaya-2025-amo,
  title =      "A Matter of Time: Revealing the Structure of Time in
               Vision-Language Models",
  author =     "Nidham Tekaya and Manuela Waldner and Matthias Zeppelzauer",
  year =       "2025",
  abstract =   "Large-scale vision-language models (VLMs) such as CLIP have
               gained popularity for their generalizable and expressive
               multimodal representations. By leveraging large-scale
               training data with diverse textual metadata, VLMs acquire
               open-vocabulary capabilities, solving tasks beyond their
               training scope. This paper investigates the temporal
               awareness of VLMs, assessing their ability to position
               visual content in time. We introduce TIME10k, a benchmark
               dataset of over 10,000 images with temporal ground truth,
               and evaluate the time-awareness of 37 VLMs by a novel
               methodology. Our investigation reveals that temporal
               information is structured along a low-dimensional,
               non-linear manifold in the VLM embedding space. Based on
               this insight, we propose methods to derive an explicit
               ''timeline'' representation from the embedding space. These
               representations model time and its chronological progression
               and thereby facilitate temporal reasoning tasks. Our
               timeline approaches achieve competitive to superior accuracy
               compared to a prompt-based baseline while being
               computationally efficient. All code and data are available
               at https://tekayanidham.github.io/timeline-page/.",
  month =      oct,
  isbn =       "979-8-4007-2035-2",
  location =   "Dublin",
  event =      "ACM International Conference on Multimedia 2025",
  doi =        "10.1145/3746027.3758163",
  booktitle =  "MM '25: Proceedings of the 33rd ACM International Conference
               on Multimedia",
  pages =      "10",
  pages =      "12371--12380",
  keywords =   "Multimodal representations, Vision-language models, Time
               modeling, Time estimation, Benchmark dataset",
  URL =        "https://www.cg.tuwien.ac.at/research/publications/2025/tekaya-2025-amo/",
}