SOMD 2026: Software Mention Detection and Coreference Resolution
Task Definition
Software is a crucial entity in scientific research, and the identification of such artifacts enables an understanding of provenance and the methods involved in data handling. One of the challenges in tracking software usage is that often different names are used for the same software due to abbreviations, geographical differences, or spelling variations (Schindler et al.). The task becomes even harder when considering multiple different documents, containing author specific variations. Hence, this underlines the importance of automatically identifying and disambiguating informally and inconsistently mentioned software. With the following three shared tasks, we aim to advance the resolution of software mentions across multiple documents (cross document coreference resolution). Additionally, we address problems such as a noisy dataset as resulting from automatic prediction and computational complexity affecting runtime as the volume of target texts increases.
Subtask 1: Coreference resolution over gold standard mentions across multiple documents. Given all gold-standard annotated software mentions (including metadata and their sentences), the objective of this task is to generate clusters in which each cluster represents mentions referring to the same underlying software.
Subtask 2: Coreference resolution over predicted mentions across multiple documents. In this subtask, we provide annotated software mentions and their metadata, which are automatically extracted using a baseline model. The challenge is to resolve all co-references to the same software by creating clusters of mentions referring to the same software. This reflects real-world co-reference resolution, where upstream pipelines (such as entity and metadata extraction) are imperfect.
Subtask 3: Coreference resolution over predicted mentions across multiple documents at scale For this subtask, we provide predicted mentions of software and metadata at a larger scale. Participants are expected to resolve coreferences to the same software by creating clusters of mentions referring to the same software. Since there are many more entity variants and numerous possible software identities in the provided corpus, this increases the computational runtime challenge for the coreference resolution task. This challenges models to scale effectively, maintain accuracy, and distinguish among an increasingly dense field of similar or overlapping software mentions.
Important Dates
- Registration Opens: January 14, 2025
- Train/Test Data Release (All Subtasks): January 20, 2026
- Competition Phase: January 20 – February 20, 2026 (via Codabench)
- System Paper & Code Submission Deadline: February 27, 2026
- Notification of Acceptance: March 10, 2026
- Camera-Ready Papers Due: March 27, 2026
- Workshop Date: May 12, 2026
Participation
Participants can take part in one or more of the three subtasks independently. Each subtask are hosted on the same competition platform but as separate competition.
Registration
To participate, teams must register on the competition platform via the Participation Link:
After registration, participants will gain access to:
- The training and Test data
- Submission instructions for each subtask
- The evaluation leaderboard(s)
Participation is open to both academic and industry teams. Teams may consist of one or more members.
Dataset
Dataset has been released on the Codabench Competition page.
Shared task organisers:
-
Sharmila Upadhyaya (GESIS Leibniz Institut für Sozialwissenschaften, Germany)
-
Wolfgang Otto (GESIS Leibniz Institut für Sozialwissenschaften, Germany)
-
Julia Matela (Wismar University of Applied Sciences, Germany)
-
Frank Krueger (Wismar University of Applied Sciences, Germany)
-
Stefan Dietze (GESIS Leibniz Institut für Sozialwissenschaften, Cologne & Heinrich-Heine-University Düsseldorf, Germany)
References
[1] Krüger, Frank, et al. SOMD@NSLP2024: Overview and Insights from the Software Mention Detection Shared Task. In Natural Scientific Language Processing and Research Knowledge Graphs. Springer, 2024. https://doi.org/10.1007/978-3-031-65794-8_17
[2] Schindler, David, et al. SoMeSci: A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles. arXiv:2108.09070, 2021. https://arxiv.org/abs/2108.09070
[3] Upadhyaya, Sharmila, et al. SOMD2025: A Challenging Shared Task for Software Related Information Extraction. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). ACL, 2025. https://aclanthology.org/2025.sdp-1.13/