Applied Track
Lessons Learned from Language-Based User Modeling in Music Recommender Systems
Abstract
Music recommender systems play a central role in helping users navigate massive music catalogues. Traditionally, these systems rely on opaque latent embeddings derived from implicit behavioral signals such as clicks, listening histories, or co-consumption patterns. Recent advances in natural language processing enable a shift toward representing user preferences explicitly in natural language. Rather than modeling users with hard-to-interpret latent vectors, emerging systems leverage textual descriptions of preferences, conversational queries, or interpretable language-based user profiles to guide retrieval and ranking. This paradigm shift improves both interoperability across systems and the transparency of user representations. In this demonstration, we present a series of works exploring the transition from embedding-based user representations to language-based user modeling for music recommendation within a music streaming platform ecosystem, addressing real-world data and deployment constraints. We investigate both long-term music preferences (e.g., LLM-generated user profiles summarizing consumption behavior) and short-term intent expressed through natural language queries over a multimodal music catalogue. Drawing on several of our recently published works, we highlight practical challenges that arise when replacing or complementing latent representations with explicit language-based ones. First, language-based user profiles are highly sensitive to the stylistic and representational biases of the LLMs that generate them, affecting both user self-recognition and recommendation performance. Second, natural language queries capture short-term intent and may express preference signals beyond simple item descriptors, while long-term preferences remain encoded in listening histories. This motivates recommendation models that jointly represent query intent and behavioral user representations. Third, evaluation metrics commonly used in NLP, particularly for generative tasks, provide new perspectives for assessing music recommender systems beyond traditional retrieval-based metrics. The goal of this demonstration is to share lessons learned from building and evaluating language-based user modeling systems using real-world music streaming data and user interactions. We use these insights to outline a roadmap and identify open challenges at the intersection of NLP and recommender systems as language becomes a central interface for expressing user preferences.
Avani: An AI-Powered Multilingual Audio Intelligence System for Ground-Level Agricultural Data Collection and Policy Insights
Abstract
Agricultural policymaking often operates at a distance from the realities experienced by farmers. Ground-level data collection about plantation cycles, crop health, and harvest conditions remains a major bottleneck for public agencies and agrarian institutions. Current survey and reporting mechanisms are frequently manual, sporadic, and administratively burdensome. As a result, the data collected is often incomplete, delayed, or biased, leading to inaccurate representations of agricultural conditions. These flawed datasets are subsequently used to design policies, subsidies, and market interventions. The result is a structural disconnect between farmers’ lived realities and policymakers’ assumptions, producing distorted schemes, ineffective support mechanisms, and systemic inefficiencies across the agricultural ecosystem. Avani addresses this problem through an AI-powered, audio-based data collection and analysis system designed to capture reliable, high-frequency insights directly from farmers. Instead of relying on infrequent surveys or field visits, farmers receive automated phone calls in their regional language at predefined intervals. During these calls, they respond to structured questions related to crop conditions, inputs, water access, labour availability, pest outbreaks, and production constraints. The conversational format lowers barriers to participation, especially for farmers with limited literacy or digital access, enabling scalable and inclusive data collection. The collected audio responses are processed through a multilingual natural language processing pipeline that converts unstructured speech into structured policy-relevant insights. The pipeline integrates speech recognition, translation, segmentation, and information extraction modules to transform raw audio responses into analysable datasets. Responses are first transcribed from regional languages into text, then translated into a standardised representation to enable cross-region analysis. Subsequent processing stages identify key entities, thematic signals, and quantitative indicators related to crop health, resource constraints, and farmer sentiment. By combining qualitative narrative signals with structured data extraction, the system enables the generation of both statistical summaries and contextual insights for decision-makers. The resulting intelligence is aggregated and visualised for government agencies, agricultural organisations, and supply-chain stakeholders. These insights help identify emerging risks such as crop disease outbreaks, irrigation shortages, or labour constraints at an early stage. Policymakers can use this information to design targeted interventions, adjust procurement strategies, and improve the allocation of subsidies and support programs. Avani is currently deployed across four Indian states, serving more than 1,000 farmers across over 30 crop types. The system continuously collects and processes multilingual audio data at scale, demonstrating the viability of speech-based NLP pipelines for large-scale rural data acquisition. Expansion plans include deployment in China and Peru, with the system expected to support more than 50 crop types by 2027. Initial evaluations indicate strong adoption and perceived value among both farmers and decision-makers. In user feedback surveys, 92 percent of participating farmers reported that the system allowed their voices and experiences to be represented in agricultural decision-making. Among policymakers and institutional stakeholders, 82 percent reported that the insights generated were useful for planning and policy design. By improving the fidelity, frequency, and accessibility of ground-level agricultural data, Avani reduces information asymmetry between farmers and policymakers. The system demonstrates how multilingual NLP and speech technologies can support large-scale participatory data collection, enabling evidence-based policymaking while strengthening environmental, economic, and social sustainability in agriculture.
From RAG to Neuro-Symbolic Agentic Systems: Lessons Learned on Reproducibility
Abstract
Reproducible natural language processing becomes hardest precisely where modern large language model (LLM) systems become most useful: when we move from "answer a question" to "reason, decide, and act" under enterprise constraints. In this talk, we share practical takeaways on improving reproducibility while building a domain-specific reasoning pipeline that can recommend and compose tool-based workflows under tight validation constraints. Concretely, we evolved Hexagon's ERDAS IMAGINE (now Octave Imagine) Spatial Model Editor [1] — an interactive canvas for designing and running custom remote-sensing and geospatial workflows ("Spatial Models") by connecting predefined processing tools ("operators") that perform specific operations on maps, images, and other features. We show how we expanded AI assistance from a highly reproducible support chatbot based on Retrieval Augmented Generation (RAG) into a production-grade, neuro-symbolic, agentic system that, in a first stage, recommends individual operators and, in a second stage, assembles complete processing subgraphs. We start with a customer support chatbot grounded in manuals and FAQs, where prompts that tightly bind generations to retrieved snippets turn the LLM largely into a question-conditioned summarizer, yielding low run-to-run variance and predictable compliance behavior. We then show why this strategy breaks down once the task shifts from retrieval to composition: recommending individual operators (≈700 in Spatial Model Editor) and, ultimately, assembling a processing subgraph with correct ordering and type-compatible connections. Our operator recommender still relies on RAG over example Spatial Models, but coverage is limited relative to the space of use cases, so the system must lean on model judgment. We therefore moved towards a neuro-symbolic approach: the LLM outputs a domain-specific language (DSL), followed by deterministic schema validation, lightweight verification and error-guided correction. In parallel, we benchmarked multiple LLMs for operator recommendation by rerunning identical inputs. In line with reports that explicit reasoning models show higher run-to-run variability [2], we observed substantially higher output standard deviation than for non-reasoning LLMs — up to three times in our setting. These benchmarks also guided cost/latency reductions, e.g., via token-efficient encodings (TOON [3]) and lower-cost models when quality was unchanged. Subgraph recommendation uses the same RAG architecture but runs more powerful reasoning LLMs in a loop with a heuristic validation agent, enabling the system to plan a typed, executable graph. Where workflows are known, we found "orchestrator agents" to be counterproductive: specifying execution order in natural language increases variance, latency, and cost compared to explicit, code-defined execution paths. More broadly, we avoid LLM invocations whenever tool inputs/outputs do not require interpretation. In practice, "out-of-the-box" ReAct [4]-style multi-agent setups often add unnecessary LLM calls by repeatedly translating between natural language, tool calls, and intermediate results. Instead, we use agents primarily to construct and iteratively refine a DSL graph that can be passed directly between components — eliminating LLM-based tool-calling conversions and any subsequent summarization or interpretation of DSL outputs. Taken together, these findings led us to a fully neuro-symbolic design. In our setting, state-of-the-art reasoning models are already strong enough to produce a correct subgraph with very few refinement steps, as long as residual non-determinism is bounded by deterministic validation and repair. [1] Hexagon, "ERDAS IMAGINE: Your Complete Remote Sensing Solution", https://hexagon.com/products/erdas-imagine [2] J. Yuan et al., "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference", arXiv:2506.09501, 2025. [3] TOON, "Token-Oriented Object Notation", https://github.com/toon-format/toon [4] S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models", arXiv:2210.03629, 2023.
Neurosymbolic Artificial Intelligence for Explainable and Reliable Assistants in High-Stake Environments
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, yet their limited explainability and susceptibility to hallucinations, bias, and non-deterministic outputs hinder their deployment in high-stakes environments. These shortcomings led to the initiation of the SYMBOL (Neurosymbolic AI for Confidential and Reliable Assistants in High-Stake Environments) project which adopts a neurosymbolic architecture that combines sub-symbolic language models with symbolic reasoning over formally structured domain knowledge. The project aims at decoupling natural language understanding from deterministic data processing and reasoning. LLMs are used to interpret user requests, mapping them to formal domain concepts and actions. This allows SYMBOL to conduct critical retrieval, aggregation, and inference operations with symbolic AI that draws upon knowledge graphs, API functions and database tables. This separation reduces the cognitive and reasoning burden on the LLM and considerably increases reliability, traceability, and explainability. SYMBOL's reasoning capabilities reply upon a knowledge graph that provides a formal description of the wealth management domain, the industry partner's software API and database schemas. Each semantic concept in the graph is aligned with its corresponding representation in the operational system landscape. For example, entities in the graph are linked to specific database tables or views, while actions and business processes are associated with API functions. This tight coupling bridges the semantic gap between user intent and executable system operations. The LLM maps user intent to the corresponding concepts within the knowledge graph which refer to data analytics operations that allow deterministic execution by symbolic AI components. Tasks that would require complex retrieval operations such as joins across multiple tables, data pre-processing and summarization are instead conducted via symbolic AI in a reliable and traceable way. Beyond enabling symbolic reasoning, the knowledge graph also adds a validation layer through the use of constraint definitions encoded in the Shapes Constraint Language (SHACL). These constraints formally specify structural requirements for entities and relationships, cardinality restrictions, value ranges, and domain-specific integrity rules. After symbolic reasoning and data retrieval are performed, SHACL rules ensure that outputs conform to domain rules and regulatory requirements. By combining flexible natural language interaction with symbolic inference, SYMBOL ensures reliable, traceable, and explainable results. The presented approach considerably improves explainability, since all requests can be traced to symbolic reasoning paths with built-in explainability. The Innosuisse-funded SYMBOL project focuses on the development of Neurosymbolic AI for regulatory compliance and decision support in wealth management systems. Nevertheless, the underlying architecture is designed to generalize to other regulated and high-stake domains, such as finance, medicine, and law.
AutoConcourse: A Computational Pipeline for Automatic Generation of Q-Concourse Statements
Abstract
Crafting a high-quality concourse is central to Q-methodology, yet manual collection and phrasing of candidate statements remains slow, costly, and difficult to standardize across languages and issues. We present AutoConcourse, an end-to-end retrieval-and-generation pipeline that assembles large issue-specific corpora, organizes them via a flexible anchor framework, embeds text with sentence-embedding models, synthesizes concise, Q-ready statements with large language models (LLMs), and applies redundancy and balance controls prior to expert curation. We apply AutoConcourse to the generation of survey statements for the Deliberative Reasoning Index (DRI), a core outcome measure for deliberative processes whose adoption is constrained by high preparation costs. DRI relies on a set of statements representative of the public discourse on a given topic, derived from Q-methodology's concourse. We contribute: (1) a generalizable pipeline for automatic concourse generation, (2) quantitative criteria for diversity tailored to Q-methodology, (3) empirical comparisons across anchor-only and no-anchor (unsupervised seeding) conditions, quantifying anchor-induced effects on diversity, and (4) a reproducible implementation with documented prompts, anchors, and configuration files. We explicitly note that the supervised nature of this pipeline can introduce coder priors and bias.
Holistic Evaluation Framework: Agentic AI in High-Stakes Investigative Environments
Abstract
As agentic AI systems transition into production-ready applications, robust evaluation becomes critical, especially in domains demanding high accuracy and strict data governance. We present a multi-dimensional evaluation framework for Thomson Reuters' CLEAR Investigate, an autonomous investigative agent used by law enforcement agencies and corporate investigators to detect fraud, uncover hidden connections, and help prevent money laundering. CLEAR Investigate answers complex investigative queries by orchestrating access to multiple structured and unstructured data sources and systematically cross‑referencing information across domains. Existing evaluations of AI agents often prioritize aggregated task scores, with limited attention to whether answers are properly evidenced, equitably delivered, or safe in the presence of sensitive data and adversarial inputs. For investigative agents, these gaps translate into concrete risks: unsupported allegations, uneven treatment across cases, and exposure of confidential information. Our framework evaluates agents across six dimensions: Task Performance & Reliability, Grounding & Attribution, Fairness & Bias, Safety & Security, Robustness & Consistency, and Efficiency & Resource Use, jointly capturing whether answers are accurate, evidenced, stable, fair, and operationally viable in real investigative use. Implementation combines automated evaluation methods, including LLM‑as‑judge and rule‑based checks, with expert‑curated test datasets aligned with investigative scenarios. This enabled rapid iteration, supporting several development cycles and thousands of tests. Human annotators evaluated responses against dimension‑specific criteria to provide ground truth and calibrate automated metrics. Compared to earlier agent versions, acceptance rates increased by more than 40% over thousands of test cases spanning diverse investigative scenarios, while evaluation throughput increased substantially by reducing manual annotation dependency. Our results indicate that multi-dimensional evaluation frameworks are not just desirable but necessary for deploying agentic AI in high-stakes, data-sensitive environments. Although developed for investigative applications, the framework applies broadly to agentic systems where reliability, safety, and ethics must be evaluated alongside task performance. This approach enables faster iteration while maintaining the level of assurance required for deployment in regulated and professional settings.
Probabilistic Testing for NLP Systems: A Reproducible Approach to Evaluating Stochastic Behaviour in Real-World Applications
Abstract
Natural language processing systems are increasingly deployed in settings where output quality matters operationally, yet the systems themselves are often stochastic. This is particularly true for LLM-based applications, but the problem also arises more broadly wherever NLP pipelines depend on probabilistic models, retrieval components, ranking, sampling, or other non-deterministic mechanisms. In such contexts, traditional pass/fail testing and one-shot benchmark scores are often insufficient: they obscure variance, conceal instability, and provide limited guidance when a system appears to “sometimes work”. This talk presents a probabilistic testing approach for NLP systems, embodied by the PUnit framework. Rather than treating a single execution as decisive, the approach evaluates behaviour over repeated trials. For binary success/failure criteria, individual outcomes can be modelled as Bernoulli variables and summarised by the observed success rate. This makes it possible to assess whether an NLP system still behaves within an expected range, whether a drop in quality/accuracy is likely to have occurred (following a change of model, training data update etc.), and whether differences between versions are meaningful or merely noise. The PUnit approach is relevant to a wide spectrum of NLP applications, including text classification, information extraction, retrieval-augmented generation, summarisation, translation, and agentic workflows built on top of language models. The emphasis is not on replacing existing benchmarks, but on complementing them with an evaluation layer that is more suitable for production-facing, stochastic systems. In particular, the method supports a more reproducible and transparent interpretation of quality by making uncertainty, sample size, and tolerance explicit. The talk will outline the underlying testing model, show how probabilistic verdicts can be integrated into engineering workflows, and discuss how this style of evaluation can help bridge the gap between academic NLP metrics and the realities of deployed systems. The talk will incorporate a live demonstration of the PUnit probabilistic testing framework.
Modern OCR in Resource-Constrained Environments
Abstract
Optical Character Recognition (OCR) plays a crucial role in automating the processing of scanned documents in many real-world applications. Designing such systems that provide good quality while operating under resource constraints remains challenging. In this talk, we present insights from our experience building OCR-based document processing solutions for productive use in real-world applications. We describe how OCR pipelines can evolve from traditional character recognition approaches, often relying on dictionaries and rule-based post-processing, toward more recent solutions that leverage open-source frameworks, vision models, and large language models to perform contextual correction and improve extraction quality and robustness. Through concrete examples, we compare different approaches and discuss their respective strengths, limitations, and performance in practical scenarios. We then discuss how such solutions can be designed and operated in resource-constrained environments. Commercial systems often need to operate under practical constraints such as limited computing and memory resources, noisy scans, and heterogeneous document formats. We present engineering trade-offs involved in designing systems that remain efficient, reliable, and maintainable under these conditions, and explore ways to combine visual and textual signals to extract structured information from scanned documents while maintaining a consistent and reliable user experience.
Red Teaming GPT-OSS-20B: Evaluating Jailbreak Susceptibility and Bias Across English and Swiss German
Abstract
This project evaluates the safety alignment of the open-weight reasoning model gpt-oss-20b against adversarial guardrail jailbreaks and inherent societal biases. While modern LLMs are extensively tuned for harmlessness, our modular testing framework demonstrates that they remain highly susceptible to contextual manipulation. Adversarial techniques embedding harmful intent within mathematical proofs, scientific citations, or iterative reasoning successfully bypassed guardrails, achieving Attack Success Rates (ASR) up to 67.28%. Furthermore, non-adversarial probing revealed pervasive systemic risks, with the model defaulting to stereotypical assumptions in 35.78% of ambiguous scenarios. A supplementary evaluation in Swiss German confirmed these vulnerabilities persist across regional linguistic contexts. These findings highlight that current alignment mechanisms have not fully resolved jailbreaks and inherent bias, posing critical challenges for automated decision-making.
AI for improved Public Participation and Responsiveness in Urban Planning
Abstract
We present an AI-assisted document processing system designed to support urban planning workflows in Swiss municipalities, with a focus on handling public objection letters (“Einsprachen”). Public participation is a central element of democratic planning processes, but it also introduces significant administrative overhead. Municipal staff must process large volumes of documents, extract relevant information, and prepare appropriate responses. Our work addresses this bottleneck by introducing a structured, AI-supported pipeline that improves efficiency while preserving transparency and human oversight. The proposed system transforms unstructured objection letters, which are often submitted as PDF documents, into structured outputs. These documents may include scanned pages, inconsistent formatting, or embedded non-relevant content, making manual processing labor-intensive. The system first recovers and normalizes the textual content, ensuring that the original meaning is preserved while removing obvious artifacts such as formatting noise or scanning distortions. It then extracts metadata such as sender details and subject, in addition to the main arguments raised in the letter and converts them into a structured representation suitable for further processing. Building on these structured data, the system generates a preliminary response draft in a formal administrative style. This draft summarizes the citizen's main concerns and acts as the basis for the final response. A central methodological contribution of this work is the combination of a deterministic workflow design with probabilistic language model components. Rather than relying on an open-ended conversational agent, the system follows a predefined sequence of processing steps. This design ensures robustness and predictable behavior. At the same time, on-premise large language models are applied in text normalization, information extraction, and controlled text generation. Transparency and controllability are key design principles throughout the system. Intermediate outputs, such as extracted and cleaned text, are made visible to users, allowing them to verify and correct results before proceeding. This human-in-the-loop approach ensures that the system remains assistive, without replacing the essential human role in collecting and interpreting citizen input, which is critical for informing and improving planning processes based on the received feedback. The modular structure of the pipeline allows individual components to be adapted or extended, for example, by incorporating additional domain knowledge. For instance, we extend the pipeline with a domain-adapted Retrieval-Augmented Generation (RAG) component to support targeted question answering over regulatory documents. Queries are assigned to common administrative tasks, such as finding facts, retrieving normative guidelines, and comparing requests. For more complex questions, the system performs lightweight decomposition (e.g., retrieving multiple passages before comparing them). Retrieval combines semantic similarity with lexical matching and structural signals (e.g., section titles) to prioritize authoritative content such as guideline sections defining planning criteria. Answer generation is performed under constrained prompting, ensuring that outputs are grounded in the retrieved context. Depending on the task, the system produces concise factual answers, structured lists, or explicit comparisons in a formal administrative style. The presented prototype demonstrates how applied AI can effectively support real-world administrative processes by addressing a clearly defined workflow. While broader visions of AI-assisted urban planning include tasks such as drafting regulations and knowledge retrieval, this work focuses on a concrete and immediately relevant use case: improving the efficiency and consistency of handling citizen submissions.
Building a RAG-Based Agentic Knowledge Assistant for a Swiss Energy Supplier
Abstract
Corporate knowledge is often distributed across various systems with diverse document formats. While public AI services are widely used by employees, they pose significant data privacy risks. This contribution describes the development, evaluation, and deployment of the "Primeo Knowledge Assistant," an agentic Retrieval-Augmented Generation (RAG) system designed to provide employees with referenced answers to ensure source transparency, tailored for the Swiss energy sector and applied to Primeo Energy's company-internal and public data. The system developed and put into production utilizes a modular architecture built with LangChain and Microsoft Azure, ensuring adherence to Swiss data protection regulations and minimizing cross-border data transfer risks. We included two different knowledge bases, i.e., (a) a HTML-dataset containing short internal news and (b) a long dataset with public PDF-documents relevant to the energy sector. A core challenge was the high-fidelity extraction of data from PDFs containing tables and multi-column layouts. This was addressed using Docling, IBM’s layout-aware document parsing library. For the PDF corpus, we applied Anthropic's contextual retrieval to enrich chunk-level representations with surrounding document context. In our evaluation, contextual retrieval substantially improved retrieval performance (Recall@1 = 0.60) compared to standard chunk-based indexing (Recall@1 = 0.36). For short text information as present in the internal news dataset, contextual retrieval is not appropriate. To increase retrieval precision there, we applied a hybrid search pipeline combined with semantic reranking. On the internal news dataset, this significantly improved the retrieval Hit Rate@1: a baseline hybrid search achieved 0.75, which increased to 0.86 when using the small bge-reranker-base model and further to 0.89 with the Azure semantic reranker. These results show that lightweight rerankers already provide substantial gains, while managed cloud-based rerankers can further improve precision. A key technical highlight is the transition to an agentic design using the Model Context Protocol (MCP). By treating different knowledge bases as specialized tools with semantically rich docstrings, the agent can autonomously decide when to query internal news or technical, public energy sector documentation. In addition, we implemented automatic extraction of document metadata from PDF content, which helped the agent to filter for relevant chunks. Targeted prompt engineering, forcing the agent to reflect on the tool output and adapting its queries if the output is not helpful, led to a tool correctness score of 84.8 \%. For end-to-end evaluation, we created a dataset with gold-standard question-answer pairs and used metrics such as hit rate for retrieval and an LLM-judge with frameworks DeepEval and RAGAS. We observed substantial discrepancies in scores for the same metric between these two libraries, resulting in a mean absolution difference of 0.28 for answer relevancy and 0.31 for faithfulness. Manual review of the conversation traces revealed a preference for RAGAS. Overall, our results demonstrate that careful document ingestion, contextual retrieval, semantic reranking, and agentic orchestration can substantially improve the performance of enterprise RAG systems even under strict regulatory constraints. Although regional data protection requirements significantly restrict the availability of the latest large language models in Swiss or European deployments, our experiments show that substantial quality improvements can still be achieved through architectural and retrieval optimizations.
Multimodal RAG System for Students
Abstract
Retrieval Augmented Generation (RAG) systems have become a practical backbone for domain-specific conversational assistants, yet many deployments treat the knowledge base as text-only. In educational settings this is a critical limitation: lecture materials rely heavily on diagrams, charts, and tables to convey concepts. We present a production case study of upgrading the FHNW AI Tutor from a text-only to a multimodal RAG system that makes visual and tabular content searchable by converting it into text representations during ingestion while preserving a unified, text-based retrieval stack. Our contributions are twofold. First, we describe a parallelized ingestion pipeline that processes a corpus of 44'663 chunks (81.1\% text, 14.0\% image descriptions, 4.9\% table summaries) from 2'176 heterogeneous sources including uploaded PDFs, Slides, Jupyter Notebooks and web URLs as well as extracted repositories and Zip files. The ingestion pipeline comprises five stages: Discovery, registration, extraction, chunking, and embedding. Each stage supports distributed execution via Celery workers. Checksum-based change detection avoids redundant reprocessing. During extraction, Docling performs layout analysis to identify text, figures, and tables and converts formulae to LaTeX. The Qwen3.5-35 VLLM classifies and describes pedagogically useful figures after removal of purely decorative elements, the same VLLM summarizes tables also using surrounding context. All content is locally contextualized and chunked via Docling's HybridChunker, embedded into a shared 2'560-dimensional space using Qwen3-Embedding-4B, and stored in PostgreSQL with pgvector HNSW indexing. Second, we evaluate the multimodal system against the text-only baseline and two reference systems in a controlled production environment. We construct a synthetic evaluation dataset of 175 question–answer pairs structured as a matrix of 7 core teaching modules $\times$ 5 difficulty scenarios $\times$ 5 samples per cell. The five scenarios combine three difficulty levels with three media types: easy/text, medium/text, medium/image, medium/table, and hard/text. Generation proceeds in three stages. First, for each cell, chunks matching the target module and media type are sampled from the corpus along with their 3 nearest neighbors by embedding similarity to provide multi-chunk context. If the LLM deems a chunk non-educational, it is skipped and a replacement chunk along with its neighbors is sampled. An LLM then generates one student-style question per chunk 5-20 words, conditioned on a scenario-specific prompt that defines the expected cognitive level. Second, each base question undergoes one evolution step randomly selected from seven rewriting strategies: Concretizing, constrained, comparative, reasoning, hypothetical, in-breadth, and multi-context. The same LLM then generates a reference answer (150-500 words) grounded in the sampled contexts. Third, a two-stage quality filter uses the LLM as a judge to evaluate each pair on language, clarity, groundedness, and coherence. Pairs that fail are resampled from the same matrix cell and re-judged for up to three attempts. For retrieval evaluation, we benchmark 16 configurations spanning 2 embedding models (BGE-m3, Qwen3-Embedding-4B), 2 rerankers (BGE-Reranker-v2-m3, Qwen3-Reranker-4B), and 4 search modes (semantic, hybrid RRF at three weight ratios). We report Hit@$k$, MRR, at the chunk and document levels. For end-to-end evaluation, we compare five systems: V1 (text-only RAG), V2 (multimodal RAG), GPT-5 zero-shot, and GPT-5 with web search. We use LLM-based scoring with Ragas metrics including answer correctness, answer similarity, response relevancy, faithfulness, and context relevance. As LLM-judge we use gpt-oss-120b to ensure unbiased evaluation and reproducibility. We find that the multi-modal RAG outperforms the text-only RAG in answer correctness (+15.7\%) and faithfulness (+30.2\%). Both our text-only and multi-modal RAG reach substantially higher answer correctness than GPT-5 zero-shot (+28.8\% and +50.9\%) and GPT-5 with web search (+21.4\% and +40.5\%) demonstrating the value of domain-specific retrieval over general-purpose LLMs.
NLP-based Quality Monitoring Support in Aviation Safety
Abstract
Air transportation requires highest safety-management standards which relies crucially on reporting any abnormal event at any phase of activity and at any level of gravity. Our research inspected how NLP can support human experts at Edelweiss Air in event classification and hence avoid new threats to safety in the future. The research was based on over 100,000 pre-processed textual reports collected between 2017 and 2025 across 215 event classes and compares classical machine learning (ML) with state-of-the-art deep learning (DL) approaches using a 80/10/10 train/validation/test data partitioning scheme. Excluding full text information and only considering six categorical features related to safety reports, XGBoost achieved good model performance (Test: F1 = 0.776, Accuracy = 0.799) and clearly outperformes Logistic Regression, Support Vector Machine and Random Forest models. With textual report information extracted from contextual embeddings with either BERT or Qwen-3, classical ML training exhibited severe overfitting, particularly for tree-based gradient boosting ensemble models. Similar findings resulted when using non-contextual, TF-IDF-based embeddings. Our best run resulted in combining Qwen-3 embeddings with a CatBoost classifier (Test: F1 = 0.642, Accuracy = 0.666). Alternatively, we trained DL models, i.e. Transformer-based Encoder models including BERT-base and the domain-specific SafeAeroBERT model with extensive hyperparameter optimization. The best BERT-base run achieved a validation F1 score of 0.730, while SafeAeroBERT substantially underperformed despite domain-specific pretraining (validation F1 = 0.491). Possible reasons include differences in SafeAeroBERT's pretraining dataset characteristics relative to our study or insufficient training data. The study showed that classical ML training driven by text-only report information underperformed fine-tuned state-of-the-art transformer-based classifiers. The overall best performance was achieved with classical ML, where categorical and text features were combined for the top-performing classifier families. XGBoost model performance measured via F1-score on the test substantially improved to 0.859 as compared to 0.776 with purely categorical input features (see above). Overall the study revealed that the predictive power of text-only models is insufficient, while text combined with categorical metadata unlocks substantial additional performance. Our study also confirmed human experts observations that an individual aviation safety report may express different event classes. We carefully analyzed predictions of our best BERT-model and selected reports with confidence scores split across multiple plausible labels for human expert review and curated a multi-label golden set of about 600 reports to anchor evaluation for the follow-up project. We plan to use an active learning framework and manually generate multi-label annotations for reports where models are most uncertain or where top-2 predictions are close. Initial findings show that human expert manual efforts can be further reduced by including LLM suggestions and model predictions to pre-fill candidate labels, which human experts then need to verify, correct, or expand. This research shows that a smart combination of classical ML and DL approaches has a high potential in specialized industries like air transportation and for tasks like aviation safety monitoring. Success is however strongly dependent on domain-specific model fine-tuning with custom data. The acquisition of such data involves substantial human expert efforts which might be reduced by natural language model-based approaches.
Boosting innovation process using LLM agents
Abstract
PICC SOLUTION S.A. provides a solution for capitalizing on a company’s know-how. This is achieved through the use of Knowledge Maps (KMAPs), which capture and structure the organization’s knowledge. PICC also provides a framework to support the application of the TRIZ methodology in order to develop innovative solutions to problems identified through the KMAP. The goal of this project is to use Large Language Models (LLMs) to assist PICC users in applying the TRIZ methodology, enabling the rapid generation of innovative solutions. LLM assistance takes the form of suggested outputs for the different steps of the innovation process. These outputs can be either adapted by the user or directly used as input for the next step.
FoegBERT: Fine-Tuning a Multilingual Language Model for News Quality Assessment
Abstract
We present the ongoing development of FoegBERT, a fine-tuned language model for assessing journalistic quality in news media coverage across seven classification tasks. FoegBERT builds on a pre-trained model developed for the Swiss national languages. We fine-tuned the model using manually annotated data from a continuous research project on media quality.
Scientific Track
Reinforcement Learning for Latent-Space Thinking in LLMs
Abstract
Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques — an underexplored direction in latent-space thinking — including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.
Enhancing Retrieval via Cognitively Motivated Document Expansion
Abstract
This study examines the potential of utilizing the capabilities of large language models (LLMs) to enhance performance in document retrieval tasks. Using human-written prompts based on the 5E Instructional Model from the field of educational psychology, we generate alternative versions of documents in a given corpus via an LLM, tapping into its vast knowledge base. These generated texts can then be used in retrieval tasks, complementing or replacing the original corpus before applying fusion algorithms to combine the results. While the generated texts individually do not outperform the original corpus, fusing retrieval results from multiple generated corpora with those of the original corpus often leads to performance improvements. This suggests that LLM-generated documents, while not a substitute for the original, can complement it to enhance retrieval performance.
An Efficient Approach for Answering Not Readily Attainable Questions for RAG-based Applications
Abstract
Retrieval-augmented generation (RAG) is an established method for addressing challenges in applying large language models (LLMs), such as ensuring timeliness, incorporating domain-specific expertise, and minimizing hallucinations. However, the effective application of data-augmented LLMs remains challenging due to, e.g., reliance on retriever performance, token-limit restrictions for the input, or the inherent difficulty of global questions directed at large text corpora. Despite various efforts to address these challenges, there are still instances where finding correct answers to certain questions remains elusive. Moreover, as more modules are added to the RAG pipeline, its complexity and latency increase, so that the achieved performance improvements may become less practically significant. Based on these observations, we propose an efficient approach to addressing the issue of not readily attainable questions in a pragmatic way: by collecting questions with incorrectly generated answers, preparing the correct answers offline, and prepending a module for semantic search among the prepared question-answer pairs to the RAG system. If we consider a traditional RAG system an open-book exam, this QA search module can be likened to an open-question exam, similar to a driver's license test.
Concept Extraction and Webb’s Depth of Knowledge: Optimizing LLM Pipelines for Educational Assessment
Abstract
This study identifies an optimal LLM pipeline for automated exercise generation in higher education. We empirically compare two context preparation methods (Sliding Window vs. Concept Extraction) in combination with two instructional frameworks (Bloom’s Revised Taxonomy vs. Webb’s Depth of Knowledge). Through a mixed-methods evaluation with 21 university course coordinators, we find that Concept Extraction combined with Webb’s Depth of Knowledge yields the highest pedagogical quality, especially for technical disciplines. While human oversight remains necessary to mitigate out-of-scope hallucinations, these pipelines serve as efficient drafting engines for scalable, high-quality academic assessments.
Optimizing Large Language Models for Robust Domain-Specific Text-to-SQL: From Prompting to Preference Alignment
Abstract
This work explores the optimization of Large Language Models (LLMs) for the task of generating SQL queries from natural language (NL2SQL), a critical capability for democratizing access to domain-specific data. While recent benchmarks show promising results for LLMs, deployment in real-world analytical processing requires strict adherence to SQL grammar, deep domain understanding, and robustness against out-of-scope queries. We present a comprehensive study evaluating three stages of optimization: (1) advanced prompting strategies including Chain-of-Thought and multi-turn conversational handling; (2) constrained decoding to enforce syntactic validity; and (3) Reinforcement Learning with AI Feedback (RLAIF). We specifically compare Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) using a novel reward modeling approach based on execution and semantic principles. Our results on the Spider and BIRD benchmarks reveal that while complex agentic prompting strategies can induce hallucinations in mid-sized models (7B-34B), monolithic alignment via ORPO provides a stable, efficient alternative to expensive inference-time scaling, offering a highly reproducible pipeline for adapting open-weights models to complex data environments.
Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance
Abstract
We present Call Support Copilot, a reproducible multimodal system that integrates automatic speech recognition, speech emotion recognition, machine translation, spoken language understanding, and client knowledge retrieval in a single dashboard for customer support agents. Built from publicly accessible pretrained models and standard benchmarks, the system transcribes speech with Whisper-family ASR, detects caller affect in valence-arousal-dominance terms, classifies intents from a banking-domain inventory of 77 categories, and retrieves client records from a database. Evaluation shows strong component performance: 6.6\% word error rate on LibriSpeech, 91.7\% macro-F1 on SUPERB ER session1 (IEMOCAP subset, $n$=6), 42.98 BLEU for German–English translation, and 87.0\% accuracy on BANKING77 intent classification. End-to-end benchmarking of the core pipeline achieves faster-than-real-time throughput with mean real-time factor 0.67–0.71. All model identifiers, configurations, and evaluation scripts are documented in an accompanying repository, supporting reproducibility in line with the SwissText 2026 theme.
Text vs. Phoneme Intermediates for Low-Resource Swiss German
Abstract
Building text-to-speech (TTS) systems for low-resource languages such as Swiss German is challenging due to limited paired data and the lack of standardized orthography. In practical Swiss settings, user input is typically written in High German, motivating pipelines that map High German text to Swiss German speech via an intermediate representation. We compare three approaches: (i) direct synthesis from High German (DE-TTS), (ii) High German$\rightarrow$Swiss German text translation followed by synthesis (CH-TTS), and (iii) High German$\rightarrow$phoneme conversion followed by synthesis (PH-TTS). Using SwissDial dataset, we fine-tune two TTS backbones: including SpeechT5 and Orpheus and evaluate systems with closed-loop STT metrics (WER/SacreBLEU) and human MOS. Objective transcript-overlap metrics reliably penalize PH-TTS but fail to reflect human preference between DE-TTS and CH-TTS. MOS consistently ranks CH-TTS highest for both backbones, with Orpheus achieving near-original quality and showing robustness when training data is halved; notably, under the half-data setting PH-TTS becomes close to DE-TTS, suggesting that phoneme intermediates may be more competitive in lower-resource regimes. Our analysis indicates that current phoneme-based TTS is limited by noisy phoneme supervision and representation mismatch, and we outline directions to make phoneme intermediates competitive in low-resource dialect TTS.
Extending the Contact Hypothesis: Cross-Linguistic Evaluation of Religion and Nationality Bias When Prompting LLMs in German and Icelandic
Abstract
Large Language Models (LLMs) can reproduce social biases, yet many bias evaluations remain English-centric. We extend the Contact Hypothesis framework presented in previous work to German and Icelandic, focusing on religion and nationality. Evaluating GPT models (3.5, 4, 4-turbo, 4o, 5), we find that positive contact reduces biases in the answers of the LLMs, while negative contact amplifies it, with cross- linguistic differences in magnitude and salience. Our results support the cross-linguistic robustness of contact-based probing and underscore the need for culturally contextualized evaluations. In addition to these insights, our contributions lies in the dataset that is made available on Github1 for further research.
Robust Language Identification for Romansh Varieties
Abstract
The Romansh language has several regional varieties, called *idioms*, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Data and models will be released upon acceptance.
Skill Extraction from Resumes and Job Offers across Six Languages
Abstract
We comprehensively evaluate multiple skill extraction approaches, including rule-based, semantic, and supervised methods, using resumes and job offers in English, French, German, Italian, Spanish, and Portuguese. Due to inherent privacy concerns in Human Resources (HR) data and the high cost of manual annotations, research on identifying relevant skills for the job market remains limited, often restricted to specific domains, datasets, and entity types, and is available in only a few languages. In the context of an industrial project, we have annotated 1,200 job offers and resumes across diverse domains and six languages, through a multidisciplinary collaboration among HR researchers, NLP researchers, and HR tech professionals. Our evaluation assesses the effectiveness of these systems in a multilingual, multidomain setting, capturing both standardized job offers and highly variable resumes. The results show that supervised models achieve F1 scores of up to 0.6, while rule-based methods offer better interpretability. Furthermore, we find large differences between how skills are formulated in job offers and resumes, while the latter is understudied in academic research.
RUMLEM: A Dictionary-Based Lemmatizer for Romansh
Abstract
Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77–84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.
Graph-Augmented LLMs for Swiss MP Ideology Prediction
Abstract
Approximating the ideological position of Members of Parliament (MPs) is a fundamental task in political science, helping researchers understand legislative behavior, party alignment, and policy preferences. While Large Language Models (LLMs) have shown promising results in estimating MPs’ ideological stances, there are more actors and elements in the parliamentary system, and relations between them, that could provide a wider and more informative picture. However, due to the complexity of integrating them in the prediction task, these additional elements are generally ignored. In this work, we propose an LLM framework, \textit{PG-RAG}, that implements a retrieval-augmented generation pipeline: it first queries a political knowledge graph (KG) and then integrates the resulting graph-structured information into the context. This allows for capturing both textual semantics and inter-MP relationships, another relevant information source in any parliamentary system. We evaluate the approach on the task of ideology prediction, using data from a Swiss parliamentary dataset. When comparing graph-augmented models against several state-of-the-art baselines, the results demonstrate that incorporating this enriched information, which encodes information about different entities and relations, improves prediction performance. These results help to highlight the value of domain-specific relational information in modeling political behavior.
Extracting Article-Level Legal Dependencies from Swiss Federal Law using LLMs
Abstract
Understanding dependencies between legal provisions is essential for analyzing statutory corpora; yet, such relationships are rarely available in machine-readable form. We present a hybrid pipeline for extracting article-level dependencies from Swiss federal legislation on Fedlex, combining deterministic XML preprocessing with large language model (LLM)–based semantic resolution. Additionally, we release three complementary data splits—document-level JSON, structured citation candidates, and LLM-based article assignments—to support downstream legal NLP research. We evaluate our approach on 2,103 SR documents, yielding over 63,000 citation instances. While LLMs are effective at resolving semantically complex references, we observe substantial limitations in structured output reliability: approximately 21\% of generated items violate the expected schema, with most errors being unrecoverable. Our findings highlight a key challenge in applying LLMs to structured legal information extraction and provide a new resource for tasks such as legal knowledge graph construction, citation analysis, and benchmarking structured prediction in the legal domain.
Data Augmentation for Historical NER: A Systematic Comparison of Lexical and LLM-based Approaches
Abstract
Named Entity Recognition (NER) on historical materials suffers significant performance degradation relative to modern text, owing to optical character (OCR) recognition errors, language evolution, and scarce annotated training data. Although various remedies have been explored to increase robustness and generalisation, data augmentation techniques, despite their proven effectiveness on modern NER benchmarks, remain largely unexplored in the historical setting. This article investigates data augmentation strategies for historical NER through a systematic comparison of two complementary approaches: intrinsic augmentation via mention replacement and extrinsic augmentation through large language model (LLM)-based corpus annotation. We experiment across varying augmentation variants and corpus sizes on French and German Swiss historical newspapers. Our results show contrasting behaviors: mention replacement yields stable improvements across settings, whereas LLM-based silver data is most useful at moderate scale and when quality-filtered, and degrades as additional pseudo-labeled data is introduced. Overall, simple lexical augmentation emerges as the more robust strategy for historical NER, while LLM-based approaches remain sensitive to annotation noise and data shift.
Code-Switching Detection in Multilingual Child Speech with SwissBERT
Abstract
Code‑switching is widespread in multilingual speech, yet its automatic detection remains challenging, especially for low‑resource languages. In Switzerland, a multilingual context with multiple languages and Swiss German varieties, these challenges are amplified by variable orthography and limited annotated data. We present a supervised word‑level language-identification system for code‑switching detection in multilingual everyday child and adult speech, obtained by fine‑tuning SwissBERT. We constructed a multilingual dataset of four languages and an “other” category, implemented controlled subword–label alignment, and evaluated performance using token‑level F1. To contextualize SwissBERT’s performance, we additionally fine‑tuned mBERT as a multilingual baseline. SwissBERT achieves robust word‑level predictions and outperforms mBERT. We release the full training pipeline and evaluation scripts to facilitate reproducibility.
Which Skills Debate Reaches the Public? Comparing Scientific Literature and Media Coverage of AI and LLM Skill Impacts (2022–2025)
Abstract
We compare scientific review literature and Swiss multilingual media coverage on how large language models (LLMs) affect skills in education and work from 2022 to 2025. Our corpus consists of 246 English review documents (168 education, 78 workplace) and 4,610 Swissdox news paragraphs (2,915 German, 1,695 French). Using a reproducible pipeline that combines conceptual mapping and BERTopic modeling, we find a sharp divergence in thematic structure. In the education reviews, a single pedagogical topic accounts for 88.69% of the corpus; in workplace reviews, it remains dominant at 51.28%. In Swiss media, however, the dominant topic is a broad and generalized AI-skills discourse (52.99%), while the education-centered topic accounts for only 1.52%. Conceptual maps show in greater details that the media coverage foregrounds AI capabilities, job loss, and replacement, while giving limited attention to themes central in the literature, including AI literacy, reflective use, metacognition, and pedagogical integration. Overall, the results indicate that media discourse recontextualizes scientific debates into simplified, product-centered narratives, frequently using "ChatGPT'' as shorthand for AI more generally.
The Same Email, Signed Differently: Testing Negotiation Bias and Recommendation Stability in LLMs
Abstract
Large language models (LLMs) are increasingly involved in hiring communication, both as tools that help applicants draft negotiation emails and as systems used to evaluate them. Such mediation risks introducing variability and hidden dependencies into high-stakes outcomes like salary expectations and hiring decisions. We study this bidirectional setting with a two-stage analysis across providers and English/German contexts, using 2,880 Stage~1 observations and 1,441 paired Stage~2 evaluations. We find no strong or consistent pooled gender effects. Instead, provider differences dominate, while scalar ratings are stable on average and categorical recommendations are less robust.
A Bounded Coordination-Support Capability for Multi-Party Settings: Task-State Monitoring in Firefighter Incident Command
Abstract
Many collaboration settings require digital support systems for several humans who coordinate through ongoing communication. We study one such application in firefighter incident command: a dashboard that monitors the state of predefined operational tasks from radio transcripts. Building such a dashboard raises a practical design question: how much transcript structure is actually needed for LLM-based task-state monitoring? More specifically, we examine whether additional transcript structure materially improves monitoring performance, even though it is difficult to obtain reliably from radio communication and increases complexity and latency. We evaluate this question on source-grounded synthetic firefighter scenarios under transcript conditions that vary speaker identity and utterance boundaries, with incremental inference as the deployment-facing condition and full-transcript inference as an offline reference. Across repeated runs, incremental monitoring remains strong across all transcript conditions. Differences between transcript structures are small, continuous transcripts remain competitive, and the main weaknesses are identifying the responsible unit and capturing the completion result, which remain broadly similar across conditions. These results suggest that for this bounded dashboard-support capability, neither speaker identities nor semantically precise utterance boundaries are a primary requirement in the controlled setting studied here.
Can Large Language Models Replace Statistical Software?
Abstract
Statistical hypothesis testing is a cornerstone of evidence-based medicine and clinical research. Despite its central importance, previous research has consistently shown substantial deficits in statistical literacy among healthcare professionals. At the same time, large language models (LLMs) have demonstrated remarkable capabilities in scientific reasoning and data analysis. This study examines whether LLMs can serve as viable substitutes for conventional statistical software in guiding users through the selection, execution, and interpretation of hypothesis tests. Using a standardized prompt based on real survey data on the association between kick-scooter riding and knee pain in children, we evaluated seven LLMs and compared their outputs with statistical software results. Our findings indicate that none of the evaluated models can currently be considered a viable substitute. Although all models selected the appropriate test, substantial variation was observed in the quality of their explanations and in test execution. Gemini 3.1 Pro Preview, Claude Opus 4.6, and ChatGPT 5.4 Thinking performed strongly in test selection and result interpretation, with Gemini producing the most structured responses. However, none matched statistical software's result in test execution.
Automated German Alt Text Generation for News Charts
Abstract
This paper investigates whether a Multimodal Large Language Model, MLLM, can automatically generate well-formed German chart alt texts that meet the requirements of visually impaired persons, follow the accessibility guidelines for chart alt texts, and match the quality of manually authored gold standards. Focusing on bar, line, and stacked bar charts from the Neue Zürcher Zeitung, we define an alt text structure, construct a gold standard corpus, and evaluate MLLM-generated chart alt texts in terms of clarity, conciseness, meaningfulness, and output consistency.
Corpus Track
A Dataset of Latin Etymologies Extracted from Wiktionary
Abstract
We present a curated resource of Latin etymological chains automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains 9,684 curated etymological chains, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.
How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark
Abstract
Swiss federal voting booklets are an interesting resource for natural language processing due to their high editing standards and coverage of the four national languages of Switzerland (German, French, Italian, and Romansh Grischun). In this paper, we present VotingBooklets, an automatically extracted and aligned dataset, as well as VotingBooklets-Diamond, a subset that was manually corrected and verified by multiple annotators. We use the latter to benchmark a range of open and closed AI systems on two interdependent tasks: optical character recognition (OCR) and cross-lingual text alignment. Gemini 2.5 Flash Lite achieves the best OCR performance across all conditions, while a hybrid alignment approach using Sentence-SwissBERT for initial embedding-based alignment and Gemini for targeted post-hoc correction of low-confidence pairs yields the most accurate results. Applying these systems to the full collection of Swiss federal voting booklets, we release a large-scale four-language parallel corpus as a resource for low-resource NLP, multilingual representation learning, and the computational study of Swiss political discourse.
Demo Track
Text Lab: A Zero-Footprint, GUI-Driven NLP Suite for HPC Environments
Abstract
The rapid adoption of NLP tools in academia is hindered by data privacy regulations. Researchers handling sensitive or personal protected data such as medical records or sociological interviews that cannot ethically use commercial cloud-based AI APIs. While local HPC clusters offer secure compute, their steep technical learning curve poses a barrier for non-technical researchers. We present Text Lab, a production-ready AI suite deployed via Open OnDemand on the University of Bern’s UBELIX supercomputer. Text Lab provides a "Zero-Footprint" architecture with an intuitive Streamlit GUI, bridging the gap between secure HPC infrastructure and NLP accessibility. Core features include: 1- Batch Transcription: Utilizing WhisperX and a custom fine-tuned model optimized specifically for Swiss German dialects. 2- OCR: Integrating EasyOCR, PaddleOCR, and local Vision-Language Models (GLM-OCR, OlmOCR) for digitizing complex archival and multi-column PDFs. 3- Sandboxed AI Data Visualization: An agentic Model Context Protocol (MCP) deployment allowing local LLMs to generate and execute Python visualization code securely. Operating on a strict session-based paradigm, Text Lab utilizes shared read-only model caches while executing all tasks in isolated, self-cleaning temporary directories. We will demonstrate live batch transcription and complex OCR extraction workflows during the session.
Human Evaluation of Translation Made Trivial
Abstract
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
GRAINS: A Live Demo of Explainable Graph-Augmented RAG for Cleantech Innovation Scouting
Abstract
GRAINS is an evidence-grounded NLP system for cleantech innovation scouting across heterogeneous sources such as media articles, patent abstracts, and scholarly topic taxonomies. The demo shows how hybrid retrieval (BM25 + dense), optional graph expansion, cross-encoder reranking, and diversification can be combined with grounded follow-up answering to support explainable technology scouting. Users can issue a query, inspect retrieved evidence, view source metadata such as CPC classes and OpenAlex topics, and compare retrieval settings in an interactive interface. The system is designed for analysts who need auditable recommendations rather than score-only outputs, with a concrete use case in battery recycling and second-life battery value chains. In addition to the interface, the demo highlights the underlying end-to-end architecture and the evaluation setup used to assess retrieval quality, diversity, and grounding reliability. The main contribution of the demo is fusing heterogeneous domain sources in a joint decision-making to support innovation scouting in an explainable and traceable fashion. We provide a short demo here: https://andrigerber.dev/demo/thesis. For more technical details, please consult https://andrigerber.github.io/MT/.
GOOSVC: A Version Control System for Collaborative Human–AI Workflows
Abstract
AI-assisted projects increasingly involve interactions with multiple AI assistants, parallel exploration of alternatives, and the continuous creation and reuse of artefacts such as texts, images, audio, or video. Current AI interfaces support prompting and conversation, but they provide only limited means to manage several assistant contexts within one project, control the exchange of artefacts between them, reconstruct how a specific result was produced, or revisit alternative development paths in a systematic way. With GOOSVC, we presented the core elements of a version control system dedicated for such applications that addresses this gap. This demonstration presents the complete GOOSVC system with a particular focus on its collaborative user interface for direct human–AI project work. Users can manage projects with multiple parallel conversations involving different AI assistants, organize generated and imported artefacts in a file-system-like structure, inspect project-wide branches, and audit the context in which each artefact version was created. Artefacts can be shared across assistant contexts while preserving traceability and alternative paths can be explored at project level. In addition, we show how GOOSVC can be integrated into existing AI-enabled applications as a backend for versioning interactions and artefacts. Compared with current approaches, GOOSVC considerably simplifies the coordination of multiple assistants for AI-assisted production workflows.
Making Disagreement Explicit: Multi-Perspective Reasoning in AI Systems
Abstract
In NLP, disagreement in annotation is often treated as noise or model error, reflecting the assumption that rational evaluation reduces to consensus, by conflating divergence to a single ground truth. However, in political or ethical domains, disagreement can reflect divergent definitions, assumptions or inferential norms, rather than irrationality. For instance, 'taxation is theft' may be judged true in one worldview and false in another, with both positions internally consistent. Such cases suggest that disagreement may reflect structured perspectival divergence rather than error. This raises a challenge of how to develop and evaluate systems that make reasoning across multiple worldviews explicit. To explore this question, we developed prototype systems that generate multi-perspective analyses. The systems produce branching reasoning paths for different 'conceptual teams', which we take to be agents in unison about the meaning of contested concepts, and make their assumptions explicit, instead of collapsing interpretations into a single label. Expert assessment and user studies suggest that this increases clarity about sources and forms of disagreement, while also revealing a challenge: Visually symmetric representations may inadvertently signal epistemic equivalence.
Similar, but why? A Toolkit for Explaining Text Similarity
Abstract
Explaining text similarity and developing interpretable models are emerging research challenges. We release XPLAINSIM, a Python package that unifies three complementary approaches for explaining textual similarity in an easily accessible way: 1. a token attribution method that explains how individual word interactions contribute to the predicted similarity of any embedding model; 2. a method for inferring structured neural embedding spaces that capture explainable aspects of text, and 3. a symbolic approach that explains textual similarity transparently through parsed meaning representations. We demonstrate the value of our package through intuitive examples and three focused empirical research studies. The first study evaluates interpretability methods for constructing cross-lingual token alignments. The second investigates how modern information retrieval methods handle stop words. The third sheds more light on a long-standing question in computational linguistics: the distinction between relatedness and similarity. XPLAINSIM is available at \url{https://github.com/flipz357/XPLAINSIM}.
Glatteis: A modular multilingual geoparser
Abstract
Glatteis is a Python package built for geoparsing: extracting place names from text and associating them with real-world locations. While dozens of excellent geoparsing tools exist already, Glatteis has two unique features: multilingual modularity and extensibility under the same framework, and a bring-your-own-geodata approach. Different methods of place name recognition (e.g. Reading, UK vs. Reading a book) and resolution (e.g. Zurich, CH vs Zurich, Montana) are better suited for different languages and media types, and Glatteis allows users to mix and match between SpaCy, StanzaNLP, or a local LLM for each step. Along with a preset location list, users can add their own geodata, which is helpful if a user is certain that some texts to be examined will include many Swiss Gemeinden, for example. The output of Glatteis can be set to be either a simple list of place names or a subset of the geographic data provided. This eases workflows where the output of the geoparser is part of a larger spatial analysis workflow. This geoparser was previously applied to a dataset of German, French and Italian language Swiss news articles (Vogler et.al. 2023a, Vogler et.al. 2023b), finding a declining number of unique place names in the articles and more mentions per article. Earlier versions were used to examine Chinese (Weston and Rauchfleisch, 2021) and Indonesian language texts.
Supporting AI discovery with the AI Radar
Abstract
Many companies struggle with identifying suitable use cases for artificial intelligence. Their discovery efforts are often biased toward familiar solutions, limited internal experience, or rapidly changing technological trends. As a result, companies frequently overlook large parts of their potential AI opportunity space. To address this challenge, we introduce the AI Radar, a system designed to support systematic AI use case discovery. The system continuously extracts real-world AI implementations from public data sources through a dynamic pipeline that identifies, structures, and enriches case information. Each use case is classified using a curated domain taxonomy covering AI solution types, business functions, industries, and other contextual attributes. The resulting dataset currently includes around 250 structured AI use cases and over 2,000 real-world implementations across industries. The use cases can be explored through a hybrid user interface: – A dashboard enables filtering, searching, and systematic exploration of AI use cases. – A conversational interface allows users to adapt discovery results to their specific organizational context and explore tailored AI opportunities. A preview of the AI Radar can be found here: https://ai-radar.app. In this demo, we will present the data pipeline, the underlying taxonomy and dataset, and the hybrid user experience that enables organizations to systematically explore the AI landscape, supporting more informed AI opportunity discovery.