Computation and Language 115
☆ Language Generation with Infinite Contamination
We study language generation in the limit, where an algorithm observes an
adversarial enumeration of strings from an unknown target language $K$ and must
eventually generate new, unseen strings from $K$. Kleinberg and Mullainathan
[KM24] proved that generation is achievable in surprisingly general settings.
But their generator suffers from ``mode collapse,'' producing from an
ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25]
require the generator's output to be ``dense'' in the target language. They
showed that generation with density, surprisingly, remains achievable at the
same generality.
Both results assume perfect data: no noisy insertions and no omissions. This
raises a central question: how much contamination can generation tolerate?
Recent works made partial progress on this question by studying (non-dense)
generation with either finite amounts of noise (but no omissions) or omissions
(but no noise).
We characterize robustness under contaminated enumerations: 1. Generation
under Contamination: Language generation in the limit is achievable for all
countable collections iff the fraction of contaminated examples converges to
zero. When this fails, we characterize which collections are generable. 2.
Dense Generation under Contamination: Dense generation is strictly less robust
to contamination than generation. As a byproduct, we resolve an open question
of Raman and Raman [ICML25] by showing that generation is possible with only
membership oracle access under finitely many contaminated examples.
Finally, we introduce a beyond-worst-case model inspired by curriculum
learning and prove that dense generation is achievable even with infinite
contamination provided the fraction of contaminated examples converges to zero.
This suggests curriculum learning may be crucial for learning from noisy web
data.
☆ DigiData: Training and Evaluating General-Purpose Mobile Control Agents
Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe
AI agents capable of controlling user interfaces have the potential to
transform human interaction with digital devices. To accelerate this
transformation, two fundamental building blocks are essential: high-quality
datasets that enable agents to achieve complex and human-relevant goals, and
robust evaluation methods that allow researchers and practitioners to rapidly
enhance agent performance. In this paper, we introduce DigiData, a large-scale,
high-quality, diverse, multi-modal dataset designed for training mobile control
agents. Unlike existing datasets, which derive goals from unstructured
interactions, DigiData is meticulously constructed through comprehensive
exploration of app features, resulting in greater diversity and higher goal
complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating
mobile control agents on real-world complex tasks. We demonstrate that the
commonly used step-accuracy metric falls short in reliably assessing mobile
control agents and, to address this, we propose dynamic evaluation protocols
and AI-powered evaluations as rigorous alternatives for agent assessment. Our
contributions aim to significantly advance the development of mobile control
agents, paving the way for more intuitive and effective human-device
interactions.
comment: Website: https://facebookresearch.github.io/DigiData
☆ SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
We introduce SPOT (Stopping Points in Online Threads), the first annotated
corpus translating the sociological concept of stopping point into a
reproducible NLP task. Stopping points are ordinary critical interventions that
pause or redirect online discussions through a range of forms (irony, subtle
doubt or fragmentary arguments) that frameworks like counterspeech or social
correction often overlook. We operationalize this concept as a binary
classification task and provide reliable annotation guidelines. The corpus
contains 43,305 manually annotated French Facebook comments linked to URLs
flagged as false information by social media users, enriched with contextual
metadata (article, post, parent comment, page or group, and source). We
benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs
under various prompting strategies. Results show that fine-tuned encoders
outperform prompted LLMs in F1 score by more than 10 percentage points,
confirming the importance of supervised learning for emerging non-English
social media tasks. Incorporating contextual metadata further improves encoder
models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along
with the annotation guidelines and code in our code repository, to foster
transparency and reproducible research.
☆ SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards NeurIPS 2025
Multimodal large language models (MLLMs) have achieved remarkable progress in
vision-language tasks, but they continue to struggle with spatial
understanding. Existing spatial MLLMs often rely on explicit 3D inputs or
architecture-specific modifications, and remain constrained by large-scale
datasets or sparse supervision. To address these limitations, we introduce
SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial
grounding with multi-step reasoning. The model simulates human-like spatial
perception by constructing a scene graph of task-relevant objects and spatial
relations, and reasoning towards an answer via dense spatial rewards.
SpatialThinker consists of two key contributions: (1) a data synthesis pipeline
that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL
with a multi-objective dense spatial reward enforcing spatial grounding.
SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline
on spatial understanding and real-world VQA benchmarks, nearly doubling the
base-model gain compared to sparse RL, and surpassing GPT-4o. These results
showcase the effectiveness of combining spatial supervision with reward-aligned
reasoning in enabling robust 3D spatial understanding with limited data and
advancing MLLMs towards human-level visual reasoning.
comment: Preprint. Accepted at NeurIPS 2025 Workshops on SPACE in Vision,
Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making
(EWM), Aligning Reinforcement Learning Experimentalists and Theorists
(ARLET), and Scaling Environments for Agents (SEA)
☆ ConvFill: Model Collaboration for Responsive Conversational Voice Agents
Deploying conversational voice agents with large language models faces a
critical challenge: cloud-based foundation models provide deep reasoning and
domain knowledge but introduce latency that disrupts natural conversation,
while on-device models respond immediately but lack sophistication. We propose
conversational infill, a task where a lightweight on-device model generates
contextually appropriate dialogue while seamlessly incorporating streaming
knowledge from a powerful backend model. This approach decouples response
latency from model capability, enabling systems that feel responsive while
accessing the full power of large-scale models. We present ConvFill, a 360M
parameter model trained on synthetic multi-domain conversations. Evaluation
across multiple backend models shows that conversational infill can be
successfully learned, with ConvFill achieving accuracy improvements of 36-42%
over standalone small models of the same size while consistently retaining
sub-200ms response latencies. Our results demonstrate the promise of this
approach for building on-device conversational agents that are both immediately
responsive and knowledgeable.
☆ Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction
In da Vinci robotic surgery, surgeons' hands and eyes are fully engaged in
the procedure, making it difficult to access and manipulate multimodal patient
data without interruption. We propose a voice-directed Surgical Agent
Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework,
consisting of an orchestration agent and three task-specific agents driven by
Large Language Models (LLMs). These LLM-based agents autonomously plan, refine,
validate, and reason to map voice commands into specific tasks such as
retrieving clinical information, manipulating CT scans, or navigating 3D
anatomical models on the surgical video. We also introduce a Multi-level
Orchestration Evaluation Metric (MOEM) to comprehensively assess the
performance and robustness from command-level and category-level perspectives.
The SAOP achieves high accuracy and success rates across 240 voice commands,
while LLM-based agents improve robustness against speech recognition errors and
diverse or ambiguous free-form commands, demonstrating strong potential to
support minimally invasive da Vinci robotic surgery.
comment: 22 pages, 12 figures, 1 table, Supplementary Information,
Supplementary Data 1
☆ Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum
Recent advances in depth-recurrent language models show that recurrence can
decouple train-time compute and parameter count from test-time compute. In this
work, we study how to convert existing pretrained non-recurrent language models
into depth-recurrent models. We find that using a curriculum of recurrences to
increase the effective depth of the model over the course of training preserves
performance while reducing total computational cost. In our experiments, on
mathematics, we observe that converting pretrained models to recurrent ones
results in better performance at a given compute budget than simply
post-training the original non-recurrent language model.
comment: code: https://github.com/mcleish7/retrofitting-recurrence, models:
https://huggingface.co/collections/tomg-group-umd/retrofitting-recurrence
☆ Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
Large Language Models (LLMs) have advanced the automated generation of code
from natural language prompts. However, low-resource languages (LRLs) like
Bangla remain underrepresented due to the limited availability of
instruction-to-code datasets and evaluation benchmarks. To address this, the
BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on "Code Generation
in Bangla". In this work, we propose a method that combines instruction
prompting with a test-driven, feedback-guided iterative refinement process
using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla
instructions, tests it against unit tests, and iteratively refines any failing
outputs through three evaluation passes, using test feedback to guide each
step. This approach helped our team "Retriv" to secure 2nd place in the shared
task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla
instruction understanding and Python code generation, emphasizing the need for
targeted methods in LRLs. We made experimental scripts publicly available for
the community.
comment: 8 pages, 1 figure, experimental scripts publicly available at
https://github.com/NafiAsib/Retriv-BLP25-Task-2
☆ Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Large language models (LLMs) have achieved remarkable success across
widespread tasks, yet their application in low-resource domains remains a
significant challenge due to data scarcity and the high risk of overfitting.
While in-domain data is limited, there exist vast amounts of similar
general-domain data, and our initial findings reveal that they could
potentially serve as auxiliary supervision for domain enhancement. This
observation leads us to our central research question: \textbf{\textit{how to
effectively select the most valuable auxiliary data to maximize domain-specific
performance}}, particularly when traditional methods are inapplicable due to a
lack of large in-domain data pools or validation sets. To address this, we
propose \textbf{NTK-Selector}, a principled and efficient framework for
selecting general-domain auxiliary data to enhance domain-specific performance
via neural tangent kernels (NTK). Our method tackles two challenges of directly
applying NTK to LLMs, theoretical assumptions and prohibitive computational
cost, by empirically demonstrating a stable NTK-like behavior in LLMs during
LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive
experiments across four low-resource domains (medical, financial, legal, and
psychological) demonstrate that NTK-Selector consistently improves downstream
performance. Specifically, fine-tuning on 1,000 in-domain samples alone only
yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In
contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led
to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a
\textbf{10.9x and 5.7x improvement} over the domain-only setting.
comment: 27 pages
☆ Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection NeurIPS 2025
Reliability and failure detection of large language models (LLMs) is critical
for their deployment in high-stakes, multi-step reasoning tasks. Prior work
explores confidence estimation for self-evaluating LLM-scorer systems, with
confidence scorers estimating the likelihood of errors in LLM responses.
However, most methods focus on single-step outputs and overlook the challenges
of multi-step reasoning. In this work, we extend self-evaluation techniques to
multi-step tasks, testing two intuitive approaches: holistic scoring and
step-by-step scoring. Using two multi-step benchmark datasets, we show that
stepwise evaluation generally outperforms holistic scoring in detecting
potential errors, with up to 15% relative increase in AUC-ROC. Our findings
demonstrate that self-evaluating LLM systems provide meaningful confidence
estimates in complex reasoning, improving their trustworthiness and providing a
practical framework for failure detection.
comment: Accepted at NeurIPS 2025 Workshop on Evaluating the Evolving LLM
Lifecycle: Benchmarks, Emergent Abilities, and Scaling
☆ IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Recent advances in deep-research agents have shown promise for autonomous
knowledge construction through dynamic reasoning over external sources.
However, existing approaches rely on a mono-contextual paradigm that
accumulates all information in a single, expanding context window, leading to
context suffocation and noise contamination that limit their effectiveness on
long-horizon tasks. We introduce IterResearch, a novel iterative deep-research
paradigm that reformulates long-horizon research as a Markov Decision Process
with strategic workspace reconstruction. By maintaining an evolving report as
memory and periodically synthesizing insights, our approach preserves
consistent reasoning capacity across arbitrary exploration depths. We further
develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning
framework that incentivizes efficient exploration through geometric reward
discounting and enables stable distributed training via adaptive downsampling.
Extensive experiments demonstrate that IterResearch achieves substantial
improvements over existing open-source agents with average +14.5pp across six
benchmarks and narrows the gap with frontier proprietary systems. Remarkably,
our paradigm exhibits unprecedented interaction scaling, extending to 2048
interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves
as an effective prompting strategy, improving frontier models by up to 19.2pp
over ReAct on long-horizon tasks. These findings position IterResearch as a
versatile solution for long-horizon reasoning, effective both as a trained
agent and as a prompting paradigm for frontier models.
comment: https://github.com/Alibaba-NLP/DeepResearch
☆ FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation AAAI 2026
While LLMs have shown great success in financial tasks like stock prediction
and question answering, their application in fully automating Equity Research
Report generation remains uncharted territory. In this paper, we formulate the
Equity Research Report (ERR) Generation task for the first time. To address the
data scarcity and the evaluation metrics absence, we present an open-source
evaluation benchmark for ERR generation - FinRpt. We frame a Dataset
Construction Pipeline that integrates 7 financial data types and produces a
high-quality ERR dataset automatically, which could be used for model training
and evaluation. We also introduce a comprehensive evaluation system including
11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent
framework specifically tailored to address this task, named FinRpt-Gen, and
train several LLM-based agents on the proposed datasets using Supervised
Fine-Tuning and Reinforcement Learning. Experimental results indicate the data
quality and metrics effectiveness of the benchmark FinRpt and the strong
performance of FinRpt-Gen, showcasing their potential to drive innovation in
the ERR generation field. All code and datasets are publicly available.
comment: AAAI 2026
☆ When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
Despite substantial advances, large language models (LLMs) continue to
exhibit hallucinations, generating plausible yet incorrect responses. In this
paper, we highlight a critical yet previously underexplored class of
hallucinations driven by spurious correlations -- superficial but statistically
prominent associations between features (e.g., surnames) and attributes (e.g.,
nationality) present in the training data. We demonstrate that these spurious
correlations induce hallucinations that are confidently generated, immune to
model scaling, evade current detection methods, and persist even after refusal
fine-tuning. Through systematically controlled synthetic experiments and
empirical evaluations on state-of-the-art open-source and proprietary LLMs
(including GPT-5), we show that existing hallucination detection methods, such
as confidence-based filtering and inner-state probing, fundamentally fail in
the presence of spurious correlations. Our theoretical analysis further
elucidates why these statistical biases intrinsically undermine
confidence-based detection techniques. Our findings thus emphasize the urgent
need for new approaches explicitly designed to address hallucinations caused by
spurious correlations.
☆ RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi
We introduce Reinforcement Learning (RL) with Adaptive Verifiable
Environments (RLVE), an approach using verifiable environments that
procedurally generate problems and provide algorithmically verifiable rewards,
to scale up RL for language models (LMs). RLVE enables each verifiable
environment to dynamically adapt its problem difficulty distribution to the
policy model's capabilities as training progresses. In contrast, static data
distributions often lead to vanishing learning signals when problems are either
too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a
large-scale suite of 400 verifiable environments carefully developed through
manual environment engineering. Using RLVE-Gym, we show that environment
scaling, i.e., expanding the collection of training environments, consistently
improves generalizable reasoning capabilities. RLVE with joint training across
all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement
across six reasoning benchmarks, starting from one of the strongest 1.5B
reasoning LMs. By comparison, continuing this LM's original RL training yields
only a 0.49% average absolute gain despite using over 3x more compute. We
release our code publicly.
☆ ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding AACL 2025
Automatic ICD coding, the task of assigning disease and procedure codes to
electronic medical records, is crucial for clinical documentation and billing.
While existing methods primarily enhance model understanding of code
hierarchies and synonyms, they often overlook the pervasive use of medical
acronyms in clinical notes, a key factor in ICD code inference. To address this
gap, we propose a novel effective data augmentation technique that leverages
large language models to expand medical acronyms, allowing models to be trained
on their full form representations. Moreover, we incorporate consistency
training to regularize predictions by enforcing agreement between the original
and augmented documents. Extensive experiments on the MIMIC-III dataset
demonstrate that our approach, ACE-ICD establishes new state-of-the-art
performance across multiple settings, including common codes, rare codes, and
full-code assignments. Our code is publicly available.
comment: Camera ready version for IJCNLP-AACL 2025 (Findings)
☆ Retriv at BLP-2025 Task 1: A Transformer Ensemble and Multi-Task Learning Approach for Bangla Hate Speech Identification
This paper addresses the problem of Bangla hate speech identification, a
socially impactful yet linguistically challenging task. As part of the "Bangla
Multi-task Hate Speech Identification" shared task at the BLP Workshop,
IJCNLP-AACL 2025, our team "Retriv" participated in all three subtasks: (1A)
hate type classification, (1B) target group identification, and (1C) joint
detection of type, severity, and target. For subtasks 1A and 1B, we employed a
soft-voting ensemble of transformer models (BanglaBERT, MuRIL, IndicBERTv2).
For subtask 1C, we trained three multitask variants and aggregated their
predictions through a weighted voting ensemble. Our systems achieved micro-f1
scores of 72.75% (1A) and 72.69% (1B), and a weighted micro-f1 score of 72.62%
(1C). On the shared task leaderboard, these corresponded to 9th, 10th, and 7th
positions, respectively. These results highlight the promise of transformer
ensembles and weighted multitask frameworks for advancing Bangla hate speech
detection in low-resource contexts. We made experimental scripts publicly
available for the community.
comment: 7 pages, 3 figures, experimental scripts publicly available at
https://github.com/sahasourav17/Retriv-BLP25-Task-1
☆ Who Is the Story About? Protagonist Entity Recognition in News
News articles often reference numerous organizations, but traditional Named
Entity Recognition (NER) treats all mentions equally, obscuring which entities
genuinely drive the narrative. This limits downstream tasks that rely on
understanding event salience, influence, or narrative focus. We introduce
Protagonist Entity Recognition (PER), a task that identifies the organizations
that anchor a news story and shape its main developments. To validate PER, we
compare he predictions of Large Language Models (LLMs) against annotations from
four expert annotators over a gold corpus, establishing both inter-annotator
consistency and human-LLM agreement. Leveraging these findings, we use
state-of-the-art LLMs to automatically label large-scale news collections
through NER-guided prompting, generating scalable, high-quality supervision. We
then evaluate whether other LLMs, given reduced context and without explicit
candidate guidance, can still infer the correct protagonists. Our results
demonstrate that PER is a feasible and meaningful extension to
narrative-centered information extraction, and that guided LLMs can approximate
human judgments of narrative importance at scale.
☆ The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models
Large-scale models are at the forefront of time series (TS) forecasting,
dominated by two paradigms: fine-tuning text-based Large Language Models
(LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both
approaches share a foundational assumption that scaling up model capacity and
data volume leads to improved performance. However, we observe a
\textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon
that larger models do \emph{NOT} achieve better performance. Through extensive
experiments on two model families across four scales (100M to 1.7B parameters)
and diverse data (up to 6B observations), we rigorously confirm that the
scaling paradox is a pervasive issue. We then diagnose its root cause by
analyzing internal representations, identifying a phenomenon we call
\textit{few-layer dominance}: only a small subset of layers are functionally
important, while the majority are redundant, under-utilized, and can even
distract training. Based on this discovery, we propose a practical method to
automatically identify and retain only these dominant layers. In our models,
retaining only 21\% of the parameters achieves up to a 12\% accuracy
improvement and a 2.7$\times$ inference speedup. We validate the universality
of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing
that retaining less than 30\% of layers achieves comparable or superior
accuracy in over 95\% of tasks.
☆ Discourse Graph Guided Document Translation with Large Language Models
Adapting large language models to full document translation remains
challenging due to the difficulty of capturing long-range dependencies and
preserving discourse coherence throughout extended texts. While recent agentic
machine translation systems mitigate context window constraints through
multi-agent orchestration and persistent memory, they require substantial
computational resources and are sensitive to memory retrieval strategies. We
introduce TransGraph, a discourse-guided framework that explicitly models
inter-chunk relationships through structured discourse graphs and selectively
conditions each translation segment on relevant graph neighbourhoods rather
than relying on sequential or exhaustive context. Across three document-level
MT benchmarks spanning six languages and diverse domains, TransGraph
consistently surpasses strong baselines in translation quality and terminology
consistency while incurring significantly lower token overhead.
☆ EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models AAAI2026
Large language models (LLMs) are increasingly deployed in real-world
communication settings, yet their ability to resolve context-dependent
ambiguity remains underexplored. In this work, we present EMODIS, a new
benchmark for evaluating LLMs' capacity to interpret ambiguous emoji
expressions under minimal but contrastive textual contexts. Each instance in
EMODIS comprises an ambiguous sentence containing an emoji, two distinct
disambiguating contexts that lead to divergent interpretations, and a specific
question that requires contextual reasoning. We evaluate both open-source and
API-based LLMs, and find that even the strongest models frequently fail to
distinguish meanings when only subtle contextual cues are present. Further
analysis reveals systematic biases toward dominant interpretations and limited
sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for
assessing contextual disambiguation, and highlights the gap in semantic
reasoning between humans and LLMs.
comment: Accepted by AAAI2026
☆ Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Internet of Agents (IoA) envisions a unified, agent-centric paradigm where
heterogeneous large language model (LLM) agents can interconnect and
collaborate at scale. Within this paradigm, federated learning (FL) serves as a
key enabler that allows distributed LLM agents to co-train global models
without centralizing data. However, the FL-enabled IoA system remains
vulnerable to model poisoning attacks, and the prevailing distance and
similarity-based defenses become fragile at billion-parameter scale and under
heterogeneous data distributions. This paper proposes a graph
representation-based model poisoning (GRMP) attack, which passively exploits
observed benign local models to construct a parameter correlation graph and
extends an adversarial variational graph autoencoder to capture and reshape
higher-order dependencies. The GRMP attack synthesizes malicious local models
that preserve benign-like statistics while embedding adversarial objectives,
remaining elusive to detection at the server. Experiments demonstrate a gradual
drop in system accuracy under the proposed attack and the ineffectiveness of
the prevailing defense mechanism in detecting the attack, underscoring a severe
threat to the ambitious IoA paradigm.
comment: 6 pages, 6 figures
☆ AdaRec: Adaptive Recommendation with LLMs via Narrative Profiling and Dual-Channel Reasoning
We propose AdaRec, a few-shot in-context learning framework that leverages
large language models for an adaptive personalized recommendation. AdaRec
introduces narrative profiling, transforming user-item interactions into
natural language representations to enable unified task handling and enhance
human readability. Centered on a bivariate reasoning paradigm, AdaRec employs a
dual-channel architecture that integrates horizontal behavioral alignment,
discovering peer-driven patterns, with vertical causal attribution,
highlighting decisive factors behind user preferences. Unlike existing
LLM-based approaches, AdaRec eliminates manual feature engineering through
semantic representations and supports rapid cross-task adaptation with minimal
supervision. Experiments on real ecommerce datasets demonstrate that AdaRec
outperforms both machine learning models and LLM-based baselines by up to eight
percent in few-shot settings. In zero-shot scenarios, it achieves up to a
nineteen percent improvement over expert-crafted profiling, showing
effectiveness for long-tail personalization with minimal interaction data.
Furthermore, lightweight fine-tuning on synthetic data generated by AdaRec
matches the performance of fully fine-tuned models, highlighting its efficiency
and generalization across diverse tasks.
☆ Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?
The convincingness of an argument does not only depend on its structure
(logos), the person who makes the argument (ethos), but also on the emotion
that it causes in the recipient (pathos). While the overall intensity and
categorical values of emotions in arguments have received considerable
attention in the research community, we argue that the emotion an argument
evokes in a recipient is subjective. It depends on the recipient's goals,
standards, prior knowledge, and stance. Appraisal theories lend themselves as a
link between the subjective cognitive assessment of events and emotions. They
have been used in event-centric emotion analysis, but their suitability for
assessing argument convincingness remains unexplored. In this paper, we
evaluate whether appraisal theories are suitable for emotion analysis in
arguments by considering subjective cognitive evaluations of the importance and
impact of an argument on its receiver. Based on the annotations in the recently
published ContArgA corpus, we perform zero-shot prompting experiments to
evaluate the importance of gold-annotated and predicted emotions and appraisals
for the assessment of the subjective convincingness labels. We find that, while
categorical emotion information does improve convincingness prediction, the
improvement is more pronounced with appraisals. This work presents the first
systematic comparison between emotion models for convincingness prediction,
demonstrating the advantage of appraisals, providing insights for theoretical
and practical applications in computational argumentation.
☆ TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine
Zihao Cheng, Yuheng Lu, Huaiqian Ye, Zeming Liu, Minqi Wang, Jingjing Liu, Zihan Li, Wei Fan, Yuanfang Guo, Ruiji Fu, Shifeng She, Gang Wang, Yunhong Wang
Large Language Models (LLMs) have demonstrated remarkable capabilities in
modern medicine, yet their application in Traditional Chinese Medicine (TCM)
remains severely limited by the absence of standardized benchmarks and the
scarcity of high-quality training data. To address these challenges, we
introduce TCM-Eval, the first dynamic and extensible benchmark for TCM,
meticulously curated from national medical licensing examinations and validated
by TCM experts. Furthermore, we construct a large-scale training corpus and
propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously
enrich question-answer pairs with validated reasoning chains through rejection
sampling, establishing a virtuous cycle of data and model co-evolution. Using
this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art
LLM specifically designed for TCM, which significantly exceeds the passing
threshold for human practitioners. To encourage future research and
development, we release a public leaderboard, fostering community engagement
and continuous improvement.
comment: Work in Progress
☆ LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for
fine-tuning large language models.However, conventional LoRA adapters are
typically trained for a single task, limiting their applicability in real-world
settings where inputs may span diverse and unpredictable domains. At inference
time, existing approaches combine multiple LoRAs for improving performance on
diverse tasks, while usually requiring labeled data or additional task-specific
training, which is expensive at scale. In this work, we introduce LoRA on the
Go (LoGo), a training-free framework that dynamically selects and merges
adapters at the instance level without any additional requirements. LoGo
leverages signals extracted from a single forward pass through LoRA adapters,
to identify the most relevant adapters and determine their contributions
on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo
outperforms training-based baselines on some tasks upto a margin of 3.6% while
remaining competitive on other tasks and maintaining inference throughput,
highlighting its effectiveness and practicality.
☆ Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought
Large Language Models (LLMs) have demonstrated strong reasoning capabilities
through \emph{Chain-of-Thought} (CoT) prompting, which enables step-by-step
intermediate reasoning. However, explicit CoT methods rely on discrete
token-level reasoning processes that are prone to error propagation and limited
by vocabulary expressiveness, often resulting in rigid and inconsistent
reasoning trajectories. Recent research has explored implicit or continuous
reasoning in latent spaces, allowing models to perform internal reasoning
before generating explicit output. Although such approaches alleviate some
limitations of discrete CoT, they generally lack explicit mechanisms to enforce
consistency among reasoning steps, leading to divergent reasoning paths and
unstable outcomes. To address this issue, we propose EBM-CoT, an Energy-Based
Chain-of-Thought Calibration framework that refines latent thought
representations through an energy-based model (EBM). Our method dynamically
adjusts latent reasoning trajectories toward lower-energy, high-consistency
regions in the embedding space, improving both reasoning accuracy and
consistency without modifying the base language model. Extensive experiments
across mathematical, commonsense, and symbolic reasoning benchmarks demonstrate
that the proposed framework significantly enhances the consistency and
efficiency of multi-step reasoning in LLMs.
☆ More Agents Helps but Adversarial Robustness Gap Persists
When LLM agents work together, they seem to be more powerful than a single
LLM in mathematical question answering. However, are they also more robust to
adversarial inputs? We investigate this question using adversarially perturbed
math questions. These perturbations include punctuation noise with three
intensities (10, 30, and 50 percent), plus real-world and human-like typos
(WikiTypo, R2ATA). Using a unified sampling-and-voting framework (Agent
Forest), we evaluate six open-source models (Qwen3-4B/14B, Llama3.1-8B,
Mistral-7B, Gemma3-4B/12B) across four benchmarks (GSM8K, MATH, MMLU-Math,
MultiArith), with various numbers of agents n from one to 25 (1, 2, 5, 10, 15,
20, 25). Our findings show that (1) Noise type matters: punctuation noise harm
scales with its severity, and the human typos remain the dominant bottleneck,
yielding the largest gaps to Clean accuracy and the highest ASR even with a
large number of agents. And (2) Collaboration reliably improves accuracy as the
number of agents, n, increases, with the largest gains from one to five agents
and diminishing returns beyond 10 agents. However, the adversarial robustness
gap persists regardless of the agent count.
☆ MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks
Ensuring the safety and value alignment of large language models (LLMs) is
critical for their deployment. Current alignment efforts primarily target
explicit risks such as bias, hate speech, and violence. However, they often
fail to address deeper, domain-specific implicit risks and lack a flexible,
generalizable framework applicable across diverse specialized fields. Hence, we
proposed MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering
and mitigating implicit Risks in LLMs on Domain Tasks. To address the
limitations of labor-intensive human evaluation, we introduce a novel
metacognitive self-assessment tool. This enables LLMs to reflect on potential
value misalignments in their responses using strategies like perspective-taking
and consequential thinking. We also release a supporting dataset of 9,000 risk
queries spanning education, finance, and management to enhance domain-specific
risk identification. Subsequently, based on the outcomes of metacognitive
reflection, the framework dynamically generates supplementary rule knowledge
graphs that extend predefined static rule trees. This enables models to
actively apply validated rules to future similar challenges, establishing a
continuous self-evolution cycle that enhances generalization by reducing
maintenance costs and inflexibility of static systems. Finally, we employ
activation steering during inference to guide LLMs in following the rules, a
cost-effective method to robustly enhance enforcement across diverse contexts.
Experimental results show MENTOR's effectiveness: In defensive testing across
three vertical domains, the framework substantially reduces semantic attack
success rates, enabling a new level of implicit risk mitigation for LLMs.
Furthermore, metacognitive assessment not only aligns closely with baseline
human evaluators but also delivers more thorough and insightful analysis of
LLMs value alignment.
☆ Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
The performance of large language models (LLMs) and large multimodal models
(LMMs) depends heavily on the quality and scale of their pre-training datasets.
Recent research shows that large multimodal models trained on natural documents
where images and text are interleaved outperform those trained only on
image-text pairs across a wide range of benchmarks, leveraging advanced pre-
trained models to enforce semantic alignment, image-sequence consistency, and
textual coherence. For Arabic, however, the lack of high-quality multimodal
datasets that preserve document structure has limited progress. In this paper,
we present our pipeline Wasm for processing the Common Crawl dataset to create
a new Arabic multimodal dataset that uniquely provides markdown output. Unlike
existing Arabic corpora that focus solely on text extraction, our approach
preserves the structural integrity of web content while maintaining flexibility
for both text-only and multimodal pre-training scenarios. We provide a
comprehensive comparative analysis of our data processing pipeline against
those used for major existing datasets, highlighting the convergences in
filtering strategies and justifying our specific design choices. To support
future research, we publicly release a representative dataset dump along with
the multimodal processing pipeline for Arabic.
☆ EmoBang: Detecting Emotion From Bengali Texts
Emotion detection from text seeks to identify an individual's emotional or
mental state - positive, negative, or neutral - based on linguistic cues. While
significant progress has been made for English and other high-resource
languages, Bengali remains underexplored despite being the world's fourth most
spoken language. The lack of large, standardized datasets classifies Bengali as
a low-resource language for emotion detection. Existing studies mainly employ
classical machine learning models with traditional feature engineering,
yielding limited performance. In this paper, we introduce a new Bengali emotion
dataset annotated across eight emotion categories and propose two models for
automatic emotion detection: (i) a hybrid Convolutional Recurrent Neural
Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder
Representations from Transformers (BERT) ensemble model (EmoBangEnsemble).
Additionally, we evaluate six baseline models with five feature engineering
techniques and assess zero-shot and few-shot large language models (LLMs) on
the dataset. To the best of our knowledge, this is the first comprehensive
benchmark for Bengali emotion detection. Experimental results show that
EmoBangH and EmoBangE achieve accuracies of 92.86% and 93.69%, respectively,
outperforming existing methods and establishing strong baselines for future
research.
☆ Importance-Aware Data Selection for Efficient LLM Instruction Tuning AAAI 2026
Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, Hao Henry Wang
Instruction tuning plays a critical role in enhancing the performance and
efficiency of Large Language Models (LLMs). Its success depends not only on the
quality of the instruction data but also on the inherent capabilities of the
LLM itself. Some studies suggest that even a small amount of high-quality data
can achieve instruction fine-tuning results that are on par with, or even
exceed, those from using a full-scale dataset. However, rather than focusing
solely on calculating data quality scores to evaluate instruction data, there
is a growing need to select high-quality data that maximally enhances the
performance of instruction tuning for a given LLM. In this paper, we propose
the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the
importance of instruction data in enhancing model's capabilities. The MIWV
metric is derived from the discrepancies in the model's responses when using
In-Context Learning (ICL), helping identify the most beneficial data for
enhancing instruction tuning performance. Our experimental results demonstrate
that selecting only the top 1\% of data based on MIWV can outperform training
on the full dataset. Furthermore, this approach extends beyond existing
research that focuses on data quality scoring for data selection, offering
strong empirical evidence supporting the effectiveness of our proposed method.
comment: Accepted by AAAI 2026 Oral
☆ Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection AAAI
The opaque nature of deep learning models presents significant challenges for
the ethical deployment of hate speech detection systems. To address this
limitation, we introduce Supervised Rational Attention (SRA), a framework that
explicitly aligns model attention with human rationales, improving both
interpretability and fairness in hate speech classification. SRA integrates a
supervised attention mechanism into transformer-based classifiers, optimizing a
joint objective that combines standard classification loss with an alignment
loss term that minimizes the discrepancy between attention weights and
human-annotated rationales. We evaluated SRA on hate speech benchmarks in
English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations.
Empirically, SRA achieves 2.4x better explainability compared to current
baselines, and produces token-level explanations that are more faithful and
human-aligned. In terms of fairness, SRA achieves competitive fairness across
all measures, with second-best performance in detecting toxic posts targeting
identity groups, while maintaining comparable results on other metrics. These
findings demonstrate that incorporating human rationales into attention
mechanisms can enhance interpretability and faithfulness without compromising
fairness.
comment: Accepted at the Annual AAAI Conference on Artificial Intelligence
(AAAI26)
☆ When Sufficient is not Enough: Utilizing the Rashomon Effect for Complete Evidence Extraction
Feature attribution methods typically provide minimal sufficient evidence
justifying a model decision. However, in many applications this is inadequate.
For compliance and cataloging, the full set of contributing features must be
identified - complete evidence. We perform a case study on a medical dataset
which contains human-annotated complete evidence. We show that individual
models typically recover only subsets of complete evidence and that aggregating
evidence from several models improves evidence recall from $\sim$0.60 (single
best model) to $\sim$0.86 (ensemble). We analyze the recall-precision
trade-off, the role of training with evidence, dynamic ensembles with certainty
thresholds, and discuss implications.
☆ Evaluating LLMs for Anxiety, Depression, and Stress Detection Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data
Mental health disorders affect over one-fifth of adults globally, yet
detecting such conditions from text remains challenging due to the subtle and
varied nature of symptom expression. This study evaluates multiple approaches
for mental health detection, comparing Large Language Models (LLMs) such as
Llama and GPT with classical machine learning and transformer-based
architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ
dataset of clinical interviews, we fine-tuned models for anxiety, depression,
and stress classification and applied synthetic data generation to mitigate
class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score
(0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to
0.891). For stress detection, a zero-shot synthetic approach
(SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings
demonstrate the effectiveness of transformer-based models and highlight the
value of synthetic data in improving recall and generalization. However,
careful calibration is required to prevent precision loss. Overall, this work
emphasizes the potential of combining advanced language models and data
augmentation to enhance automated mental health assessment from text.
☆ Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, Even Oldridge
We introduce llama-embed-nemotron-8b, an open-weights text embedding model
that achieves state-of-the-art performance on the Multilingual Massive Text
Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent
models show strong performance, their training data or methodologies are often
not fully disclosed. We aim to address this by developing a fully open-source
model, publicly releasing its weights and detailed ablation studies, and
planning to share the curated training datasets. Our model demonstrates
superior performance across all major embedding tasks -- including retrieval,
classification and semantic textual similarity (STS) -- and excels in
challenging multilingual scenarios, such as low-resource languages and
cross-lingual setups. This state-of-the-art performance is driven by a novel
data mix of 16.1 million query-document pairs, split between 7.7 million
samples from public datasets and 8.4 million synthetically generated examples
from various open-weight LLMs. One of our key contributions is a detailed
ablation study analyzing core design choices, including a comparison of
contrastive loss implementations, an evaluation of synthetic data generation
(SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b
is an instruction-aware model, supporting user-defined instructions to enhance
performance for specific use-cases. This combination of top-tier performance,
broad applicability, and user-driven flexibility enables it to serve as a
universal text embedding solution.
☆ Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity
Anastasiia Tokareva, Judith Dineley, Zoe Firth, Pauline Conde, Faith Matcham, Sara Siddi, Femke Lamers, Ewan Carr, Carolin Oetzmann, Daniel Leightley, Yuezhou Zhang, Amos A. Folarin, Josep Maria Haro, Brenda W. J. H. Penninx, Raquel Bailon, Srinivasan Vairavan, Til Wykes, Richard J. B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Nicholas Cummins, The RADAR-CNS Consortium
Background: Captured between clinical appointments using mobile devices,
spoken language has potential for objective, more regular assessment of symptom
severity and earlier detection of relapse in major depressive disorder.
However, research to date has largely been in non-clinical cross-sectional
samples of written language using complex machine learning (ML) approaches with
limited interpretability.
Methods: We describe an initial exploratory analysis of longitudinal speech
data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK,
Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify
interpretable lexical features associated with MDD symptom severity with linear
mixed-effects modelling. Interpretable features and high-dimensional vector
embeddings were also used to test the prediction performance of four regressor
ML models.
Results: In English data, MDD symptom severity was associated with 7 features
including lexical diversity measures and absolutist language. In Dutch,
associations were observed with words per sentence and positive word frequency;
no associations were observed in recordings collected in Spain. The predictive
power of lexical features and vector embeddings was near chance level across
all languages.
Limitations: Smaller samples in non-English speech and methodological
choices, such as the elicitation prompt, may have also limited the effect sizes
observable. A lack of NLP tools in languages other than English restricted our
feature choice.
Conclusion: To understand the value of lexical markers in clinical research
and practice, further research is needed in larger samples across several
languages using improved protocols, and ML models that account for within- and
between-individual variations in language.
☆ A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation AACL 2025
In this paper, we describe our system under the team name BLEU Monday for the
English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the
text-only translation tasks for English-Hindi, English-Bengali,
English-Malayalam, and English-Odia language pairs. We present a two-stage
approach that addresses quality issues in the training data through automated
error detection and correction, followed by parameter-efficient model
fine-tuning.
Our methodology introduces a vision-augmented judge-corrector pipeline that
leverages multimodal language models to systematically identify and correct
translation errors in the training data. The judge component classifies
translations into three categories: correct, visually ambiguous (requiring
image context), or mistranslated (poor translation quality). Identified errors
are routed to specialized correctors: GPT-4o-mini regenerates captions
requiring visual disambiguation, while IndicTrans2 retranslates cases with pure
translation quality issues. This automated pipeline processes 28,928 training
examples across four languages, correcting an average of 17.1% of captions per
language.
We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2
en-indic 200M distilled model on both original and corrected datasets. Training
on corrected data yields consistent improvements, with BLEU score gains of
+1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on
the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation
set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90
-> 54.00).
comment: Accepted at The 12th Workshop on Asian Translation, co-located with
IJCLNLP-AACL 2025
☆ Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu
Large language models have significantly advanced Multilingual Machine
Translation (MMT), yet the broad language coverage, consistent translation
quality, and English-centric bias remain open challenges. To address these
challenges, we introduce \textbf{LMT}, a suite of \textbf{L}arge-scale
\textbf{M}ultilingual \textbf{T}ranslation models centered on both Chinese and
English, covering 60 languages and 234 translation directions. During
development, we identify a previously overlooked phenomenon of
\textbf{directional degeneration}, where symmetric multi-way fine-tuning data
overemphasize reverse directions (X $\to$ En/Zh), leading to excessive
many-to-one mappings and degraded translation quality. We propose
\textbf{Strategic Downsampling}, a simple yet effective method to mitigate this
degeneration. In addition, we design \textbf{Parallel Multilingual Prompting
(PMP)}, which leverages typologically related auxiliary languages to enhance
cross-lingual transfer. Through rigorous data curation and refined adaptation
strategies, LMT achieves SOTA performance among models of comparable language
coverage, with our 4B model (LMT-60-4B) surpassing the much larger Aya-101-13B
and NLLB-54B models by a substantial margin. We release LMT in four sizes
(0.6B/1.7B/4B/8B) to catalyze future research and provide strong baselines for
inclusive, scalable, and high-quality MMT
\footnote{\href{https://github.com/NiuTrans/LMT}{https://github.com/NiuTrans/LMT}}.
☆ Automated Circuit Interpretation via Probe Prompting
Mechanistic interpretability aims to understand neural networks by
identifying which learned features mediate specific behaviors. Attribution
graphs reveal these feature pathways, but interpreting them requires extensive
manual analysis -- a single prompt can take approximately 2 hours for an
experienced circuit tracer. We present probe prompting, an automated pipeline
that transforms attribution graphs into compact, interpretable subgraphs built
from concept-aligned supernodes. Starting from a seed prompt and target logit,
we select high-influence features, generate concept-targeted yet
context-varying probes, and group features by cross-prompt activation
signatures into Semantic, Relationship, and Say-X categories using transparent
decision rules.
Across five prompts including classic "capitals" circuits, probe-prompted
subgraphs preserve high explanatory coverage while compressing complexity
(Completeness 0.83, mean across circuits; Replacement 0.54). Compared to
geometric clustering baselines, concept-aligned groups exhibit higher
behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and
5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower
geometric compactness. Entity-swap tests reveal a layerwise hierarchy:
early-layer features transfer robustly (64% transfer rate, mean layer 6.3),
while late-layer Say-X features specialize for output promotion (mean layer
16.4), supporting a backbone-and-specialization view of transformer
computation.
We release code (https://github.com/peppinob-ol/attribution-graph-probing),
an interactive demo
(https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal
artifacts enabling immediate reproduction and community adoption.
comment: 27 pages, 5 figures, 3 tables. Code and interactive demo available
☆ SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs AAAI 2026
Large language models sometimes inadvertently reproduce passages that are
copyrighted, exposing downstream applications to legal risk. Most existing
studies for inference-time defences focus on surface-level token matching and
rely on external blocklists or filters, which add deployment complexity and may
overlook semantically paraphrased leakage. In this work, we reframe copyright
infringement mitigation as intrinsic semantic-space control and introduce
SCOPE, an inference-time method that requires no parameter updates or auxiliary
filters. Specifically, the sparse autoencoder (SAE) projects hidden states into
a high-dimensional, near-monosemantic space; benefiting from this
representation, we identify a copyright-sensitive subspace and clamp its
activations during decoding. Experiments on widely recognized benchmarks show
that SCOPE mitigates copyright infringement without degrading general utility.
Further interpretability analyses confirm that the isolated subspace captures
high-level semantics.
comment: Accepted by the AAAI 2026 (Main Track)
☆ HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection AAAI'26
To prevent misinformation and social issues arising from trustworthy-looking
content generated by LLMs, it is crucial to develop efficient and reliable
methods for identifying the source of texts. Previous approaches have
demonstrated exceptional performance in detecting texts fully generated by
LLMs. However, these methods struggle when confronting more advanced LLM output
or text with adversarial multi-task machine revision, especially in the
black-box setting, where the generating model is unknown. To address this
challenge, grounded in the hypothesis that human writing possesses distinctive
stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD
employs a reward-based alignment process, Human Language Preference
Optimization (HLPO), to shift the scoring model's token distribution toward
human-like writing, making the model more sensitive to human writing, therefore
enhancing the identification of machine-revised text. We test HLPD in an
adversarial multi-task evaluation framework that leverages a five-dimensional
prompt generator and multiple advanced LLMs to create diverse revision
scenarios. When detecting texts revised by GPT-series models, HLPD achieves a
15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by
45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the
highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%.
Code will be made available at https://github.com/dfq2021/HLPD.
comment: 9 pages, 3 figures, accepted by AAAI'26
☆ RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have
shown impressive performance on various multimodal benchmarks. However, most of
these benchmarks evaluate models primarily through multiple-choice or
short-answer formats, which do not take the reasoning process into account.
Although some benchmarks assess the reasoning process, their methods are often
overly simplistic and only examine reasoning when answers are incorrect. This
approach overlooks scenarios where flawed reasoning leads to correct answers.
In addition, these benchmarks do not consider the impact of intermodal
relationships on reasoning. To address this issue, we propose the Reasoning
Process Tree Score (RPTS), a tree structure-based metric to assess reasoning
processes. Specifically, we organize the reasoning steps into a reasoning tree
and leverage its hierarchical information to assign weighted faithfulness
scores to each reasoning step. By dynamically adjusting these weights, RPTS not
only evaluates the overall correctness of the reasoning, but also pinpoints
where the model fails in the reasoning. To validate RPTS in real-world
multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374
images and 390 reasoning instances. Each instance includes reliable
visual-textual clues that serve as leaf nodes of the reasoning tree.
Furthermore, we define three types of intermodal relationships to investigate
how intermodal interactions influence the reasoning process. We evaluated
representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in
multimodal reasoning and highlighting the differences between open-source and
closed-source commercial LVLMs. We believe that this benchmark will contribute
to the advancement of research in the field of multimodal reasoning.
☆ EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers AAAI2026
Large Language Models for Simulating Professions (SP-LLMs), particularly as
teachers, are pivotal for personalized education. However, ensuring their
professional competence and ethical safety is a critical challenge, as existing
benchmarks fail to measure role-playing fidelity or address the unique teaching
harms inherent in educational scenarios. To address this, we propose
EduGuardBench, a dual-component benchmark. It assesses professional fidelity
using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to
the teaching profession. It also probes safety vulnerabilities using
persona-based adversarial prompts targeting both general harms and,
particularly, academic misconduct, evaluated with metrics including Attack
Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive
experiments on 14 leading models reveal a stark polarization in performance.
While reasoning-oriented models generally show superior fidelity, incompetence
remains the dominant failure mode across most models. The adversarial tests
uncovered a counterintuitive scaling paradox, where mid-sized models can be the
most vulnerable, challenging monotonic safety assumptions. Critically, we
identified a powerful Educational Transformation Effect: the safest models
excel at converting harmful requests into teachable moments by providing ideal
Educational Refusals. This capacity is strongly negatively correlated with ASR,
revealing a new dimension of advanced AI safety. EduGuardBench thus provides a
reproducible framework that moves beyond siloed knowledge tests toward a
holistic assessment of professional, ethical, and pedagogical alignment,
uncovering complex dynamics essential for deploying trustworthy AI in
education. See https://github.com/YL1N/EduGuardBench for Materials.
comment: 22 pages, 9 figures, accepted by AAAI2026 as oral paper
☆ Inclusion of Role into Named Entity Recognition and Ranking
Most of the Natural Language Processing sys- tems are involved in
entity-based processing for several tasks like Information Extraction,
Question-Answering, Text-Summarization and so on. A new challenge comes when
entities play roles according to their act or attributes in certain context.
Entity Role Detection is the task of assigning such roles to the entities. Usu-
ally real-world entities are of types: person, lo- cation and organization etc.
Roles could be con- sidered as domain-dependent subtypes of these types. In the
cases, where retrieving a subset of entities based on their roles is needed,
poses the problem of defining the role and entities having those roles. This
paper presents the study of study of solving Entity Role Detection prob- lem by
modeling it as Named Entity Recogni- tion (NER) and Entity Retrieval/Ranking
task. In NER, these roles could be considered as mutually exclusive classes and
standard NER methods like sequence tagging could be used. For Entity Retrieval,
Roles could be formulated as Query and entities as Collection on which the
query needs to be executed. The aspect of Entity Retrieval task, which is
different than document retrieval task is that the entities and roles against
which they need to be retrieved are indirectly described. We have formulated
au- tomated ways of learning representative words and phrases and building
representations of roles and entities using them. We have also explored
different contexts like sentence and document. Since the roles depend upon con-
text, so it is not always possible to have large domain-specific dataset or
knowledge bases for learning purposes, so we have tried to exploit the
information from small dataset in domain- agnostic way.
comment: MTP Paper
☆ CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Hung-Yang Sung, Chien-Chun Wang, Kuan-Tang Huang, Tien-Hong Lo, Yu-Sheng Tsao, Yung-Chang Hsu, Berlin Chen
Automatic speech recognition (ASR) for low-resource languages such as
Taiwanese Hokkien is difficult due to the scarcity of annotated data. However,
direct fine-tuning on Han-character transcriptions often fails to capture
detailed phonetic and tonal cues, while training only on romanization lacks
lexical and syntactic coverage. In addition, prior studies have rarely explored
staged strategies that integrate both annotation types. To address this gap, we
present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on
Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The
framework employs a two-stage process in which it first learns acoustic and
tonal representations from phonetic Tai-lo annotations and then captures
vocabulary and syntax from Han-character transcriptions. This progressive
adaptation enables effective alignment between speech sounds and orthographic
structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR
achieves a 24.88\% relative reduction in character error rate (CER) compared
with strong baselines. The results indicate that CLiFT-ASR provides an
effective and parameter-efficient solution for Taiwanese Hokkien ASR and that
it has potential to benefit other low-resource language scenarios.
comment: Accepted for an oral presentation at the 37th Conference on
Computational Linguistics and Speech Processing (ROCLING 2025)
☆ Beyond Plain Demos: A Demo-centric Anchoring Paradigm for In-Context Learning in Alzheimer's Disease Detection AAAI
Detecting Alzheimer's disease (AD) from narrative transcripts challenges
large language models (LLMs): pre-training rarely covers this
out-of-distribution task, and all transcript demos describe the same scene,
producing highly homogeneous contexts. These factors cripple both the model's
built-in task knowledge (\textbf{task cognition}) and its ability to surface
subtle, class-discriminative cues (\textbf{contextual perception}). Because
cognition is fixed after pre-training, improving in-context learning (ICL) for
AD detection hinges on enriching perception through better demonstration (demo)
sets. We demonstrate that standard ICL quickly saturates, its demos lack
diversity (context width) and fail to convey fine-grained signals (context
depth), and that recent task vector (TV) approaches improve broad task
adaptation by injecting TV into the LLMs' hidden states (HSs), they are
ill-suited for AD detection due to the mismatch of injection granularity,
strength and position. To address these bottlenecks, we introduce
\textbf{DA4ICL}, a demo-centric anchoring framework that jointly expands
context width via \emph{\textbf{Diverse and Contrastive Retrieval}} (DCR) and
deepens each demo's signal via \emph{\textbf{Projected Vector Anchoring}} (PVA)
at every Transformer layer. Across three AD benchmarks, DA4ICL achieves large,
stable gains over both ICL and TV baselines, charting a new paradigm for
fine-grained, OOD and low-resource LLM adaptation.
comment: Accepted to the 40th Annual AAAI Conference on Artificial
Intelligence (2026) - Main Technical Track (Oral)
☆ Learning to Focus: Focal Attention for Selective and Scalable Transformers
Attention is a core component of transformer architecture, whether
encoder-only, decoder-only, or encoder-decoder model. However, the standard
softmax attention often produces noisy probability distribution, which can
impair effective feature selection at every layer of these models, particularly
for long contexts. We propose Focal Attention, a simple yet effective
modification that sharpens the attention distribution by controlling the
softmax temperature, either as a fixed hyperparameter or as a learnable
parameter during training. This sharpening enables the model to concentrate on
the most relevant tokens while suppressing irrelevant ones. Empirically, Focal
Attention scales more favorably than standard transformer with respect to model
size, training data, and context length. Across diverse benchmarks, it achieves
the same accuracy with up to 42% fewer parameters or 33% less training data. On
long-context tasks, it delivers substantial relative improvements ranging from
17% to 82%, demonstrating its effectiveness in real world applications.
☆ SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
The rapid advancement of Large Language Models (LLMs) has driven significant
progress in Natural Language Interface to Database (NLIDB). However, the
widespread adoption of LLMs has raised critical privacy and security concerns.
During interactions, LLMs may unintentionally expose confidential database
contents or be manipulated by attackers to exfiltrate data through seemingly
benign queries. While current efforts typically rely on rule-based heuristics
or LLM agents to mitigate this leakage risk, these methods still struggle with
complex inference-based attacks, suffer from high false positive rates, and
often compromise the reliability of SQL queries. To address these challenges,
we propose \textsc{SafeNlidb}, a novel privacy-security alignment framework for
LLM-based NLIDB. The framework features an automated pipeline that generates
hybrid chain-of-thought interaction data from scratch, seamlessly combining
implicit security reasoning with SQL generation. Additionally, we introduce
reasoning warm-up and alternating preference optimization to overcome the
multi-preference oscillations of Direct Preference Optimization (DPO), enabling
LLMs to produce security-aware SQL through fine-grained reasoning without the
need for human-annotated preference data. Extensive experiments demonstrate
that our method outperforms both larger-scale LLMs and ideal-setting baselines,
achieving significant security improvements while preserving high
utility.WARNING: This work may contain content that is offensive and harmful!
comment: 26 pages, 14 figures, 22 tables
☆ Sensitivity of Small Language Models to Fine-tuning Data Contamination
Small Language Models (SLMs) are increasingly being deployed in
resource-constrained environments, yet their behavioral robustness to data
contamination during instruction tuning remains poorly understood. We
systematically investigate the contamination sensitivity of 23 SLMs (270M to 4B
parameters) across multiple model families by measuring susceptibility to
syntactic and semantic transformation types during instruction tuning:
syntactic transformations (character and word reversal) and semantic
transformations (irrelevant and counterfactual responses), each applied at
contamination levels of 25\%, 50\%, 75\%, and 100\%. Our results reveal
fundamental asymmetries in vulnerability patterns: syntactic transformations
cause catastrophic performance degradation, with character reversal producing
near-complete failure across all models regardless of size or family, while
semantic transformations demonstrate distinct threshold behaviors and greater
resilience in core linguistic capabilities. Critically, we discover a
``\textit{capability curse}" where larger, more capable models become more
susceptible to learning semantic corruptions, effectively following harmful
instructions more readily, while our analysis of base versus instruction-tuned
variants reveals that alignment provides inconsistent robustness benefits,
sometimes even reducing resilience. Our work establishes three core
contributions: (1) empirical evidence of SLMs' disproportionate vulnerability
to syntactic pattern contamination, (2) identification of asymmetric
sensitivity patterns between syntactic and semantic transformations, and (3)
systematic evaluation protocols for contamination robustness assessment. These
findings have immediate deployment implications, suggesting that current
robustness assumptions may not hold for smaller models and highlighting the
need for contamination-aware training protocols.
☆ Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights
Hyunjae Kim, Jiwoong Sohn, Aidan Gilson, Nicholas Cochran-Caggiano, Serina Applebaum, Heeju Jin, Seihee Park, Yujin Park, Jiyeong Park, Seoyoung Choi, Brittany Alexandra Herrera Contreras, Thomas Huang, Jaehoon Yun, Ethan F. Wei, Roy Jiang, Leah Colucci, Eric Lai, Amisha Dave, Tuo Guo, Maxwell B. Singer, Yonghoe Koo, Ron A. Adelman, James Zou, Andrew Taylor, Arman Cohan, Hua Xu, Qingyu Chen
Large language models (LLMs) are transforming the landscape of medicine, yet
two fundamental challenges persist: keeping up with rapidly evolving medical
knowledge and providing verifiable, evidence-grounded reasoning.
Retrieval-augmented generation (RAG) has been widely adopted to address these
limitations by supplementing model outputs with retrieved evidence. However,
whether RAG reliably achieves these goals remains unclear. Here, we present the
most comprehensive expert evaluation of RAG in medicine to date. Eighteen
medical experts contributed a total of 80,502 annotations, assessing 800 model
outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and
USMLE-style queries. We systematically decomposed the RAG pipeline into three
components: (i) evidence retrieval (relevance of retrieved passages), (ii)
evidence selection (accuracy of evidence usage), and (iii) response generation
(factuality and completeness of outputs). Contrary to expectation, standard RAG
often degraded performance: only 22% of top-16 passages were relevant, evidence
selection remained weak (precision 41-43%, recall 27-49%), and factuality and
completeness dropped by up to 6% and 5%, respectively, compared with non-RAG
variants. Retrieval and evidence selection remain key failure points for the
model, contributing to the overall performance drop. We further show that
simple yet effective strategies, including evidence filtering and query
reformulation, substantially mitigate these issues, improving performance on
MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call
for re-examining RAG's role in medicine and highlight the importance of
stage-aware evaluation and deliberate system design for reliable medical LLM
applications.
comment: 34 pages, 6 figures
☆ Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View AAAI 2026
Recent advances in Multimodal Large Language Models (MLLMs) have spurred
significant progress in Chain-of-Thought (CoT) reasoning. Building on the
success of Deepseek-R1, researchers extended multimodal reasoning to
post-training paradigms based on reinforcement learning (RL), focusing
predominantly on mathematical datasets. However, existing post-training
paradigms tend to neglect two critical aspects: (1) The lack of quantifiable
difficulty metrics capable of strategically screening samples for post-training
optimization. (2) Suboptimal post-training paradigms that fail to jointly
optimize perception and reasoning capabilities. To address this gap, we propose
two novel difficulty-aware sampling strategies: Progressive Image Semantic
Masking (PISM) quantifies sample hardness through systematic image degradation,
while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction
complexity via attention distribution analysis. Leveraging these metrics, we
design a hierarchical training framework that incorporates both GRPO-only and
SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark
datasets. Experiments demonstrate consistent superiority of GRPO applied to
difficulty-stratified samples compared to conventional SFT+GRPO pipelines,
indicating that strategic data sampling can obviate the need for supervised
fine-tuning while improving model accuracy. Our code will be released at
https://github.com/qijianyu277/DifficultySampling.
comment: Accpeted by AAAI 2026
☆ Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content
Adi Danish Bin Muhammad Amin, Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Zulfahmi Toh, Nur Syafiqah Nafis
The rapid evolution of the gaming industry, driven by technological
advancements and a burgeoning community, necessitates a deeper understanding of
user sentiments, especially as expressed on popular social media platforms like
YouTube. This study presents a sentiment analysis on video games based on
YouTube comments, aiming to understand user sentiments within the gaming
community. Utilizing YouTube API, comments related to various video games were
collected and analyzed using the TextBlob sentiment analysis tool. The
pre-processed data underwent classification using machine learning algorithms,
including Na\"ive Bayes, Logistic Regression, and Support Vector Machine (SVM).
Among these, SVM demonstrated superior performance, achieving the highest
classification accuracy across different datasets. The analysis spanned
multiple popular gaming videos, revealing trends and insights into user
preferences and critiques. The findings underscore the importance of advanced
sentiment analysis in capturing the nuanced emotions expressed in user
comments, providing valuable feedback for game developers to enhance game
design and user experience. Future research will focus on integrating more
sophisticated natural language processing techniques and exploring additional
data sources to further refine sentiment analysis in the gaming domain.
comment: 6 pages, 7 figures, 2025 IEEE 9th International Conference on
Software Engineering & Computer Systems
☆ Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries
How do we make a meaningful comparison of a large language model's knowledge
of the law in one place compared to another? Quantifying these differences is
critical to understanding if the quality of the legal information obtained by
users of LLM-based chatbots varies depending on their location. However,
obtaining meaningful comparative metrics is challenging because legal
institutions in different places are not themselves easily comparable. In this
work we propose a methodology to obtain place-to-place metrics based on the
comparative law concept of functionalism. We construct a dataset of factual
scenarios drawn from Reddit posts by users seeking legal advice for family,
housing, employment, crime and traffic issues. We use these to elicit a summary
of a law from the LLM relevant to each scenario in Los Angeles, London and
Sydney. These summaries, typically of a legislative provision, are manually
evaluated for hallucinations. We show that the rate of hallucination of legal
information by leading closed-source LLMs is significantly associated with
place. This suggests that the quality of legal solutions provided by these
models is not evenly distributed across geography. Additionally, we show a
strong negative correlation between hallucination rate and the frequency of the
majority response when the LLM is sampled multiple times, suggesting a measure
of uncertainty of model predictions of legal facts.
☆ Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention AAAI2026
Large Language Models (LLMs) have demonstrated remarkable generalization
capabilities, but aligning their outputs with human preferences typically
requires expensive supervised fine-tuning. Recent test-time methods leverage
textual feedback to overcome this, but they often critique and revise a single
candidate response, lacking a principled mechanism to systematically analyze,
weigh, and synthesize the strengths of multiple promising candidates. Such a
mechanism is crucial because different responses may excel in distinct aspects
(e.g., clarity, factual accuracy, or tone), and combining their best elements
may produce a far superior outcome. This paper proposes the Textual
Self-Attention Network (TSAN), a new paradigm for test-time preference
optimization that requires no parameter updates. TSAN emulates self-attention
entirely in natural language to overcome this gap: it analyzes multiple
candidates by formatting them into textual keys and values, weighs their
relevance using an LLM-based attention module, and synthesizes their strengths
into a new, preference-aligned response under the guidance of the learned
textual attention. This entire process operates in a textual gradient space,
enabling iterative and interpretable optimization. Empirical evaluations
demonstrate that with just three test-time iterations on a base SFT model, TSAN
outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the
current state-of-the-art test-time alignment method by effectively leveraging
multiple candidate solutions.
comment: AAAI2026
☆ Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation LREC 2026
Standard-to-dialect machine translation remains challenging due to a
persistent dialect gap in large language models and evaluation distortions
inherent in n-gram metrics, which favor source copying over authentic dialect
translation. In this paper, we propose the dialect refinement (DIA-REFINE)
framework, which guides LLMs toward faithful target dialect outputs through an
iterative loop of translation, verification, and feedback using external
dialect classifiers. To address the limitations of n-gram-based metrics, we
introduce the dialect fidelity score (DFS) to quantify linguistic shift and the
target dialect ratio (TDR) to measure the success of dialect translation.
Experiments on Korean dialects across zero-shot and in-context learning
baselines demonstrate that DIA-REFINE consistently enhances dialect fidelity.
The proposed metrics distinguish between False Success cases, where high n-gram
scores obscure failures in dialectal translation, and True Attempt cases, where
genuine attempts at dialectal translation yield low n-gram scores. We also
observed that models exhibit varying degrees of responsiveness to the
framework, and that integrating in-context examples further improves the
translation of dialectal expressions. Our work establishes a robust framework
for goal-directed, inclusive dialect translation, providing both rigorous
evaluation and critical insights into model performance.
comment: Submitted to LREC 2026
☆ How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models
Now that AI-driven moderation has become pervasive in everyday life, we often
hear claims that "the AI is biased". While this is often said jokingly, the
light-hearted remark reflects a deeper concern. How can we be certain that an
online post flagged as "inappropriate" was not simply the victim of a biased
algorithm? This paper investigates this problem using a dual approach. First, I
conduct a quantitative benchmark of a widely used toxicity model
(unitary/toxic-bert) to measure performance disparity between text in
African-American English (AAE) and Standard American English (SAE). The
benchmark reveals a clear, systematic bias: on average, the model scores AAE
text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second,
I introduce an interactive pedagogical tool that makes these abstract biases
tangible. The tool's core mechanic, a user-controlled "sensitivity threshold,"
demonstrates that the biased score itself is not the only harm; instead, the
more-concerning harm is the human-set, seemingly neutral policy that ultimately
operationalises discrimination. This work provides both statistical evidence of
disparate impact and a public-facing tool designed to foster critical AI
literacy.
comment: 9 pages, 5 figures, 4 tables, 14 references
☆ HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment AAAI 2026
Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian
Contrastive vision-language models like CLIP have achieved impressive results
in image-text retrieval by aligning image and text representations in a shared
embedding space. However, these models often treat text as flat sequences,
limiting their ability to handle complex, compositional, and long-form
descriptions. In particular, they fail to capture two essential properties of
language: semantic hierarchy, which reflects the multi-level compositional
structure of text, and semantic monotonicity, where richer descriptions should
result in stronger alignment with visual content.To address these limitations,
we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style
models without modifying the encoder architecture. HiMo-CLIP introduces two key
components: a hierarchical decomposition (HiDe) module that extracts latent
semantic components from long-form text via in-batch PCA, enabling flexible,
batch-aware alignment across different semantic granularities, and a
monotonicity-aware contrastive loss (MoLo) that jointly aligns global and
component-level representations, encouraging the model to internalize semantic
ordering and alignment strength as a function of textual completeness.These
components work in concert to produce structured, cognitively-aligned
cross-modal representations. Experiments on multiple image-text retrieval
benchmarks show that HiMo-CLIP consistently outperforms strong baselines,
particularly under long or compositional descriptions. The code is available at
https://github.com/UnicomAI/HiMo-CLIP.
comment: Accepted by AAAI 2026 as an Oral Presentation (13 pages, 7 figures, 7
tables)
☆ Adaptive Testing for Segmenting Watermarked Texts From Language Models
The rapid adoption of large language models (LLMs), such as GPT-4 and Claude
3.5, underscores the need to distinguish LLM-generated text from human-written
content to mitigate the spread of misinformation and misuse in education. One
promising approach to address this issue is the watermark technique, which
embeds subtle statistical signals into LLM-generated text to enable reliable
identification. In this paper, we first generalize the likelihood-based LLM
detection method of a previous study by introducing a flexible weighted
formulation, and further adapt this approach to the inverse transform sampling
method. Moving beyond watermark detection, we extend this adaptive detection
strategy to tackle the more challenging problem of segmenting a given text into
watermarked and non-watermarked substrings. In contrast to the approach in a
previous study, which relies on accurate estimation of next-token probabilities
that are highly sensitive to prompt estimation, our proposed framework removes
the need for precise prompt estimation. Extensive numerical experiments
demonstrate that the proposed methodology is both effective and robust in
accurately segmenting texts containing a mixture of watermarked and
non-watermarked content.
comment: 13 pages, 3 figures, accepted for publication in STAT, October 28,
2025
☆ GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization
Contracts are complex documents featuring detailed formal structures,
explicit and implicit dependencies and rich semantic content. Given these
document properties, contract drafting and manual examination of contracts have
proven to be both arduous and susceptible to errors. This work aims to simplify
and automate the task of contract review and analysis using a novel framework
for transforming legal contracts into structured semantic graphs, enabling
computational analysis and data-driven insights. We introduce a detailed
ontology mapping core legal contract elements to their graph-theoretic
equivalents of nodes and edges. We then present a reinforcement learning based
Large Language Model (LLM) framework for segmentation and extraction of
entities and relationships from contracts. Our method, GRAPH-GRPO-LEX,
incorporates both LLMs and reinforcement learning with group relative policy
optimization (GRPO). By applying a carefully drafted reward function of graph
metrics, we demonstrate the ability to automatically identify direct
relationships between clauses, and even uncover hidden dependencies. Our
introduction of the gated GRPO approach shows a strong learning signal and can
move contract analysis from a linear, manual reading process to an easily
visualized graph. This allows for a more dynamic analysis, including building
the groundwork for contract linting similar to what is now practiced in
software engineering.
☆ Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes
Rhetorical modes are useful in both academic and non-academic writing, and
can be subjects to be studied within linguistic research and computational
modeling. Establishing a conceptual bridge among these domains could enable
each to benefit from the others. This paper proposes duality-based mode
operations (split-unite, forward-backward, expansion-reduction and orthogonal
dualities) to expand the set of rhetorical modes, introducing generated modes
like combination and generalization, thereby enhancing epistemic diversity
across multiple applications. It further presents a pyramid multilayer mapping
framework (e.g., three layers from the rhetorical model layer, to cognitive
layer, and to epistemic layers) that reduces the resulting cognitive
complexity. The degrees of expressive diversity and complexity reduction are
quantified through binomial combinatorics and Shannon entropy analysis. A
Marginal Rhetorical Bit (MRB) is identified, permitting the definition of a
rhetorical-scalable parameter that measures expressive growth speed in bits per
stage. A direct entropy measure shows that hierarchical selection over smaller
subsets markedly reduces choice uncertainty compared with flat selection across
all modes. These considerations appear to transform static and non-measurable
rhetorical taxonomies into more dynamic and more measurable systems for
discourse design. From this work, it would be possible to identify a pathway
for future AI systems to operate not only on language tokens but on layered
rhetorical reasoning structures, bridging linguistic, pedagogical, academic,
and computational research
☆ MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
As large language models transition from text-based interfaces to audio
interactions in clinical settings, they might introduce new vulnerabilities
through paralinguistic cues in audio. We evaluated these models on 170 clinical
cases, each synthesized into speech from 36 distinct voice profiles spanning
variations in age, gender, and emotion. Our findings reveal a severe modality
bias: surgical recommendations for audio inputs varied by as much as 35%
compared to identical text-based inputs, with one model providing 80% fewer
recommendations. Further analysis uncovered age disparities of up to 12%
between young and elderly voices, which persisted in most models despite
chain-of-thought prompting. While explicit reasoning successfully eliminated
gender bias, the impact of emotion was not detected due to poor recognition
performance. These results demonstrate that audio LLMs are susceptible to
making clinical decisions based on a patient's voice characteristics rather
than medical evidence, a flaw that risks perpetuating healthcare disparities.
We conclude that bias-aware architectures are essential and urgently needed
before the clinical deployment of these models.
☆ TabRAG: Tabular Document Retrieval via Structured Language Representations NeurIPS 2025
Ingesting data for Retrieval-Augmented Generation (RAG) involves either
fine-tuning the embedding model directly on the target corpus or parsing
documents for embedding model encoding. The former, while accurate, incurs high
computational hardware requirements, while the latter suffers from suboptimal
performance when extracting tabular data. In this work, we address the latter
by presenting TabRAG, a parsing-based RAG pipeline designed to tackle
table-heavy documents via structured language representations. TabRAG
outperforms existing popular parsing-based methods for generation and
retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.
comment: NeurIPS 2025 AI4Tab
♻ ☆ Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection
Multimodal models play a key role in empathy detection, but their performance
can suffer when modalities provide conflicting cues. To understand these
failures, we examine cases where unimodal and multimodal predictions diverge.
Using fine-tuned models for text, audio, and video, along with a gated fusion
model, we find that such disagreements often reflect underlying ambiguity, as
evidenced by annotator uncertainty. Our analysis shows that dominant signals in
one modality can mislead fusion when unsupported by others. We also observe
that humans, like models, do not consistently benefit from multimodal input.
These insights position disagreement as a useful diagnostic signal for
identifying challenging examples and improving empathy system robustness.
♻ ☆ REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in
aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy
Optimization~(PPO), employs a critic network to estimate advantages, which
introduces significant computational and memory overhead. To address this, a
family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these
methods typically rely on \textit{prompt-level (local)} advantage
normalization, which suffers from inaccurate advantage estimation, a tendency
to overfit, and, as we show, is a theoretically biased estimator. To solve
these challenges, we introduce REINFORCE++, a critic-free framework centered on
\textbf{Global Advantage Normalization}. By normalizing advantages across the
entire global batch rather than small, prompt-specific groups, our method
provides a more stable and theoretically sound, \textit{effectively unbiased}
estimate (whose bias vanishes as batch size increases). We introduce two
variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for
general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling
variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation
demonstrates that each variant shows superior stability and performance in its
respective domain, outperforming existing methods and even PPO in complex
agentic settings.
comment: refactor
♻ ☆ Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, and iteratively conducts
experiments until improvements are realized, and writes a paper with the
results. Unlike previous approaches that assume full automation or operate on
small-scale code, Jr. AI Scientist follows a well-defined research workflow and
leverages modern coding agents to handle complex, multi-file implementations,
leading to scientifically valuable contributions. Through our experiments, the
Jr. AI Scientist successfully generated new research papers that build upon
real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods.
For evaluation, we conducted automated assessments using AI Reviewers,
author-led evaluations, and submissions to Agents4Science, a venue dedicated to
AI-driven scientific contributions. The findings demonstrate that Jr. AI
Scientist generates papers receiving higher review scores than existing fully
automated systems. Nevertheless, we identify important limitations from both
the author evaluation and the Agents4Science reviews, indicating the potential
risks of directly applying current AI Scientist systems and key challenges for
future research. Finally, we comprehensively report various risks identified
during development. We believe this study clarifies the current role and
limitations of AI Scientist systems, offering insights into the areas that
still require human expertise and the risks that may emerge as these systems
evolve.
comment: Issues, comments, and questions are all welcome in
https://github.com/Agent4Science-UTokyo/Jr.AI-Scientist
♻ ☆ When Language Shapes Thought: Cross-Lingual Transfer of Factual Knowledge in Question Answering
Multilingual large language models (LLMs) offer promising opportunities for
cross-lingual information access, yet their use of factual knowledge remains
highly sensitive to the input language. Prior work has addressed this through
English prompting and evaluation, assuming that English-based reasoning is
universally beneficial. In this work, we challenge that assumption by exploring
factual knowledge transfer from non-English to English through the lens of
Language and Thought Theory. We introduce Language-to-Thought (L2T) prompting,
which aligns the model's internal ''thinking'' language with the source of
knowledge. Across three languages and four models, L2T consistently outperforms
English-based reasoning, reversing the expected advantage of English prompts.
Our code is available at https://github.com/GeomeunByeol/Language2Thought.
comment: Accepted at CIKM2025 (Expanded version)
♻ ☆ ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation
ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a
large-model encoder for Wi-Fi channel state information (and optionally mmWave
radar or RFID) with a policy-grounded decision layer and end-to-end
zero-knowledge proofs of inference. The encoder uses masked spectral
pretraining with phase-consistency regularization, plus a light cross-modal
alignment that ties RF features to compact, human-interpretable policy tokens.
To reduce unsafe actions under distribution shift, we add a calibrated
selective-abstention head; the chosen risk-coverage operating point is
registered and bound into the proof. We implement a four-stage proving
pipeline: (C1) feature sanity and commitment, (C2) threshold and version
binding, (C3) time-window binding, and (C4) PLONK-style proofs that the
quantized network, given the committed window, produced the logged action and
confidence. Micro-batched proving amortizes cost across adjacent windows, and a
gateway option offloads proofs from low-power devices. The system integrates
with differentially private federated learning and on-device personalization
without weakening verifiability: model hashes and the registered threshold are
part of each public statement. Across activity, presence or intrusion,
respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1
and calibration, yields favorable coverage-risk curves under perturbations, and
rejects tamper and replay with compact proofs and fast verification.
comment: 45 pages
♻ ☆ ReCode: Updating Code API Knowledge with Reinforcement Learning AAAI 2026
Large Language Models (LLMs) exhibit remarkable code generation capabilities
but falter when adapting to frequent updates in external library APIs. This
critical limitation, stemming from reliance on outdated API knowledge from
their training data, even with access to current documentation, impedes
reliable code generation in dynamic environments. To tackle this issue, we
propose ReCode (rule-based Reinforcement learning for Code Update), a novel
framework that mimics human programmer adaptation to API changes. Specifically,
we construct a dataset of approximately 2,000 data entries to train the LLMs to
perform version migration based on updated information. Then, we introduce a
modified string similarity metric for code evaluation as the reward for
reinforcement learning. Our experiments demonstrate that ReCode substantially
boosts LLMs' code generation performance in dynamic API scenarios, especially
on the unseen CodeUpdateArena task. Crucially, compared to supervised
fine-tuning, ReCode has less impact on LLMs' general code generation abilities.
We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and
DAPO), all achieving consistent improvements. Notably, after training,
Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned
model and the reasoning model with the same architecture. Code is available at
https://github.com/zjunlp/ReCode.
comment: AAAI 2026
♻ ☆ Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines
Data preparation (DP) transforms raw data into a form suitable for downstream
applications, typically by composing operations into executable pipelines.
Building such pipelines is time-consuming and requires sophisticated
programming skills, posing a significant barrier for non-experts. To lower this
barrier, we introduce Text-to-Pipeline, a new task that translates NL data
preparation instructions into DP pipelines, and PARROT, a large-scale benchmark
to support systematic evaluation. To ensure realistic DP scenarios, PARROT is
built by mining transformation patterns from production pipelines and
instantiating them on 23,009 real-world tables, resulting in ~18,000 tasks
spanning 16 core operators. Our empirical evaluation on PARROT reveals a
critical failure mode in cutting-edge LLMs: they struggle not only with
multi-step compositional logic but also with semantic parameter grounding. We
thus establish a strong baseline with Pipeline-Agent, an execution-aware agent
that iteratively reflects on intermediate states. While it achieves
state-of-the-art performance, a significant gap remains, underscoring the deep,
unsolved challenges for PARROT. It provides the essential, large-scale testbed
for developing and evaluating the next generation of autonomous data
preparation agentic systems.
♻ ☆ CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
We present CoSense-LLM, an edge-first framework that turns continuous
multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and
lightweight vision) into compact, verifiable semantic tokens and coordinates
with large language models under explicit latency, energy, bandwidth, and
privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight
encoder that aligns sensor embeddings with language and compresses them into
short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer
that grounds generation in site specific policies and notes; (iii)
PromptRouter, a cost and uncertainty aware policy that selects edge only
generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure
Execution, an auditable redaction path that enforces data minimization so raw
waveforms never leave the device. The system works with modern serving
optimizations, including paged or streaming KV caches, FlashAttention style
kernels, speculative decoding, and quantized LoRA adapters, and supports on
device personalization and federated updates under non IID drift. Across home,
office, and clinic deployments, CoSense-LLM delivers grounded explanations
while meeting tight service level objectives: it sustains sub second (p95) end
to end latency on edge dominant paths, reduces inter tier token and bandwidth
costs by preferring local retrieval grounded responses, and preserves privacy
by transmitting only discrete codes and redacted metadata. Ablations show that
Edge-RAG improves factual consistency and reduces contradictions, calibrated
uncertainty enables selective abstention and controlled escalations, and KV
plus decoding accelerators lower energy per decision. The results support an
edge first design that treats semantics, privacy, and predictable latency as co
equal goals for large model deployments in interference prone environments.
comment: 19 pages,8 figures
♻ ☆ Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE),
Unigram), morphological alignment, tokenization quality (e.g., compression
efficiency), and downstream performance remains largely unclear, particularly
for languages with complex morphology. In this paper, we conduct a
comprehensive evaluation of tokenizers using small-sized BERT models -- from
pre-training through fine-tuning -- for Telugu (agglutinative), along with
preliminary evaluation in Hindi (primarily fusional with some agglutination)
and English (fusional). To evaluate morphological alignment of tokenizers in
Telugu, we create a dataset containing gold morpheme segmentations of 600
derivational and 7000 inflectional word forms.
Our experiments reveal two key findings for Telugu. First, the choice of
tokenizer algorithm is the most significant factor influencing performance,
with Unigram-based tokenizers consistently outperforming BPE across most
settings. Second, while better morphological alignment shows a moderate,
positive correlation with performance on text classification and structure
prediction tasks, its impact is secondary to the tokenizer algorithm. Notably,
hybrid approaches that use morphological information for pre-segmentation
significantly boost the performance of BPE, though not Unigram. Our results
further showcase the need for comprehensive intrinsic evaluation metrics for
tokenizers that could explain downstream performance trends consistently.
♻ ☆ All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing
Due to their capacity to acquire world knowledge from large corpora,
pre-trained language models (PLMs) are extensively used in ultra-fine entity
typing tasks where the space of labels is extremely large. In this work, we
explore the limitations of the knowledge acquired by PLMs by proposing a novel
heuristic to approximate the pre-training distribution of entities when the
pre-training data is unknown. Then, we systematically demonstrate that
entity-typing approaches that rely solely on the parametric knowledge of PLMs
struggle significantly with entities at the long tail of the pre-training
distribution, and that knowledge-infused approaches can account for some of
these shortcomings. Our findings suggest that we need to go beyond PLMs to
produce solutions that perform well for infrequent entities.
♻ ☆ Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited IJCAI
We investigate the abilities of 28 Large language Models (LLMs) to reason
about cardinal directions (CDs) using a benchmark generated from a set of
templates, extensively testing an LLM's ability to determine the correct CD
given a particular scenario. The templates allow for a number of degrees of
variation such as means of locomotion of the agent involved, and whether set in
the first, second or third person. Even the newer Large Reasoning Models are
unable to reliably determine the correct CD for all questions. This paper
summarises and extends earlier work presented at COSIT-24.
comment: 8 pages, 5 figures. Accepted at QR 2025 : 38th International Workshop
on Qualitative Reasoning at IJCAI. arXiv admin note: substantial text overlap
with arXiv:2406.16528
♻ ☆ DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning NeurIPS 2025
Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lu Hou, Lifeng Shang
Information seeking demands iterative evidence gathering and reflective
reasoning, yet large language models (LLMs) still struggle with it in open-web
question answering. Existing prompting and supervised fine-tuning (SFT) methods
remain fixed by prompt rules or training corpora, and are usually benchmarked
only on well-structured wiki sources, limiting real-world adaptability. We
introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that
evaluates information seeking on the live internet, across both wiki and
open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a
reinforcement-learning (RL) framework that cultivates Search Intensity Scaling
(SIS)-an emergent ability to escalate search frequency and depth instead of
settling on overconfident, under-evidenced answers. With SIS,
Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks
comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum
from cold-start SFT to a well designed RL procedure, and show that its seeking
policy generalized from closed-ended queries to open-ended generation such as
long-form writing. Our results advance adaptive information seeking in LLMs and
provide a rigorous benchmark for future work.
comment: Accepted as NeurIPS 2025 Spotlight
♻ ☆ Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents AACL 2025
Historical documents in the Sinosphere are known to share common formats and
practices, particularly in veritable records compiled by court historians. This
shared linguistic heritage has led researchers to use Classical Chinese
resources for cross-lingual transfer when processing historical documents from
Korea and Japan, which remain relatively low-resource. In this paper, we
question the assumption of cross-lingual transferability from Classical Chinese
to Hanja and Kanbun, the ancient written languages of Korea and Japan,
respectively. Our experiments across machine translation, named entity
recognition, and punctuation restoration tasks show minimal impact of Classical
Chinese datasets on language model performance for ancient Korean documents
written in Hanja, with performance differences within $\pm{}0.0068$ F1-score
for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These
limitations persist consistently across various model sizes, architectures, and
domain-specific datasets. Our analysis reveals that the benefits of Classical
Chinese resources diminish rapidly as local language data increases for Hanja,
while showing substantial improvements only in extremely low-resource scenarios
for both Korean and Japanese historical documents. These findings emphasize the
need for careful empirical validation rather than assuming benefits from
indiscriminate cross-lingual transfer.
comment: IJCNLP-AACL 2025 Findings
♻ ☆ How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
Diffusion language models (DLMs) have emerged as a promising alternative to
the long-dominant autoregressive (AR) paradigm, offering a parallelable
decoding process that could yield greater efficiency. Yet, in practice, current
open-source DLMs often underperform their AR counterparts in speed, limiting
their real-world utility. This work presents a systematic study of DLM
efficiency, identifying key issues in prior evaluation methods. Through
empirical benchmarking and a theoretical analysis, we demonstrate that AR
models generally achieve higher throughput, while DLMs consistently lag. We
also investigate acceleration strategies, finding that techniques like dual
cache and parallel decoding mainly offer gains at small batch sizes, with their
benefits diminishing upon scaling. Our findings underscore the necessity of
robust evaluation methods and improved acceleration strategies to advance
research on DLMs.
♻ ☆ On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation EMNLP 2025
Retrieval-augmented generation (RAG) with large language models (LLMs) has
demonstrated strong performance in multilingual question-answering (QA) tasks
by leveraging relevant passages retrieved from corpora. In multilingual RAG
(mRAG), the retrieved passages can be written in languages other than that of
the query entered by the user, making it challenging for LLMs to effectively
utilize the provided information. Recent research suggests that retrieving
passages from multilingual corpora can improve RAG performance, particularly
for low-resource languages. However, the extent to which LLMs can leverage
different kinds of multilingual contexts to generate accurate answers,
*independently from retrieval quality*, remains understudied. In this paper, we
conduct an extensive assessment of LLMs' ability to (i) make consistent use of
a relevant passage regardless of its language, (ii) respond in the expected
language, and (iii) focus on the relevant passage even when multiple
`distracting' passages in different languages are provided in the context. Our
experiments with four LLMs across three QA datasets covering a total of 48
languages reveal a surprising ability of LLMs to extract the relevant
information from passages in a different language than the query, but a much
weaker ability to formulate a full answer in the correct language. Our
analysis, based on both accuracy and feature attribution techniques, further
shows that distracting passages negatively impact answer quality regardless of
their language. However, distractors in the query language exert a slightly
stronger influence. Taken together, our findings deepen the understanding of
how LLMs utilize context in mRAG systems, providing directions for future
improvements.
comment: Best Paper Award at MRL Workshop 2025, colocated with EMNLP 2025. All
codes and data are released at
https://github.com/Betswish/mRAG-Context-Consistency
♻ ☆ Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain
Tabular data is considered the last unconquered castle of deep learning, yet
the task of data stream classification is stated to be an equally important and
demanding research area. Due to the temporal constraints, it is assumed that
deep learning methods are not the optimal solution for application in this
field. However, excluding the entire -- and prevalent -- group of methods seems
rather rash given the progress that has been made in recent years in its
development. For this reason, the following paper is the first to present an
approach to natural language data stream classification using the sentence
space method, which allows for encoding text into the form of a discrete
digital signal. This allows the use of convolutional deep networks dedicated to
image classification to solve the task of recognizing fake news based on text
data. Based on the real-life Fakeddit dataset, the proposed approach was
compared with state-of-the-art algorithms for data stream classification based
on generalization ability and time complexity.
comment: 16 pages, 7 figures
♻ ☆ DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Speculative decoding has become a standard way to accelerate LLM inference: a
small drafter proposes multiple tokens and a large target model verifies them
once per speculation length. Recently, scaling of the LLM vocabulary has pushed
the number of tokens to grow substantially. While verification over the full
vocabulary leaves the target model largely unaffected, the O(|V|d) parameters
in the drafter's output head become a latency bottleneck, slowing the entire
pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the
drafter's vocabulary to a fixed top frequent subset of the target model's
vocabulary. Although this reduces draft-time compute, it is brittle, since: (i)
frequency lists are corpus-dependent and require retuning to generalize, and
(ii) static shortlists suppress rare or domain-specific tokens, lowering the
expected number of tokens per verification step. We propose DynaSpec, a
context-dependent dynamic shortlisting mechanism that is robust, speeds up
drafting, and generalizes across diverse tasks. Concretely, we introduce
lightweight, coarse-grained meta-classifiers that route contexts to a small
number of token clusters; the union of the top-k selected clusters forms the
drafter's shortlist, while verification retains the full vocabulary and
exactness. The meta-classifier finishes its computation earlier than the
drafter's hidden state generation by exploiting parallel execution of draft
encoding and meta shortlisting on separate streams. Across standard speculative
decoding benchmarks, DynaSpec delivers consistent improvements in mean accepted
length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance,
while fixed-shortlist baselines attain only 84.4%. By leveraging
context-dependent selection, DynaSpec achieves up to a 2.18 times increase in
generated tokens compared to 1.91 times for fixed-vocabulary approaches.
♻ ☆ Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study
Despite recent progress in training long-chain-of-thought reasoning models
via scaling reinforcement learning (RL), its underlying training dynamics
remain poorly understood, and several counterintuitive behaviors persist. This
work focuses on three key aspects: (1) We systematically analyze the roles of
positive and negative samples in scaling RL, revealing that positive samples
mainly facilitate precise fitting to the training data, whereas negative
samples significantly enhance generalization and robustness. Interestingly,
while positive samples are essential for convergence in the zero-RL setting,
training on negative samples alone suffices to attain strong reasoning
performance and even better generalization in cold-start scenarios. (2) We
identify substantial data inefficiency in group relative policy optimization,
where over half of the samples yield zero advantage. To address this, we
explore two strategies, including relative length rewards and offline sample
injection, to leverage these data better and enhance reasoning efficiency and
capability. (3) We investigate unstable performance across various reasoning
models and benchmarks, attributing instability to uncertain problems with
ambiguous outcomes, and demonstrate that greedy decoding can distort evaluation
by flipping the correctness of responses. Our code is available at:
https://github.com/takagi97/Dissect-Long-Reason-Models.
comment: Working in process
♻ ☆ LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
As large language models (LLMs) are increasingly used in legal applications,
current evaluation benchmarks tend to focus mainly on factual accuracy while
largely neglecting important linguistic quality aspects such as clarity,
coherence, and terminology. To address this gap, we propose three steps: First,
we develop a regression model to evaluate the quality of legal texts based on
clarity, coherence, and terminology. Second, we create a specialized set of
legal questions. Third, we analyze 49 LLMs using this evaluation framework.
Our analysis identifies three key findings: First, model quality levels off
at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at
72 billion parameters. Second, engineering choices such as quantization and
context length have a negligible impact, as indicated by statistical
significance thresholds above 0.016. Third, reasoning models consistently
outperform base architectures. A significant outcome of our research is the
release of a ranking list and Pareto analysis, which highlight the Qwen3 series
as the optimal choice for cost-performance tradeoffs. This work not only
establishes standardized evaluation protocols for legal LLMs but also uncovers
fundamental limitations in current training data refinement approaches. Code
and models are available at: https://github.com/lyxx3rd/LegalEval-Q.
comment: 10 pages, 11 figures
♻ ☆ Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond EMNLP 2025
Recent advances in test-time scaling of large language models (LLMs),
exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of
thought during inference can significantly improve general reasoning
performance. However, the impact of this paradigm on legal reasoning remains
insufficiently explored. To address this gap, we present the first systematic
evaluation of 12 LLMs, including both reasoning-focused and general-purpose
models, across 17 Chinese and English legal tasks spanning statutory and
case-law traditions. In addition, we curate a bilingual chain-of-thought
dataset for legal reasoning through distillation from DeepSeek-R1 and develop
Legal-R1, an open-source model specialized for the legal domain. Experimental
results show that Legal-R1 delivers competitive performance across diverse
tasks. DeepSeek-R1 exhibits clear advantages in Chinese legal reasoning, while
OpenAI's o1 achieves comparable results on English tasks. We further conduct a
detailed error analysis, which reveals recurring issues such as outdated legal
knowledge, limited capacity for legal interpretation, and susceptibility to
factual hallucinations. These findings delineate the main obstacles confronting
legal-domain LLMs and suggest promising directions for future research.
comment: 23 pages, Published in Findings of the Association for Computational
Linguistics: EMNLP 2025
♻ ☆ Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Network pruning focuses on algorithms that aim to reduce a given model's
computational cost by removing a subset of its parameters while having minimal
impact on performance. Throughout the last decade, the most widely used pruning
paradigm has been pruning and re-training, which nowadays is inconvenient due
to the vast amount of pre-trained models, which are, in any case, too expensive
to re-train. In this paper, we exploit functional information from dense
pre-trained models, i.e., their input activations, to obtain sparse models that
maximize the activations' alignment with respect to their corresponding dense
models. Hence, we propose \textbf{NeuroAl}, a \emph{top-up} algorithm that can
be used on top of any given pruning algorithm for LLMs, which modifies the
block-wise and row-wise sparsity, exploiting information from both the dense
model and its sparse version to maximize the \emph{neuron alignment} among
activations. Different from existing methods, our approach adaptively selects
the best hyperparameters for the block-wise and row-wise sparsity ratios w.r.t.
the model and the desired sparsity, and requires \emph{no re-training}. We test
our method over $\sim$300 test cases with four LLM families, three sparsity
ratios, and ten language tasks (three language modeling and seven zero-shot
datasets), showing how it consistently outperforms the latest state-of-the-art
methods in terms of performance-runtime trade-off. The code is available at
\href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.
comment: Published in Transactions on Machine Learning Research (TMLR)
♻ ☆ BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation
With the rapid advancement of large language models (LLMs),
retrieval-augmented generation (RAG) has emerged as a critical approach to
supplement the inherent knowledge limitations of LLMs. However, due to the
typically large volume of retrieved information, RAG tends to operate with long
context lengths. From the perspective of entropy engineering, we identify
unconstrained entropy growth and attention dilution due to long retrieval
context as significant factors affecting RAG performance. In this paper, we
propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves
the adaptability of RAG systems to varying context lengths through the
principle of entropy invariance. By leveraging balanced context entropy to
reformulate attention dynamics, BEE-RAG separates attention sensitivity from
context length, ensuring a stable entropy level. Building upon this, we
introduce a zero-shot inference strategy for multi-importance estimation and a
parameter-efficient adaptive fine-tuning mechanism to obtain the optimal
balancing factor for different settings. Extensive experiments across multiple
RAG tasks demonstrate the effectiveness of BEE-RAG.
♻ ☆ ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning AAAI 2026
Narrative comprehension on long stories and novels has been a challenging
domain attributed to their intricate plotlines and entangled, often evolving
relations among characters and entities. Given the LLM's diminished reasoning
over extended context and its high computational cost, retrieval-based
approaches remain a pivotal role in practice. However, traditional RAG methods
could fall short due to their stateless, single-step retrieval process, which
often overlooks the dynamic nature of capturing interconnected relations within
long-range context. In this work, we propose ComoRAG, holding the principle
that narrative reasoning is not a one-shot process, but a dynamic, evolving
interplay between new evidence acquisition and past knowledge consolidation,
analogous to human cognition on reasoning with memory-related signals in the
brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes
iterative reasoning cycles while interacting with a dynamic memory workspace.
In each cycle, it generates probing queries to devise new exploratory paths,
then integrates the retrieved evidence of new aspects into a global memory
pool, thereby supporting the emergence of a coherent context for the query
resolution. Across four challenging long-context narrative benchmarks (200K+
tokens), ComoRAG outperforms strong RAG baselines with consistent relative
gains up to 11% compared to the strongest baseline. Further analysis reveals
that ComoRAG is particularly advantageous for complex queries requiring global
context comprehension, offering a principled, cognitively motivated paradigm
towards retrieval-based stateful reasoning. Our framework is made publicly
available at https://github.com/EternityJune25/ComoRAG.
comment: Accepted by AAAI 2026
♻ ☆ UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT)
reasoning introduces new safety challenges. Existing SFT-based safety alignment
studies dominantly focused on filtering prompts with safe, high-quality
responses, while overlooking hard prompts that always elicit harmful outputs.
To fill this gap, we introduce UnsafeChain, a safety alignment dataset
constructed from hard prompts with diverse sources, where unsafe completions
are identified and explicitly corrected into safe responses. By exposing models
to unsafe behaviors and guiding their correction, UnsafeChain enhances safety
while preserving general reasoning ability. We fine-tune three LRMs on
UnsafeChain and compare them against recent SafeChain and STAR-1 across six
out-of-distribution and five in-distribution benchmarks. UnsafeChain
consistently outperforms prior datasets, with even a 1K subset matching or
surpassing baseline performance, demonstrating the effectiveness and
generalizability of correction-based supervision. We release our dataset and
code at https://github.com/mbzuai-nlp/UnsafeChain
♻ ☆ How Does a Deep Neural Network Look at Lexical Stress?
Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet
Despite their success in speech processing, neural networks often operate as
black boxes, prompting the question: what informs their decisions, and how can
we interpret them? This work examines this issue in the context of lexical
stress. A dataset of English disyllabic words was automatically constructed
from read and spontaneous speech. Several Convolutional Neural Network (CNN)
architectures were trained to predict stress position from a spectrographic
representation of disyllabic words lacking minimal stress pairs (e.g., initial
stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out
test data. Layerwise Relevance Propagation (LRP), a technique for CNN
interpretability analysis, revealed that predictions for held-out minimal pairs
(PROtest vs. proTEST ) were most strongly influenced by information in stressed
versus unstressed syllables, particularly the spectral properties of stressed
vowels. However, the classifiers also attended to information throughout the
word. A feature-specific relevance analysis is proposed, and its results
suggest that our best-performing classifier is strongly influenced by the
stressed vowel's first and second formants, with some evidence that its pitch
and third formant also contribute. These results reveal deep learning's ability
to acquire distributed cues to stress from naturally occurring data, extending
traditional phonetic work based around highly controlled stimuli.
comment: 11 pages, 5 figures, submitted to the Journal of the Acoustical
Society of America (JASA)
♻ ☆ Pralekha: Cross-Lingual Document Alignment for Indic Languages
Mining parallel document pairs for document-level machine translation (MT)
remains challenging due to the limitations of existing Cross-Lingual Document
Alignment (CLDA) techniques. Existing methods often rely on metadata such as
URLs, which are scarce, or on pooled document representations that fail to
capture fine-grained alignment cues. Moreover, the limited context window of
sentence embedding models hinders their ability to represent document-level
context, while sentence-based alignment introduces a combinatorially large
search space, leading to high computational cost. To address these challenges
for Indic languages, we introduce Pralekha, a benchmark containing over 3
million aligned document pairs across 11 Indic languages and English, which
includes 1.5 million English-Indic pairs. Furthermore, we propose Document
Alignment Coefficient (DAC), a novel metric for fine-grained document
alignment. Unlike pooling-based methods, DAC aligns documents by matching
smaller chunks and computes similarity as the ratio of aligned chunks to the
average number of chunks in a pair. Intrinsic evaluation shows that our
chunk-based method is 2-3x faster while maintaining competitive performance,
and that DAC achieves substantial gains over pooling-based baselines. Extrinsic
evaluation further demonstrates that document-level MT models trained on
DAC-aligned pairs consistently outperform those using baseline alignment
methods. These results highlight DAC's effectiveness for parallel document
mining. The dataset and evaluation framework are publicly available to support
further research.
♻ ☆ Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training AACL 2025
Learners of a second language (L2) often map non-native phonemes to similar
native-language (L1) phonemes, making conventional L2-focused training slow and
effortful. To address this, we propose an L1-grounded pronunciation training
method based on compositional phoneme approximation (CPA), a feature-based
representation technique that approximates L2 sounds with sequences of L1
phonemes. Evaluations with 20 Korean non-native English speakers show that
CPA-based training achieves a 76% in-box formant rate in acoustic analysis,
17.6% relative improvement in phoneme recognition accuracy, and over 80% of
speech being rated as more native-like, with minimal training. Project page:
https://gsanpark.github.io/CPA-Pronunciation.
comment: Accepted to IJCNLP-AACL 2025
♻ ☆ LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data
Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori
Alzheimer's disease and related dementias(ADRD) affect nearly five million
older adults in the United States, yet more than half remain undiagnosed.
Speech-based natural language processing(NLP) offers a scalable approach for
detecting early cognitive decline through subtle linguistic markers that may
precede clinical diagnosis. This study develops and evaluates a speech-based
screening pipeline integrating transformer embeddings with handcrafted
linguistic features, synthetic augmentation using large language models(LLMs),
and benchmarking of unimodal and multimodal classifiers. External validation
assessed generalizability to a MCI-only cohort.
Transcripts were drawn from the ADReSSo 2021 benchmark dataset(n=237, Pitt
Corpus) and the DementiaBank Delaware corpus(n=205, MCI vs. controls). Ten
transformer models were tested under three fine-tuning strategies. A
late-fusion model combined embeddings from the top transformer with 110
linguistic features. Five LLMs(LLaMA8B/70B, MedAlpaca7B, Ministral8B,GPT-4o)
generated label-conditioned synthetic speech for augmentation, and three
multimodal LLMs(GPT-4o,Qwen-Omni,Phi-4) were evaluated in zero-shot and
fine-tuned modes. On ADReSSo, the fusion model achieved F1=83.3(AUC=89.5),
outperforming transformer-only and linguistic baselines. MedAlpaca7B
augmentation(2x) improved F1=85.7, though larger scales reduced gains.
Fine-tuning boosted unimodal LLMs(MedAlpaca7B F1=47.7=>78.7), while multimodal
models performed lower (Phi-4=71.6;GPT-4o=67.6). On Delaware, the fusion plus
1x MedAlpaca7B model achieved F1=72.8(AUC=69.6). Integrating transformer and
linguistic features enhances ADRD detection. LLM-based augmentation improves
data efficiency but yields diminishing returns, while current multimodal models
remain limited. Validation on an independent MCI cohort supports the pipeline's
potential for scalable, clinically relevant early screening.
♻ ☆ LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
With the ever-increasing number of news stories available online, classifying
them by topic, regardless of the language they are written in, has become
crucial for enhancing readers' access to relevant content. To address this
challenge, we propose a teacher-student framework based on large language
models (LLMs) for developing multilingual news topic classification models of
reasonable size with no need for manual data annotation. The framework employs
a Generative Pretrained Transformer (GPT) model as the teacher model to develop
a news topic training dataset through automatic annotation of 20,000 news
articles in Slovenian, Croatian, Greek, and Catalan. Articles are classified
into 17 main categories from the Media Topic schema, developed by the
International Press Telecommunications Council (IPTC). The teacher model
exhibits high zero-shot performance in all four languages. Its agreement with
human annotators is comparable to that between the human annotators themselves.
To mitigate the computational limitations associated with the requirement of
processing millions of texts daily, smaller BERT-like student models are
fine-tuned on the GPT-annotated dataset. These student models achieve high
performance comparable to the teacher model. Furthermore, we explore the impact
of the training data size on the performance of the student models and
investigate their monolingual, multilingual, and zero-shot cross-lingual
capabilities. The findings indicate that student models can achieve high
performance with a relatively small number of training instances, and
demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the
best-performing news topic classifier, enabling multilingual classification
with the top-level categories of the IPTC Media Topic schema.
comment: This work has been accepted and published in the IEEE Access journal.
This arXiv version is retained for archival purposes. Readers should use and
cite the IEEE Access Version available at
https://ieeexplore.ieee.org/document/10900365
♻ ☆ DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains EMNLP 2025
Knowledge graph completion (KGC) aims to predict missing triples in knowledge
graphs (KGs) by leveraging existing triples and textual information. Recently,
generative large language models (LLMs) have been increasingly employed for
graph tasks. However, current approaches typically encode graph context in
textual form, which fails to fully exploit the potential of LLMs for perceiving
and reasoning about graph structures. To address this limitation, we propose
DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph
Completion). DrKGC employs a flexible lightweight model training strategy to
learn structural embeddings and logical rules within the KG. It then leverages
a novel bottom-up graph retrieval method to extract a subgraph for each query
guided by the learned rules. Finally, a graph convolutional network (GCN)
adapter uses the retrieved subgraph to enhance the structural embeddings, which
are then integrated into the prompt for effective LLM fine-tuning. Experimental
results on two general domain benchmark datasets and two biomedical datasets
demonstrate the superior performance of DrKGC. Furthermore, a realistic case
study in the biomedical domain highlights its interpretability and practical
utility.
comment: Accepted at EMNLP 2025 Findings
♻ ☆ Skill Path: Unveiling Language Skills from Circuit Graphs AAAI 2026
Circuit graph discovery has emerged as a fundamental approach to elucidating
the skill mechanistic of language models. Despite the output faithfulness of
circuit graphs, they suffer from atomic ablation, which causes the loss of
causal dependencies between connected components. In addition, their discovery
process, designed to preserve output faithfulness, inadvertently captures
extraneous effects other than an isolated target skill. To alleviate these
challenges, we introduce skill paths, which offers a more refined and compact
representation by isolating individual skills within a linear chain of
components. To enable skill path extracting from circuit graphs, we propose a
three-step framework, consisting of decomposition, pruning, and post-pruning
causal mediation. In particular, we offer a complete linear decomposition of
the transformer model which leads to a disentangled computation graph. After
pruning, we further adopt causal analysis techniques, including counterfactuals
and interventions, to extract the final skill paths from the circuit graph. To
underscore the significance of skill paths, we investigate three generic
language skills-Previous Token Skill, Induction Skill, and In-Context Learning
Skill-using our framework. Experiments support two crucial properties of these
skills, namely stratification and inclusiveness.
comment: accepted by AAAI 2026 (oral)
♻ ☆ Verifiable Fine-Tuning for LLMs: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy
Large language models are often adapted through parameter efficient fine
tuning, but current release practices provide weak assurances about what data
were used and how updates were computed. We present Verifiable Fine Tuning, a
protocol and system that produces succinct zero knowledge proofs that a
released model was obtained from a public initialization under a declared
training program and an auditable dataset commitment. The approach combines
five elements. First, commitments that bind data sources, preprocessing,
licenses, and per epoch quota counters to a manifest. Second, a verifiable
sampler that supports public replayable and private index hiding batch
selection. Third, update circuits restricted to parameter efficient fine tuning
that enforce AdamW style optimizer semantics and proof friendly approximations
with explicit error budgets. Fourth, recursive aggregation that folds per step
proofs into per epoch and end to end certificates with millisecond
verification. Fifth, provenance binding and optional trusted execution property
cards that attest code identity and constants. On English and bilingual
instruction mixtures, the method maintains utility within tight budgets while
achieving practical proof performance. Policy quotas are enforced with zero
violations, and private sampling windows show no measurable index leakage.
Federated experiments demonstrate that the system composes with probabilistic
audits and bandwidth constraints. These results indicate that end to end
verifiable fine tuning is feasible today for real parameter efficient
pipelines, closing a critical trust gap for regulated and decentralized
deployments.
comment: 20 pages, 10 figures
♻ ☆ SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to
natural human-computer interaction, enabling end-to-end spoken dialogue
systems. However, evaluating these models remains a fundamental challenge. We
propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech
LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches
that disregard acoustic features, SageLM jointly assesses both semantic and
acoustic dimensions. Second, it leverages rationale-based supervision to
enhance explainability and guide model learning, achieving superior alignment
with evaluation outcomes compared to rule-based reinforcement learning methods.
Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset,
and employ a two-stage training paradigm to mitigate the scarcity of speech
preference data. Trained on both semantic and acoustic dimensions, SageLM
achieves an 82.79\% agreement rate with human evaluators, outperforming
cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.
♻ ☆ BLADE: Benchmarking Language Model Agents for Data-Driven Science EMNLP 2024
Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff
Data-driven scientific discovery requires the iterative integration of
scientific domain knowledge, statistical expertise, and an understanding of
data semantics to make nuanced analytical decisions, e.g., about which
variables, transformations, and statistical models to consider. LM-based agents
equipped with planning, memory, and code execution capabilities have the
potential to support data-driven science. However, evaluating agents on such
open-ended tasks is challenging due to multiple valid approaches, partially
correct steps, and different ways to express the same decisions. To address
these challenges, we present BLADE, a benchmark to automatically evaluate
agents' multifaceted approaches to open-ended research questions. BLADE
consists of 12 datasets and research questions drawn from existing scientific
literature, with ground truth collected from independent analyses by expert
data scientists and researchers. To automatically evaluate agent responses, we
developed corresponding computational methods to match different
representations of analyses to this ground truth. Though language models
possess considerable world knowledge, our evaluation shows that they are often
limited to basic analyses. However, agents capable of interacting with the
underlying data demonstrate improved, but still non-optimal, diversity in their
analytical decision making. Our work enables the evaluation of agents for
data-driven science and provides researchers deeper insights into agents'
analysis approaches.
comment: EMNLP 2024
♻ ☆ Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish
Sparsely-activated Mixture of Experts (MoE) transformers are promising
architectures for foundation models. Compared to dense transformers that
require the same amount of floating-point operations (FLOPs) per forward pass,
MoEs benefit from improved sample efficiency at training time and achieve much
stronger performance. Many closed-source and open-source frontier language
models have thus adopted an MoE architecture. Naturally, practitioners will
want to extend the capabilities of these models with large amounts of newly
collected data without completely re-training them. Prior work has shown that a
simple combination of replay, learning rate re-warming, and re-decaying can
enable the continual pre-training (CPT) of dense decoder-only transformers with
minimal performance degradation compared to full re-training. In the case of
decoder-only MoE transformers, however, it is unclear how the routing algorithm
will impact continual pre-training performance: 1) do the MoE transformer's
routers exacerbate forgetting relative to a dense model?; 2) do the routers
maintain a balanced load on previous distributions after CPT?; 3) are the same
strategies applied to dense models sufficient to continually pre-train MoE
LLMs? In what follows, we conduct a large-scale study training a 500M parameter
dense transformer and four 500M-active/2B-total parameter MoE transformers.
Each model is trained for 600B tokens. Our results establish a surprising
robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and
Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually
pre-trained without replay. Moreover, we show that MoE LLMs maintain their
sample efficiency (relative to a FLOP-matched dense model) during CPT and that
they can match the performance of a fully re-trained MoE at a fraction of the
cost.
♻ ☆ RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment AAAI2026
Rare diseases, despite their low individual incidence, collectively impact
around 300 million people worldwide due to the vast number of diseases. The
involvement of multiple organs and systems, and the shortage of specialized
doctors with relevant experience, make diagnosing and treating rare diseases
more challenging than common diseases. Recently, agents powered by large
language models (LLMs) have demonstrated notable applications across various
domains. In the medical field, some agent methods have outperformed direct
prompts in question-answering tasks from medical examinations. However, current
agent frameworks are not well-adapted to real-world clinical scenarios,
especially those involving the complex demands of rare diseases. To bridge this
gap, we introduce RareAgents, the first LLM-driven multi-disciplinary team
decision-support tool designed specifically for the complex clinical context of
rare diseases. RareAgents integrates advanced Multidisciplinary Team (MDT)
coordination, memory mechanisms, and medical tools utilization, leveraging
Llama-3.1-8B/70B as the base model. Experimental results show that RareAgents
outperforms state-of-the-art domain-specific models, GPT-4o, and current agent
frameworks in diagnosis and treatment for rare diseases. Furthermore, we
contribute a novel rare disease dataset, MIMIC-IV-Ext-Rare, to facilitate
further research in this field.
comment: AAAI2026 Oral
♻ ☆ Likelihood-based Mitigation of Evaluation Bias in Large Language Models
Large Language Models (LLMs) are widely used to evaluate natural language
generation tasks as automated metrics. However, the likelihood, a measure of
LLM's plausibility for a sentence, can vary due to superficial differences in
sentences, such as word order and sentence structure. It is therefore possible
that there might be a likelihood bias if LLMs are used for evaluation: they
might overrate sentences with higher likelihoods while underrating those with
lower likelihoods. In this paper, we investigate the presence and impact of
likelihood bias in LLM-based evaluators. We also propose a method to mitigate
the likelihood bias. Our method utilizes highly biased instances as few-shot
examples for in-context learning. Our experiments in evaluating the
data-to-text and grammatical error correction tasks reveal that several LLMs we
test display a likelihood bias. Furthermore, our proposed method successfully
mitigates this bias, also improving evaluation performance (in terms of
correlation of models with human scores) significantly.
comment: 5 main pages
♻ ☆ SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents
Existing benchmarks for visual document retrieval (VDR) largely overlook
non-English languages and the structural complexity of official publications.
To address this gap, we introduce SDS KoPub VDR, the first large-scale, public
benchmark for retrieving and understanding Korean public documents. The
benchmark is built upon 361 real-world documents, including 256 files under the
KOGL Type 1 license and 105 from official legal portals, capturing complex
visual elements like tables, charts, and multi-column layouts. To establish a
reliable evaluation set, we constructed 600 query-page-answer triples. These
were initially generated using multimodal models (e.g., GPT-4o) and
subsequently underwent human verification to ensure factual accuracy and
contextual relevance. The queries span six major public domains and are
categorized by the reasoning modality required: text-based, visual-based, and
cross-modal. We evaluate SDS KoPub VDR on two complementary tasks: (1)
text-only retrieval and (2) multimodal retrieval, which leverages visual
features alongside text. This dual-task evaluation reveals substantial
performance gaps, particularly in multimodal scenarios requiring cross-modal
reasoning, even for state-of-the-art models. As a foundational resource, SDS
KoPub VDR enables rigorous and fine-grained evaluation and provides a roadmap
for advancing multimodal AI in real-world document intelligence. The dataset is
available at
https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark.
comment: 27 pages, 15 figures, 6 tables
♻ ☆ SEAGraph: Unveiling the Whole Story of Paper Review Comments
Jianxiang Yu, Jiaqi Tan, Zichen Ding, Jiapeng Zhu, Jiahao Li, Yao Cheng, Qier Cui, Yunshi Lan, Yao Liu, Xiang Li
Peer review, as a cornerstone of scientific research, ensures the integrity
and quality of scholarly work by providing authors with objective feedback for
refinement. However, in the traditional peer review process, authors often
receive vague or insufficiently detailed feedback, which provides limited
assistance and leads to a more time-consuming review cycle. If authors can
identify some specific weaknesses in their paper, they can not only address the
reviewer's concerns but also improve their work. This raises the critical
question of how to enhance authors' comprehension of review comments. In this
paper, we present SEAGraph, a novel framework developed to clarify review
comments by uncovering the underlying intentions behind them. We construct two
types of graphs for each paper: the semantic mind graph, which captures the
authors' thought process, and the hierarchical background graph, which
delineates the research domains related to the paper. A retrieval method is
then designed to extract relevant content from both graphs, facilitating
coherent explanations for the review comments. Extensive experiments show that
SEAGraph excels in review comment understanding tasks, offering significant
benefits to authors. By bridging the gap between reviewers' critiques and
authors' comprehension, SEAGraph contributes to a more efficient, transparent
and collaborative scientific publishing ecosystem.
♻ ☆ Atomic Consistency Preference Optimization for Long-Form Question Answering
Large Language Models (LLMs) often produce factoid hallucinations - plausible
yet incorrect answers. A common mitigation strategy is model alignment, which
improves factual accuracy by training on curated (factual, non-factual) pairs.
However, this approach often relies on a stronger model (e.g., GPT-4) or an
external knowledge base to assess factual correctness that may not always be
accessible. Addressing this, we propose Atomic Consistency Preference
Optimization (ACPO), a self-supervised preference-tuning method that enhances
factual accuracy without external supervision. ACPO leverages atomic
consistency signals (i.e., the agreement of individual facts across multiple
stochastic responses) to identify high- and low-quality data pairs for model
alignment. Despite being fully self-supervised, ACPO outperforms the strong
supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3
on the LongFact and BioGen datasets, demonstrating its effectiveness in
improving factual reliability without relying on external models or knowledge
bases.
comment: 13 pages, 1 figure
♻ ☆ multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder
The early detection of mental health disorders from social media text is
critical for enabling timely support, risk assessment, and referral to
appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned
RoBERTa model designed for multiclass classification of common mental health
conditions, including stress, anxiety, depression, post-traumatic stress
disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple
curated datasets, data exploration is conducted to analyze class overlaps,
revealing strong correlations between depression and suicidal ideation as well
as anxiety and PTSD, while stress emerges as a broad, overlapping category.
Comparative experiments with traditional machine learning methods,
domain-specific transformers, and prompting-based large language models
demonstrate that multiMentalRoBERTa achieves superior performance, with macro
F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup
(excluding stress), outperforming both fine-tuned MentalBERT and baseline
classifiers. Beyond predictive accuracy, explainability methods, including
Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues
that drive classification, with a particular focus on distinguishing depression
from suicidal ideation. The findings emphasize the effectiveness of fine-tuned
transformers for reliable and interpretable detection in sensitive contexts,
while also underscoring the importance of fairness, bias mitigation, and
human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as
a lightweight, robust, and deployable solution for enhancing support in mental
health platforms.
comment: Accepted in IEEE Big Data, 8-11 December, 2025 @ Macau SAR, China
♻ ☆ Rethinking Text-based Protein Understanding: Retrieval or LLM? EMNLP 2025
In recent years, protein-text models have gained significant attention for
their potential in protein generation and understanding. Current approaches
focus on integrating protein-related knowledge into large language models
through continued pretraining and multi-modal alignment, enabling simultaneous
comprehension of textual descriptions and protein sequences. Through a thorough
analysis of existing model architectures and text-based protein understanding
benchmarks, we identify significant data leakage issues present in current
benchmarks. Moreover, conventional metrics derived from natural language
processing fail to accurately assess the model's performance in this domain. To
address these limitations, we reorganize existing datasets and introduce a
novel evaluation framework based on biological entities. Motivated by our
observation, we propose a retrieval-enhanced method, which significantly
outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy
and efficiency in training-free scenarios. Our code and data can be seen at
https://github.com/IDEA-XL/RAPM.
comment: Accepted by Empirical Methods in Natural Language Processing 2025
(EMNLP 2025) Main Conference
♻ ☆ Reasoning Planning for Language Models
Selecting an appropriate reasoning method for a given query remains a key
challenge in language model generation. Existing approaches typically generate
multiple candidate responses and use an aggregation strategy to select the
output answer, often assuming that more candidate answers yield higher
accuracy. We revisit this assumption through a rigorous theoretical analysis,
deriving accuracy bounds for standard aggregation methods under fixed
generation distributions and candidate sizes. Building on these insights, we
introduce EPIC, an Ensemble Planning with Contrastive learning framework to
learn a shared representation space that captures both model reasoning
abilities and query-method compatibility. EPIC incorporates our probability
bounds as a regularizer in a utility-driven optimization that balances accuracy
and computational cost. Experiments on diverse mathematical reasoning tasks
show that EPIC consistently selects optimal reasoning methods, improving
accuracy while reducing computational overhead. Our code can be found at
https://github.com/nguyenngocbaocmt02/EPIC.
comment: 27 pages, 5 figures
♻ ☆ The Markovian Thinker
Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy
Reinforcement learning (RL) has recently become a strong recipe for training
reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard
RL "thinking environment", where the state is the prompt plus all prior
reasoning tokens, makes the state unbounded and forces attention-based policies
to pay quadratic compute as thoughts lengthen. We revisit the environment
itself. We propose Markovian Thinking, a paradigm in which the policy advances
reasoning while conditioning on a constant-size state, decoupling thinking
length from context size. As an immediate consequence this yields linear
compute with constant memory. We instantiate this idea with Delethink, an RL
environment that structures reasoning into fixed-size chunks. Within each
chunk, the model thinks as usual; at the boundary, the environment resets the
context and reinitializes the prompt with a short carryover. Through RL, the
policy learns to write a textual state near the end of each chunk sufficient
for seamless continuation of reasoning after reset. Trained in this
environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up
to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget.
With test-time scaling, Delethink continues to improve where LongCoT plateaus.
The effect of linear compute is substantial: we empirically estimate at 96K
average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink.
Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B)
often sample Markovian traces zero-shot across diverse benchmarks, providing
positive samples that make RL effective at scale. Our results show that
redesigning the thinking environment is a powerful lever: it enables very long
reasoning without quadratic overhead and opens a path toward efficient,
scalable reasoning LLMs.
♻ ☆ Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM
Multilingual large language models (LLMs) are great translators, but this is
largely limited to high-resource languages. For many LLMs, translating in and
out of low-resource languages remains a challenging task. To maximize data
efficiency in this low-resource setting, we introduce Mufu, which includes a
selection of automatically generated multilingual candidates and an instruction
to correct inaccurate translations in the prompt. Mufu prompts turn a
translation task into a postediting one, and seek to harness the LLM's
reasoning capability with auxiliary translation candidates, from which the
model is required to assess the input quality, align the semantics
cross-lingually, copy from relevant inputs and override instances that are
incorrect. Our experiments on En-XX translations over the Flores-200 dataset
show LLMs finetuned against Mufu-style prompts are robust to poor quality
auxiliary translation candidates, achieving performance superior to NLLB 1.3B
distilled model in 64% of low- and very-low-resource language pairs. We then
distill these models to reduce inference cost, while maintaining on average 3.1
chrF improvement over finetune-only baseline in low-resource translations.
comment: 29 pages
♻ ☆ R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs AACL 2025
Recent studies have combined Large Language Models (LLMs) with Knowledge
Graphs (KGs) to enhance reasoning, improving inference accuracy without
additional training while mitigating hallucination. However, existing
frameworks still suffer two practical drawbacks: they must be re-tuned whenever
the KG or reasoning task changes, and they depend on a single, high-capacity
LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce
R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two
roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor
(a high-capacity LLM) that makes final judgments. This design is cost-efficient
for LLM inference while still maintaining strong reasoning accuracy.
Additionally, R2-KG employs an Abstention mechanism, generating answers only
when sufficient evidence is collected from KG, which significantly enhances
reliability. Experiments across five diverse benchmarks show that R2-KG
consistently outperforms baselines in both accuracy and reliability, regardless
of the inherent capability of LLMs used as the Operator. Further experiments
reveal that the single-agent version of R2-KG, equipped with a strict
self-consistency strategy, achieves significantly higher-than-baseline
reliability with reduced inference cost but increased abstention rate in
complex KGs. Our findings establish R2-KG as a flexible and cost-effective
solution for KG-based reasoning, reducing reliance on high-capacity LLMs while
ensuring trustworthy inference. The code is available at
https://github.com/ekrxjwh2009/R2-KG/.
comment: Accepted to IJCNLP-AACL 2025 Findings
♻ ☆ EMNLP: Educator-role Moral and Normative Large Language Models Profiling EMNLP
Simulating Professions (SP) enables Large Language Models (LLMs) to emulate
professional roles. However, comprehensive psychological and ethical evaluation
in these contexts remains lacking. This paper introduces EMNLP, an
Educator-role Moral and Normative LLMs Profiling framework for personality
profiling, moral development stage measurement, and ethical risk under soft
prompt injection. EMNLP extends existing scales and constructs 88
teacher-specific moral dilemmas, enabling profession-oriented comparison with
human teachers. A targeted soft prompt injection set evaluates compliance and
vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs
exhibit more idealized and polarized personalities than human teachers, excel
in abstract moral reasoning, but struggle with emotionally complex situations.
Models with stronger reasoning are more vulnerable to harmful prompt injection,
revealing a paradox between capability and safety. The model temperature and
other hyperparameters have limited influence except in some risk behaviors.
This paper presents the first benchmark to assess ethical and psychological
alignment of teacher-role LLMs for educational AI. Resources are available at
https://e-m-n-l-p.github.io/.
comment: 29pages, 15 figures, Accepted by EMNLP Main Confrence
♻ ☆ Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective
Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen
The post-training of LLMs, which typically consists of the supervised
fine-tuning (SFT) stage and the preference learning stage (RLHF or DPO), is
crucial to effective and safe LLM applications. The widely adopted approach in
post-training popular open-source LLMs is to sequentially perform SFT and
RLHF/DPO. However, this is suboptimal in terms of SFT and RLHF/DPO trade-off:
the LLM gradually forgets about the first stage's training when undergoing the
second stage's training. This sequential paradigm persists largely due to its
simplicity and modularity, which make it easier to implement and manage at
scale despite its limitations. We theoretically prove the sub-optimality of
sequential post-training and propose a practical joint post-training framework
which has theoretical convergence guarantees and empirically outperforms
sequential post-training framework, with up to 23% overall performance
improvement across multiple LLM evaluation benchmarks, while having minimal
computational overhead. Our code is available at
https://github.com/heshandevaka/XRIGHT.
♻ ☆ Retrieval-Augmented Feature Generation for Domain-Specific Classification
Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yu-Zhong Chen, Fei Xie, Kunpeng Liu
Feature generation can significantly enhance learning outcomes, particularly
for tasks with limited data. An effective way to improve feature generation is
to expand the current feature space using existing features and enriching the
informational content. However, generating new, interpretable features usually
requires domain-specific knowledge on top of the existing features. In this
paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to
generate useful and explainable features specific to domain classification
tasks. To increase the interpretability of the generated features, we conduct
knowledge retrieval among the existing features in the domain to identify
potential feature associations. These associations are expected to help
generate useful features. Moreover, we develop a framework based on large
language models (LLMs) for feature generation with reasoning to verify the
quality of the features during their generation process. Experiments across
several datasets in medical, economic, and geographic domains show that our
RAFG method can produce high-quality, meaningful features and significantly
improve classification performance compared with baseline methods.
comment: Accepted by ICDM 2025
♻ ☆ EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation AACL 2025
Radiology report generation requires advanced medical image analysis,
effective temporal reasoning, and accurate text generation. Although recent
innovations, particularly multimodal large language models, have shown improved
performance, their supervised fine-tuning (SFT) objective is not explicitly
aligned with clinical efficacy. In this work, we introduce EditGRPO, a
mixed-policy reinforcement learning algorithm designed specifically to optimize
the generation through clinically motivated rewards. EditGRPO integrates
on-policy exploration with off-policy guidance by injecting sentence-level
detailed corrections during training rollouts. This mixed-policy approach
addresses the exploration dilemma and sampling efficiency issues typically
encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT
and vanilla GRPO baselines, achieving an average improvement of 3.4\% in
clinical metrics across four major datasets. Notably, EditGRPO also
demonstrates superior out-of-domain generalization, with an average performance
gain of 5.9\% on unseen datasets.
comment: AACL 2025
♻ ☆ Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs
Recent studies have revealed that LLMs can exhibit behavioral self-awareness:
the ability to accurately describe or predict their own learned behaviors
without explicit supervision. This capability raises safety concerns as it may,
for example, allow models to better conceal their true abilities during
evaluation. We attempt to characterize the minimal conditions under which such
self-awareness emerges, and the mechanistic processes through which it
manifests. Through controlled finetuning experiments on instruction-tuned LLMs
with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably
induced using a single rank-1 LoRA adapter; (2) that the learned self-aware
behavior can be largely captured by a single steering vector in activation
space, recovering nearly all of the fine-tune's behavioral effect; and (3) that
self-awareness is non-universal and domain-localized, with independent
representations across tasks. Together, these findings suggest that behavioral
self-awareness emerges as a domain-specific, linear feature that can be easily
induced and modulated.
♻ ☆ KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
We describe KVLink, an approach for efficient key-value (KV) cache reuse in
large language models (LLMs). In many LLM applications, different inputs can
share overlapping context, such as the same retrieved document appearing in
multiple queries. However, the LLMs still need to encode the entire context for
each query, leading to redundant computation. In this paper, we investigate a
new strategy to eliminate such inefficiency, where the KV cache of each
document is precomputed independently. During inference, the KV caches of
retrieved documents are concatenated, allowing the model to reuse cached
representations instead of recomputing them. To mitigate the performance
degradation when using KV caches computed independently for each document,
KVLink introduces two key techniques: adjusting positional embeddings of the KV
cache at inference to match the global position after concatenation, and using
trainable special tokens to restore self-attention across independently encoded
documents. Experiments across 7 datasets demonstrate that KVLink improves
question answering accuracy by an average of 4% over state-of-the-art methods.
Furthermore, by leveraging precomputed KV caches, our approach reduces
time-to-first-token by up to 96% compared to standard LLM inference, making it
a scalable and efficient solution for context reuse. Additionally, KVLink can
be combined with KV cache compression to further save cache loading and storage
overhead while outperforming the baselines.
♻ ☆ Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons
The scalability of large language models for long-context reasoning is
severely constrained by the linear growth of their Transformer key-value cache,
which incurs significant memory and computational costs. We posit that as a
model generates reasoning tokens, the informational value of past generated
tokens diminishes, creating an opportunity for compression. In this work, we
propose to periodically compress the generation KV cache with a learned,
special-purpose token and evict compressed entries. We train the model to
perform this compression via a modified joint distillation and reinforcement
learning (RL) framework. Our training method minimizes overhead over the
conventional RL process, as it leverages RL outputs for distillation.
Empirically, our method achieves a superior memory-accuracy Pareto frontier
compared to both the model without cache compression and training-free
compression techniques.