Computation and Language 86
☆ Cross-modal Information Flow in Multimodal Large Language Models
The recent advancements in auto-regressive multimodal large language models
(MLLMs) have demonstrated promising progress for vision-language tasks. While
there exists a variety of studies investigating the processing of linguistic
information within large language models, little is currently known about the
inner working mechanism of MLLMs and how linguistic and visual information
interact within these models. In this study, we aim to fill this gap by
examining the information flow between different modalities -- language and
vision -- in MLLMs, focusing on visual question answering. Specifically, given
an image-question pair as input, we investigate where in the model and how the
visual and linguistic information are combined to generate the final
prediction. Conducting experiments with a series of models from the LLaVA
series, we find that there are two distinct stages in the process of
integration of the two modalities. In the lower layers, the model first
transfers the more general visual features of the whole image into the
representations of (linguistic) question tokens. In the middle layers, it once
again transfers visual information about specific objects relevant to the
question to the respective token positions of the question. Finally, in the
higher layers, the resulting multimodal representation is propagated to the
last position of the input sequence for the final prediction. Overall, our
findings provide a new and comprehensive perspective on the spatial and
functional aspects of image and language processing in the MLLMs, thereby
facilitating future research into multimodal information localization and
editing.
☆ Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation
This research presents and compares multiple approaches to automate the
generation of literature reviews using several Natural Language Processing
(NLP) techniques and retrieval-augmented generation (RAG) with a Large Language
Model (LLM). The ever-increasing number of research articles provides a huge
challenge for manual literature review. It has resulted in an increased demand
for automation. Developing a system capable of automatically generating the
literature reviews from only the PDF files as input is the primary objective of
this research work. The effectiveness of several Natural Language Processing
(NLP) strategies, such as the frequency-based method (spaCy), the transformer
model (Simple T5), and retrieval-augmented generation (RAG) with Large Language
Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR
dataset is chosen for this research experiment and three distinct techniques
are utilized to implement three different systems for auto-generating the
literature reviews. The ROUGE scores are used for the evaluation of all three
systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo
achieved the highest ROUGE-1 score, 0.364. The transformer model comes in
second place and spaCy is at the last position. Finally, a graphical user
interface is created for the best system based on the large language model.
comment: Key Words : T5, SpaCy, Large Language Model, GPT, ROUGE, Literature
Review, Natural Language Processing, Retrieval-augmented generation
☆ On Importance of Code-Mixed Embeddings for Hate Speech Identification
Code-mixing is the practice of using two or more languages in a single
sentence, which often occurs in multilingual communities such as India where
people commonly speak multiple languages. Classic NLP tools, trained on
monolingual data, face challenges when dealing with code-mixed data. Extracting
meaningful information from sentences containing multiple languages becomes
difficult, particularly in tasks like hate speech detection, due to linguistic
variation, cultural nuances, and data sparsity. To address this, we aim to
analyze the significance of code-mixed embeddings and evaluate the performance
of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech
detection. Our study demonstrates that HingBERT models, benefiting from
training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform
BERT models when tested on hate speech text datasets. We also found that
code-mixed Hing-FastText performs better than standard English FastText and
vanilla BERT models.
☆ Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning
Large Language Models (LLMs) have demonstrated remarkable multilingual
capabilities, yet challenges persist in adapting these models for low-resource
languages. In this study, we investigate the effects of Low-Rank Adaptation
(LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for
Marathi, a language with limited resources. Using a translated Alpaca dataset
with 52,000 instruction-response pairs, our findings reveal that while
evaluation metrics often show a performance decline post-fine-tuning, manual
assessments frequently suggest that the fine-tuned models outperform their
original counterparts. The observations indicate improvements in target
language generation capabilities but a reduction in reasoning abilities
following language adaptation. These results underscore the need for improved
evaluation methodologies and the creation of high-quality native datasets to
accurately assess language-specific model performance in low-resource settings.
☆ A Pipeline of Neural-Symbolic Integration to Enhance Spatial Reasoning in Large Language Models
Large Language Models (LLMs) have demonstrated impressive capabilities across
various tasks. However, LLMs often struggle with spatial reasoning which is one
essential part of reasoning and inference and requires understanding complex
relationships between objects in space. This paper proposes a novel
neural-symbolic framework that enhances LLMs' spatial reasoning abilities. We
evaluate our approach on two benchmark datasets: StepGame and SparQA,
implementing three distinct strategies: (1) ASP (Answer Set Programming)-based
symbolic reasoning, (2) LLM + ASP pipeline using DSPy, and (3) Fact + Logical
rules. Our experiments demonstrate significant improvements over the baseline
prompting methods, with accuracy increases of 40-50% on StepGame} dataset and
3-13% on the more complex SparQA dataset. The "LLM + ASP" pipeline achieves
particularly strong results on the tasks of Finding Relations (FR) and Finding
Block (FB) questions, though performance varies across different question
types. The impressive results suggest that while neural-symbolic approaches
offer promising directions for enhancing spatial reasoning in LLMs, their
effectiveness depends heavily on the specific task characteristics and
implementation strategies. We propose an integrated, simple yet effective set
of strategies using a neural-symbolic pipeline to boost spatial reasoning
abilities in LLMs. This pipeline and its strategies demonstrate strong and
broader applicability to other reasoning domains in LLMs, such as temporal
reasoning, deductive inference etc.
☆ Retrofitting (Large) Language Models with Dynamic Tokenization
Current language models (LMs) use a fixed, static subword tokenizer. This
choice, often taken for granted, typically results in degraded efficiency and
capabilities in languages other than English, and makes it challenging to apply
LMs to new domains or languages. To address these issues, we propose
retrofitting LMs with dynamic tokenization: a way to dynamically decide on
token boundaries based on the input text. For encoder-style models, we
introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but
at a batch level. We merge frequent subword sequences in a batch, then apply a
pretrained embedding-prediction hypernetwork to compute the token embeddings
on-the-fly. When applied with word-level boundaries, this on average reduces
token sequence lengths by >20% across 14 languages on XNLI with XLM-R while
degrading its task performance by less than 2%. For decoder-style models, we
apply dynamic tokenization in two ways: 1) for prefilling, maintaining
performance of Mistral-7B almost completely with up to 40% sequence reduction -
relative to the word-level; and 2) via an approximate nearest neighbor index,
achieving fast generation with a one million token vocabulary, demonstrating
scalability to even larger, dynamic vocabularies. Overall, our findings show
that dynamic tokenization substantially improves inference speed and promotes
fairness across languages, making a leap towards overcoming the limitations of
static tokenization and enabling more equitable and adaptable LMs.
☆ Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models
This paper introduces a mathematical framework for defining and quantifying
self-identity in artificial intelligence (AI) systems, addressing a critical
gap in the theoretical foundations of artificial consciousness. While existing
approaches to artificial self-awareness often rely on heuristic implementations
or philosophical abstractions, we present a formal framework grounded in metric
space theory, measure theory, and functional analysis. Our framework posits
that self-identity emerges from two mathematically quantifiable conditions: the
existence of a connected continuum of memories $C \subseteq \mathcal{M}$ in a
metric space $(\mathcal{M}, d_{\mathcal{M}})$, and a continuous mapping $I:
\mathcal{M} \to \mathcal{S}$ that maintains consistent self-recognition across
this continuum, where $(\mathcal{S}, d_{\mathcal{S}})$ represents the metric
space of possible self-identities. To validate this theoretical framework, we
conducted empirical experiments using the Llama 3.2 1B model, employing
Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model was trained on
a synthetic dataset containing temporally structured memories, designed to
capture the complexity of coherent self-identity formation. Our evaluation
metrics included quantitative measures of self-awareness, response consistency,
and linguistic precision. The experimental results demonstrate substantial
improvements in measurable self-awareness metrics, with the primary
self-awareness score increasing from 0.276 to 0.801. This enables the
structured creation of AI systems with validated self-identity features. The
implications of our study are immediately relevant to the fields of humanoid
robotics and autonomous systems.
☆ Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS
In-context Learning (ICL) enables large language models (LLMs) to tackle
downstream tasks through sophisticated prompting and high-quality
demonstrations. However, this traditional ICL paradigm shows limitations when
facing complex mathematical reasoning tasks, primarily due to its heavy
dependence on example quality and the necessity for human intervention in
challenging scenarios. To address these limitations, this paper presents
HiAR-ICL, a \textbf{Hi}gh-level \textbf{A}utomated \textbf{R}easoning paradigm
in \textbf{ICL} that shifts focus from specific examples to abstract thinking
patterns, extending the conventional concept of context in ICL. HiAR-ICL
introduces five atomic reasoning actions as fundamental components for
constructing chain-structured patterns. Using Monte Carlo Tree Search, we
explore reasoning paths and construct thought cards to guide subsequent
inference. We then develop a cognitive complexity framework that dynamically
matches problems with appropriate thought cards. Experimental results
demonstrate HiAR-ICL's effectiveness, achieving state-of-the-art accuracy
(79.6$\%$) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o
(76.6$\%$) and Claude 3.5 (71.1$\%$).
☆ Isolating authorship from content with semantic embeddings and contrastive learning
Authorship has entangled style and content inside. Authors frequently write
about the same topics in the same style, so when different authors write about
the exact same topic the easiest way out to distinguish them is by
understanding the nuances of their style. Modern neural models for authorship
can pick up these features using contrastive learning, however, some amount of
content leakage is always present. Our aim is to reduce the inevitable impact
and correlation between content and authorship. We present a technique to use
contrastive learning (InfoNCE) with additional hard negatives synthetically
created using a semantic similarity model. This disentanglement technique aims
to distance the content embedding space from the style embedding space, leading
to embeddings more informed by style. We demonstrate the performance with
ablations on two different datasets and compare them on out-of-domain
challenges. Improvements are clearly shown on challenging evaluations on
prolific authors with up to a 10% increase in accuracy when the settings are
particularly hard. Trials on challenges also demonstrate the preservation of
zero-shot capabilities of this method as fine tuning.
☆ Parole de présidents (1958-2022)
En plus de soixante ans, huit pr\'esidents se sont succ\'ed\'e \`a la t\^ete
de la Ve R\'epublique fran\c{c}aise (de Gaulle, Pompidou, Giscard d'Estaing,
Mitterrand, Chirac, Sarkozy, Hollande, Macron). Apr\`es avoir pr\'esent\'e le
corpus de leurs discours -- soit 9202 textes et plus de 20 millions de mots
\'etiquet\'es -- le style de chacun des pr\'esidents sera caract\'eris\'e \`a
l'aide de leurs vocabulaire (vocables et cat\'egories grammaticales). Une
analyse plus approfondie r\'ev\`ele les s\'equences typiques de chaque
locataire de l'\'Elys\'ee. Bas\'ee sur les distances entre l'ensemble des
allocutions, une figure illustre les similitudes et diff\'erences entre les
diff\'erents pr\'esidents.
Over the past sixty-six years, eight presidents successively headed the Fifth
French Republic (de Gaulle, Pompidou, Giscard d'Estaing, Mitterrand, Chirac,
Sarkozy, Holland, Macron). After presenting the corpus of their speeches --
9,202 texts and more than 20 million labelled words -- the style of each of
them will be characterized by their vocabulary (lemmas and part-of-speech). A
deeper analysis reveals the typical sequences of each tenant of the Elys\'ee.
Based on an intertextual distance between all presidential speeches, a
synthesis can be drawn reflecting the similarities and differences between
presidents.
comment: in French language
☆ Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
Speculative Decoding (SD) has become an important technique in accelerating
the inference speed of large language models. Conventional SD methods employ a
fixed draft length, which ignores the token generation difficulty across tasks.
Consequently, in this paper, we address such an issue and introduce SVIP - a
difficulty-aware dynamic draft length policy for speculative decoding systems.
Based on a theoretical lower bound of draft token acceptance rate and its
inference-time approximation, SVIP adaptively determines the lengths of draft
sequences based on the entropy of each draft token distribution. Experimental
results on mainstream SD benchmarks and frameworks demonstrate the superior
performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over
baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up
to 8K tokens. Moreover, SVIP is totally training-free and compatible with any
existing SD methods that generate draft tokens autoregressively. Experimental
results also show that SVIP yields consistent walltime improvement on top of
GliDe & CaPE and EAGLE-2.
comment: Code at https://github.com/Geralt-Targaryen/SVIP
☆ Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator
The quality of meeting summaries generated by natural language generation
(NLG) systems is hard to measure automatically. Established metrics such as
ROUGE and BERTScore have a relatively low correlation with human judgments and
fail to capture nuanced errors. Recent studies suggest using large language
models (LLMs), which have the benefit of better context understanding and
adaption of error definitions without training on a large number of human
preference judgments. However, current LLM-based evaluators risk masking errors
and can only serve as a weak proxy, leaving human evaluation the gold standard
despite being costly and hard to compare across studies. In this work, we
present MESA, an LLM-based framework employing a three-step assessment of
individual error types, multi-agent discussion for decision refinement, and
feedback-based self-training to refine error definition understanding and
alignment with human judgment. We show that MESA's components enable thorough
error detection, consistent rating, and adaptability to custom error
guidelines. Using GPT-4o as its backbone, MESA achieves mid to high
Point-Biserial correlation with human judgment in error detection and mid
Spearman and Kendall correlation in reflecting error impact on summary quality,
on average 0.25 higher than previous methods. The framework's flexibility in
adapting to custom error guidelines makes it suitable for various tasks with
limited human-labeled data.
☆ Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication
This paper aims to provide a comparison between texts produced by French and
Italian politicians on polarizing issues, such as immigration and the European
Union, and their chatbot counterparts created with ChatGPT 3.5. In this study,
we focus on implicit communication, in particular on presuppositions and their
functions in discourse, which have been considered in the literature as a
potential linguistic feature of manipulation. This study also aims to
contribute to the emerging literature on the pragmatic competences of Large
Language Models.
comment: Published: 2024-07-04
☆ Topic Modeling and Sentiment Analysis on Japanese Online Media's Coverage of Nuclear Energy
Thirteen years after the Fukushima Daiichi nuclear power plant accident,
Japan's nuclear energy accounts for only approximately 6% of electricity
production, as most nuclear plants remain shut down. To revitalize the nuclear
industry and achieve sustainable development goals, effective communication
with Japanese citizens, grounded in an accurate understanding of public
sentiment, is of paramount importance. While nationwide surveys have
traditionally been used to gauge public views, the rise of social media in
recent years has provided a promising new avenue for understanding public
sentiment. To explore domestic sentiment on nuclear energy-related issues
expressed online, we analyzed the content and comments of over 3,000 YouTube
videos covering topics related to nuclear energy. Topic modeling was used to
extract the main topics from the videos, and sentiment analysis with large
language models classified user sentiments towards each topic. Additionally,
word co-occurrence network analysis was performed to examine the shift in
online discussions during August and September 2023 regarding the release of
treated water. Overall, our results provide valuable insights into the online
discourse on nuclear energy and contribute to a more comprehensive
understanding of public sentiment in Japan.
comment: 15 pages, 9 figures, 4 tables
☆ ChatGPT as speechwriter for the French presidents
Generative AI proposes several large language models (LLMs) to automatically
generate a message in response to users' requests. Such scientific
breakthroughs promote new writing assistants but with some fears. The main
focus of this study is to analyze the written style of one LLM called ChatGPT
by comparing its generated messages with those of the recent French presidents.
To achieve this, we compare end-of-the-year addresses written by Chirac,
Sarkozy, Hollande, and Macron with those automatically produced by ChatGPT. We
found that ChatGPT tends to overuse nouns, possessive determiners, and numbers.
On the other hand, the generated speeches employ less verbs, pronouns, and
adverbs and include, in mean, too standardized sentences. Considering some
words, one can observe that ChatGPT tends to overuse "to must" (devoir), "to
continue" or the lemma "we" (nous). Moreover, GPT underuses the auxiliary verb
"to be" (^etre), or the modal verbs "to will" (vouloir) or "to have to"
(falloir). In addition, when a short text is provided as example to ChatGPT,
the machine can generate a short message with a style closed to the original
wording. Finally, we reveal that ChatGPT style exposes distinct features
compared to real presidential speeches.
☆ AMPS: ASR with Multimodal Paraphrase Supervision
Spontaneous or conversational multilingual speech presents many challenges
for state-of-the-art automatic speech recognition (ASR) systems. In this work,
we present a new technique AMPS that augments a multilingual multimodal ASR
system with paraphrase-based supervision for improved conversational ASR in
multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja.
We use paraphrases of the reference transcriptions as additional supervision
while training the multimodal ASR model and selectively invoke this paraphrase
objective for utterances with poor ASR performance. Using AMPS with a
state-of-the-art multimodal model SeamlessM4T, we obtain significant relative
reductions in word error rates (WERs) of up to 5%. We present detailed analyses
of our system using both objective and human evaluation metrics.
☆ GPT as ghostwriter at the White House
Recently several large language models (LLMs) have demonstrated their
capability to generate a message in response to a user request. Such scientific
breakthroughs promote new perspectives but also some fears. The main focus of
this study is to analyze the written style of one LLM called ChatGPT 3.5 by
comparing its generated messages with those of the recent US presidents. To
achieve this objective, we compare the State of the Union addresses written by
Reagan to Obama with those automatically produced by ChatGPT. We found that
ChatGPT tends to overuse the lemma "we" as well as nouns and commas. On the
other hand, the generated speeches employ less verbs and include, in mean,
longer sentences. Even when imposing a given style to ChatGPT, the resulting
speech remains distinct from messages written by the target author. Moreover,
ChatGPT opts for a neutral tone with mainly positive emotional expressions and
symbolic terms (e.g., freedom, nation). Finally, we show that the GPT's style
exposes distinct features compared to real presidential addresses.
☆ Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Ambiguous words are often found in modern digital communications. Lexical
ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due
to limited data. Consequently, the efficiency of translation, information
retrieval, and question-answering systems is hindered by these limitations.
This study investigates the use of Large Language Models (LLMs) to improve WSD
using a novel approach combining a systematic prompt augmentation mechanism
with a knowledge base (KB) consisting of different sense interpretations. The
proposed method incorporates a human-in-loop approach for prompt augmentation
where prompt is supported by Part-of-Speech (POS) tagging, synonyms of
ambiguous words, aspect-based sense filtering and few-shot prompting to guide
the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based
approach, this work demonstrates a substantial improvement in performance. The
evaluation was conducted using FEWS test data and sense tags. This research
advances accurate word interpretation in social media and digital
communication.
comment: 12 pages,6 tables, 1 figure, Proceedings of the 1st International
Conference on NLP & AI for Cyber Security
☆ Continual Learning in Machine Speech Chain Using Gradient Episodic Memory
Geoffrey Tyndall, Kurniawati Azizah, Dipta Tanaya, Ayu Purwarianti, Dessi Puji Lestari, Sakriani Sakti
Continual learning for automatic speech recognition (ASR) systems poses a
challenge, especially with the need to avoid catastrophic forgetting while
maintaining performance on previously learned tasks. This paper introduces a
novel approach leveraging the machine speech chain framework to enable
continual learning in ASR using gradient episodic memory (GEM). By
incorporating a text-to-speech (TTS) component within the machine speech chain,
we support the replay mechanism essential for GEM, allowing the ASR model to
learn new tasks sequentially without significant performance degradation on
earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate
that our method outperforms traditional fine-tuning and multitask learning
approaches, achieving a substantial error rate reduction while maintaining high
performance across varying noise conditions. We showed the potential of our
semi-supervised machine speech chain approach for effective and efficient
continual learning in speech recognition.
comment: Published as a conference paper at O-COCOSDA 2024. 6 pages; 2 figures
☆ Aligning Pre-trained Models for Spoken Language Translation
This paper investigates a novel approach to end-to-end speech translation
(ST) based on aligning frozen pre-trained automatic speech recognition (ASR)
and machine translation (MT) models via a small connector module (Q-Former, our
Subsampler-Transformer Encoder). This connector bridges the gap between the
speech and text modalities, transforming ASR encoder embeddings into the latent
representation space of the MT encoder while being the only part of the system
optimized during training. Experiments are conducted on the How2
English-Portuguese dataset as we investigate the alignment approach in a
small-scale scenario focusing on ST. While keeping the size of the connector
module constant and small in comparison ( < 5% of the size of the larger
aligned models), increasing the size and capability of the foundation ASR and
MT models universally improves translation results. We also find that the
connectors can serve as domain adapters for the foundation MT models,
significantly improving translation performance in the aligned ST setting. We
conclude that this approach represents a viable and scalable approach to
training end-to-end ST systems.
☆ Neutralizing Backdoors through Information Conflicts for Large Language Models
Large language models (LLMs) have seen significant advancements, achieving
superior performance in various Natural Language Processing (NLP) tasks, from
understanding to reasoning. However, they remain vulnerable to backdoor
attacks, where models behave normally for standard queries but generate harmful
responses or unintended output when specific triggers are activated. Existing
backdoor defenses often suffer from drawbacks that they either focus on
detection without removal, rely on rigid assumptions about trigger properties,
or prove to be ineffective against advanced attacks like multi-trigger
backdoors. In this paper, we present a novel method to eliminate backdoor
behaviors from LLMs through the construction of information conflicts using
both internal and external mechanisms. Internally, we leverage a lightweight
dataset to train a conflict model, which is then merged with the backdoored
model to neutralize malicious behaviors by embedding contradictory information
within the model's parametric memory. Externally, we incorporate convincing
contradictory evidence into the prompt to challenge the model's internal
backdoor knowledge. Experimental results on classification and conversational
tasks across 4 widely used LLMs demonstrate that our method outperforms 8
state-of-the-art backdoor defense baselines. We can reduce the attack success
rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean
data accuracy. Furthermore, our method has proven to be robust against adaptive
backdoor attacks. The code will be open-sourced upon publication.
☆ Large Language Model-Brained GUI Agents: A Survey
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents.
☆ Hidden Data Privacy Breaches in Federated Learning
Federated Learning (FL) emerged as a paradigm for conducting machine learning
across broad and decentralized datasets, promising enhanced privacy by
obviating the need for direct data sharing. However, recent studies show that
attackers can steal private data through model manipulation or gradient
analysis. Existing attacks are constrained by low theft quantity or
low-resolution data, and they are often detected through anomaly monitoring in
gradients or weights. In this paper, we propose a novel data-reconstruction
attack leveraging malicious code injection, supported by two key techniques,
i.e., distinctive and sparse encoding design and block partitioning. Unlike
conventional methods that require detectable changes to the model, our method
stealthily embeds a hidden model using parameter sharing to systematically
extract sensitive data. The Fibonacci-based index design ensures efficient,
structured retrieval of memorized data, while the block partitioning method
enhances our method's capability to handle high-resolution images by dividing
them into smaller, manageable units. Extensive experiments on 4 datasets
confirmed that our method is superior to the five state-of-the-art
data-reconstruction attacks under the five respective detection methods. Our
method can handle large-scale and high-resolution data without being detected
or mitigated by state-of-the-art data reconstruction defense methods. In
contrast to baselines, our method can be directly applied to both FedAVG and
FedSGD scenarios, underscoring the need for developers to devise new defenses
against such vulnerabilities. We will open-source our code upon acceptance.
☆ MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets
The metaphor studies community has developed numerous valuable labelled
corpora in various languages over the years. Many of these resources are not
only unknown to the NLP community, but are also often not easily shared among
the researchers. Both in human sciences and in NLP, researchers could benefit
from a centralised database of labelled resources, easily accessible and
unified under an identical format. To facilitate this, we present
MetaphorShare, a website to integrate metaphor datasets making them open and
accessible. With this effort, our aim is to encourage researchers to share and
upload more datasets in any language in order to facilitate metaphor studies
and the development of future metaphor processing NLP systems. The website is
accessible at www.metaphorshare.com.
☆ A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering
Adapting models to a language that was only partially present in the
pre-training data requires fine-tuning, which is expensive in terms of both
data and computational resources. As an alternative to fine-tuning, we explore
the potential of activation steering-based techniques to enhance model
performance on Italian tasks. Through our experiments we show that Italian
steering (i) can be successfully applied to different models, (ii) achieves
performances comparable to, or even better than, fine-tuned models for Italian,
and (iii) yields higher quality and consistency in Italian generations. We also
discuss the utility of steering and fine-tuning in the contemporary LLM
landscape where models are anyway getting high Italian performances even if not
explicitly trained in this language.
☆ Thai Financial Domain Adaptation of THaLLE -- Technical Report
KBTG Labs, Atthakorn Petchsod, Pornchanan Balee, Danupat Khamnuansin, Anuruth Lertpiya, Chanatip Saetia, Tawunrat Chalothorn, Thadpong Pongthawornkamol, Monchai Lertsutthiwong
Large Language Models (LLMs) excel in general tasks but struggle with
domain-specific challenges, such as specialized terminology and localized
regulations. Existing financial LLMs, like FinGPT and BloombergGPT, lack
support for the Thai financial domain. We developed a Thai Financial LLM using
the Investment Consultant (IC) exam dataset from the Stock Exchange of
Thailand. To address dataset limitations, we applied data augmentation, ReLoRA
for efficient training, Continued Pretraining (CPT) for domain knowledge, and
Rank-Stabilized LoRA (rsLoRA) for fine-tuning. Supervised Fine-Tuning (SFT)
simulated exam scenarios, while Direct Preference Optimization (DPO) refined
the model using feedback. The model achieved scores of 72%, 72%, and 84% on IC
exam levels P1, P2, and P3, respectively, demonstrating its effectiveness in
Thai financial advisory tasks and its potential for specialized applications.
☆ How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario
Shih-Heng Wang, Zih-Ching Chen, Jiatong Shi, Ming-To Chuang, Guan-Ting Lin, Kuan-Po Huang, David Harwath, Shang-Wen Li, Hung-yi Lee
The utilization of speech Self-Supervised Learning (SSL) models achieves
impressive performance on Automatic Speech Recognition (ASR). However, in
low-resource language ASR, they encounter the domain mismatch problem between
pre-trained and low-resource languages. Typical solutions like fine-tuning the
SSL model suffer from high computation costs while using frozen SSL models as
feature extractors comes with poor performance. To handle these issues, we
extend a conventional efficient fine-tuning scheme based on the adapter. We add
an extra intermediate adaptation to warm up the adapter and downstream model
initialization. Remarkably, we update only 1-5% of the total model parameters
to achieve the adaptation. Experimental results on the ML-SUPERB dataset show
that our solution outperforms conventional efficient fine-tuning. It achieves
up to a 28% relative improvement in the Character/Phoneme error rate when
adapting to unseen languages.
☆ Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
Vision-language models~(VLMs) have shown remarkable advancements in
multimodal reasoning tasks. However, they still often generate inaccurate or
irrelevant responses due to issues like hallucinated image understandings or
unrefined reasoning paths. To address these challenges, we introduce Critic-V,
a novel framework inspired by the Actor-Critic paradigm to boost the reasoning
capability of VLMs. This framework decouples the reasoning process and critic
process by integrating two independent components: the Reasoner, which
generates reasoning paths based on visual and textual inputs, and the Critic,
which provides constructive critique to refine these paths. In this approach,
the Reasoner generates reasoning responses according to text prompts, which can
evolve iteratively as a policy based on feedback from the Critic. This
interaction process was theoretically driven by a reinforcement learning
framework where the Critic offers natural language critiques instead of scalar
rewards, enabling more nuanced feedback to boost the Reasoner's capability on
complex reasoning tasks. The Critic model is trained using Direct Preference
Optimization (DPO), leveraging a preference dataset of critiques ranked by
Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results
show that the Critic-V framework significantly outperforms existing methods,
including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning
accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner
and constructive feedback from the preference-optimized Critic enables a more
reliable and context-sensitive multimodal reasoning process. Our approach
provides a promising solution to enhance the reliability of VLMs, improving
their performance in real-world reasoning-heavy multimodal applications such as
autonomous driving and embodied intelligence.
comment: 16 pages, 11 figures
☆ SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment
With strong expressive capabilities in Large Language Models(LLMs),
generative models effectively capture sentiment structures and deep semantics,
however, challenges remain in fine-grained sentiment classification across
multi-lingual and complex contexts. To address this, we propose the Sentiment
Cross-Lingual Recognition and Logic Framework (SentiXRL), which incorporates
two modules,an emotion retrieval enhancement module to improve sentiment
classification accuracy in complex contexts through historical dialogue and
logical reasoning,and a self-circulating analysis negotiation mechanism
(SANM)to facilitates autonomous decision-making within a single model for
classification tasks.We have validated SentiXRL's superiority on multiple
standard datasets, outperforming existing models on CPED and CH-SIMS,and
achieving overall better performance on MELD,Emorynlp and IEMOCAP. Notably, we
unified labels across several fine-grained sentiment annotation datasets and
conducted category confusion experiments, revealing challenges and impacts of
class imbalance in standard datasets.
☆ A survey on cutting-edge relation extraction techniques based on language models
This comprehensive survey delves into the latest advancements in Relation
Extraction (RE), a pivotal task in natural language processing essential for
applications across biomedical, financial, and legal sectors. This study
highlights the evolution and current state of RE techniques by analyzing 137
papers presented at the Association for Computational Linguistics (ACL)
conferences over the past four years, focusing on models that leverage language
models. Our findings underscore the dominance of BERT-based methods in
achieving state-of-the-art results for RE while also noting the promising
capabilities of emerging large language models (LLMs) like T5, especially in
few-shot relation extraction scenarios where they excel in identifying
previously unseen relations.
comment: 50 pages, under review in Artificial Intelligence Review
☆ MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe
speech while assigning transcripts to the corresponding speakers accurately.
Existing methods often rely on complex modular systems or require extensive
fine-tuning of joint modules, limiting their adaptability and general
efficiency. This paper introduces a novel approach, leveraging a frozen
multilingual ASR model to incorporate speaker attribution into the
transcriptions, using only standard monolingual ASR datasets. Our method
involves training a speaker module to predict speaker embeddings based on weak
labels without requiring additional ASR model modifications. Despite being
trained exclusively with non-overlapping monolingual data, our approach
effectively extracts speaker attributes across diverse multilingual datasets,
including those with overlapping speech. Experimental results demonstrate
competitive performance compared to strong baselines, highlighting the model's
robustness and potential for practical applications.
☆ SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
Full-duplex multimodal large language models (LLMs) provide a unified
framework for addressing diverse speech understanding and generation tasks,
enabling more natural and seamless human-machine conversations. Unlike
traditional modularised conversational AI systems, which separate speech
recognition, understanding, and text-to-speech generation into distinct
components, multimodal LLMs operate as single end-to-end models. This
streamlined design eliminates error propagation across components and fully
leverages the rich non-verbal information embedded in input speech signals. We
introduce SALMONN-omni, a codec-free, full-duplex speech understanding and
generation model capable of simultaneously listening to its own generated
speech and background sounds while speaking. To support this capability, we
propose a novel duplex spoken dialogue framework incorporating a ``thinking''
mechanism that facilitates asynchronous text and speech generation relying on
embeddings instead of codecs (quantized speech and audio tokens). Experimental
results demonstrate SALMONN-omni's versatility across a broad range of
streaming speech tasks, including speech recognition, speech enhancement, and
spoken question answering. Additionally, SALMONN-omni excels at managing
turn-taking, barge-in, and echo cancellation scenarios, establishing its
potential as a robust prototype for full-duplex conversational AI systems. To
the best of our knowledge, SALMONN-omni is the first codec-free model of its
kind. A full technical report along with model checkpoints will be released
soon.
comment: Technical report
☆ Curriculum Demonstration Selection for In-Context Learning
Duc Anh Vu, Nguyen Tran Cong Duy, Xiaobao Wu, Hoang Minh Nhat, Du Mingzhe, Nguyen Thanh Thong, Anh Tuan Luu
Large Language Models (LLMs) have shown strong in-context learning (ICL)
abilities with a few demonstrations. However, one critical challenge is how to
select demonstrations to elicit the full potential of LLMs. In this paper, we
propose Curriculum Demonstration Selection (CDS), a novel demonstration
selection method for ICL. Instead of merely using similarity, CDS additionally
partitions samples by their complexity measurements. Following curriculum
learning, CDS then selects demonstrations from easy to difficult. Thus the
selected demonstrations cover a wide range of difficulty levels, enabling LLMs
to learn from varied complexities within the training set. Experiments
demonstrate that our CDS consistently outperforms baseline methods, achieving
notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves
especially effective in enhancing LLM performance in solving challenging
problems.
comment: Accepted at the 40th ACM/SIGAPP Symposium On Applied Computing (SAC
2025), Main Conference
☆ Training and Evaluating Language Models with Template-based Data Generation
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM,
and Llama has significantly transformed natural language processing, showcasing
remarkable capabilities in understanding and generating language. However,
these models often struggle with tasks requiring complex reasoning,
particularly in mathematical problem-solving, due in part to the scarcity of
large-scale, high-quality, domain-specific datasets necessary for training
sophisticated reasoning abilities. To address this limitation, we introduce
Template-based Data Generation (TDG), a novel approach that leverages LLMs
(GPT-4) to automatically generate parameterized meta-templates, which are then
used to synthesize a vast array of high-quality problems and solutions.
Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset
comprising over 7 million synthetically generated grade school math
problems--each accompanied by code-based and natural language solutions--with
the potential to generate an effectively unlimited number more. This dataset
alleviates the scarcity of large-scale mathematical datasets and serves as a
valuable resource for pre-training, fine-tuning, and evaluating LLMs in
mathematical reasoning. Our method not only enables the generation of virtually
infinite data but also elevates data augmentation to a new level by using GPT-4
for meta-template generation, ensuring diverse and high-quality problem
structures. The TemplateMath Part I: TemplateGSM dataset is publicly available
at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available
at https://github.com/iiis-ai/TemplateMath.
comment: 8 pages, 2 figures
☆ Fine-Tuning Small Embeddings for Elevated Performance
Contextual Embeddings have yielded state-of-the-art results in various
natural language processing tasks. However, these embeddings are constrained by
models requiring large amounts of data and huge computing power. This is an
issue for low-resource languages like Nepali as the amount of data available
over the internet is not always sufficient for the models. This work has taken
an incomplete BERT model with six attention heads pretrained on Nepali language
and finetuned it on previously unseen data. The obtained results from intrinsic
and extrinsic evaluations have been compared to the results drawn from the
original model baseline and a complete BERT model pretrained on Nepali language
as the oracle. The results demonstrate that even though the oracle is better on
average, finetuning the small embeddings drastically improves results compared
to the original baseline.
☆ Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
How to efficiently serve LLMs in practice has become exceptionally
challenging due to their prohibitive memory and computation requirements. In
this study, we investigate optimizing the KV cache, whose memory footprint
poses a critical bottleneck in LLM inference, especially when dealing with long
context tasks. To tackle the challenge, we introduce MiniKV, a KV cache
optimization method that simultaneously preserves long context task accuracy
while significantly reducing KV cache size via a novel 2-bit
layer-discriminative KV cache. More importantly, we develop specialized CUDA
kernels to make MiniKV compatible with FlashAttention. Experiments on a wide
range of long context tasks show that MiniKV effectively achieves 86% KV cache
compression ratio while recovering over 98.5% of accuracy, outperforming
state-of-the-art methods while achieving excellent measured system performance
improvements.
☆ Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?
Over the past few decades, Artificial Intelligence(AI) has progressed from
the initial machine learning stage to the deep learning stage, and now to the
stage of foundational models. Foundational models have the characteristics of
pre-training, transfer learning, and self-supervised learning, and pre-trained
models can be fine-tuned and applied to various downstream tasks. Under the
framework of foundational models, models such as Bidirectional Encoder
Representations from Transformers(BERT) and Generative Pre-trained
Transformer(GPT) have greatly advanced the development of natural language
processing(NLP), especially the emergence of many models based on BERT. BERT
broke through the limitation of only using one-way methods for language
modeling in pre-training by using a masked language model. It can capture
bidirectional context information to predict the masked words in the sequence,
this can improve the feature extraction ability of the model. This makes the
model very useful for downstream tasks, especially for specialized
applications. The model using the bidirectional encoder can better understand
the domain knowledge and be better applied to these downstream tasks. So we
hope to help understand how this technology has evolved and improved model
performance in various natural language processing tasks under the background
of foundational models and reveal its importance in capturing context
information and improving the model's performance on downstream tasks. This
article analyzes one-way and bidirectional models based on GPT and BERT and
compares their differences based on the purpose of the model. It also briefly
analyzes BERT and the improvements of some models based on BERT. The model's
performance on the Stanford Question Answering Dataset(SQuAD) and General
Language Understanding Evaluation(GLUE) was compared.
comment: 9 pages, 4 figures, FLLM2024
☆ JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services
Large Language Models (LLMs) have demonstrated remarkable capabilities in
various tasks, leading to their increasing deployment in wireless networks for
a wide variety of user services. However, the growing longer prompt setting
highlights the crucial issue of computational resource demands and huge
communication load. To address this challenge, we propose Joint Power and
Prompt Optimization (JPPO), a framework that combines Small Language Model
(SLM)-based prompt compression with wireless power allocation optimization. By
deploying SLM at user devices for prompt compression and employing Deep
Reinforcement Learning for joint optimization of compression ratio and
transmission power, JPPO effectively balances service quality with resource
efficiency. Experimental results demonstrate that our framework achieves high
service fidelity and low bit error rates while optimizing power usage in
wireless LLM services. The system reduces response time by about 17%, with the
improvement varying based on the length of the original prompt.
☆ DRS: Deep Question Reformulation With Structured Output
Question answering is a fundamental capability of large language models
(LLMs). However, when people encounter completely new knowledge texts, they
often ask questions that the text cannot answer due to a lack of understanding
of the knowledge. Recent research shows that large language models identify the
unanswerability of questions, but they lack the ability to help people
reformulate their questions. Even powerful models like GPT-3.5 perform poorly
in this regard. To enhance the ability of LLMs to assist humans in
reformulating questions to extract relevant knowledge from new documents, we
propose a zero-shot method called DRS: Deep Question Reformulation With
Structured Output. Our proposed method leverages large language models and the
DFS-based algorithm to iteratively search for possible entity combinations and
constrain the output with certain entities, effectively improving the
capabilities of large language models in this area. Extensive experimental
results show that our zero-shot DRS method significantly improves the
reformulation accuracy of GPT-3.5 from 23.03% to 70.42% and effectively
improves the score of open-source large language models, such as Gemma2-9B,
from 26.35% to 56.75%.
☆ New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing
As machine learning becomes more widespread and is used in more critical
applications, it's important to provide explanations for these models, to
prevent unintended behavior. Unfortunately, many current interpretability
methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates
the question "How to provide and ensure faithful explanations for complex
general-purpose neural NLP models?" The main thesis is that we should develop
new paradigms in interpretability. This is achieved by first developing solid
faithfulness metrics and then applying the lessons learned from this
investigation to develop new paradigms. The two new paradigms explored are
faithfulness measurable models (FMMs) and self-explanations. The idea in
self-explanations is to have large language models explain themselves, we
identify that current models are not capable of doing this consistently.
However, we suggest how this could be achieved. The idea of FMMs is to create
models that are designed such that measuring faithfulness is cheap and precise.
This makes it possible to optimize an explanation towards maximum faithfulness,
which makes FMMs designed to be explained. We find that FMMs yield explanations
that are near theoretical optimal in terms of faithfulness. Overall, from all
investigations of faithfulness, results show that post-hoc and intrinsic
explanations are by default model and task-dependent. However, this was not the
case when using FMMs, even with the same post-hoc explanation methods. This
shows, that even simple modifications to the model, such as randomly masking
the training dataset, as was done in FMMs, can drastically change the situation
and result in consistently faithful explanations. This answers the question of
how to provide and ensure faithful explanations.
comment: Doctoral thesis
☆ VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Recent researches on video large language models (VideoLLM) predominantly
focus on model architectures and training datasets, leaving the interaction
format between the user and the model under-explored. In existing works, users
often interact with VideoLLMs by using the entire video and a query as input,
after which the model generates a response. This interaction format constrains
the application of VideoLLMs in scenarios such as live-streaming comprehension
where videos do not end and responses are required in a real-time manner, and
also results in unsatisfactory performance on time-sensitive tasks that
requires localizing video segments. In this paper, we focus on a video-text
duet interaction format. This interaction format is characterized by the
continuous playback of the video, and both the user and the model can insert
their text messages at any position during the video playback. When a text
message ends, the video continues to play, akin to the alternative of two
performers in a duet. We construct MMDuetIT, a video-text training dataset
designed to adapt VideoLLMs to video-text duet interaction format. We also
introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to
benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT,
MMDuet demonstrates that adopting the video-text duet interaction format
enables the model to achieve significant improvements in various time-sensitive
tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights
highlight detection and 25% R@0.5 on Charades-STA temporal video grounding)
with minimal training efforts, and also enable VideoLLMs to reply in a
real-time manner as the video plays. Code, data and demo are available at:
https://github.com/yellow-binary-tree/MMDuet.
comment: 9 pages
☆ QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions
Ramez Kouzy, Roxanna Attar-Olyaee, Michael K. Rooney, Comron J. Hassanzadeh, Junyi Jessy Li, Osama Mohamad
Health-related discussions on social media like Reddit offer valuable
insights, but extracting quantitative data from unstructured text is
challenging. In this work, we present an adapted framework from QuaLLM into
QuaLLM-Health for extracting clinically relevant quantitative data from Reddit
discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large
language models (LLMs). We collected 410k posts and comments from five
GLP-1-related communities using the Reddit API in July 2024. After filtering
for cancer-related discussions, 2,059 unique entries remained. We developed
annotation guidelines to manually extract variables such as cancer
survivorship, family cancer history, cancer types mentioned, risk perceptions,
and discussions with physicians. Two domain-experts independently annotated a
random sample of 100 entries to create a gold-standard dataset. We then
employed iterative prompt engineering with OpenAI's "GPT-4o-mini" on the
gold-standard dataset to build an optimized pipeline that allowed us to extract
variables from the large dataset. The optimized LLM achieved accuracies above
0.85 for all variables, with precision, recall and F1 score macro averaged >
0.90, indicating balanced performance. Stability testing showed a 95% match
rate across runs, confirming consistency. Applying the framework to the full
dataset enabled efficient extraction of variables necessary for downstream
analysis, costing under $3 and completing in approximately one hour.
QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract
clinically relevant quantitative data from unstructured social media content.
Incorporating human expertise and iterative prompt refinement ensures accuracy
and reliability. This methodology can be adapted for large-scale analysis of
patient-generated data across various health domains, facilitating valuable
insights for healthcare research.
♻ ☆ XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
The applications of LLM Agents are becoming increasingly complex and diverse,
leading to a high demand for structured outputs that can be parsed into code,
structured function calls, and embodied agent commands. These developments
bring significant demands for structured generation in LLM inference.
Context-free grammar is a flexible approach to enable structured generation via
constrained decoding. However, executing context-free grammar requires going
through several stack states over all tokens in vocabulary during runtime,
bringing non-negligible overhead for structured generation. In this paper, we
propose XGrammar, a flexible and efficient structure generation engine for
large language models. XGrammar accelerates context-free grammar execution by
dividing the vocabulary into context-independent tokens that can be prechecked
and context-dependent tokens that need to be interpreted during runtime. We
further build transformations to expand the grammar context and reduce the
number of context-independent tokens. Additionally, we build an efficient
persistent stack to accelerate the context-dependent token checks. Finally, we
co-design the grammar engine with LLM inference engine to overlap grammar
computation with GPU executions. Evaluation results show that XGrammar can
achieve up to 100x speedup over existing solutions. Combined with an LLM
inference engine, it can generate near-zero overhead structure generation in
end-to-end low-LLM serving.
♻ ☆ A Suite for Acoustic Language Model Evaluation
Speech language models have recently demonstrated great potential as
universal speech processing systems. Such models have the ability to model the
rich acoustic information existing in audio signals, beyond spoken content,
such as emotion, background noise, etc. Despite this, evaluation benchmarks
which evaluate awareness to a wide range of acoustic aspects, are lacking. To
help bridge this gap, we introduce SALMon, a novel evaluation suite
encompassing background noise, emotion, speaker identity and room impulse
response. The proposed benchmarks both evaluate the consistency of the
inspected element and how much it matches the spoken text. We follow a
modelling based approach, measuring whether a model gives correct samples
higher scores than incorrect ones. This approach makes the benchmark fast to
compute even for large models. We evaluated several speech language models on
SALMon, thus highlighting the strengths and weaknesses of each evaluated
method. We make the code and data publicly available at
https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .
♻ ☆ A Novel Word Pair-based Gaussian Sentence Similarity Algorithm For Bengali Extractive Text Summarization
Extractive Text Summarization is the process of selecting the most
representative parts of a larger text without losing any key information.
Recent attempts at extractive text summarization in Bengali, either relied on
statistical techniques like TF-IDF or used naive sentence similarity measures
like the word averaging technique. All of these strategies suffer from
expressing semantic relationships correctly. Here, we propose a novel Word
pair-based Gaussian Sentence Similarity (WGSS) algorithm for calculating the
semantic relation between two sentences. WGSS takes the geometric means of
individual Gaussian similarity values of word embedding vectors to get the
semantic relationship between sentences. It compares two sentences on a
word-to-word basis which rectifies the sentence representation problem faced by
the word averaging method. The summarization process extracts key sentences by
grouping semantically similar sentences into clusters using the Spectral
Clustering algorithm. After clustering, we use TF-IDF ranking to pick the best
sentence from each cluster. The proposed method is validated using four
different datasets, and it outperformed other recent models by 43.2% on average
ROUGE scores (ranging from 2.5% to 95.4%). It is also experimented on other
low-resource languages i.e. Turkish, Marathi, and Hindi language, where we find
that the proposed method performs as similar as Bengali for these languages. In
addition, a new high-quality Bengali dataset is curated which contains 250
articles and a pair of summaries for each of them. We believe this research is
a crucial addition to Bengali Natural Language Processing (NLP) research and it
can easily be extended into other low-resource languages. We made the
implementation of the proposed model and data public on
https://github.com/FMOpee/WGSS.
♻ ☆ DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization
Data visualization (DV) is the fundamental and premise tool to improve the
efficiency in conveying the insights behind the big data, which has been widely
accepted in existing data-driven world. Task automation in DV, such as
converting natural language queries to visualizations (i.e., text-to-vis),
generating explanations from visualizations (i.e., vis-to-text), answering
DV-related questions in free form (i.e. FeVisQA), and explicating tabular data
(i.e., table-to-text), is vital for advancing the field. Despite their
potential, the application of pre-trained language models (PLMs) like T5 and
BERT in DV has been limited by high costs and challenges in handling
cross-modal information, leading to few studies on PLMs for DV. We introduce
DataVisT5, a novel PLM tailored for DV that enhances the T5 architecture
through a hybrid objective pre-training and multi-task fine-tuning strategy,
integrating text and DV datasets to effectively interpret cross-modal
semantics. Extensive evaluations on public datasets show that DataVisT5
consistently outperforms current state-of-the-art models on various DV-related
tasks. We anticipate that DataVisT5 will not only inspire further research on
vertical PLMs but also expand the range of applications for PLMs.
♻ ☆ Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
The impressive capabilities of large language models (LLMs) have sparked
debate over whether these models genuinely generalize to unseen tasks or
predominantly rely on memorizing vast amounts of pretraining data. To explore
this issue, we introduce an extended concept of memorization, distributional
memorization, which measures the correlation between the LLM output
probabilities and the pretraining data frequency. To effectively capture
task-specific pretraining data frequency, we propose a novel task-gram language
model, which is built by counting the co-occurrence of semantically related
$n$-gram pairs from task inputs and outputs in the pretraining corpus. Using
the Pythia models trained on the Pile dataset, we evaluate four distinct tasks:
machine translation, factual question answering, world knowledge understanding,
and math reasoning. Our findings reveal varying levels of memorization, with
the strongest effect observed in factual question answering. Furthermore, while
model performance improves across all tasks as LLM size increases, only factual
question answering shows an increase in memorization, whereas machine
translation and reasoning tasks exhibit greater generalization, producing more
novel outputs. This study demonstrates that memorization plays a larger role in
simpler, knowledge-intensive tasks, while generalization is the key for harder,
reasoning-based tasks, providing a scalable method for analyzing large
pretraining corpora in greater depth. We also show the practical implications
of our analysis through a novel prompt optimization algorithm.
comment: updated 10-page version
♻ ☆ An iterated learning model of language change that mixes supervised and unsupervised learning
The iterated learning model is an agent model which simulates the
transmission of of language from generation to generation. It is used to study
how the language adapts to pressures imposed by transmission. In each
iteration, a language tutor exposes a na\"ive pupil to a limited training set
of utterances, each pairing a random meaning with the signal that conveys it.
Then the pupil becomes a tutor for a new na\"ive pupil in the next iteration.
The transmission bottleneck ensures that tutors must generalize beyond the
training set that they experienced. Repeated cycles of learning and
generalization can result in a language that is expressive, compositional and
stable. Previously, the agents in the iterated learning model mapped signals to
meanings using an artificial neural network but relied on an unrealistic and
computationally expensive process of obversion to map meanings to signals.
Here, both maps are neural networks, trained separately through supervised
learning and together through unsupervised learning in the form of an
autoencoder. This avoids the computational burden entailed in obversion and
introduces a mixture of supervised and unsupervised learning as observed during
language learning in children. The new model demonstrates a linear relationship
between the dimensionality of meaning-signal space and effective bottleneck
size and suggests that internal reflection on potential utterances is important
in language learning and evolution.
comment: 33pages. (v2->v3 - revisions following referees report; the paper is
now in press with PLoS Complex Systems)
♻ ☆ Agent Skill Acquisition for Large Language Models via CycleQD
Training large language models to acquire specific skills remains a
challenging endeavor. Conventional training approaches often struggle with data
distribution imbalances and inadequacies in objective functions that do not
align well with task-specific performance. To address these challenges, we
introduce CycleQD, a novel approach that leverages the Quality Diversity
framework through a cyclic adaptation of the algorithm, along with a model
merging based crossover and an SVD-based mutation. In CycleQD, each task's
performance metric is alternated as the quality measure while the others serve
as the behavioral characteristics. This cyclic focus on individual tasks allows
for concentrated effort on one task at a time, eliminating the need for data
ratio tuning and simplifying the design of the objective function. Empirical
results from AgentBench indicate that applying CycleQD to LLAMA3-8B-INSTRUCT
based models not only enables them to surpass traditional fine-tuning methods
in coding, operating systems, and database tasks, but also achieves performance
on par with GPT-3.5-TURBO, which potentially contains much more parameters,
across these domains. Crucially, this enhanced performance is achieved while
retaining robust language capabilities, as evidenced by its performance on
widely adopted language benchmark tasks. We highlight the key design choices in
CycleQD, detailing how these contribute to its effectiveness. Furthermore, our
method is general and can be applied to image segmentation models, highlighting
its applicability across different domains.
♻ ☆ Navigating the Post-API Dilemma | Search Engine Results Pages Present a Biased View of Social Media Data
Recent decisions to discontinue access to social media APIs are having
detrimental effects on Internet research and the field of computational social
science as a whole. This lack of access to data has been dubbed the Post-API
era of Internet research. Fortunately, popular search engines have the means to
crawl, capture, and surface social media data on their Search Engine Results
Pages (SERP) if provided the proper search query, and may provide a solution to
this dilemma. In the present work we ask: does SERP provide a complete and
unbiased sample of social media data? Is SERP a viable alternative to direct
API-access? To answer these questions, we perform a comparative analysis
between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We
find that SERP results are highly biased in favor of popular posts; against
political, pornographic, and vulgar posts; are more positive in their
sentiment; and have large topical gaps. Overall, we conclude that SERP is not a
viable alternative to social media API access.
comment: Proceedings of the ACM Web Conference 2024 (WWW '24)
♻ ☆ Creativity in AI: Progresses and Challenges
Creativity is the ability to produce novel, useful, and surprising ideas, and
has been widely studied as a crucial aspect of human cognition. Machine
creativity on the other hand has been a long-standing challenge. With the rise
of advanced generative AI, there has been renewed interest and debate regarding
AI's creative capabilities. Therefore, it is imperative to revisit the state of
creativity in AI and identify key progresses and remaining challenges. In this
work, we survey leading works studying the creative capabilities of AI systems,
focusing on creative problem-solving, linguistic, artistic, and scientific
creativity. Our review suggests that while the latest AI models are largely
capable of producing linguistically and artistically creative outputs such as
poems, images, and musical pieces, they struggle with tasks that require
creative problem-solving, abstract thinking and compositionality and their
generations suffer from a lack of diversity, originality, long-range
incoherence and hallucinations. We also discuss key questions concerning
copyright and authorship issues with generative models. Furthermore, we
highlight the need for a comprehensive evaluation of creativity that is
process-driven and considers several dimensions of creativity. Finally, we
propose future research directions to improve the creativity of AI outputs,
drawing inspiration from cognitive science and psychology.
comment: minor updates to content + figure
♻ ☆ EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction
Social platforms have emerged as crucial platforms for disseminating
information and discussing real-life social events, offering researchers an
excellent opportunity to design and implement novel event detection frameworks.
However, most existing approaches only exploit keyword burstiness or network
structures to detect unspecified events. Thus, they often need help identifying
unknown events regarding the challenging nature of events and social data.
Social data, e.g., tweets, is characterized by misspellings, incompleteness,
word sense ambiguation, irregular language, and variation in aspects of
opinions. Moreover, extracting discriminative features and patterns for
evolving events by exploiting the limited structural knowledge is almost
infeasible. To address these challenges, in this paper, we propose a novel
framework, namely EnrichEvent, that leverages the linguistic and contextual
representations of streaming social data. In particular, we leverage contextual
and linguistic knowledge to detect semantically related tweets and enhance the
effectiveness of the event detection approaches. Eventually, our proposed
framework produces cluster chains for each event to show the evolving variation
of the event through time. We conducted extensive experiments to evaluate our
framework, validating its high performance and effectiveness in detecting and
distinguishing unspecified social events.
♻ ☆ Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study
Retrieval-augmented generation (RAG) is increasingly recognized as an
effective approach for mitigating the hallucination of large language models
(LLMs) through the integration of external knowledge. While numerous efforts,
most studies focus on a single type of externeal knowledge source. However, in
real-world applications, most situations involve diverse knowledge from various
sources, yet this area has been less explored. The main dilemma is the lack of
a suitable dataset containing multiple knowledge sources and pre-exploration of
the associated issues. To address these challenges, we standardize a benchmark
dataset that combines structured and unstructured knowledge across diverse and
complementary domains. Based on this dataset, we further develop a
plug-and-play RAG framework, PruningRAG, whose main characteristic is to employ
multi-granularity pruning strategies for optimizing the integration of relevant
information and minimizing misleading context. Building upon the standardized
dataset and PruningRAG, we also report a series of experimental results, as
well as insightful findings. Our dataset and code are publicly
available\footnote{https://github.com/USTCAGI/PruningRAG}, with the aim of
advancing future research in the RAG community.
comment: 10 pages, 11 figures;
♻ ☆ Learning and communication pressures in neural networks: Lessons from emergent communication
Finding and facilitating commonalities between the linguistic behaviors of
large language models and humans could lead to major breakthroughs in our
understanding of the acquisition, processing, and evolution of language.
However, most findings on human-LLM similarity can be attributed to training on
human data. The field of emergent machine-to-machine communication provides an
ideal testbed for discovering which pressures are neural agents naturally
exposed to when learning to communicate in isolation, without any human
language to start with. Here, we review three cases where mismatches between
the emergent linguistic behavior of neural agents and humans were resolved
thanks to introducing theoretically-motivated inductive biases. By contrasting
humans, large language models, and emergent communication agents, we then
identify key pressures at play for language learning and emergence:
communicative success, production effort, learnability, and other
psycho-/sociolinguistic factors. We discuss their implications and relevance to
the field of language evolution and acquisition. By mapping out the necessary
inductive biases that make agents' emergent languages more human-like, we not
only shed light on the underlying principles of human cognition and
communication, but also inform and improve the very use of these models as
valuable scientific tools for studying language learning, processing, use, and
representation more broadly.
comment: camera-ready version, as published in Language Development Research
♻ ☆ Don't Command, Cultivate: An Exploratory Study of System-2 Alignment
The o1 system card identifies the o1 models as the most robust within OpenAI,
with their defining characteristic being the progression from rapid, intuitive
thinking to slower, more deliberate reasoning. This observation motivated us to
investigate the influence of System-2 thinking patterns on model safety. In our
preliminary research, we conducted safety evaluations of the o1 model,
including complex jailbreak attack scenarios using adversarial natural language
prompts and mathematical encoding prompts. Our findings indicate that the o1
model demonstrates relatively improved safety performance; however, it still
exhibits vulnerabilities, particularly against jailbreak attacks employing
mathematical encoding. Through detailed case analysis, we identified specific
patterns in the o1 model's responses. We also explored the alignment of
System-2 safety in open-source models using prompt engineering and supervised
fine-tuning techniques. Experimental results show that some simple methods to
encourage the model to carefully scrutinize user requests are beneficial for
model safety. Additionally, we proposed a implementation plan for process
supervision to enhance safety alignment. The implementation details and
experimental results will be provided in future versions.
comment: Preprint version, more results will be updated
♻ ☆ Leveraging Large Language Models in Human-Robot Interaction: A Critical Analysis of Potential and Pitfalls
The emergence of large language models (LLM) and, consequently, vision
language models (VLM) has ignited new imaginations among robotics researchers.
At this point, the range of applications to which LLM and VLM can be applied in
human-robot interaction (HRI), particularly socially assistive robots (SARs),
is unchartered territory. However, LLM and VLM present unprecedented
opportunities and challenges for SAR integration. We aim to illuminate the
opportunities and challenges when roboticists deploy LLM and VLM in SARs.
First, we conducted a meta-study of more than 250 papers exploring 1) major
robots in HRI research and 2) significant applications of SARs, emphasizing
education, healthcare, and entertainment while addressing 3) societal norms and
issues like trust, bias, and ethics that the robot developers must address.
Then, we identified 4) critical components of a robot that LLM or VLM can
replace while addressing the 5) benefits of integrating LLM into robot designs
and the 6) risks involved. Finally, we outline a pathway for the responsible
and effective adoption of LLM or VLM into SARs, and we close our discussion by
offering caution regarding this deployment.
♻ ☆ MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration EMNLP 2024
Large Language Models (LLMs) have significantly advanced natural language
processing, demonstrating exceptional reasoning, tool usage, and memory
capabilities. As their applications expand into multi-agent environments, there
arises a need for a comprehensive evaluation framework that captures LLMs'
reasoning, planning, collaboration, and other social abilities. This work
introduces a novel competition-based benchmark framework specifically designed
to assess LLMs within multi-agent settings, providing quantitative metrics to
evaluate their judgment, reasoning, deception, self-awareness, cooperation,
coordination, and rationality. We utilize two social deduction games alongside
three game-theory scenarios to create diverse environments. Our frame is
fortified with the probabilistic graphic modeling (PGM) method, enhancing the
LLMs' capabilities in navigating complex social and cognitive dimensions. We
evaluate seven LLMs, quantitatively highlighting a significant capability gap
of over threefold between the strongest, GPT o1, and the weakest, Llama-2-70B.
It also confirms that our PGM enhancement boosts the abilities of all selected
models by an average of 37%. Our data and code can be found here
https://github.com/cathyxl/MAgIC.
comment: EMNLP 2024
♻ ☆ Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation
Learners of a second language (L2) often unconsciously substitute unfamiliar
L2 phonemes with similar phonemes from their native language (L1), even though
native speakers of the L2 perceive these sounds as distinct and
non-interchangeable. This phonemic substitution leads to deviations from the
standard phonological patterns of the L2, creating challenges for learners in
acquiring accurate L2 pronunciation. To address this, we propose
Inter-linguistic Phonetic Composition (IPC), a novel computational method
designed to minimize incorrect phonological transfer by reconstructing L2
phonemes as composite sounds derived from multiple L1 phonemes. Tests with two
automatic speech recognition models demonstrated that when L2 speakers produced
IPC-generated composite sounds, the recognition rate of target L2 phonemes
improved by 20% compared to when their pronunciation was influenced by original
phonological transfer patterns. The improvement was observed within a
relatively shorter time frame, demonstrating rapid acquisition of the composite
sound.
♻ ☆ On Designing Effective RL Reward at Training Time for LLM Reasoning
Reward models have been increasingly critical for improving the reasoning
capability of LLMs. Existing research has shown that a well-trained reward
model can substantially improve model performances at inference time via
search. However, the potential of reward models during RL training time still
remains largely under-explored. It is currently unclear whether these reward
models can provide additional training signals to enhance the reasoning
capabilities of LLMs in RL training that uses sparse success rewards, which
verify the correctness of solutions. In this work, we evaluate popular reward
models for RL training, including the Outcome-supervised Reward Model (ORM) and
the Process-supervised Reward Model (PRM), and train a collection of LLMs for
math problems using RL by combining these learned rewards with success rewards.
Surprisingly, even though these learned reward models have strong
inference-time performances, they may NOT help or even hurt RL training,
producing worse performances than LLMs trained with the success reward only.
Our analysis reveals that an LLM can receive high rewards from some of these
reward models by repeating correct but unnecessary reasoning steps, leading to
a severe reward hacking issue. Therefore, we introduce two novel reward
refinement techniques, including Clipping and Delta. The key idea is to ensure
the accumulative reward of any reasoning trajectory is upper-bounded to keep a
learned reward model effective without being exploited. We evaluate our
techniques with multiple reward models over a set of 1.5B and 7B LLMs on MATH
and GSM8K benchmarks and demonstrate that with a carefully designed reward
function, RL training without any additional supervised tuning can improve all
the evaluated LLMs, including the state-of-the-art 7B LLM
Qwen2.5-Math-7B-Instruct on MATH and GSM8K benchmarks.
♻ ☆ Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
Recent advancements in audio generation have been significantly propelled by
the capabilities of Large Language Models (LLMs). The existing research on
audio LLM has primarily focused on enhancing the architecture and scale of
audio language models, as well as leveraging larger datasets, and generally,
acoustic codecs, such as EnCodec, are used for audio tokenization. However,
these codecs were originally designed for audio compression, which may lead to
suboptimal performance in the context of audio LLM. Our research aims to
address the shortcomings of current audio LLM codecs, particularly their
challenges in maintaining semantic integrity in generated audio. For instance,
existing methods like VALL-E, which condition acoustic token generation on text
transcriptions, often suffer from content inaccuracies and elevated word error
rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in
word skipping and errors. To overcome these issues, we propose a
straightforward yet effective approach called X-Codec. X-Codec incorporates
semantic features from a pre-trained semantic encoder before the Residual
Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss
after RVQ. By enhancing the semantic ability of the codec, X-Codec
significantly reduces WER in speech synthesis tasks and extends these benefits
to non-speech applications, including music and sound generation. Our
experiments in text-to-speech, music continuation, and text-to-sound tasks
demonstrate that integrating semantic information substantially improves the
overall performance of language models in audio generation. Our code and demo
are available (Demo: https://x-codec-audio.github.io Code:
https://github.com/zhenye234/xcodec)
♻ ☆ ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains
Large language models (LLMs) have brought significant changes to many aspects
of our lives. However, assessing and ensuring their chronological knowledge
remains challenging. Existing approaches fall short in addressing the temporal
adaptability of knowledge, often relying on a fixed time-point view. To
overcome this, we introduce ChroKnowBench, a benchmark dataset designed to
evaluate chronologically accumulated knowledge across three key aspects:
multiple domains, time dependency, temporal state. Our benchmark distinguishes
between knowledge that evolves (e.g., personal history, scientific discoveries,
amended laws) and knowledge that remain constant (e.g., mathematical truths,
commonsense facts). Building on this benchmark, we present ChroKnowledge
(Chronological Categorization of Knowledge), a novel sampling-based framework
for evaluating LLMs' non-parametric chronological knowledge. Our evaluation led
to the following observations: (1) The ability of eliciting temporal knowledge
varies depending on the data format that model was trained on. (2) LLMs
partially recall knowledge or show a cut-off at temporal boundaries rather than
recalling all aspects of knowledge correctly. Thus, we apply ourChroKnowPrompt,
an in-depth prompting to elicit chronological knowledge by traversing
step-by-step through the surrounding time spans. We observe that it
successfully recalls objects across both open-source and proprietary LLMs,
demonstrating versatility, though it faces challenges with dynamic datasets and
unstructured formats.
♻ ☆ Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants
Beatriz Borges, Negar Foroutan, Deniz Bayazit, Anna Sotnikova, Syrielle Montariol, Tanya Nazaretzky, Mohammadreza Banaei, Alireza Sakhaeirad, Philippe Servant, Seyed Parsa Neshaei, Jibril Frej, Angelika Romanou, Gail Weiss, Sepideh Mamooler, Zeming Chen, Simin Fan, Silin Gao, Mete Ismayilzada, Debjit Paul, Alexandre Schöpfer, Andrej Janchevski, Anja Tiede, Clarence Linden, Emanuele Troiani, Francesco Salvi, Freya Behrens, Giacomo Orsi, Giovanni Piccioli, Hadrien Sevel, Louis Coulon, Manuela Pineros-Rodriguez, Marin Bonnassies, Pierre Hellich, Puck van Gerwen, Sankalp Gambhir, Solal Pirelli, Thomas Blanchard, Timothée Callens, Toni Abi Aoun, Yannick Calvino Alonso, Yuri Cho, Alberto Chiappa, Antonio Sclocchi, Étienne Bruno, Florian Hofhammer, Gabriel Pescia, Geovani Rizk, Leello Dadi, Lucas Stoffl, Manoel Horta Ribeiro, Matthieu Bovel, Yueyang Pan, Aleksandra Radenovic, Alexandre Alahi, Alexander Mathis, Anne-Florence Bitbol, Boi Faltings, Cécile Hébert, Devis Tuia, François Maréchal, George Candea, Giuseppe Carleo, Jean-Cédric Chappelier, Nicolas Flammarion, Jean-Marie Fürbringer, Jean-Philippe Pellet, Karl Aberer, Lenka Zdeborová, Marcel Salathé, Martin Jaggi, Martin Rajman, Mathias Payer, Matthieu Wyart, Michael Gastpar, Michele Ceriotti, Ola Svensson, Olivier Lévêque, Paolo Ienne, Rachid Guerraoui, Robert West, Sanidhya Kashyap, Valerio Piazza, Viesturs Simanis, Viktor Kuncak, Volkan Cevher, Philippe Schwaller, Sacha Friedli, Patrick Jermann, Tanja Käser, Antoine Bosselut
AI assistants are being increasingly used by students enrolled in higher
education institutions. While these tools provide opportunities for improved
teaching and education, they also pose significant challenges for assessment
and learning outcomes. We conceptualize these challenges through the lens of
vulnerability, the potential for university assessments and learning outcomes
to be impacted by student use of generative AI. We investigate the potential
scale of this vulnerability by measuring the degree to which AI assistants can
complete assessment questions in standard university-level STEM courses.
Specifically, we compile a novel dataset of textual assessment questions from
50 courses at EPFL and evaluate whether two AI assistants, GPT-3.5 and GPT-4
can adequately answer these questions. We use eight prompting strategies to
produce responses and find that GPT-4 answers an average of 65.8% of questions
correctly, and can even produce the correct answer across at least one
prompting strategy for 85.1% of questions. When grouping courses in our dataset
by degree program, these systems already pass non-project assessments of large
numbers of core courses in various degree programs, posing risks to higher
education accreditation that will be amplified as these models improve. Our
results call for revising program-level assessment design in higher education
in light of advances in generative AI.
comment: 20 pages, 8 figures
♻ ☆ From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML 2024
Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yonggang Zhang, Wenxiao Wang, Xu Shen, Jieping Ye
Large Language Models (LLMs) tend to prioritize adherence to user prompts
over providing veracious responses, leading to the sycophancy issue. When
challenged by users, LLMs tend to admit mistakes and provide inaccurate
responses even if they initially provided the correct answer. Recent works
propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy
issue, while it typically leads to the degeneration of LLMs' general
capability. To address the challenge, we propose a novel supervised pinpoint
tuning (SPT), where the region-of-interest modules are tuned for a given
objective. Specifically, SPT first reveals and verifies a small percentage
(<5%) of the basic modules, which significantly affect a particular behavior of
LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified
modules while freezing the rest. To verify the effectiveness of the proposed
SPT, we conduct comprehensive experiments, demonstrating that SPT significantly
mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT
introduces limited or even no side effects on the general capability of LLMs.
Our results shed light on how to precisely, effectively, and efficiently
explain and improve the targeted ability of LLMs.
comment: Accepted by ICML 2024
♻ ★ Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun
Agents powered by large language models have shown remarkable abilities in
solving complex tasks. However, most agent systems remain reactive, limiting
their effectiveness in scenarios requiring foresight and autonomous
decision-making. In this paper, we tackle the challenge of developing proactive
agents capable of anticipating and initiating tasks without explicit human
instructions. We propose a novel data-driven approach for this problem.
Firstly, we collect real-world human activities to generate proactive task
predictions. These predictions are then labeled by human annotators as either
accepted or rejected. The labeled data is used to train a reward model that
simulates human judgment and serves as an automatic evaluator of the
proactiveness of LLM agents. Building on this, we develop a comprehensive data
generation pipeline to create a diverse dataset, ProactiveBench, containing
6,790 events. Finally, we demonstrate that fine-tuning models with the proposed
ProactiveBench can significantly elicit the proactiveness of LLM agents.
Experimental results show that our fine-tuned model achieves an F1-Score of
66.47% in proactively offering assistance, outperforming all open-source and
close-source models. These results highlight the potential of our method in
creating more proactive and effective agent systems, paving the way for future
advancements in human-agent collaboration.
comment: 9 pages, 4 figures
♻ ☆ CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching NeurIPS 2024
Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li
Diffusion models have demonstrated great success in the field of
text-to-image generation. However, alleviating the misalignment between the
text prompts and images is still challenging. The root reason behind the
misalignment has not been extensively investigated. We observe that the
misalignment is caused by inadequate token attention activation. We further
attribute this phenomenon to the diffusion model's insufficient condition
utilization, which is caused by its training paradigm. To address the issue, we
propose CoMat, an end-to-end diffusion model fine-tuning strategy with an
image-to-text concept matching mechanism. We leverage an image captioning model
to measure image-to-text alignment and guide the diffusion model to revisit
ignored tokens. A novel attribute concentration module is also proposed to
address the attribute binding problem. Without any image or human preference
data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.
Extensive experiments show that CoMat-SDXL significantly outperforms the
baseline model SDXL in two text-to-image alignment benchmarks and achieves
start-of-the-art performance.
comment: NeurIPS 2024
♻ ☆ Empowering ChatGPT-Like Large-Scale Language Models with Local Knowledge Base for Industrial Prognostics and Health Management
Prognostics and health management (PHM) is essential for industrial operation
and maintenance, focusing on predicting, diagnosing, and managing the health
status of industrial systems. The emergence of the ChatGPT-Like large-scale
language model (LLM) has begun to lead a new round of innovation in the AI
field. It has extensively promoted the level of intelligence in various fields.
Therefore, it is also expected further to change the application paradigm in
industrial PHM and promote PHM to become intelligent. Although ChatGPT-Like
LLMs have rich knowledge reserves and powerful language understanding and
generation capabilities, they lack domain-specific expertise, significantly
limiting their practicability in PHM applications. To this end, this study
explores the ChatGPT-Like LLM empowered by the local knowledge base (LKB) in
industrial PHM to solve the above limitations. In addition, we introduce the
method and steps of combining the LKB with LLMs, including LKB preparation, LKB
vectorization, prompt engineering, etc. Experimental analysis of real cases
shows that combining the LKB with ChatGPT-Like LLM can significantly improve
its performance and make ChatGPT-Like LLMs more accurate, relevant, and able to
provide more insightful information. This can promote the development of
ChatGPT-Like LLMs in industrial PHM and promote their efficiency and quality.
♻ ☆ Isotropy Matters: Soft-ZCA Whitening of Embeddings for Semantic Code Search
Low isotropy in an embedding space impairs performance on tasks involving
semantic inference. Our study investigates the impact of isotropy on semantic
code search performance and explores post-processing techniques to mitigate
this issue. We analyze various code language models, examine isotropy in their
embedding spaces, and its influence on search effectiveness. We propose a
modified ZCA whitening technique to control isotropy levels in embeddings. Our
results demonstrate that Soft-ZCA whitening improves the performance of
pre-trained code language models and can complement contrastive fine-tuning.
♻ ★ Simulating Classroom Education with LLM-Empowered Agents
Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianxiao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, Lei Hou, Juanzi Li
Large language models (LLMs) have been applied across various intelligent
educational tasks to assist teaching. While preliminary studies have focused on
task-specific, independent LLM-empowered agents, the potential of LLMs within a
multi-agent collaborative framework for classroom simulation with real user
participation remains unexplored. In this work, we propose SimClass, a
multi-agent classroom simulation teaching framework. We recognize
representative class roles and introduce a novel class control mechanism for
automatic classroom teaching, and conduct user experiments in two real-world
courses. Using the Flanders Interactive Analysis System and Community of
Inquiry theoretical frameworks from educational analysis, we demonstrate that
LLMs can simulate a dynamic learning environment for users with active
teacher-student and student-student interactions. We also observe group
behaviors among agents in SimClass, where agents collaborate to create
enlivening interactions in classrooms to improve user learning process. We hope
this work pioneers the application of LLM-empowered multi-agent systems in
virtual classroom teaching.
♻ ★ MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, Hongsheng Li
The advent of Large Language Models (LLMs) has paved the way for AI search
engines, e.g., SearchGPT, showcasing a new paradigm in human-internet
interaction. However, most current AI search engines are limited to text-only
settings, neglecting the multimodal user queries and the text-image interleaved
nature of website information. Recently, Large Multimodal Models (LMMs) have
made impressive strides. Yet, whether they can function as AI search engines
remains under-explored, leaving the potential of LMMs in multimodal search an
open question. To this end, we first design a delicate pipeline,
MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On
top of this, we introduce MMSearch, a comprehensive evaluation benchmark to
assess the multimodal search performance of LMMs. The curated dataset contains
300 manually collected instances spanning 14 subfields, which involves no
overlap with the current LMMs' training data, ensuring the correct answer can
only be obtained within searching. By using MMSearch-Engine, the LMMs are
evaluated by performing three individual tasks (requery, rerank, and
summarization), and one challenging end-to-end task with a complete searching
process. We conduct extensive experiments on closed-source and open-source
LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best
results, which surpasses the commercial product, Perplexity Pro, in the
end-to-end task, demonstrating the effectiveness of our proposed pipeline. We
further present error analysis to unveil current LMMs still struggle to fully
grasp the multimodal search tasks, and conduct ablation study to indicate the
potential of scaling test-time computation for AI search engine. We hope
MMSearch may provide unique insights to guide the future development of
multimodal AI search engine. Project Page: https://mmsearch.github.io
comment: Project Page: https://mmsearch.github.io
♻ ☆ ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework
Hengyuan Zhang, Chenming Shang, Sizhe Wang, Dongdong Zhang, Renliang Sun, Yiyao Yu, Yujiu Yang, Furu Wei
Although fine-tuning Large Language Models (LLMs) with multilingual data can
rapidly enhance the multilingual capabilities of LLMs, they still exhibit a
performance gap between the dominant language (e.g., English) and non-dominant
ones due to the imbalance of training data across languages. To further enhance
the performance of non-dominant languages, we propose ShifCon, a Shift-based
Contrastive framework that aligns the internal forward process of other
languages toward that of the dominant one. Specifically, it shifts the
representations of non-dominant languages into the dominant language subspace,
allowing them to access relatively rich information encoded in the model
parameters. The enriched representations are then shifted back into their
original language subspace before generation. Moreover, we introduce a subspace
distance metric to pinpoint the optimal layer area for shifting representations
and employ multilingual contrastive learning to further enhance the alignment
of representations within this area. Experiments demonstrate that our ShifCon
framework significantly enhances the performance of non-dominant languages,
particularly for low-resource ones. Further analysis offers extra insights to
verify the effectiveness of ShifCon and propel future research
comment: 23 pages, 11 figures
♻ ☆ EFSA: Towards Event-Level Financial Sentiment Analysis
In this paper, we extend financial sentiment analysis~(FSA) to event-level
since events usually serve as the subject of the sentiment in financial text.
Though extracting events from the financial text may be conducive to accurate
sentiment predictions, it has specialized challenges due to the lengthy and
discontinuity of events in a financial text. To this end, we reconceptualize
the event extraction as a classification task by designing a categorization
comprising coarse-grained and fine-grained event categories. Under this
setting, we formulate the \textbf{E}vent-Level \textbf{F}inancial
\textbf{S}entiment \textbf{A}nalysis~(\textbf{EFSA} for short) task that
outputs quintuples consisting of (company, industry, coarse-grained event,
fine-grained event, sentiment) from financial text. A large-scale Chinese
dataset containing $12,160$ news articles and $13,725$ quintuples is publicized
as a brand new testbed for our task. A four-hop Chain-of-Thought LLM-based
approach is devised for this task. Systematically investigations are conducted
on our dataset, and the empirical results demonstrate the benchmarking scores
of existing methods and our proposed method can reach the current
state-of-the-art. Our dataset and framework implementation are available at
https://anonymous.4open.science/r/EFSA-645E
♻ ☆ Playing Language Game with LLMs Leads to Jailbreaking
The advent of large language models (LLMs) has spurred the development of
numerous jailbreak techniques aimed at circumventing their security defenses
against malicious attacks. An effective jailbreak approach is to identify a
domain where safety generalization fails, a phenomenon known as mismatched
generalization. In this paper, we introduce two novel jailbreak methods based
on mismatched generalization: natural language games and custom language games,
both of which effectively bypass the safety mechanisms of LLMs, with various
kinds and different variants, making them hard to defend and leading to high
attack rates. Natural language games involve the use of synthetic linguistic
constructs and the actions intertwined with these constructs, such as the Ubbi
Dubbi language. Building on this phenomenon, we propose the custom language
games method: by engaging with LLMs using a variety of custom rules, we
successfully execute jailbreak attacks across multiple LLM platforms. Extensive
experiments demonstrate the effectiveness of our methods, achieving success
rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet.
Furthermore, to investigate the generalizability of safety alignments, we
fine-tuned Llama-3.1-70B with the custom language games to achieve safety
alignment within our datasets and found that when interacting through other
language games, the fine-tuned models still failed to identify harmful content.
This finding indicates that the safety alignment knowledge embedded in LLMs
fails to generalize across different linguistic formats, thus opening new
avenues for future research in this area.
♻ ☆ IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
In the realm of large language models (LLMs), the ability of models to
accurately follow instructions is paramount as more agents and applications
leverage LLMs for construction, where the complexity of instructions are
rapidly increasing. However, on the one hand, there is only a certain amount of
complex instruction evaluation data; on the other hand, there are no dedicated
algorithms to improve the ability to follow complex instructions. To this end,
this paper introduces TRACE, a benchmark for improving and evaluating the
complex instructionfollowing ability, which consists of 120K training data and
1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference
Optimization) alignment method which takes both input and output preference
pairs into consideration, where LLMs not only rapidly align with response
preferences but also meticulously explore the instruction preferences.
Extensive experiments on both in-domain and outof-domain datasets confirm the
effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and
6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
comment: Work in progress
♻ ☆ A First Look at GPT Apps: Landscape and Vulnerability
Following OpenAI's introduction of GPTs, a surge in GPT apps has led to the
launch of dedicated LLM app stores. Nevertheless, given its debut, there is a
lack of sufficient understanding of this new ecosystem. To fill this gap, this
paper presents a first comprehensive longitudinal (5-month) study of the
evolution, landscape, and vulnerability of the emerging LLM app ecosystem,
focusing on two GPT app stores: \textit{GPTStore.AI} and the official
\textit{OpenAI GPT Store}. Specifically, we develop two automated tools and a
TriLevel configuration extraction strategy to efficiently gather metadata (\ie
names, creators, descriptions, \etc) and user feedback for all GPT apps across
these two stores, as well as configurations (\ie system prompts, knowledge
files, and APIs) for the top 10,000 popular apps. Our extensive analysis
reveals: (1) the user enthusiasm for GPT apps consistently rises, whereas
creator interest plateaus within three months of GPTs' launch; (2) nearly 90\%
system prompts can be easily accessed due to widespread failure to secure GPT
app configurations, leading to considerable plagiarism and duplication among
apps. Our findings highlight the necessity of enhancing the LLM app ecosystem
by the app stores, creators, and users.
♻ ☆ Self-Training Meets Consistency: Improving LLMs' Reasoning With Consistency-Driven Rationale Evaluation
Self-training approach for large language models (LLMs) improves reasoning
abilities by training the models on their self-generated rationales. Previous
approaches have labeled rationales that produce correct answers for a given
question as appropriate for training. However, a single measure risks
misjudging rationale quality, leading the models to learn flawed reasoning
patterns. To address this issue, we propose CREST (Consistency-driven Rationale
Evaluation for Self-Training), a self-training framework that further evaluates
each rationale through follow-up questions and leverages this evaluation to
guide its training. Specifically, we introduce two methods: (1) filtering out
rationales that frequently result in incorrect answers on follow-up questions
and (2) preference learning based on mixed preferences from rationale
evaluation results of both original and follow-up questions. Experiments on
three question-answering datasets using open LLMs show that CREST not only
improves the logical robustness and correctness of rationales but also improves
reasoning abilities compared to previous self-training approaches.
comment: Under review
♻ ☆ Towards More Accurate US Presidential Election via Multi-step Reasoning with Large Language Models
Can Large Language Models (LLMs) accurately predict election outcomes? While
LLMs have demonstrated impressive performance in various domains, including
healthcare, legal analysis, and creative tasks, their ability to forecast
elections remains unknown. Election prediction poses unique challenges, such as
limited voter-level data, rapidly changing political landscapes, and the need
to model complex human behavior. To address these challenges, we introduce a
multi-step reasoning framework designed for political analysis. Our approach is
validated on real-world data from the American National Election Studies (ANES)
2016 and 2020, as well as synthetic personas generated by the leading machine
learning framework, offering scalable datasets for voter behavior modeling. To
capture temporal dynamics, we incorporate candidates' policy positions and
biographical details, ensuring that the model adapts to evolving political
contexts. Drawing on Chain of Thought prompting, our multi-step reasoning
pipeline systematically integrates demographic, ideological, and time-dependent
factors, enhancing the model's predictive power.
comment: This research is ongoing work. Xiyang Hu and Yue Zhao are the
corresponding authors
♻ ☆ A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Since the release of ChatGPT and GPT-4, large language models (LLMs) and
multimodal large language models (MLLMs) have attracted widespread attention
for their exceptional capabilities in understanding, reasoning, and generation,
introducing transformative paradigms for integrating artificial intelligence
into medicine. This survey provides a comprehensive overview of the
development, principles, application scenarios, challenges, and future
directions of LLMs and MLLMs in medicine. Specifically, it begins by examining
the paradigm shift, tracing the transition from traditional models to LLMs and
MLLMs, and highlighting the unique advantages of these LLMs and MLLMs in
medical applications. Next, the survey reviews existing medical LLMs and MLLMs,
providing detailed guidance on their construction and evaluation in a clear and
systematic manner. Subsequently, to underscore the substantial value of LLMs
and MLLMs in healthcare, the survey explores five promising applications in the
field. Finally, the survey addresses the challenges confronting medical LLMs
and MLLMs and proposes practical strategies and future directions for their
integration into medicine. In summary, this survey offers a comprehensive
analysis of the technical methodologies and practical clinical applications of
medical LLMs and MLLMs, with the goal of bridging the gap between these
advanced technologies and clinical practice, thereby fostering the evolution of
the next generation of intelligent healthcare systems.
♻ ☆ OpenMU: Your Swiss Army Knife for Music Understanding
Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji
We present OpenMU-Bench, a large-scale benchmark suite for addressing the
data scarcity issue in training multimodal language models to understand music.
To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new
annotations. OpenMU-Bench also broadens the scope of music understanding by
including lyrics understanding and music tool usage. Using OpenMU-Bench, we
trained our music understanding model, OpenMU, with extensive ablations,
demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both
OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music
understanding and to enhance creative music production efficiency.
comment: Resources: https://github.com/sony/openmu
♻ ☆ How language models extrapolate outside the training data: A case study in Textualized Gridworld
Language models' ability to extrapolate learned behaviors to novel, more
complex environments beyond their training scope is highly unknown. This study
introduces a path planning task in a textualized Gridworld to probe language
models' extrapolation capabilities. We show that conventional approaches,
including next token prediction and Chain of Thought (CoT) finetuning, fail to
extrapolate in larger, unseen environments. Inspired by human cognition and
dual process theory, we propose cognitive maps for path planning, a novel CoT
framework that simulates humanlike mental representations. Our experiments show
that cognitive maps not only enhance extrapolation to unseen environments but
also exhibit humanlike characteristics through structured mental simulation and
rapid adaptation. Our finding that these cognitive maps require specialized
training schemes and cannot be induced through simple prompting opens up
important questions about developing general-purpose cognitive maps in language
models. Our comparison with exploration-based methods further illuminates the
complementary strengths of offline planning and online exploration.
♻ ☆ Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Efficient inference in large language models (LLMs) has become a critical
focus as their scale and complexity grow. Traditional autoregressive decoding,
while effective, suffers from computational inefficiencies due to its
sequential token generation process. Speculative decoding addresses this
bottleneck by introducing a two-stage framework: drafting and verification. A
smaller, efficient model generates a preliminary draft, which is then refined
by a larger, more sophisticated model. This paper provides a comprehensive
survey of speculative decoding methods, categorizing them into draft-centric
and model-centric approaches. We discuss key ideas associated with each method,
highlighting their potential for scaling LLM inference. This survey aims to
guide future research in optimizing speculative decoding and its integration
into real-world LLM applications.
♻ ☆ AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu MD, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, Irfan Essa, Stephen Edward Moore, Chris Fourie, Mercy Nyamewaa Asiedu
Recent advancements in large language model(LLM) performance on medical
multiple choice question (MCQ) benchmarks have stimulated interest from
healthcare providers and patients globally. Particularly in low-and
middle-income countries (LMICs) facing acute physician shortages and lack of
specialists, LLMs offer a potentially scalable pathway to enhance healthcare
access and reduce costs. However, their effectiveness in the Global South,
especially across the African continent, remains to be established. In this
work, we introduce AfriMed-QA, the first large scale Pan-African English
multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open
and closed-ended) sourced from over 60 medical schools across 16 countries,
covering 32 medical specialties. We further evaluate 30 LLMs across multiple
axes including correctness and demographic bias. Our findings show significant
performance variation across specialties and geographies, MCQ performance
clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general
models and smaller edge-friendly LLMs struggle to achieve a passing score.
Interestingly, human evaluations show a consistent consumer preference for LLM
answers and explanations when compared with clinician answers.
♻ ☆ A Method for Building Large Language Models with Predefined KV Cache Capacity
This paper introduces a novel approach, the Bounded-Cache Transformer (BCT),
for building large language models with a predefined Key-Value (KV) cache
capacity. The BCT addresses the excessive memory consumption issue in
traditional KV caches by implementing a bounded-length KV cache, which is
particularly suitable for the attention layers in Transformer decode-only
architectures. By dynamically updating the key-value vector sequences, the BCT
achieves efficient inference within limited cache capacity, significantly
reducing memory usage while maintaining model performance and system
throughput. Experimental results demonstrate that the BCT significantly reduces
memory usage while maintaining the model's inference quality, offering a new
solution for efficient inference in large language models.
♻ ☆ Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
We reveal that low-bit quantization favors undertrained large language models
(LLMs) by observing that models with larger sizes or fewer training tokens
experience less quantization-induced degradation (QiD) when applying low-bit
quantization, whereas smaller models with extensive training tokens suffer
significant QiD. To gain deeper insights into this trend, we study over 1500
quantized LLM checkpoints of various sizes and at different training levels
(undertrained or fully trained) in a controlled setting, deriving scaling laws
for understanding the relationship between QiD and factors such as the number
of training tokens, model size and bit width.
With the derived scaling laws, we propose a novel perspective that we can use
QiD to measure an LLM's training levels and determine the number of training
tokens required for fully training LLMs of various sizes. Moreover, we use the
scaling laws to predict the quantization performance of different-sized LLMs
trained with 100 trillion tokens. Our projection shows that the low-bit
quantization performance of future models, which are expected to be trained
with over 100 trillion tokens, may NOT be desirable. This poses a potential
challenge for low-bit quantization in the future and highlights the need for
awareness of a model's training level when evaluating low-bit quantization
research. To facilitate future research on this problem, we release all the
1500+ quantized checkpoints used in this work at
https://huggingface.co/Xu-Ouyang.
comment: Work in Progress
♻ ☆ CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition ICASSP 2024
RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve
length alignment between input audio and target sequence. However, the
implementation complexity and the alignment-based optimization target of RNN-T
loss lead to computational redundancy and a reduced role for predictor network,
respectively. In this paper, we propose a novel model named CIF-Transducer
(CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism
with the RNN-T model to achieve efficient alignment. In this way, the RNN-T
loss is abandoned, thus bringing a computational reduction and allowing the
predictor network a more significant role. We also introduce Funnel-CIF,
Context Blocks, Unified Gating and Bilinear Pooling joint network, and
auxiliary training strategy to further improve performance. Experiments on the
178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves
state-of-the-art results with lower computational overhead compared to RNN-T
models.
comment: Accepted by ICASSP 2024
♻ ☆ BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
While large language models (LLMs) exhibit remarkable capabilities across
various tasks, they encounter potential security risks such as jailbreak
attacks, which exploit vulnerabilities to bypass security measures and generate
harmful outputs. Existing jailbreak strategies mainly focus on maximizing
attack success rate (ASR), frequently neglecting other critical factors,
including the relevance of the jailbreak response to the query and the level of
stealthiness. This narrow focus on single objectives can result in ineffective
attacks that either lack contextual relevance or are easily recognizable. In
this work, we introduce BlackDAN, an innovative black-box attack framework with
multi-objective optimization, aiming to generate high-quality prompts that
effectively facilitate jailbreaking while maintaining contextual relevance and
minimizing detectability. BlackDAN leverages Multiobjective Evolutionary
Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks
across multiple objectives including ASR, stealthiness, and semantic relevance.
By integrating mechanisms like mutation, crossover, and Pareto-dominance,
BlackDAN provides a transparent and interpretable process for generating
jailbreaks. Furthermore, the framework allows customization based on user
preferences, enabling the selection of prompts that balance harmfulness,
relevance, and other factors. Experimental results demonstrate that BlackDAN
outperforms traditional single-objective methods, yielding higher success rates
and improved robustness across various LLMs and multimodal LLMs, while ensuring
jailbreak responses are both relevant and less detectable.
♻ ☆ Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
Value-based reinforcement learning (RL) can in principle learn effective
policies for a wide range of multi-turn problems, from games to dialogue to
robotic control, including via offline RL from static previously collected
datasets. However, despite the widespread use of policy gradient methods to
train large language models for single turn tasks (e.g., question answering),
value-based methods for multi-turn RL in an off-policy or offline setting have
proven particularly challenging to scale to the setting of large language
models. This setting requires effectively leveraging pretraining, scaling to
large architectures with billions of parameters, and training on large
datasets, all of which represent major challenges for current value-based RL
methods. In this work, we propose a novel offline RL algorithm that addresses
these drawbacks, casting Q-learning as a modified supervised fine-tuning (SFT)
problem where the probabilities of tokens directly translate to Q-values. In
this way we obtain an algorithm that smoothly transitions from maximizing the
likelihood of the data during pretraining to learning a near-optimal Q-function
during finetuning. Our algorithm has strong theoretical foundations, enjoying
performance bounds similar to state-of-the-art Q-learning methods, while in
practice utilizing an objective that closely resembles SFT. Because of this,
our approach can enjoy the full benefits of the pretraining of language models,
without the need to reinitialize any weights before RL finetuning, and without
the need to initialize new heads for predicting values or advantages.
Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a
variety of tasks including both natural language dialogue and robotic
manipulation and navigation from images.
comment: 17 pages, 4 figures