[论文翻译]用于精神障碍检测的少样本学习:一种结合医学知识注入的连续多提示工程方法


原文地址:https://arxiv.org/abs/2401.12988


ABSTRACT

摘要

This study harnesses state-of-the-art AI technology for detecting mental disorders through user-generated textual content. Existing studies typically rely on fully supervised machine learning, which presents challenges such as the labor-intensive manual process of annotating extensive training data for each research problem and the need to design specialized deep learning architectures for each task. We propose a novel method to address these challenges by leveraging large language models and continuous multi-prompt engineering, which offers two key advantages: (1) developing personalized prompts that capture each user's unique characteristics and (2) integrating structured medical knowledge into prompts to provide context for disease detection and facilitate predictive modeling. We evaluate our method using three widely prevalent mental disorders as research cases. Our method significantly outperforms existing methods, including feature engineering, architecture engineering, and discrete prompt engineering. Meanwhile, our approach demonstrates success in few-shot learning, i.e., requiring only a minimal number of training examples. Moreover, our method can be generalized to other rare mental disorder detection tasks with few positive labels. In addition to its technical contributions, our method has the potential to enhance the well-being of individuals with mental disorders and offer a cost-effective, accessible alternative for stakeholders beyond traditional mental disorder screening methods.

本研究利用前沿AI技术,通过用户生成文本内容检测精神障碍。现有研究通常依赖全监督机器学习,存在两大挑战:(1) 为每个研究问题标注大量训练数据需要耗费大量人工;(2) 需为每项任务设计专用深度学习架构。我们提出创新方法,通过大语言模型和连续多提示工程解决这些问题,其优势在于:(1) 开发能捕捉用户独特性征的个性化提示;(2) 将结构化医学知识融入提示,为疾病检测提供背景并辅助预测建模。我们选取三种高发精神障碍作为研究案例进行评估,本方法在特征工程、架构工程和离散提示工程等现有方法中显著胜出。同时,该方法在少样本学习场景中表现优异,仅需极少量训练样本即可实现。此外,本方法可推广至其他阳性标签稀缺的罕见精神障碍检测任务。除技术贡献外,该方法有望提升精神障碍患者福祉,并为利益相关方提供比传统筛查更经济、可及的替代方案。

Keywords: prompt engineering; large language model; machine learning; computational design science; mental health management

关键词:提示工程 (Prompt Engineering);大语言模型 (Large Language Model);机器学习 (Machine Learning);计算设计科学 (Computational Design Science);心理健康管理 (Mental Health Management)

1. INTRODUCTION

1. 引言

Mental disorders are a major global health burden. There are over 150 recognized core mental health conditions (APA 2022) and approximately 1 in 8 people worldwide (970 million), live with a mental disorder (WHO 2022a). Recent studies indicate a significant $25%$ increase in anxiety and depression, two key mental disorders, since the onset of the COVID-19 pandemic in 2020 (WHO 2022a). More importantly, despite considerable research efforts, mental disorders are difficult to detect for reasons such as the lack of a reliable laboratory test for diagnosis and insufficient behavioral data in electronic health records (EHR) for effective detection (WHO 2022b). Moreover, mental disorders continue to be under diagnosed due to factors such as lack of awareness of their symptoms, myths and misunderstandings, stigma leading individuals to hide their issues and delay seeking help, and barriers to healthcare access (Patel et al. 2018). Using user-generated content as a supplement to existing mental disorder screening methods is considered a promising approach to combating mental disorders and has far-reaching health and societal implications (Guntuku et al. 2017, Wongkoblap et al. 2017). Information systems (IS) research closely follows this direction and emphasizes the use of user-generated content for mental disorder detection (Chau et al. 2020, D. Zhang et al. 2024, W. Zhang et al. 2024).

精神疾病是全球主要的健康负担。目前公认的核心心理健康问题超过150种(APA 2022),全球约八分之一人口(9.7亿)患有精神疾病(WHO 2022a)。最新研究表明,自2020年新冠疫情爆发以来,焦虑症和抑郁症这两种关键精神障碍的发病率显著上升了25%(WHO 2022a)。更重要的是,尽管研究投入巨大,精神疾病仍难以检测——原因包括缺乏可靠的实验室诊断测试,以及电子健康档案(EHR)中行为数据不足导致有效检测困难(WHO 2022b)。此外,由于症状认知不足、误解与偏见使患者隐瞒问题并延误求助,以及医疗资源获取障碍等因素(Patel et al. 2018),精神疾病仍存在普遍漏诊现象。利用用户生成内容作为现有精神疾病筛查方法的补充手段,被视为对抗精神疾病的有效途径,具有深远的健康与社会意义(Guntuku et al. 2017, Wongkoblap et al. 2017)。信息系统(IS)研究密切关注这一方向,强调利用用户生成内容进行精神疾病检测(Chau et al. 2020, D. Zhang et al. 2024, W. Zhang et al. 2024)。

Previous work in data-driven healthcare studies demonstrates that patients with chronic diseases, including mental disorders, consistently share their symptoms, life events associated with their conditions, and details of their treatments through user-generated textual content online (Abbasi et al. 2019, Chau et al. 2020, Zhang and Ram 2020). Among the various healthcare studies that leverage user-generated content, research on mental disorders is particularly well-suited for utilizing this type of data because the current diagnosis of mental disorders relies on self-reported symptoms and life events in natural languages (APA 2022); and decades of scientific research have shown that user-generated text can reveal people's psychological states (e.g., emotions, moods) as well as behaviors and activities often associated with mental disorders (Tausczik and Pennebaker 2010, W. Zhang et al. 2024). Hence, using AI to analyze user-generated content holds great potential for enhancing the detection and management of mental disorders by extracting valuable insights from individuals firsthand experiences (Bardhan et al. 2020). For instance, online platforms can leverage mental disorder detection techniques to develop new services featuring personalized recommendations for users (e.g., encouraging individuals to seek help and treatments, promoting educational content and tools, offering treatment options, and fostering social support). Public administration can employ these techniques to strategically allocate resources to areas with high incidence rates, thereby enhancing the overall effectiveness of chronic disease programs. Policymakers can monitor large-scale user-generated textual content and facilitate the creation of evidence-based policies tailored to the specific needs of different patient cohorts.

先前在数据驱动的医疗健康研究中发现,慢性疾病(包括精神障碍)患者会持续通过在线用户生成文本内容分享症状、病情相关生活事件及治疗细节(Abbasi等2019,Chau等2020,Zhang和Ram2020)。在利用用户生成内容的各类医疗研究中,精神障碍研究特别适合采用此类数据,因为当前精神障碍诊断依赖于自然语言描述的自我报告症状和生活事件(APA2022);数十年科学研究表明,用户生成文本能揭示人们的心理状态(如情绪、心境)以及常与精神障碍相关的行为活动(Tausczik和Pennebaker2010,W.Zhang等2024)。因此,运用AI分析用户生成内容,通过从个体第一手经验中提取有价值洞见,对提升精神障碍检测与管理具有巨大潜力(Bardhan等2020)。例如,在线平台可运用精神障碍检测技术开发新服务,为用户提供个性化推荐(如鼓励寻求帮助和治疗、推送教育内容与工具、提供治疗方案、促进社会支持);公共管理部门可借此策略性地向高发地区分配资源,从而提升慢性病防治整体效能;政策制定者能通过监测大规模用户生成文本,推动制定基于证据、针对不同患者群体需求的精准政策。

However, existing methods on mental disorders through user-generated content show limitations in terms of real-world applicability and general iz ability due to the heavy reliance on fully supervised learning. Each mental disorder has unique characteristics and features, and these distinctions could include variations in the symptoms that patients reported, the life events that led to the development or progression of the disease, and the specific treatments that were applied. As a result, researchers have to follow a labor-intensive process to analyze and predict outcomes for these chronic diseases. For instance, it is required to collect and label data, creating a dataset where examples are categorized or identified according to the specific disease they were related to. However, it is difficult to reuse a dataset for a specific chronic disease for another disease using a fully supervised learning model, resulting in the need to create multiple datasets for different mental disorders. Furthermore, a customized machine learning model for each individual disease has to be designed and fine-tuned, which involves meticulously optimizing the algorithms, parameters, and features to make the model effective for that particular disease. This process is highly costly in terms of time, effort, and resources, significantly hampering the applicability of

然而,现有通过用户生成内容研究心理障碍的方法因严重依赖全监督学习,在实际应用和泛化能力方面存在局限。每种心理障碍都具有独特的特征,这些差异可能包括患者报告的症状差异、导致疾病发生或进展的生活事件差异,以及所采用的具体治疗方式差异。因此,研究人员必须遵循劳动密集型流程来分析预测这些慢性疾病的结果。例如:需要收集标注数据,创建按相关疾病分类识别的数据集。但基于全监督学习模型难以将特定慢性疾病数据集复用于其他疾病,导致需要为不同心理障碍创建多个数据集。此外,必须为每种疾病单独设计定制机器学习模型并进行微调,这涉及精心优化算法、参数和特征以使模型对该疾病有效。这一过程在时间、人力和资源方面成本极高,严重阻碍了应用性。

the resulting prediction model.

生成的预测模型。

This study aims to address this research gap by developing a general iz able and adaptable method capable of detecting multiple mental disorders without constructing a large amount of training data or designing a customized model for each disease. With the emergence of LLMs in AI and their remarkable abilities across various downstream tasks, the learning paradigms in NLP-related tasks have evolved from traditional feature engineering and architecture engineering to more advanced learning paradigms, including fine-tuning and prompt engineering (Liu et al. 2023). The foundation for such an approach is grounded in LLMs that have already employed extensive training data, computing power, and algorithmic capabilities to achieve highly promising results across various domains, surpassing the performance of individually-trained, task-specific models. Given these technological advances, leveraging the immense potential of LLMs and prompt engineering for mental disorder detection using user-generated content can minimize the cost of training and developing disease- or problem-specific models. This approach offers a more effective and efficient alternative than traditional methods.

本研究旨在通过开发一种通用且可适应的方法来填补这一研究空白,该方法能够检测多种精神障碍,而无需构建大量训练数据或为每种疾病设计定制模型。随着大语言模型(LLM)在人工智能领域的出现及其在各种下游任务中的卓越表现,自然语言处理相关任务的学习范式已从传统的特征工程和架构工程演变为更先进的学习范式,包括微调(fine-tuning)和提示工程(prompt engineering) (Liu et al. 2023)。这种方法的基础在于大语言模型已经通过大量训练数据、计算能力和算法能力,在各个领域取得了超越单独训练的特定任务模型的优异表现。鉴于这些技术进步,利用大语言模型和提示工程的巨大潜力,通过用户生成内容进行精神障碍检测,可以最大限度地降低训练和开发针对特定疾病或问题模型的成本。这种方法比传统方法提供了更有效、更高效的替代方案。

Nevertheless, challenges remain in utilizing prompt engineering and LLMs for mental disorder detection. Most current studies on prompt engineering in IS focus on discrete prompts-how to tweak the natural language used in prompts for downstream tasks. However, we argue that this approach is not the most effective in the context of mental disorder detection: a binary classification task where classification performance is paramount, and human comprehension of the prompt itself is not crucial. Moreover, detecting mental disorders using user-generated content presents two unique challenges. (1) Detecting heterogeneity among various diseases and how each individual presents their conditions poses significant challenges for prompt engineering. Different mental disorders exhibit distinct characteristics, including unique symptoms, risk factors, and treatment approaches, all of which shape the content of user-generated material. Additionally, each patient has a unique persona, characterized by individual patterns, habits, and disease progression, which collectively influence the creation of user-generated content. As a result, user-generated content can be lengthy, noisy, and highly complex, making it challenging for LLMs to efficiently detect various mental disorders at the subject level. This task extends beyond simply identifying explicit mentions of mental disorders and presents significant challenges for LLM models in comprehending nuances, extracting, understanding, and inferring the implicit information related to mental disorders. (2) Incorporating structured medical domain knowledge into prompt engineering to enhance an LLM's predictive performance remains under-explored. Within the medical domain, there is a wealth of medical knowledge that can be closely linked to the content reported in user-generated content and pertains to mental disorders, which can provide significant assistance in employing LLMs for mental disorder detection. However, existing medical knowledge is often organized in tree or network structures (e.g., ontologies, one of the most prevalent forms of domain knowledge). Current discrete prompt methods, which refine questions in natural language, lack effective mechanisms to leverage such structured knowledge.

然而,在利用提示工程和大语言模型进行精神障碍检测方面仍存在挑战。目前信息系统领域关于提示工程的研究大多集中于离散提示(discrete prompts)——即如何调整自然语言提示以适应下游任务。但我们认为,这种方法在精神障碍检测(二元分类任务)中并非最优解:当分类性能至关重要而人类对提示本身的理解无关紧要时,现有方法存在局限。此外,利用用户生成内容检测精神障碍面临两大独特挑战:(1) 疾病异质性与个体表现差异对提示工程构成重大挑战。不同精神障碍具有独特的症状特征、风险因素和治疗方式,这些都会影响用户生成内容。同时,每位患者的个人特质(包括行为模式、习惯和病程发展)会共同塑造其生成内容。这导致用户生成内容往往冗长、含噪且高度复杂,使得大语言模型难以在个体层面有效检测多种精神障碍——该任务不仅需要识别明确提及的精神障碍,更要求模型理解细微差异、提取并推断与精神障碍相关的隐含信息。(2) 如何将结构化医学知识融入提示工程以提升大语言模型预测性能仍待探索。医学领域存在大量与用户生成内容密切相关且涉及精神障碍的专业知识,这些知识能显著提升检测效果。但现有医学知识通常以树状或网状结构组织(如本体论这类典型领域知识形式),而当前基于自然语言问题优化的离散提示方法缺乏有效利用此类结构化知识的机制。

These two unique challenges in detecting mental disorders using user-generated content motivate us to propose a novel method that aims to: (1) maximize the performance of binary classification for mental disorder detection; (2) account for significant individual differences in both the disorder and the individuals during the prompt engineering process to improve predictive performance; and (3) enhance the effectiveness of LLM predictions by injecting medical knowledge in the form of ontologies during the prompting process. Specifically, our proposed framework utilizes continuous ensemble prompt engineering techniques to interact with LLMs and generate accurate mental disorder detection results. We incorporate prefix-tuning to create personalized prompts tailored to individual patients. Additionally, to account for the unique characteristics of each mental disorder and leverage medical knowledge, we integrate a novel rule-based prompting method that incorporates disease-related medical ontologies.

利用用户生成内容检测心理障碍时面临的这两大独特挑战,促使我们提出一种新方法,旨在:(1) 最大化心理障碍检测二元分类的性能;(2) 在提示工程过程中兼顾障碍与个体的显著个体差异,以提升预测性能;(3) 通过在提示过程中注入本体形式的医学知识,增强大语言模型预测的有效性。具体而言,我们提出的框架采用连续集成提示工程技术与大语言模型交互,生成准确的心理障碍检测结果。通过融入前缀调优技术,我们为个体患者创建个性化提示。此外,为兼顾每种心理障碍的独特性并利用医学知识,我们整合了一种新型基于规则的提示方法,该方法融合了疾病相关的医学本体。

Our key contributions are twofold. From the healthcare domain perspective, we propose a novel approach using prompt engineering and LLMs for the detection of mental disorders through user-generated textual content and achieve few-shot learning. The key advantage lies in eliminating the need for a substantial amount of labeled training data or customized architecture engineering for each specific disease or research problem. From the methodology perspective, we have two innovations. (1) We propose an ensemble prompt method, syne rg i zing prefix tuning and rule-based prompt engineering to address challenges in healthcare: personalized prompts and medical knowledge injection, which enhance method accuracy and efficacy. (2) We propose a new rule-based prompt method that efficiently tackles complex detection problems, integrating ontology-format domain knowledge, and its design principles can be extended to other problem domains, maximizing the potential of LLMs for real-world problem-solving.

我们的核心贡献体现在两方面。从医疗领域视角,我们提出了一种创新方法,通过提示工程 (prompt engineering) 和大语言模型,基于用户生成文本内容实现精神障碍检测,并达成少样本学习。其核心优势在于无需为每种特定疾病或研究问题准备大量标注训练数据或定制架构工程。从方法论视角,我们有两项创新:(1) 提出集成提示方法,协同前缀调优 (prefix tuning) 与基于规则的提示工程,解决医疗领域两大挑战:个性化提示与医学知识注入,从而提升方法准确性与有效性;(2) 提出新型基于规则的提示方法,通过整合本体格式领域知识高效解决复杂检测问题,其设计原则可拓展至其他问题领域,最大化释放大语言模型解决现实问题的潜力。

We position our work as computational design science research (Gregor and Hevner 2013, Hevner et al. 2004). In the context of machine learning in IS research (Padma nab han et al. 2022), our work represents a Type I contribution focused on method development. Specifically, we introduce a new continuous ensemble prompt engineering method for personalized context and medical knowledge injection. Given that mental disorders are one of the major contributors to the overall global disease burden, our approach addresses this societal challenge by providing a tailored machine learning framework along with accompanying algorithms (Padma nab han et al. 2022). In line with design research pathways for artificial intelligence (Abbasi et al., 2024), our 'artifact typologies"” are a new “predictive model". The “abstraction spectrum” of our work includes: (1) emphasizing individual differences in the prompt-based prediction process to enhance accuracy, and (2) incorporating existing domain knowledge in ontology format into the prompting process can significantly improve performance. Both contribute valuable “salient design insights? for future research.

我们将本研究定位为计算设计科学研究 (Gregor and Hevner 2013, Hevner et al. 2004)。在信息系统研究中机器学习应用的背景下 (Padmanabhan et al. 2022),我们的工作属于专注于方法开发的I类贡献。具体而言,我们提出了一种新型连续集成提示工程方法,用于个性化上下文和医学知识注入。鉴于精神障碍是全球疾病负担的主要诱因之一,我们的方法通过提供定制化机器学习框架及配套算法 (Padmanabhan et al. 2022) 来应对这一社会挑战。遵循人工智能领域的设计研究路径 (Abbasi et al., 2024),我们的"人工制品类型学"是一种新型"预测模型"。本研究的"抽象谱系"包括:(1) 在基于提示的预测过程中强调个体差异以提升准确性,(2) 将以本体形式存在的领域知识融入提示过程可显著提升性能。二者均为未来研究提供了宝贵的"关键设计洞见"。

Practically, our work has significant implications for mental disorder detection. It provides an accurate detection method that can provide complementary information to existing mental disorder screening procedures. For public health management, our method enables large-scale analyses of a population's mental health beyond what has previously been possible with traditional methods.

实践中,我们的研究对精神障碍检测具有重大意义。它提供了一种精准的检测方法,可为现有精神障碍筛查流程提供补充信息。在公共卫生管理层面,该方法实现了超越传统手段的大规模人群心理健康分析能力。

2. RELATED WORK

2. 相关工作

Our work aims to leverage user-generated text content for detecting mental disorders, framing it as a binary classification problem. We also propose a novel prompting method that utilizes large LLMs and overcomes the limitations of current mental disorder detection approaches, which often rely heavily on large amounts of labeled training data. We begin by reviewing the evolution of supervised machine learning techniques in natural language processing (NLP) and explaining why prompt-based methods hold promise in this research area. Next, we provide an overview of existing prompting techniques, justifying the motivations behind our research design and emphasizing the novelty of our proposed method.

我们的工作旨在利用用户生成的文本内容检测心理障碍,并将其视为二元分类问题。我们还提出了一种新颖的提示方法,利用大语言模型克服当前心理障碍检测方法的局限性,这些方法通常严重依赖大量带标签的训练数据。首先,我们回顾了自然语言处理 (NLP) 中监督机器学习技术的演变,并解释了为什么基于提示的方法在该研究领域具有前景。接着,我们概述了现有的提示技术,论证了研究设计背后的动机,并强调了我们提出的方法的新颖性。

2.1. Paradigms in NLP-related Supervised Machine Learning

2.1. NLP相关监督式机器学习范式

Supervised learning, a subcategory of machine learning and AI, has found extensive applications across diverse domains, facilitating tasks such as classifications, detections, and predictions. It is characterized by its use of labeled training datasets to supervise algorithms that produce outcomes accurately. In NLP (i.e., textual content-related machine learning), supervised learning has its paradigms and has evolved through various stages (Figure 1): from feature engineering and architecture engineering to pre-training and fine-tuning, and finally, to pre-training and prompt engineering (Liu et al. 2023).

监督学习 (Supervised learning) 作为机器学习和人工智能的子领域,已在分类、检测和预测等任务中展现出广泛应用。其核心特征是通过标注训练数据集来指导算法生成准确结果。在自然语言处理 (NLP) (即与文本内容相关的机器学习)领域,监督学习形成了独特范式并经历了多个演进阶段 (图 1):从特征工程与架构工程,到预训练与微调,最终发展为预训练与提示工程 (Liu et al. 2023)。

Until recently, most studies have focused on fully supervised learning. Since fully supervised learning requires a substantial amount of labeled data to train high-performing models, and large-scale labeled data for specific NLP or healthcare-related tasks are limited, researchers have primarily focused on feature engineering before the advent of deep learning. Feature engineering involves extracting meaningful features from data using domain knowledge. For instance, Chau et al. (2020) focus on identifying emotional distress in user-generated content by employing a

直到最近,大多数研究都集中在全监督学习上。由于全监督学习需要大量标注数据来训练高性能模型,而针对特定自然语言处理(NLP)或医疗健康相关任务的大规模标注数据有限,在深度学习兴起之前,研究人员主要专注于特征工程。特征工程指利用领域知识从数据中提取有意义的特征。例如,Chau等人 (2020) 通过采用......

combination of feature extraction, feature selection, rules derived from domain experts, and machine learning classification.

特征提取、特征选择、领域专家推导规则与机器学习分类的结合。

With the emergence of deep learning, which has the capacity to automatically extract features from data without feature engineering, researchers shifted their focus to model architecture engineering. These approaches involve designing appropriate deep learning structures to introduce inductive biases into models, facilitating the learning of useful features. A notable work is Yang et al. (2022), in which the authors develop a deep learning architecture for personality detection using user-generated content. Their research design is deliberately crafted to incorporate advanced deep learning architecture engineering, including transfer learning and hierarchical attention network architectures, alongside concepts from relevant psycho linguistic theories.

随着深度学习(Deep Learning)的出现,其能够无需特征工程即可从数据中自动提取特征,研究人员将注意力转向了模型架构工程。这些方法通过设计合适的深度学习结构,为模型引入归纳偏置,从而促进有用特征的学习。Yang等人(2022)的研究是其中的代表性工作,作者开发了一种用于人格检测的深度学习架构,利用用户生成内容。他们的研究设计精心结合了先进的深度学习架构工程,包括迁移学习和分层注意力网络架构,同时融合了相关心理语言学理论的概念。


gure 1. Different NLP Supervised Learning Paradigms and Their Key Scholarly Contributions

图 1: 不同NLP监督学习范式及其关键学术贡献

Since 2018, NLP-related machine learning models transitioned to a new paradigm known as pre-train and fine-tune (Devlin et al. 2018), where a fixed architecture language model (e.g..

自2018年起,NLP相关的机器学习模型转向了一种称为预训练与微调(pre-train and fine-tune)的新范式(Devlin等人2018),其中采用固定架构的语言模型(例如...

BERT, T5, and GPT) can be pre-trained on a massive amount of text data. Pre-training typically involves tasks such as completing contextual sentences (e.g., fill-in-the-blank tasks), which do not require expert knowledge and can be directly performed on pre-existing large-scale data (i.e., self-supervised learning). The pre-trained model is then adapted to downstream tasks by fine-tuning (i.e., introducing additional parameters). This shift led researchers to focus on objective engineering, involving designing better objective functions for both pre-training and fine-tuning tasks (Sanh et al. 2020).

BERT、T5和GPT) 可以在海量文本数据上进行预训练。预训练通常涉及完成上下文句子等任务(例如填空任务),这些任务不需要专业知识,可以直接在已有的大规模数据上执行(即自监督学习)。随后通过微调(即引入额外参数)使预训练模型适配下游任务。这一转变促使研究者聚焦于目标工程,涉及为预训练和微调任务设计更好的目标函数 (Sanh et al. 2020)。

Table 1. Example of Prompts inNLPTasks
TaskOriginalinputPromptInputtoan LLMFeedbackfroman LLM
Sentiment predictionmissedthe bus today.Ifeltsomissed the bus today. feltso [mask].TheLMfillsinthe [mask] with anemotionword e.g..,f frustrating.
TranslationEnglish: French:English:I missed the ebus today. French: [mask]TheLMfillsinthe [mask] with the corresponding Frenchsentence,e.g., J'airatelebusaujourd'hui

表 1: NLP任务中的提示词示例

任务 原始输入 提示词 大语言模型输入 大语言模型反馈
情感预测 missed the bus today. Ifeltso missed the bus today. feltso [mask]. 模型用情感词(如frustrating)填充[mask]。
翻译 English: French: English: I missed the bus today. French: [mask] 模型用法语句子(如J'airatelebusaujourd'hui)填充[mask]。

During the process of objective engineering, using different prompts to frame the same input (templates $T$ surrounding the input) could facilitate various tasks and improve predictive performance. Table 1 provides examples using discrete prompts (natural language). However, it is important to note that the templates $T$ surrounding the input can also be continuous prompts (i.e., numeric vectors). Subsequently, it was discovered that even using different prompts for the same task can result in a variance in the prediction performance. That is, not only how the language model is trained but also how the prompt is designed can have a significant impact on the performance. Therefore, many researchers have shifted their focus to prompt engineering, exploring the design of effective prompts for downstream tasks (Liu et al. 2023).

在目标工程过程中,使用不同的提示词(围绕输入的模板$T$)可以促进多种任务并提升预测性能。表1展示了使用离散提示(自然语言)的示例。但需注意,围绕输入的模板$T$也可以是连续提示(即数值向量)。后续研究发现,即使针对同一任务使用不同提示词,也会导致预测性能的波动。这表明不仅大语言模型的训练方式,提示词的设计同样对性能有重大影响。因此,许多研究者将重心转向提示工程,探索针对下游任务的有效提示设计(Liu et al. 2023)。

Although both fine-tuning and prompt engineering leverage the capabilities of LLMs for various downstream prediction tasks, researchers have found that prompt engineering can outperform fine-tuning in terms of predictions. This is because the objectives of pre-training of LLMs (e.g., masked language modeling) and fine-tuning (e.g., binary classification) are not always aligned, making it difficult to fully exploit the knowledge embedded in pre-trained LLMs, which can lead to suboptimal performance on downstream tasks (Wang et al. 2022). Moreover, empirical evidence suggests that prompt learning is computationally efficient, as it is well-suited for zero-shot or few-shot learning scenarios where data is limited (Gao et al. 2021).

尽管微调 (fine-tuning) 和提示工程 (prompt engineering) 都能利用大语言模型的能力完成各种下游预测任务,但研究人员发现提示工程在预测效果上可能优于微调。这是因为大语言模型的预训练目标(如掩码语言建模)与微调目标(如二元分类)并不总是一致,导致难以充分挖掘预训练模型中嵌入的知识,从而影响下游任务性能 (Wang et al. 2022)。此外,实证研究表明提示学习具有计算效率优势,尤其适合数据有限的零样本或少样本学习场景 (Gao et al. 2021)。

2.2. Prompt Engineering

2.2. 提示工程

We first elucidate the key differences between prompt engineering and other NLP-related supervised learning paradigms. Feature enginering, architecture enginering, and fine-tuning share a common pattern: training a machine learning model to process labeled training examples ( $x,y)$ and predict an output $y$ as $p(\boldsymbol{y}|\boldsymbol{x})$ . In contrast, prompt engineering follows a distinct learning process: using an LLM, it directly models the probability of an outcome $z$ (see Table 1: Feedback from LM). To leverage these models for prediction tasks, the original input $x$ undergoes modification using a template $T$ to create a new input $x'$ (i.e., the prompt). The template $T$ can be either discrete (natural language) or continuous (numeric vectors).

我们首先阐明提示工程 (prompt engineering) 与其他NLP相关监督学习范式之间的关键区别。特征工程、架构工程和微调遵循相同范式:通过训练机器学习模型处理带标签的训练样本 ( $x,y)$ 并预测输出 $y$ 作为 $p(\boldsymbol{y}|\boldsymbol{x})$ 。而提示工程采用截然不同的学习过程:利用大语言模型直接建模结果 $z$ 的概率 (见表1: LM反馈)。为将这些模型应用于预测任务,原始输入 $x$ 会通过模板 $T$ 修改为新输入 $x'$ (即提示)。模板 $T$ 可以是离散的 (自然语言) 或连续的 (数值向量)。

To clarify the operational mechanisms of prompt engineering, we employ a discrete, human-readable prompt as an illustrative example. This new input $x'$ of LLMs has unfilled slots $[m a s k]$ , and the LLM is then utilized to probabilistic ally fill in the missing information $[m a s k]$ resulting in $z$ , from which, the ultimate output $y$ can be derived through $p(\boldsymbol{\hat{y}}|z)$ . Take the sentiment prediction task in Table 1 as an example. The original input $x$ is the text $^{\circ}I$ missed the bus today? or its vector representation. The corresponding label, denoted as $y$ , is “negative (sentiment)." In learning paradigms other than prompt engineering, $y$ is directly derived through P("I missed the bus today"). However, in prompt engineering, researchers first design a prompt denoted as $x'$ (i.e., a new input to an LLM): "I missed the bus today. I felt so [mask]." The LLM fills the unfilled slots of $x'$ , resulting in $z$ ,"frustrating." The prediction result $y$ is determined by the value of $z$ .As $z$ is closely associated with a negative sentiment, the prediction outcome $y$ is consequently classified as "negative." The determination of which words are more closely

为阐明提示工程(prompt engineering)的运作机制,我们采用一个离散化、人类可读的提示作为示例。大语言模型的新输入$x'$包含未填充的槽位$[mask]$,随后利用大语言模型以概率方式填补缺失信息$[mask]$得到$z$,最终输出$y$可通过$p(\boldsymbol{\hat{y}}|z)$推导得出。以表1中的情感预测任务为例,原始输入$x$是文本"I missed the bus today"或其向量表示,对应标签$y$为"negative"(情感)。在非提示工程的学习范式中,$y$直接通过P("I missed the bus today")推导获得;而在提示工程中,研究者首先设计提示$x'$(即大语言模型的新输入):"I missed the bus today. I felt so [mask]."大语言模型填补$x'$的未填充槽位得到$z$:"frustrating"。预测结果$y$由$z$的值决定,由于$z$与负面情感高度关联,因此预测结果$y$被判定为"negative"。具体哪些词汇更紧密...

associated with “negative” or “positive? can be either pre-defined or learned automatically, which is referred to as the verbalizer V.

与"负面"或"正面"相关的分类可以预先定义或自动学习,这被称为verbalizer V。

Table 2. Representative Prompt Engineering Methods and Comparison with Our Method

(a) Classification based onPrompt Design
Cat.Shape of promptsManual/automatedDiscrete/continuousStatic/dynamic prompts
ClozePrefixprompts ManualAutomatedprompts DiscreteContinuousStaticDynamic
LAMA(Petronietal.2019) TemplateNER(Cuietal.2021)
GPT-3(Brownetal.2020)
Prefix-Tuning(LiandLiang2021)
Prompttuning(Lesteretal.2021)
AutoPrompt(Shinetal.2020)
Ours
(b)Classification based on Multi-prompt Learning
PromptPrompt
Cat.PromptEnsembleAugmentationPrompt CompositionDecomposition
BARTScore(Yuanetal.2021)
GPT-3(Brownetal.2020)
PTR(Hanetal.2022) TemplateNER(Cuietal.2021)
Ours

表 2. 代表性提示工程方法及与本文方法的对比

(a) 基于提示设计的分类

Cat. Shape of prompts Manual/automated Discrete/continuous Static/dynamic prompts
Cloze Prefix prompts Manual Automated
LAMA (Petroni et al. 2019) TemplateNER (Cui et al. 2021)
GPT-3 (Brown et al. 2020)
Prefix-Tuning (Li and Liang 2021)
Prompt tuning (Lester et al. 2021)
AutoPrompt (Shin et al. 2020)
Ours

(b) 基于多提示学习的分类

Cat. Prompt Ensemble Prompt Augmentation Prompt Composition Prompt Decomposition
BARTScore (Yuan et al. 2021)
GPT-3 (Brown et al. 2020)
PTR (Han et al. 2022) TemplateNER (Cui et al. 2021)
Ours

Note:

注:

The objective of prompt engineering is to develop a prompting function, denoted as $x'=f_{prompt}(x)$ , to achieve optimal performance in the subsequent task. Prompt engineering can significantly enhance the efficiency and effectiveness of the prediction process since it enables LLMs to undergo pre-training on vast amounts of pre-existing textual data. Moreover, by defining $f_{prompt}(x)$ , the model can facilitate few-shot or even zero-shot learning, seamlessly adapting to new scenarios with minimal or no labeled data.

提示工程的目标是开发一个提示函数,记为 $x'=f_{prompt}(x)$ ,以在后续任务中实现最佳性能。提示工程可以显著提高预测过程的效率和效果,因为它使大语言模型能够对大量现有文本数据进行预训练。此外,通过定义 $f_{prompt}(x)$ ,该模型可以促进少样本甚至零样本学习,只需极少或无需标注数据即可无缝适应新场景。

As the literature underscores, the design of a prompt can have substantial influence on the overall performance of a prompt-based method (Liu et al. 2023). Therefore, various prompt engineering methods have been proposed, which can be categorized based on the shape of prompts, manual/automated prompts, discrete/continuous prompts, and static/dynamic prompts, each with distinct characteristics and associated pros and cons. The choice of prompt engineering method depends on both the task at hand and the specific LLMs employed to address the task. We provide a summary of representative studies in Table 2. Recently, many studies have highlighted the significant improvement in the effectiveness of prompt engineering methods through the utilization of multiple prompts—a concept known as multi-prompt engineering. Several key strategies for multi-prompt learning have been identified, including prompt ensembling, prompt augmentation, prompt composition, and prompt decomposition (Liu et al. 2023).

文献研究表明,提示(prompt)设计对基于提示的方法整体性能具有显著影响(Liu et al. 2023)。因此,研究者提出了多种提示工程方法,这些方法可按提示形态、人工/自动提示、离散/连续提示、静态/动态提示等维度进行分类,各类方法具有不同特性及优缺点。提示工程方法的选择需同时考虑具体任务和所用大语言模型(LLM)。我们在表2中总结了代表性研究。近期许多研究表明,通过使用多重提示(即多提示工程)可显著提升提示工程方法的有效性。目前已识别出多提示学习的若干关键策略,包括提示集成(prompt ensembling)、提示增强(prompt augmentation)、提示组合(prompt composition)和提示分解(prompt decomposition)(Liu et al. 2023)。

Although prompt engineering has shown significant potential among different tasks and scenarios, many challenges remain (Liu et al. 2023). Two of the most significant technical challenges in this field are as follows. (1) Prompt design for complex tasks: the formulation and design of prompts for complex tasks are not straightforward (Liu et al. 2023). Particularly, prompt design in mental disorder detection using textual data is under-explored. Each patient possesses unique characteristics and patterns, including but not limited to, linguistic styles (such as a tendency to complain, convey setbacks; or a tendency to endure, face challenges positively), habits of using social media, and the extent to which one is willing to openly discuss their own illnesses, a unique course of progression in their illness, and so on. Moreover, different types of mental disorders exhibit distinct (but sometimes similar) symptoms, risk factors, and treatments. (2) Prompt engineering with structured domain knowledge (Han et al. 2022): in many NLP tasks, inputs may exhibit various structures (e.g., syntax trees or relational structures from relationship extraction); effectively expressing these structures in prompt engineering poses a significant challenge. In the realm of mental disorder management, chronic disease management, and healthcare in general, a substantial volume of medical knowledge exists in structured formats (e.g.. ontologies, which are tree or network structures). Leveraging the existing domain knowledge can greatly enhance disease detection using textual data. However, this domain remains largely under explored, presenting a potentially crucial and interesting avenue for research. These two

尽管提示工程(prompt engineering)在不同任务和场景中展现出巨大潜力,但仍存在诸多挑战(Liu et al. 2023)。该领域最显著的两个技术挑战如下:(1) 复杂任务的提示设计:针对复杂任务的提示制定与设计并非易事(Liu et al. 2023)。特别是在利用文本数据进行精神障碍检测时,提示设计研究尚不充分。每位患者都具有独特特征和行为模式,包括但不限于:语言风格(如倾向于抱怨、诉说挫折;或倾向于忍耐、积极面对挑战)、社交媒体使用习惯、对公开讨论自身疾病的接受程度、独特的病情发展轨迹等。此外,不同类型的精神障碍表现出独特(但有时相似)的症状、风险因素和治疗方案。(2) 结合结构化领域知识的提示工程(Han et al. 2022):在许多NLP任务中,输入可能呈现多种结构(如句法树或关系抽取得到的关系结构);如何在提示工程中有效表达这些结构构成重大挑战。在精神障碍管理、慢性病管理及医疗健康领域,大量医学知识以结构化形式存在(如本体论中的树状或网状结构)。利用现有领域知识可显著提升基于文本数据的疾病检测效果。然而这一领域仍存在大量研究空白,可能成为关键而有趣的研究方向。这两个

challenges represent the core issues that this work aims to address. In the following sections, we will explore our proposed solutions within our research context.

这些挑战代表了本工作旨在解决的核心问题。在接下来的章节中,我们将在研究背景下探讨提出的解决方案。

2.2.1. Mental Disorder Detection and Continuous Prompt Engineering

2.2.1. 精神障碍检测与连续提示工程

One method of classifying prompts is to categorize them as either continuous or discrete. A discrete prompt modifies the input to an LLM using natural language. In contrast, a continuous prompt operates directly in the model's embedding space, allowing it to (1) relax the constraint that template embeddings $T$ must correspond to natural language (e.g., English) words and (2) eliminate the restriction that the template $T$ is parameterized by the pre-trained LLM's parameter

一种对提示进行分类的方法是将它们划分为连续型或离散型。离散型提示通过自然语言修改大语言模型(LLM)的输入。相比之下,连续型提示直接在模型的嵌入空间中操作,使其能够:(1) 放宽模板嵌入 $T$ 必须对应自然语言(如英语)单词的约束;(2) 消除模板 $T$ 由预训练大语言模型参数参数化的限制

As mentioned earlier, this study focuses on detecting mental disorders using user-generated content. The problem is a binary classification task. We argue that continuous prompts are more effective than discrete ones for our research problem for the following reasons: (1) Enhanced expressiveness: continuous prompts leverage high-dimensional vector representations, enabling the model to learn latent semantic features tailored to the classification task (Lester et al. 2021). Unlike discrete prompts, continuous prompts capture nonlinear relationships that rigid textual templates (discrete prompt) often miss (Liu et al. 2024). (2) Flexibility via optimization: continuous prompts consist of trainable parameters optimized through back propagation, allowing them to align directly with the prediction task's loss function (Lester et al. 2021). In contrast, discrete prompts rely on heuristic tuning, which may lead to misalignment with the LLMs' internal representations and the objectives of the prediction task (Shin et al. 2020). (3) Mitigating ambiguity in classifications: binary tasks, such as mental disorder detection (positive/negative), require probabilistic boundaries. Continuous prompts excel in these domains by providing probabilities as feedback from LLMs. In our work, we leverage these probabilities for result interpretation, which can offer significant benefits to stakeholders (see Section 4.5). In contrast, discrete prompts impose rigid mappings, only yielding binary outcomes (benign/malignant) (Shin et al. 2020). (4) Computational efficiency and performance improvement: although discrete

如前所述,本研究聚焦于利用用户生成内容检测心理障碍。该问题属于二分类任务。我们认为连续提示(continuous prompts)比离散提示更适合本研究问题,原因如下:(1) 表现力增强:连续提示利用高维向量表示,使模型能够学习针对分类任务的潜在语义特征(Lester et al. 2021)。与离散提示不同,连续提示能捕捉刚性文本模板(离散提示)常忽略的非线性关系(Liu et al. 2024)。(2) 通过优化实现灵活性:连续提示由可训练参数组成,通过反向传播进行优化,使其能直接对齐预测任务的损失函数(Lester et al. 2021)。相比之下,离散提示依赖启发式调优,可能导致与大语言模型内部表征及预测任务目标不一致(Shin et al. 2020)。(3) 缓解分类模糊性:心理障碍检测(阳性/阴性)等二分类任务需要概率边界。连续提示通过提供大语言模型的概率反馈,在此类任务中表现优异。我们在工作中利用这些概率进行结果解释,可为利益相关者带来显著价值(参见第4.5节)。而离散提示采用刚性映射,仅产生二元结果(良性/恶性)(Shin et al. 2020)。(4) 计算效率与性能提升:尽管离散

prompts may seem straightforward because they are easily understandable by humans, their heuristic tuning (e.g., words tweaking for better performance) can be time-consuming. In contrast, continuous prompts automate the tuning process with minimal computational overhead (add minimal trainable parameters, e.g., prefix embeddings, avoiding full model fine-tuning). This is critical for binary tasks with limited labeled data (Lester et al. 2021).

提示词看似简单易懂,但其启发式调优(如调整用词以提升效果)可能耗时。相比之下,连续提示通过最小计算开销(仅添加少量可训练参数,如前缀嵌入,避免全模型微调)实现了调优自动化。这对于标注数据有限的二元分类任务至关重要(Lester et al. 2021)。

Continuous prompts have one primary limitation: they are typically represented as high-dimensional numerical vectors, which makes it difficult to directly understand or interpret their specific meanings. This contrasts with the more intuitive nature of discrete prompts. However, these limitations are not critical in the context of our research-binary classification. The understand ability of the prompt per se (the input) is not the ultimate goal. Our goal is to predict mental disorders (the output). Even with continuous prompts, we can still offer explain able insights for this prediction outcome, which we will show in Section 4.5.

连续提示有一个主要限制:它们通常表示为高维数值向量,这使得难以直接理解或解释其具体含义。这与离散提示更直观的特性形成对比。然而,这些限制在我们的研究——二元分类背景下并不关键。提示本身(输入)的可理解性并非最终目标,我们的目标是预测精神障碍(输出)。即使使用连续提示,我们仍可为该预测结果提供可解释的洞见,这将在第4.5节展示。

Moreover, there are two key research challenges in our research problem: (1) addressing heterogeneity among individuals and disorders, and (2) incorporating structured medical knowledge into prompts to enhance LLM prediction. We aim to tackle this research problem by enhancing the performance of LLMs in binary mental disorder detection tasks through a prompt ensemble approach that combines two continuous prompt engineering techniques: prefix tuning and rule-based prompting.

此外,我们的研究问题存在两个关键挑战:(1) 解决个体与疾病间的异质性问题,(2) 将结构化医学知识融入提示词以增强大语言模型预测性能。我们计划通过提示集成方法提升大语言模型在二元精神障碍检测任务中的表现,该方法结合了两种连续提示工程技术:前缀调优和基于规则的提示构建。

2.2.2. Prefix Tuning for Personalized Prompts

2.2.2. 个性化提示的前缀调优 (Prefix Tuning)

Prefix tuning is a continuous prompt method that optimizes a sequence of trainable vectors prepended to the input embeddings (Li and Liang 2021). The underlying intuition of this method lies in the idea that by providing an appropriate context to an LLM, which can influence the encodingof $x$ and direct the LLM on what information to extract from $x$ .Therefore, the context can guide the LLM to effectively solve downstream tasks. Nevertheless, it is not clear whether such a context exists or how to identify such a context for each individual $x$ . Therefore, the authors propose the prefix tuning method to automatically optimize continuous prefixes for inputs as the context. Formally, for a training example $(x,y)$ ,theydefine

前缀调优是一种连续提示方法,它优化了预置在输入嵌入前的可训练向量序列 (Li and Liang 2021)。该方法的核心思想在于:通过为大语言模型提供适当的上下文,可以影响 $x$ 的编码过程,并指导模型从 $x$ 中提取哪些信息。因此,这种上下文能引导大语言模型有效解决下游任务。然而,目前尚不清楚这种上下文是否存在,或如何为每个独立的 $x$ 识别这种上下文。为此,作者提出前缀调优方法来自动优化作为上下文输入的连续前缀。形式上,对于训练样本 $(x,y)$ ,他们定义

$$
f_{prompt}=[Prefix;x;Prefix';y]
$$

$$
f_{prompt}=[Prefix;x;Prefix';y]
$$

where Prefix and Prefix are placeholder s for values associated with the training example $(x,y)$ The Prefix and $Prefix'$ for all training examples consist of a trainable matrix $P_{\theta}[i,:]$ , where $ i \in P_{idx}$ and $P_{idx}$ denotes the sequence of prefix indices. Therefore, the feedback from LLM is

其中 Prefix 和 Prefix' 是与训练样本 $(x,y)$ 相关联的值的占位符。所有训练样本的 Prefix 和 $Prefix^{\prime}$ 由一个可训练矩阵 $P_{\theta}[i,:]$ 组成,其中 $i \in P_{idx}$ 且 $P_{idx}$ 表示前缀索引序列。因此,大语言模型 (LLM) 的反馈为
$$
z_i =\begin{cases} P_{\theta}[i:], & \text{if } i \in P_{idx},\\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.} \end{cases}
$$

$$
z_i =\begin{cases} P_{\theta}[i:], & \text{if } i \in P_{idx},\\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.} \end{cases}
$$

$z_{_ i}$ represens funetion O the raiable $P_{\theta}$ (of dimension $\left|P_{i d x}\right|\times d i m(z_{i}))$ When $i\in P_{i d x},z_{i}$ directly opies fom $P_{\theta};$ when $i\notin P_{i d x},z_{i}$ $P_{theta}$ asitis the prefix contextand subsequent feedback from $L M_{\Phi}$ relies on the activation s of the preceding feedback. Empirically $P_{\mathrm{{\scriptsize6}}}[i,:]$ $P_{\mathrm{\scriptsize~\theta~}}^{\prime}[i,:]$ $\left|{{P}_ {i d x}}\right|\times k$ whero $k$ hyper para met r)using a feed forward neural etwork for stable traning: $P_{\mathrm{_ \theta}}[i,:]=M L P_{\mathrm{_ \theta}}(P_{\mathrm{_\theta}}^{\prime}[i,:])$ The learning goal is

$z_{_ i}$ 表示关于变量 $P_{\theta}$ 的函数 (维度为 $\left|P_{i d x}\right|\times d i m(z_{i}))$。当 $i\in P_{i d x}$ 时,$z_{i}$ 直接从 $P_{\theta}$ 复制;当 $i\notin P_{i d x}$ 时,$z_{i}$ 将 $P_{\theta}$ 作为前缀上下文,而 $L M_{\Phi}$ 的后续反馈依赖于先前反馈的激活。经验上,$P_{\mathrm{{\scriptsize6}}}[i,:]$ 和 $P_{\mathrm{\scriptsize~\theta~}}^{\prime}[i,:]$ 的维度为 $\left|{{P}_ {i d x}}\right|\times k$ (其中 $k$ 为超参数),使用前馈神经网络进行稳定训练:$P_{\mathrm{_ \theta}}[i,:]=M L P_{\mathrm{_ \theta}}(P_{\mathrm{_\theta}}^{\prime}[i,:])$。学习目标是

$$
max_{\phi} \log p_{\phi(y|x)} = \sum_{i \in y_{idx}} \log p_{\phi}(f_{prompt}(x, y)* i | z* {<i})
$$

$$
max_{\phi} \log p_{\phi(y|x)} = \sum_{i \in y_{idx}} \log p_{\phi}(f_{prompt}(x, y)* i | z* {<i})
$$

where $p_{\phi}$ is an LLM distribution, $\Phi$ is the LLM parameters that are fixed.

其中 $p_{\phi}$ 是大语言模型 (LLM) 的分布,$\Phi$ 是固定的大语言模型参数。

One significant advantage of this method is that it provides flexibility and adaptability to individual input $x$ . Given that the prepended vectors (i.e., Prefix and Prefix') are automatically updated during training, each prefix vector (i.e., $P_{\mathrm{{\scriptsize~_{\theta}}}}[i,:])$ is customized for individual input $x$ simultaneously. We exploit this feature of prefix tuning to generate personalized prompts for user textual data. In the context of mental disorder detection using user-generated content, it is desirable to provide a distinct prompt for each user for optimal performance since different users have unique characteristics and underlying patterns. Thus, prefix tuning represents a promising research direction for developing continuous prompts optimized for each individual, thereby

这种方法的一个显著优势在于它为单个输入$x$提供了灵活性和适应性。由于前置向量(即Prefix和Prefix')在训练过程中会自动更新,每个前缀向量(即$P_{\mathrm{{\scriptsize~_{\theta}}}}[i,:]$)都能同时为单个输入$x$进行定制。我们利用前缀调优的这一特性来为用户文本数据生成个性化提示。在使用用户生成内容进行精神障碍检测的场景中,由于不同用户具有独特的特征和潜在模式,为每个用户提供不同的提示以获得最佳性能是可取的。因此,前缀调优代表了一个有前景的研究方向,可以开发针对每个个体优化的连续提示,从而...

enhancing the performance of mental disorder detection. Specifically, our intentions are twofold: (1) to refine the design of $f_{p r o m p t}$ (Eq. 1) to better fulfill the role of a personalized prompt tailored to individual users for mental disorder detection; and (2) to seamlessly integrate the learning objective of prefix tuning (Eq. 3) with other prompt learning goals through a multi-prompt approach (i.e., prompt ensemble) to more effectively address the challenges associated with mental disorder detection.

提升心理障碍检测的性能。具体而言,我们的目标有两个:(1) 改进 $f_{p r o m p t}$ (公式1) 的设计,使其更好地作为针对个体用户量身定制的个性化提示 (prompt) 用于心理障碍检测;(2) 通过多提示方法 (即提示集成) 将前缀调优 (prefix tuning) (公式3) 的学习目标与其他提示学习目标无缝结合,以更有效地应对心理障碍检测相关的挑战。

2.2.3. Knowledge Injection Through Rule-based Prompts

2.2.3. 基于规则提示的知识注入

The rule-based prompt is another continuous prompt method that incorporates logical rules (e.g. "'If the text contains word $X.$ it likely belongs to class $\mathit{\nabla}Y^{\rangle}$ ) into the prompt tuning process. It uses continuous prompts as its foundation while constraining their optimization through rule-based losses (Han et al. 2022). Rule-based prompt is proposed to address the limitations of other widely-used prompt engineering methods in addressing complex text classification tasks: (1) manual prompt design is both laborious and prone to errors, (2) and for auto-generated prompts the validation of the efficacy is a resource-intensive and time-consuming process.

基于规则的提示 (rule-based prompt) 是另一种连续提示方法,它将逻辑规则 (例如 "如果文本包含单词 $X.$ 则很可能属于类别 $\mathit{\nabla}Y^{\rangle}$") 融入提示调优过程。该方法以连续提示为基础,同时通过基于规则的损失函数来约束其优化 (Han et al. 2022)。提出基于规则的提示是为了解决其他广泛使用的提示工程方法在处理复杂文本分类任务时的局限性:(1) 手动设计提示既费力又容易出错,(2) 对于自动生成的提示,验证其有效性是一个资源密集且耗时的过程。

The essence of the rule-based prompt to solve challenging classification tasks is threefold. First, for a highly challenging text classification problem (i.e., given $(x,y)$ and predict $p(\boldsymbol{\hat{y}}|\boldsymbol{x}).$ , rule-based prompt breaks down the classification question into several simpler sub-classification tasks, namely, breaking down $p(y|x)$ to $p(y^{1}|x)...p(y^{f}|x)...p(y^{k}|x)$ , where $k$ indicates the number of subtasks. Then, the rule-based prompt method incorporates logical rules to compose task-specific prompts with several simpler sub-prompts and accomplish the complex classification task. Formally, for each sub-class fi cation task $p(y^{f}|x)$ , the rule-based prompt method sets a template $T^{f}(x)$ and a set of verbalizer words $\boldsymbol{V}^{f}={v_{_ {1}},...,v_{_{n}}}$ . The template $T^{f}(x)$ and verbalizer Vf constitute the prompting function f fpromp(x). The logical rule isdefined as

基于规则的提示方法解决复杂分类任务的核心可归纳为三点。首先,针对高难度文本分类问题(即给定$(x,y)$并预测$p(\boldsymbol{\hat{y}}|\boldsymbol{x})$),该方法将分类问题拆解为多个更简单的子分类任务,即将$p(y|x)$分解为$p(y^{1}|x)...p(y^{f}|x)...p(y^{k}|x)$,其中$k$表示子任务数量。随后,该方法通过逻辑规则组合任务特定的提示模板与若干简单子提示,最终完成复杂分类任务。形式上,对于每个子分类任务$p(y^{f}|x)$,规则提示方法会设定模板$T^{f}(x)$和对应的标签词集$\boldsymbol{V}^{f}={v_{_ {1}},...,v_{_{n}}}$。模板$T^{f}(x)$与标签词集Vf共同构成提示函数f fpromp(x)。逻辑规则定义为

$$
p(y^{1}|x)\wedge p(y^{2}|x)...p(y^{f-1}|x)\wedge p(y^{f}|x)...p(y^{k-1}|x)\wedge p(y^{k}|x)\to p(y|x)
$$

$$
p(y^{1}|x)\wedge p(y^{2}|x)...p(y^{f-1}|x)\wedge p(y^{f}|x)...p(y^{k-1}|x)\wedge p(y^{k}|x)\to p(y|x)
$$

Second, the rule-based prompt method incorporates prior knowledge for each sub-classification task, reducing the laborious and error-prone nature of manual prompt construction and mitigating the uncertainties associated with auto-generated prompts. Formally, when con strut ing each sub-promt f $f_{_ {p r o m p t}}^{f}(x)$ for each sub-classification task $p(y^{f}|x)$ , prior knowledge can be injected in both the design of $T^{f}(x)$ and verbalizer V’ to facilitate the prediction and performance of $p(y^{f}|x)$ . For instance, consider a classical sub-classification problem in named entity recognition. Let $T^{f}(x)="x$ is the [mask] entity" and V= (pn, "gazaton his ub task il v nad enttygnit, t templates and verbalize rs can be meticulously customized to assist an LLM in accurately identifying the entity category. For a classical relation prediction problem, let $T^f(x) = "x \ entity_1 \ [mask] \ entity_2" \text{ and } V^f = {"was born in", "is parent of",...}$ . Again,the templates and verbalize rs for this sub-classification problem can be tailored to assist an LLM in completing a relation prediction task.

其次,基于规则的提示方法为每个子分类任务融入了先验知识,既减少了人工构建提示的繁琐与易错性,也降低了自动生成提示的不确定性。具体而言,在构建每个子分类任务 $p(y^{f}|x)$ 的次级提示 $f_{_ {p r o m p t}}^{f}(x)$ 时,可通过设计 $T^{f}(x)$ 和标签词表V'来注入先验知识,从而提升 $p(y^{f}|x)$ 的预测效果。例如命名实体识别中的经典子分类问题,设 $T^{f}(x)="x$ 是[mask]实体"且V=(pn, "gazaton his ub task il v nad enttygnit, t,通过精心定制模板与标签词表可帮助大语言模型准确识别实体类别。对于经典关系预测问题,设 $T^{f}(x)="x e n t i t y_{_ 1}[m a s k]e n t i t y_{_2}"{\mathrm{and}}V^{f}={"𝑤𝑎𝑠 𝑏𝑜𝑟𝑛 𝑖𝑛", "𝑖𝑠 𝑝𝑎𝑟𝑒𝑛𝑡 𝑜𝑓", .}$ ,同样可通过定制该子分类问题的模板与标签词表,辅助大语言模型完成关系预测任务。

Lastly, the rule-based prompt method composes sub-prompts of various sub-problems into a complete task-specific prompt,

最后,基于规则的提示方法将各种子问题的子提示组合成一个完整的任务特定提示。

$$
f_{prompt}(x)=\begin{cases} T(x)=[T^1(x);...;T^f(x);...;T^k(x)],\\V[mask]_1 = {v^1_1, v^1_2,...},..., V[mask]_2 = {v^2_1, v^2_2,...},..., V[mask]_k = {v^k_1, v^k_2,...}.
\end{cases}
$$

$$
f_{prompt}(x)=\begin{cases} T(x)=[T^1(x);...;T^f(x);...;T^k(x)],\\V[mask]_1 = {v^1_1, v^1_2,...},..., V[mask]_2 = {v^2_1, v^2_2,...},..., V[mask]_k = {v^k_1, v^k_2,...}.
\end{cases}
$$

where $[\cdot;\cdot;\cdot]$ is the aggregation function of sub-templates. The learning objective of the rule-based prompt method is

其中 $[\cdot;\cdot;\cdot]$ 是子模板的聚合函数。基于规则的提示方法的学习目标是

$$
m a x_{\Phi}p_{\Phi(y|x)}=l o g\prod_{f=1}^{r}p_{\Phi}\Big([m a s k]_ {_ f}=L M_{_\Phi}(y)|T(x)\Big)
$$

$$
m a x_{\Phi}p_{\Phi(y|x)}=l o g\prod_{f=1}^{r}p_{\Phi}\Big([m a s k]_ {_ f}=L M_{_\Phi}(y)|T(x)\Big)
$$

Iwhere $r$ is the muabero masked posion n $T(x)$ and $\left[m a s k\right]_ {f}=L M_{_ \Phi}(y)$ is to map the las to the set of label words $V[m a s k]_{f}$

其中 $r$ 是掩码位置的数量,$T(x)$ 和 $\left[mask\right]_ {f}=LM_{\Phi}(y)$ 的作用是将最后一个 token 映射到标签词集合 $V[mask]_{f}$。

In our research context, predicting whether an individual has a specific mental disorder by directly utilizing an LLM and ultra-long user-generated content as inputs (since the task is at the individual level) presents a significant challenge. As mentioned, various mental disorders exhibit distinct or sometimes similar symptoms, risk factors, and treatments. Therefore, the rule-based prompt method is an efficient way to design sub-prompts to capture different aspects of mental disorders (e.g., symptoms, risk factors, and treatments) to simplify the detection task using user-generated content. Furthermore, the rule-based prompt method is an ideal method to incorporate the existing domain knowledge which is widely available and essential in mental disorder diagnosis and healthcare. Hence, in this study, we attempt to encode and incorporate existing medical knowledge by proposing a new rule-based prompt engineering method for improved mental disorder detection performance. Specifically, our key innovations include: (1) modifying the logic rules implemented in the original method (Eq. 4) and the learning goal of the rule-based prompt method (Eq. 6) to transfer it to the mental disorder detection task; (2) exploring an effective mechanism to inject existing medical knowledge of mental disorder detection in the rule-based prompt engineering process (Eq. 5), and (3) seamlessly integrating the learning objective of the rule-based prompt method (Eq. 6) with other prompt learning goals through a multi-prompt approach (i.e., prompt ensemble and prompt composition) to more effectively address the challenges associated with mental disorder detection.

在我们的研究背景下,直接利用大语言模型(LLM)和超长用户生成内容作为输入(由于任务处于个体层面)来预测个体是否患有特定精神障碍,是一项重大挑战。如前所述,各类精神障碍表现出不同或有时相似的症状、风险因素和治疗方法。因此,基于规则的提示(prompt)方法能高效设计子提示来捕捉精神障碍的不同方面(如症状、风险因素和治疗方案),从而简化基于用户生成内容的检测任务。此外,基于规则的提示方法也是整合现有领域知识的理想方式——这些知识在精神障碍诊断和医疗保健中既广泛可得又至关重要。为此,本研究尝试通过提出新型基于规则的提示工程方法,对现有医学知识进行编码整合,以提升精神障碍检测性能。具体而言,我们的核心创新包括:(1) 改进原方法中的逻辑规则(公式4)和基于规则提示方法的学习目标(公式6),使其适配精神障碍检测任务;(2) 探索在基于规则的提示工程过程中注入现有精神障碍检测医学知识的有效机制(公式5);(3) 通过多提示方法(即提示集成和提示组合)将基于规则提示方法的学习目标(公式6)与其他提示学习目标无缝整合,以更有效应对精神障碍检测的相关挑战。

2.3. Key Novelties

2.3. 关键创新点

From the perspective of design science, we make three technical contributions with our main IT artifact developed for mental disorder detection using textual data. First, we present a novel framework grounded in LLMs and prompt engineering, facilitating the few-shot detection of multiple mental disorders through user-generated text content. Notably, this framework confers a significant advantage by obviating the necessity for an extensive volume of labeled training data or the intricate engineering of customized architectures for each distinct disease or research problem. The proposed framework can be extended to tasks related to detecting other mental disorders and chronic diseases, especially those exhibiting discernible characteristics within

从设计科学的角度出发,我们为基于文本数据的心理障碍检测开发了主要IT构件,并做出三项技术贡献。首先,我们提出一个基于大语言模型和提示工程的新框架,通过用户生成文本内容实现多种心理障碍的少样本检测。值得注意的是,该框架具有显著优势:既无需大量标注训练数据,也不必为每种特定疾病或研究问题复杂地定制架构。所提框架可扩展至其他心理障碍和慢性疾病的检测任务,尤其是那些在...

user-generated textual content. Second, within our framework, we propose a multi-prompt engineering approach, effectively syne rg i zing various continuous prompt engineering techniques, including prefix tuning and rule-based prompt engineering. This strategic amalgamation is specifically tailored to address the unique technical challenges within the healthcare domain. It involves the utilization of personalized prompts and the integration of existing medical domain knowledge, thereby markedly enhancing the accuracy and efficacy of our method. Third, as an integral component of our framework, we propose a new rule-based prompt engineering method, adept at efficiently dissecting complex textual content-based detection problems. This method seamlessly integrates domain knowledge existing in the ontology format—one of the widely adopted formats for domain knowledge. The design principle extends to other research problems necessitating the decomposition of challenging tasks and maximizes the utilization of LLM's potential to address real-world challenges.

用户生成的文本内容。其次,在我们的框架中,我们提出了一种多提示工程方法,有效协同了包括前缀调优和基于规则的提示工程在内的多种连续提示工程技术。这种策略性融合专门针对医疗领域的独特技术挑战而设计,涉及个性化提示的使用以及现有医学领域知识的整合,从而显著提升了我们方法的准确性和有效性。第三,作为我们框架的一个组成部分,我们提出了一种新的基于规则的提示工程方法,能够高效剖析基于复杂文本内容的检测问题。该方法无缝整合了以本体论格式(领域知识广泛采用的格式之一)存在的领域知识。其设计原则可延伸至其他需要分解复杂任务的研究问题,并最大限度地利用大语言模型的潜力来解决现实世界的挑战。

3. RESEARCH DESIGN

3. 研究设计


Figure 2. Research Design

图 2: 研究设计

In this study, we introduce a novel multi-prompt engineering method for detecting mental disorders through user-generated textual content. The innovative design of our multi-prompt engineering method aims to tackle two technical challenges: (1) personalized prompts for individual users and each mental disorder, capturing the unique characteristics and underlying patterns of each user, and (2) integrating prompts with structured medical knowledge to contextual ize the task, which instructs the LLMs on the learning objectives and operational ize s prediction goals. Subsequently, the outcomes of the prompts serve as the input for an LLM, which determines whether the targeted user exhibits signs of a mental disorder. The flowchart of our method is shown in Figure 2.

本研究提出了一种新颖的多提示工程方法,用于通过用户生成的文本内容检测心理障碍。我们多提示工程方法的创新设计旨在解决两个技术挑战:(1) 为每位用户和每种心理障碍定制个性化提示,捕捉每位用户的独特特征和潜在模式;(2) 将提示与结构化医学知识相结合,使任务情境化,从而指导大语言模型的学习目标并操作化预测任务。随后,这些提示的结果将作为大语言模型的输入,用于判断目标用户是否表现出心理障碍迹象。图2展示了我们方法的流程图。

3.1. Problem Formulation

3.1. 问题表述

We focus on user-generated textual content on online platforms (e.g., Reddit, Twitter, etc.), which potentially encompasses each user's self-reported information relevant to mental disorder detection. Each textual post is denoted as $x_{i}$ . We collect data from a user base $U$ with $l$ users. For given period of time, we observe the user-generated content of the focal user $u{\in}U$ from $N$ text posts, denoted by $\boldsymbol{x}_ {u}=\bigg(x_{_ {1}},x_{_ {2}},...,x_{_ {N_{u}}}\bigg),$ ordered in time, and $N_{{}_ {u}}={1,2,3,...}$ as each user has an arbitrary number of posts. Each user $u$ may suffer from one or more mental disorders in a disease set $D={d_{_ {1}},...,d_{_ {j}},...,d_{_ {N}}}$ For ach disease $d_{_ {j}},$ an ontology $O_{j}$ can be constructed t depithe symptoms, risk factors, and treatments of the disease $d_{_ j}.$ Given a user's text posts $\scriptstyle\left(x_{_ {1}},x_{_ {2}},\ldots,x_{_ {N_{u}}}\right)$ and a target disease $d_{_ {j}},$ we aim to design a new multi-prompt function $f_{_ {p r o m p t}}\biggl(\boldsymbol{x}_ {1},\boldsymbol{x}_ {2},...,\boldsymbol{x}_ {N_{u}}\biggr)$ to address two technique challenges: personalized prompts and prompts integrated with medical knowledge $O_{_ j}$ . As the foundation of our approach, we build upon LLMs, denoted as $L M_{_ \phi}$ with parameters $\Phi$ . The prediction outcome is $y_{d_{j}}={0,1}$ , where 1 suggests that the focal user $u$ suffers or will suffer from the target disease $d_{_ j}$ Formally, the mental disorder $d_{_j}$ detection problen is a binary probabilistic classification problem (Eq. 7), which applies to all diseases in $D$

我们关注在线平台(如Reddit、Twitter等)上的用户生成文本内容,这些内容可能包含每位用户自我报告的与精神障碍检测相关的信息。每个文本帖子表示为$x_{i}$。我们从包含$l$个用户的用户群$U$中收集数据。在给定时间段内,我们观察到焦点用户$u{\in}U$生成的$N$个文本帖子内容,按时间顺序记为$\boldsymbol{x}_ {u}=\bigg(x_{_ {1}},x_{_ {2}},...,x_{_ {N_{u}}}\bigg),$,且$N_{{}_ {u}}={1,2,3,...}$表示每位用户的帖子数量不定。每位用户$u$可能患有疾病集合$D={d_{_ {1}},...,d_{_ {j}},...,d_{_ {N}}}$中的一种或多种精神障碍。对于每种疾病$d_{_ {j}},$可以构建本体$O_{j}$来描述该疾病$d_{_ j}$的症状、风险因素和治疗方法。给定用户的文本帖子$\scriptstyle\left(x_{_ {1}},x_{_ {2}},\ldots,x_{_ {N_{u}}}\right)$和目标疾病$d_{_ {j}},$我们旨在设计一个新的多提示函数$f_{_ {p r o m p t}}\biggl(\boldsymbol{x}_ {1},\boldsymbol{x}_ {2},...,\boldsymbol{x}_ {N_{u}}\biggr)$来解决两个技术挑战:个性化提示和整合医学知识$O_{_ j}$的提示。作为方法基础,我们基于参数为$\Phi$的大语言模型$L M_{_ \phi}$构建方案。预测结果为$y_{d_{j}}={0,1}$,其中1表示焦点用户$u$患有或将患有目标疾病$d_{_ j}$。形式上,精神障碍$d_{_j}$检测问题是一个二元概率分类问题(公式7),适用于集合$D$中的所有疾病。

$$
\widehat{y_{d_{j}}}=a r g m a x p\biggl(y_{d_{j}}|L M_{\Phi}\biggl(f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)\biggr)\biggr)
$$

$$
\widehat{y_{d_{j}}}=a r g m a x p\biggl(y_{d_{j}}|L M_{\Phi}\biggl(f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)\biggr)\biggr)
$$

3.2。 Multi-prompt Engineering with Personalization and Knowledge Injection

3.2 多提示工程与个性化及知识注入

As outlined in the literature review, the objective of the prompt engineering function, $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$ is to leverage the capabilities of LLMs while streamlining the complexity of disease- or problem-specific model design. The key scholarly contribution of this study lies in the development of $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$.

如文献综述所述,提示工程函数 $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$的目标是充分利用大语言模型的能力,同时简化针对特定疾病或问题的模型设计复杂性。本研究的关键学术贡献在于开发了 $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$。

3.2.1. Automated Continuous Dynamic Prefix Tuning for Personalized Prompts

3.2.1. 个性化提示的自动连续动态前缀调优

When patients describe their experience on social media, their patterns and persona' are distinct from each other. This motivates us to design a personalized prompt for each user for the mental disorder detection problem. We leverage the attributes of prefix tuning (Li and Liang 2021), where each prefix vector is customized for individual input simultaneously, and adapt it to our multi-prompt method to achieve this goal. Specifically, we designate a one-dimensional vector $v$ of length $k$ for each user $u$

当患者在社交媒体上描述他们的经历时,他们的行为模式和人物特征各不相同。这促使我们为每位用户设计个性化的提示(prompt)来解决心理健康障碍检测问题。我们借鉴了前缀调优(prefix tuning) [20] 的属性(每个前缀向量同时为单个输入定制),并将其适配到我们的多提示方法中以实现这一目标。具体而言,我们为每个用户$u$指定一个长度为$k$的一维向量$v$

$$
f_{p r o m p t_p r e f i x}\left(x\right)=[v\oplus x]
$$

$$
f_{p r o m p t_p r e f i x}\left(x\right)=[v\oplus x]
$$

where $\circleddash$ denotes concatenation and $k$ is a hyper-parameter. In the user base $U$ with $l$ users, a trainable matrix $P$ (of dimension $l\times(k+L M_{\Phi^{-}}t o k e n i z i n g(x)))$ will be para met rize d using a feedorward neura network,. $P_{_ {\ominus}}=M L P_{_{\ominus}}$ during raining where $\theta$ is the raiabl paramers or the $M L P$ and is used to create unseen users' f prompt prefix (x).Each row of P can be trained (i.e,. re parameterized) simultaneously during the training process to reflect each user's unique characteristics, allowing for personalized prompts. Consequently, the expected feedback from LLM is

其中 $\circleddash$ 表示拼接操作,$k$ 是一个超参数。在包含 $l$ 个用户的用户库 $U$ 中,一个可训练的矩阵 $P$(维度为 $l\times(k+L M_{\Phi^{-}}tokenizing(x))$)将通过前馈神经网络进行参数化,即 $P_{_ {\ominus}}=MLP_{_ {\ominus}}$。在训练过程中,$\theta$ 是 $MLP$ 的可训练参数,用于生成未见用户的提示前缀 $f_{prompt}(x)$。矩阵 $P$ 的每一行都可以在训练过程中同时进行训练(即重新参数化),以反映每个用户的独特特征,从而实现个性化提示。因此,大语言模型的预期反馈为
$$
z_i =\begin{cases} P_{\theta}[:k], & \text{if } i \leq k, \\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.}
\end{cases}
$$

$$
z_i =\begin{cases} P_{\theta}[:k], & \text{if } i \leq k, \\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.}
\end{cases}
$$

where $i$ is the $i\cdot$ -th digits of $z$ When $i\leq k,z_{i}$ directly copies from $P_{\theta}$ when $i>k,z_{i}$ still depends $P_{\mathfrak{\theta}}$ $f_{p r o m p t}(x)$ (section 3.2.3) relies on the activation s of the preceding feedback.

其中 $i$ 是 $z$ 的第 $i$ 位数字。当 $i\leq k$ 时,$z_{i}$ 直接从 $P_{\theta}$ 复制;当 $i>k$ 时,$z_{i}$ 仍依赖于 $P_{\mathfrak{\theta}}$。$f_{prompt}(x)$ (第3.2.3节)依赖于前序反馈的激活状态。

3.2.2. A New Rule-Based Prompt for Injecting Structural Medical Knowledge

3.2.2 注入结构化医学知识的新型基于规则的提示

The mental disorder detection task is at the subject level using all the posts in a time period. Therefore, compared to other NLP tasks, the unique challenge of this task lies in the input to a machine learning prediction model, $x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg),$ which contains ultra-long user-generated unstructured text content with variable lengths (see section 4.1 Table 4, around 1,583,227 ~ 44,077,018 tokens per user). However, extracting valid information from these data is challenging for traditional machine learning models that rely on feature engineering. On the other hand, conventional end-to-end deep learning models may not be able to remember and learn from ultra-long dependencies from $x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$ (e.g., RNNs) or be constrained by a fixed sequence length established during training (e.g., Transformers), making handling a large amount of arbitrary number text posts challenging.

精神障碍检测任务是在一段时间内使用所有帖子在受试者层面进行的。因此,与其他NLP任务相比,该任务的独特挑战在于机器学习预测模型的输入$x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$包含超长的用户生成非结构化文本内容,且长度可变(参见第4.1节表4,每位用户约1,583,227~44,077,018个token)。然而,对于依赖特征工程的传统机器学习模型而言,从这些数据中提取有效信息具有挑战性。另一方面,传统的端到端深度学习模型可能无法记忆和学习$x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$中的超长依赖关系(例如RNN),或受限于训练时建立的固定序列长度(例如Transformer),这使得处理大量任意数量的文本帖子具有挑战性。

For standard prompt engineering methods, even with the assistance of highly potent LLMs, it remains a challenging task. This is due to (1) the presence of a significant amount of noise in the user-generated content (e.g., the text content can be unrelated to mental disorders, or similar symptoms shared across various mental disorders), making the prediction task difficult. (2) Considering the limited memory capacity of LLMs based on the number of parameters, most LLMs are insufficient to handle the extensive input required for user-level mental disorder detection. Even if the LLMs can handle such inputs, these models typically have a large number of parameters, imposing significant costs in applications. (3) If we employ a method, such as a discrete prompt that utilizes natural language and expects only binary outcomes (e.g., 1/0), it

对于标准的提示工程方法,即使借助高性能大语言模型,这仍然是一项具有挑战性的任务。原因在于:(1) 用户生成内容中存在大量噪声(例如文本内容可能与精神障碍无关,或不同精神障碍表现出相似症状),导致预测任务困难。(2) 考虑到基于参数数量的大语言模型记忆容量有限,大多数大语言模型不足以处理用户级精神障碍检测所需的大量输入。即便模型能够处理此类输入,这些模型通常参数量庞大,会带来高昂的应用成本。(3) 若采用离散提示等自然语言处理方法且仅预期二元输出(如1/0)

becomes challenging for stakeholders to determine which specific part of the user-generated content leads the LLM to conclude that the subject has a particular mental disorder. Motivated by these challenges, we propose a new rule-based prompt engineering method with three design principles, differing from existing approaches.

对于利益相关者来说,确定用户生成内容中哪一部分导致大语言模型得出受试者患有特定精神障碍的结论变得具有挑战性。基于这些挑战,我们提出了一种新的基于规则的提示工程方法,该方法遵循三个设计原则,与现有方法不同。

Design principle 1. Instead of using $x=\left(x_{_ 1},x_{_ 2},...,x_{N_{u}}\right)$ as the input, we expand the individual elements $x_{_ i}$ within $x$ and concatenate them into along list of tokens, $x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$ where $t_{_j}$ indicates a single token and $|x|$ is the magnitude of $x$ . We then design a sliding window $x$ and assume $m$ sliding windows in total. Therefore, a sliding window $x$ of size w can be represented as $x[i;~i+w]$ , for $i={0,w,2w,...,(m-1)w}$ . The number $m$ and the size w of moving windows are hyper parameters correlated and constrained by $|x|$ . Empirically, let $|{\overline{{x}}}|\ll|x|$ and the content in the slidng window x = (t, t.(-)] is used as the new input of prompt() and LM significantly reducing the negative impacts caused by ultra-long sequences in user-level disease detection tasks.

设计原则1:我们不直接使用$x=\left(x_{_ 1},x_{_ 2},...,x_{N_{u}}\right)$作为输入,而是将$x$中的各个元素$x_{_ i}$展开并拼接成一个长token序列$x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$,其中$t_{_j}$表示单个token,$|x|$表示$x$的规模。随后设计一个滑动窗口$x$,假设共有$m$个滑动窗口。因此,大小为w的滑动窗口$x$可表示为$x[i;~i+w]$,其中$i={0,w,2w,...,(m-1)w}$。移动窗口的数量$m$和大小w是与$|x|$相关且相互约束的超参数。经验表明,当$|{\overline{{x}}}|\ll|x|$时,将滑动窗口内容x = (t, t.(-))作为prompt()和大语言模型的新输入,能显著降低用户级疾病检测任务中超长序列带来的负面影响。

Another significant advantage of this approach is that the $L M_{\Phi}$ returns a probability for each $x$ which reflects Eq. 7, i.e., the probability that the LLM considers $x$ to be related to a mental disorder. In Section 4.5, we leverage this feature to interpret our prediction results. Additionally, through Design principles 2 and 3 discussed below, we can further determine which aspects of existing medical knowledge are related to x, thus facilitating stakeholders in utilizing our method and results.

该方法另一个显著优势在于 $L M_{\Phi}$ 会为每个 $x$ 返回反映式(7)的概率值,即大语言模型判定 $x$ 与精神障碍相关的概率。在第4.5节中,我们利用这一特性来解释预测结果。此外,通过下文讨论的设计原则2和3,还能进一步确定现有医学知识中哪些方面与x相关,从而帮助利益相关方运用我们的方法和成果。

Design principle 2. Instead of directly relying on $L M_{\Phi}$ to determine whether a user $u$ has a mental disorder, we can break down this task into three sub-tasks: assessing whether a focal user $u$ exhibits: (1) symptoms, such as anxiety, fatigue, low mood, reduced self-esteem, change in appetite or sleep, suicide attempt, etc. (APA 2022, Martin et al. 2006, Rush et al. 2003); (2) major life event changes, such as divorce, body shape change, violence, abuse, drug or alcohol use, and so on (Beck and Alford 2014); and (3) treatments, such as medication, therapy, or a combination of these two (Beck and Alford 2014). The construction of this ontology is supported by medical literature and has been rigorously validated through quantitative methods and expert evaluation, as detailed in Appendix 1.

设计原则2:我们不是直接依赖 $L M_{\Phi}$ 来判断用户 $u$ 是否存在精神障碍,而是将该任务分解为三个子任务:评估目标用户 $u$ 是否表现出 (1) 症状 (如焦虑、疲劳、情绪低落、自尊心下降、食欲或睡眠改变、自杀企图等) (APA 2022, Martin et al. 2006, Rush et al. 2003);(2) 重大生活事件变化 (如离婚、体型改变、暴力、虐待、吸毒或酗酒等) (Beck and Alford 2014);(3) 治疗方式 (如药物治疗、心理治疗或二者结合) (Beck and Alford 2014)。该本体的构建得到医学文献支持,并通过定量方法和专家评估进行严格验证 (详见附录1)。

We decompose the task of detecting subject-level mental disorders into three subtasks, relying on the following assumptions: if the subject $u$ displays an increased number of symptoms, self-reports a greater number of life events that may cause or exacerbate disease $d_{_ {j}},$ or discusses the use of treatments associated with the disease $d_{_ {j}},$ the accumulation of such mentions suggests a higher likelihood that the subject $u$ currently suffers from or will suffer from the target disease $d_{_ {j}}$.

我们将检测个体层面精神障碍的任务分解为三个子任务,基于以下假设:若受试者$u$表现出更多症状、自述更多可能引发或加剧疾病$d_{_ {j}}$的生活事件、或提及与疾病$d_{_ {j}}$相关的治疗手段,这些叙述的累积表明受试者$u$当前或未来罹患目标疾病$d_{_ {j}}$的可能性更高。

Meanwhile, the reason for decomposing the task into these three subtasks is that, in user-generated text, these three aspects are often self-reported by users with mental disorders and are detectable (Copper smith et al. 2015, Nadeem 2016, W. Zhang et al. 2024). It is worth noting that there are other indicators for detecting mental disorders, such as family history, genetics, and poor nutrition. However, since our research context revolves around user-generated text content, we concentrate on factors that are detectable.

同时,将任务分解为这三个子任务的原因是,在用户生成的文本中,这三个方面通常由心理健康障碍患者自行报告且可被检测到 (Coppersmith et al. 2015, Nadeem 2016, W. Zhang et al. 2024)。值得注意的是,检测心理健康障碍还存在其他指标,如家族史、遗传因素和营养不良等。但由于我们的研究场景围绕用户生成的文本内容展开,因此重点关注可被文本检测的因素。

We aim to design an inclusive method that can (1) identify users with easily noticeable signs of disease $d_{_ j{}}$ who are currently undergoing treatment and (2) provide early detection for users at risk of disease $d_{_ j{}}$ in the future, which includes detecting symptoms and life events that may exacerbate depression. In the disease detection task, treatment entities play a significant role. This is because, within the population of individuals discussing disease $d_{_ {j}},$ there is a subset of users who have already received clinical diagnoses and have undergone various treatments. When a user openly discusses treatments of disease $d_{j},$ it strongly indicates that the user likely has disease $d_{_j}$ To ensure the comprehensiveness of our mental disorder detection method, which aims to identify as many patients as possible, treatment-related entities can be highly effective. On the other hand, when it comes to the early disease detection task, our method leverages other disease-related factors for early detection, including symptoms and major life event changes.

我们旨在设计一种包容性方法,能够:(1) 识别当前正在接受治疗且具有明显疾病$d_{_ j{}}$症状的用户,(2) 为未来可能患$d_{_ j{}}$疾病风险的用户提供早期检测,包括检测可能加剧抑郁的症状和生活事件。在疾病检测任务中,治疗实体起着重要作用。这是因为在讨论疾病$d_{_ {j}}$的人群中,存在一部分已获得临床诊断并接受过各种治疗的用户。当用户公开讨论$d_{j}$疾病的治疗时,强烈表明该用户很可能患有$d_{_j}$疾病。为确保我们旨在识别尽可能多患者的精神障碍检测方法的全面性,治疗相关实体可以非常有效。另一方面,在早期疾病检测任务中,我们的方法利用其他疾病相关因素进行早期检测,包括症状和重大生活事件变化。

Formally, we define the following logic rule:

我们定义如下逻辑规则:

$$
p(y_{d_{j}}^{s y m p t o m}|\overline{{x}})\vee p(y_{d_{j}}^{l i f e_{-}e v e n t}|\overline{{x}})\vee p(y_{d_{j}}^{t r e a t m e n t}|\overline{{x}})\rightarrow p(y_{d_{j}}|\overline{{x}})
$$

$$
p(y_{d_{j}}^{s y m p t o m}|\overline{{x}})\vee p(y_{d_{j}}^{l i f e_{-}e v e n t}|\overline{{x}})\vee p(y_{d_{j}}^{t r e a t m e n t}|\overline{{x}})\rightarrow p(y_{d_{j}}|\overline{{x}})
$$

where $\vee$ is the logical connective "or". The logical "or" (V) is inclusive, meaning that at least one of the $p(y_{d_{j}}^{f}\mid\overline{{x}})$ , where $f\in$ [symptom, life_event, treatment} must be true for the compound proposition, $p(y|\overline{{x}})=1$ , to be true. Pactically,. $p(y_{d_{j}}^{f}|\overline{{x}})$ represents th probability fdback from $L M_{\phi}.$ indicating whether $L M_{_ \phi}$ judges $x$ to be associated with the sub-task $f$ In actual calculations, for a user $u$ , if there are more $x$ within $x$ are determined by $L M_{_ \phi}$ that they are "symptom” or “life event'” or “treatment”? of mental disorder $d_{_j}$ there is a higher probability that our framework will predict a correspondingly higher probability that this focal user $u$ has mental disorder d

其中 $\vee$ 是逻辑连接词"或"。逻辑"或"(V)是包含性的,这意味着对于复合命题 $p(y|\overline{{x}})=1$ 为真,至少需要 $p(y_{d_{j}}^{f}\mid\overline{{x}})$ 中的一个为真,其中 $f\in$ [症状(symptom)、生活事件(life_event)、治疗(treatment)]。实际上,$p(y_{d_{j}}^{f}|\overline{{x}})$ 表示来自 $L M_{\phi}$ 的概率反馈,表明 $L M_{_ \phi}$ 是否判断 $x$ 与子任务 $f$ 相关联。在实际计算中,对于用户 $u$,如果在 $x$ 中有更多被 $L M_{_ \phi}$ 判定为精神障碍 $d_{_j}$ 的"症状(symptom)"、"生活事件(life event)"或"治疗(treatment)"的 $x$,我们的框架预测该焦点用户 $u$ 患有精神障碍 $d$ 的概率相应会更高。

Design principle 3. Having $L M_{\Phi}$ directly determine whether $x$ is a “symptom,” "life event,” or "treatment' of a mental disorder $d_{_ j{}}$ remains a challenging task as the features of $d_{_j{}}$ can be highly specific and complex, but sometimes, certain features of mental disorders can be remarkably similar. For instance, feelings of excessive guilt or self-blame are often linked with depression but can also manifest in other disorders, including PTSD, anorexia, and self-harm. Accurately distinguishing between different mental disorders and providing effective interventions is crucial. For depression, it is important to create a friendly environment and offer information related to treatment. For PTSD, it is essential to avoid triggering information related to trauma. In the case of anorexia nervosa and self-harm, users may be at risk of life-threatening situations, necessitating more immediate help and intervention.

设计原则3:让 $L M_{\Phi}$ 直接判断 $x$ 是精神障碍 $d_{_ j{}}$ 的"症状"、"生活事件"还是"治疗手段"仍具挑战性,因为 $d_{_j{}}$ 的特征可能高度特异且复杂,但某些精神障碍的特征有时会惊人地相似。例如,过度愧疚或自责感通常与抑郁症相关,但也可能出现在PTSD、厌食症和自残等其他障碍中。准确区分不同精神障碍并提供有效干预至关重要:针对抑郁症需营造友好环境并提供治疗相关信息;处理PTSD时必须避免触发创伤相关信息;对于神经性厌食症和自残行为,用户可能面临生命危险,需要更即时的帮助与干预。

We can further enhance the performance of $L M_{_ \phi}$ by employing prompt engineering to clearly instruct $L M_{\Phi}$ on the specific characteristics of the {symptom, life_event, treatment} associated with $d_{_ j}.$ It is noteworthy that the specificity of these three aspects in various mental disorders is well-documented in the medical literature. If appropriately integrated, such existing medical knowledge can significantly alleviate the challenges faced by $L M_{\Phi}$ and predictive models in detecting mental disorders at the user level using user-generated text content. The injection of medical knowledge into prompt design, therefore, is a significant and promising direction.

我们可以通过提示工程 (prompt engineering) 进一步优化 $L M_{_ \phi}$ 的表现,明确指导 $L M_{\Phi}$ 识别与 $d_{_ j}$ 相关的 {症状、生活事件、治疗方式} 具体特征。值得注意的是,这三种要素在不同精神障碍中的特异性已在医学文献中得到充分记载。若能恰当整合,这类现有医学知识将大幅缓解 $L M_{\Phi}$ 和预测模型在用户层级通过生成文本内容检测精神障碍时所面临的挑战。因此,将医学知识注入提示设计是一个重要且前景广阔的研究方向。

Accordingly, to leverage medical domain knowledge for mental disorder detection, we adhere to previous studies and adopt the established mental disorder ontology $O_{j}$ for each disease $d_{_ j{}}$ that explicitly explains the terminologies used in disease $d_{_ j{}}$ 's diagnosis and treatments (W. Zhang et al. 2024). The ontology $O_{_ j}$ focuses on specific aspects of disease $d_{_ j{}}$ particularly the medical terminologies used in diagnosing disease $d_{_ j{}}$ that are possible to detect from user-generated textual content, formally denoted as O, f E {symptom, life event, treatment}. The purpose of the J-f ontology is to facilitate the detection of symptoms, major life events, and treatments from user-generated text content. Based on the extensive literature review (APA 2022, Beck and Alford 2014, Martin et al. 2006, Rush et al. 2003), a list of concepts, $o_{j_{j_{j}}}$ related to $d_{_ j{}}$ diagnosis (e.g.. dejected mood, self-blame, fatal illness, psychotherapy, etc.) is compiled. Next, we organize the terminologies $o_{j_{j}}$ into three classes for each disease $d_{j}!$ symptom (0_ j_symptom? a collection of symptoms), life event fee lonfjlieet hang eth a mayr exacerbate $d_{_ j})$ or treatment $(O_{j_{-}t r e a t m e n t})$ medications and therapies). Meanwhile, we determine the relationships between terminologies $o_{j_{j_{j}}}$ and classes $O_{_ {j_{-}k}}$ as o. : relation $O_{_ {j_{-}k}}$ (e.g., for "depression"” and one of its symptoms “dejected mood", depression dec e nod : is a O depression symptom).

为利用医学领域知识进行精神障碍检测,我们遵循先前研究,采用既定的精神障碍本体$O_{j}$来描述每种疾病$d_{_ j{}}$的诊断和治疗术语(W. Zhang et al. 2024)。该本体$O_{_ j}$聚焦于疾病$d_{_j{}}$的特定方面,尤其是可从用户生成文本内容中检测到的诊断术语,形式化表示为O, f E {症状(symptom)、生活事件(life event)、治疗(treatment)}。J-f本体的目的是从用户生成文本中识别症状、重大生活事件和治疗方案。

基于广泛文献综述(APA 2022, Beck and Alford 2014, Martin et al. 2006, Rush et al. 2003),我们汇编了与$d_{_ j{}}$诊断相关的概念列表$o_{j_{j_{j}}}$(如情绪低落、自责、绝症、心理治疗等)。随后将这些术语$o_{j_{j}}$按三类组织:症状集合$(O_{j_{-}symptom})$、生活事件$(O_{j_{-}lifeevent})$可能加剧病情的事件)和治疗方案$(O_{j_{-}treatment})$(药物与疗法)。同时建立术语$o_{j_{j_{j}}}$与类别$O_{_ {j_{-}k}}$的关系:$o_{j_{j_{j}}}$ : relation $O_{_ {j_{-}k}}$(例如"抑郁症"与其症状"情绪低落"的关系表示为:depression dec e nod : is a O depression symptom)。

Adhering to the three design principles, we formulate our new rule-based prompting function as follows:

遵循这三项设计原则,我们将新的基于规则的提示函数定义如下:

$$
f_{prompt_rule}(\bar{x}) =
\begin{cases}
T_{d_j}(\bar{x}) = [T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})], \\
V_{d_j}[mask]_ {symptom} = {o_{j_i}}, o_{j_i} \in O_{j_symptom}, \text{ when } p(y_{d_j}^{symptom} = 1|\bar{x}), \\
V_{d_j}[mask]_ {life_event} = {o_{j_i}}, o_{j_i} \in O_{j_life_event}, \text{ when } p(y_{d_j}^{life_event} = 1|\bar{x}), \\
V_{d_j}[mask]_ {treatment} = {o_{j_i}}, o_{j_i} \in O_{j_treatment}, \text{ when } p(y_{d_j}^{treatment} = 1|\bar{x})
\end{cases}
$$

$$
f_{prompt_rule}(\bar{x}) =
\begin{cases}
T_{d_j}(\bar{x}) = [T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})], \\
V_{d_j}[mask]_ {symptom} = {o_{j_i}}, o_{j_i} \in O_{j_symptom}, \text{ when } p(y_{d_j}^{symptom} = 1|\bar{x}), \\
V_{d_j}[mask]_ {life_event} = {o_{j_i}}, o_{j_i} \in O_{j_life_event}, \text{ when } p(y_{d_j}^{life_event} = 1|\bar{x}), \\
V_{d_j}[mask]_ {treatment} = {o_{j_i}}, o_{j_i} \in O_{j_treatment}, \text{ when } p(y_{d_j}^{treatment} = 1|\bar{x})
\end{cases}
$$

where $T^{f}(\overline{{x}})="\overline{{x}}$ : relat ${\cdot}i o n\left[m a s k\right]_ {f}o f{d_{_ j}}f^{"},V_{_ d}[m a s k]_ {_ f}={o_{_ j}}$ denotes althe concept n the ontology $O_{j}$ that belongs to class $f$ and $f\in{s y m p t o m,l i f e_e v e n t,t r e a t m e n t}.$

其中 $T^{f}(\overline{{x}})="\overline{{x}}$ : 关系 ${\cdot}i o n\left[m a s k\right]_ {f}o f{d_{_ j}}f^{"},V_{_ d}[m a s k]_ {_ f}={o_{_ j}}$ 表示本体 $O_{j}$ 中属于类 $f$ 的所有概念,且 $f\in{症状, 生活事件, 治疗}$。

Existing medical knowledge, represented as ontology $O_{j^{\flat}}$ is injected into prompt rue C) in two ways, and aids $f_{p r o m p t_r u l e}(\overline{{x}})$ to instrut the $L M_{\Phi}$ to beter aomlish teuse-evel mtal disorder detection task: (1) the relation (: relation) between concept $^o_{j_{j}}$ and concept class $O_{_ {j_{-}k}}$ .s injected into the prompt template $T_{d_{j}}(\overline{{x}})$ , therefore directly instructing the $L M_{\Phi}$ learning objective (i.e., filling the $[m a s k],$ 0, connecting text x, disease $d_{_ j{}}$ , and three aspects of disease $f;(2)$ the concepts $o_{j_{j_{j}}}$ of ontology $O_{_ j}$ is injected into the verbalizer $V_{d_{j}}$ , which projects the original prediction goals (i.e., x is a “symptom,” “life event,” or "treatment” of a mental disorder $d_{_ j})$ to a set of label words (i.e., $o_{j_{j}})$ . As the prediction goal is a binary classification problem, the verbalizer words for negative examples are designed using manual verb aliz ation methods, incorporating the most frequent words with the highest sentiment tendency.

现有医学知识以本体 $O_{j^{\flat}}$ 形式通过两种方式注入提示规则 C),辅助 $f_{prompt_rule}(\overline{{x}})$ 指导 $LM_{\Phi}$ 更好地完成用例级精神障碍检测任务:(1) 概念 $^o_{j_{j}}$ 与概念类 $O_{_ {j_{-}k}}$ 的关系 (: relation) 被注入提示模板 $T_{d_{j}}(\overline{{x}})$,从而直接指导 $LM_{\Phi}$ 的学习目标 (即填充 $[mask],$ 0,连接文本 x、疾病 $d_{_ j{}}$ 及疾病三方面特征 $f;(2) $ 本体$o_{j_{j_{j}}}$的概念$O_{_ j}$被注入词表器$V_{d_{j}}$,将原始预测目标 (即 x 是精神障碍$d_{_ j})$的“症状”“生活事件”或“治疗”) 映射到一组标签词 (即$o_{j_{j}})$。由于预测目标是二分类问题,负例的词表器词汇采用人工词表化方法设计,结合了情感倾向最高的高频词。

Take “depression” as an example,

以“抑郁症”为例,

$$
\begin{align}
T_{depression}(\bar{x}) &= "\bar{x} \text{ is a } [mask]_ {depression} \text{ of depression symptom}; \bar{x} \text{ is a } [mask]_ {life_event} \text{ of depression life event}; \bar{x} \text{ is a } [mask]_ {treatment} \text{ of depression treatment.}" \\
V_{depression}[mask]_ {symptom} &= {anxiety, dejected mood,...} \\
V_{depression}[mask]_ {life_event} &= {divorce, domestic violence,...} \\
V_{depression}[mask]_{treatment} &= {supportive psychotherapy, abilify,...}
\end{align}
$$

$$
\begin{align}
T_{depression}(\bar{x}) &= "\bar{x} \text{ is a } [mask]_ {depression} \text{ of depression symptom}; \bar{x} \text{ is a } [mask]_ {life_event} \text{ of depression life event}; \bar{x} \text{ is a } [mask]_ {treatment} \text{ of depression treatment.}" \\
V_{depression}[mask]_ {symptom} &= {anxiety, dejected mood,...} \\
V_{depression}[mask]_ {life_event} &= {divorce, domestic violence,...} \\
V_{depression}[mask]_{treatment} &= {supportive psychotherapy, abilify,...}
\end{align}
$$

Essentily, $f_{p r o m p t_r u l e}(\overline{{x}})$ instructs the $L M_{_ \phi}$ to evaluate whether $x$ discloses disease $d_{_ j{}}$ symptoms, life events, or treatments. This task is much simpler compared to the original user-level mental detection task and, therefore, aids the $L M_{_\phi}$ in performance improvement. For instance,

本质上,$f_{prompt_rule}(\overline{{x}})$ 指示 $LM_{_ \phi}$ 评估 $x$ 是否披露了疾病 $d_{_ j{}}$ 的症状、生活事件或治疗方法。相比原始的用户级心理检测任务,这项任务要简单得多,因此有助于提升 $LM_{_\phi}$ 的性能。例如,

consider the case where $\overline{{x}}="I$ feel so lost after my divorce .. The prompt rule(x) directs the $L M_{\Phi}$ to discern whether $\overline{{x}}$ represents a symptm, major lif vent, or a treatment aociatd with depression. The feedback from the $L M_{\Phi}$ yields probabilities: de preston l x) = 0.596, $p(y_{d e p r e s s i o n}^{l i f e_{-}e v e n t}|\overline{{x}})=0.8789,\mathrm{and}p(y_{d e p r e s s i o n}^{t r e a t m e n t}|\overline{{x}})=0.0001.$

考虑这样一个案例:$\overline{{x}}="I$ feel so lost after my divorce.." 提示规则(x)指导 $L M_{\Phi}$ 判断 $\overline{{x}}$ 是否代表抑郁症状、重大生活事件或治疗关联。$L M_{\Phi}$ 反馈的概率结果为:抑郁症状 $p(y_{depression}^{symptom}|\overline{{x}})=0.596$,生活事件 $p(y_{depression}^{life_event}|\overline{{x}})=0.8789$,治疗关联 $p(y_{depression}^{treatment}|\overline{{x}})=0.0001$。

This approach significantly reduces the difficulty for the $L M_{_ \phi}$ in determining whether a given $x$ is related to $d_{_ j}^{_ {\perp}}$ s symptoms, life events, or treatments, allowing the $L M_{_ \phi}$ to focus on key areas already summarized by existing medical knowledge as the verbalize rs are derived from disease ontology, helping the $L M_{\Phi}$ transform inputs into prediction labels (Liu et al. 2023).

这种方法显著降低了 $L M_{_ \phi}$ 判断给定 $x$ 是否与 $d_{_ j}^{_ {\perp}}$ 的症状、生活事件或治疗相关的难度,使 $L M_{_ \phi}$ 能够专注于现有医学知识已总结的关键领域,因为表述词源自疾病本体,帮助 $L M_{\Phi}$ 将输入转化为预测标签 (Liu et al. 2023)。

3.2.3. Prompt Ensembling of Multi-prompt Engineering for Mental Disorder Detection

3.2.3. 多提示工程的心理障碍检测提示集成

The prompt engineering methods, prompt prefix() and f p promptrue), focused on constructing a single prompt for different motivations for the mental disorder detection task using user-generated textual content. We now employ the prompt ensemble method to generate our multi-prompt function, fprompt() for two reasons: (1) both fp prompt prefix() and fp prompt rue) are crucial in the context of mental disorder detection, and we need to combine them to accomplish the task together, (2) a significant body of research has demonstrated that the use of multiple prompts can further improve the efficacy of prompting methods (Liu et al. 2023).

提示工程方法 prompt prefix() 和 f p promptrue() 专注于为利用用户生成文本内容进行精神障碍检测任务的不同动机构建单一提示。我们现采用提示集成方法生成多提示函数 fprompt(),原因有二:(1) 在精神障碍检测场景中,fp prompt prefix() 和 fp prompt rue() 都至关重要,需要协同完成任务;(2) 大量研究表明,使用多重提示能进一步提升提示方法的效能 (Liu et al. 2023)。

While there are several methods for creating a multi-prompt function, we have opted for the prompt ensemble method, which involves utilizing multiple prompts for a given input during the inference phase to make predictions. It serves three purposes: (1) leveraging the complementary advantages of both $f_{_ {p r o m p t_{-p r e f i x}}}(\cdot)$ and $f_{p r o m p t_r u l e}(\cdot)$ 2) addresing the challenges of prompt engineering by eliminating the need to select a single best-performing prompt, and (3) stabilizing performance on downstream tasks (Liu et al. 2023). Formally,

虽然创建多提示函数有几种方法,但我们选择了提示集成方法,即在推理阶段对给定输入使用多个提示进行预测。该方法有三个目的:(1) 利用 $f_{_ {p r o m p t_{-p r e f i x}}}(\cdot)$ 和 $f_{p r o m p t_r u l e}(\cdot)$ 的互补优势;(2) 通过无需选择单一最佳性能提示来解决提示工程面临的挑战;(3) 稳定下游任务的性能 (Liu et al. 2023)。形式上,

$$
f_{prompt}(\bar{x}) =\begin{cases} T_{d_j}(\bar{x}) = [v \oplus T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})] \\
V_{d_j} = {V_{d_j}[mask]_ {symptom}, V_{d_j}[mask]_ {life_event}, V_{d_j}[mask]_{treatment}}
\end{cases}
$$

$$
f_{prompt}(\bar{x}) =\begin{cases} T_{d_j}(\bar{x}) = [v \oplus T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})] \\
V_{d_j} = {V_{d_j}[mask]_ {symptom}, V_{d_j}[mask]_ {life_event}, V_{d_j}[mask]_{treatment}}
\end{cases}
$$

Note: if the $\bar{x} = {t_1, t_2, \cdots, t_{|\bar{x}|}}$ originates from the same $x = {t_1, t_2, \cdots, t_{|x|}}$, these $\bar{x}$ share the same prefix vector $v$

注意 $x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$ 源自相同的 $x={t_{_ {1}},t_{_ {2}},...,t_{_{|x|}}}$ 这些 $x$ 共享相同的前缀向量 $v$

The input to the $L M_{\Phi}$ is the numerical vector representation f $T_{d_{j}}(\overline{{x}})$ which depends on the tokenizing method of $L M_{\phi};$ namely,

输入到 $L M_{\Phi}$ 的是数值向量表示 $T_{d_{j}}(\overline{{x}})$ ,这取决于 $L M_{\phi}$ 的分词方法;即,

$[v\oplus L M_{-}t o k e n i z i n g(T_{d_{j}}^{s y m p t o m}(\overline{{{x}}});T_{d_{j}}^{l i f e_{-}e v e n t}(\overline{{{x}}});T_{d_{j}}^{t r e a t m e n t}(\overline{{{x}}}))].$ The expected feedback from the model is

$[v\oplus L M_{-}t o k e n i z i n g(T_{d_{j}}^{s y m p t o m}(\overline{{{x}}});T_{d_{j}}^{l i f e_{-}e v e n t}(\overline{{{x}}});T_{d_{j}}^{t r e a t m e n t}(\overline{{{x}}}))].$ 模型预期的反馈是

$$
z_i =\begin{cases}
P_{\theta}[:k], & \text{if } i \leq k, \\
[mask]_ i z_{<i}, & \text{o.w.}
\end{cases}
$$

$$
z_i =\begin{cases}
P_{\theta}[:k], & \text{if } i \leq k, \\
[mask]_ i z_{<i}, & \text{o.w.}
\end{cases}
$$

Th infomation withnthe mask is cntingent upon two key factors: $(1)P_{\theta},$ serving as the prefix context, where all subsequent fedback hinges on the activation s from the preceding feedback; (2) the template $T_{d_{j}}(\overline{{x}})$ which provides direct instructions and context to the $L M_{\phi};$ constraint how $L M_{\Phi}$ fills in the content of [mask] in $T_{d_{j}}(\overline{{x}})$

掩码内的信息取决于两个关键因素:(1) $P_{\theta}$ 作为前缀上下文,后续所有反馈都依赖于先前反馈的激活;(2) 模板 $T_{d_{j}}(\overline{{x}})$ 为 $L M_{\phi}$ 提供直接指令和上下文,约束 $L M_{\Phi}$ 如何填充 $T_{d_{j}}(\overline{{x}})$ 中 [mask] 的内容

The learning goal of our multi-prompt learning method is:

我们多提示学习方法的学习目标是:

$$
\begin{array}{r}{p(\boldsymbol{y}_ {d_{j}}|\boldsymbol{x})=\frac{1}{\lambda m}\displaystyle\sum_{f=1}^{r}\sum_{i=1}^{m}p_{\phi}\bigg([m a s k]_ {f}=L M_{\phi}\bigg(\boldsymbol{y}_{d_{j}}\bigg)|T_{d_{j}}(\overline{{\boldsymbol{x}}})\bigg)}\end{array}
$$

$$
\begin{array}{r}{p(\boldsymbol{y}_ {d_{j}}|\boldsymbol{x})=\frac{1}{\lambda m}\displaystyle\sum_{f=1}^{r}\sum_{i=1}^{m}p_{\phi}\bigg([m a s k]_ {f}=L M_{\phi}\bigg(\boldsymbol{y}*{d* {j}}\bigg)|T_{d_{j}}(\overline{{\boldsymbol{x}}})\bigg)}\end{array}
$$

where ris then be rf masked ps it ions if pr fprompe(x) (in our contextr = 3), $[m a s k]_ {f}=L M_{\Phi}\biggl(y_{d_{j}}\biggr)$ is to map the clas $y_{d}$ to the set o label woraes $V_{d_{j}}[m a s k]_{f}$ and $m$ is the number of sliding windows $x$ in $x$

其中 ris 为 rf masked ps it ions if pr fprompe(x) (在我们的上下文中 r = 3), $[m a s k]_ {f}=L M_{\Phi}\biggl(y_{d_{j}}\biggr)$ 的作用是将类别 $y_{d}$ 映射到标签词集合 $V_{d_{j}}[m a s k]_{f}$ ,而 $m$ 是滑动窗口 $x$ 在 $x$ 中的数量

The normalization term $\frac{1}{\lambda m}$ is introduced in Eq. 14 for two reasons: (1) $p(\cdot)$ represents probability feedback from $L M_{\phi};$ and the sum of multiple probabilities could exceed 1. The upper $r_m$ limit of $\Sigma{\textstyle\sum}p(\cdot)$ is $m\mathrm{+~}r$ (if each $p(\cdot)$ returns a value of 1). Since $r$ is a very small number in $\scriptstyle{j=1i=1}$ our setting, we simplify the upper limit to $m$ ; the lower limit of this summation of multiple

归一化项 $\frac{1}{\lambda m}$ 在公式 14 中引入有两个原因:(1) $p(\cdot)$ 表示来自 $L M_{\phi}$ 的概率反馈,多个概率之和可能超过 1。$\Sigma{\textstyle\sum}p(\cdot)$ 的上限 $r_m$ 为 $m\mathrm{+~}r$ (若每个 $p(\cdot)$ 返回值为 1)。由于 $r$ 在我们的设置 $\scriptstyle{j=1i=1}$ 中是非常小的数值,我们将上限简化为 $m$;该多重求和的下限

$m$ probabilities is O (if each $p(\cdot)$ returns a value of O). Consequently, if we normalize $\sum_{i=1}p(\cdot)$ to the ange [0, 1], $\begin{array}{r}{(\underset{i=1}{\overset{m}{\sum}}p(\cdot)-l o w e r_{-}l i m i t)/(u p p e r_{-}l i m i t-l o w e r_{-}l i m i t)(1-0)+0=\frac{1}{m}\underset{i=1}{\overset{m}{\sum}}p(\cdot).}\end{array}$ (2) Additionally, since we employ a sliding window $x$ to break down $x$ and thereby simplify the mental disorder detection task, a possibility arises: multiple sliding window $x$ instances may be describing the same symptoms, life events, or treatment for a disease $d_{{}_{j}}.$ Therefore, we also use $\frac{1}{m}$ as a pen aliz ation factor. Overall, we incorporate the normalization term $\frac{1}{\lambda m}$ into our learning goal, where $\lambda$ is a hyper parameter.

$m$ 个概率的总和为 O (若每个 $p(\cdot)$ 返回值为 O)。因此,若将 $\sum_{i=1}p(\cdot)$ 归一化至 [0, 1] 范围,则 $\begin{array}{r}{(\underset{i=1}{\overset{m}{\sum}}p(\cdot)-l o w e r_{-}l i m i t)/(u p p e r_{-}l i m i t-l o w e r_{-}l i m i t)(1-0)+0=\frac{1}{m}\underset{i=1}{\overset{m}{\sum}}p(\cdot).}\end{array}$ (2) 此外,由于我们采用滑动窗口 $x$ 来分解 $x$ 以简化精神障碍检测任务,可能出现多个滑动窗口 $x$ 实例描述同一疾病 $d_{j}$ 的症状、生活事件或治疗的情况。因此,我们也将 $\frac{1}{m}$ 作为惩罚因子。总体而言,我们将归一化项 $\frac{1}{\lambda m}$ 纳入学习目标,其中 $\lambda$ 为超参数。

4. EVALUATION

4. 评估

To evaluate the performance of the proposed method, we conduct the following examinations: (1) Comparison with benchmarks: we demonstrate the advantages of LLM-based prompt engineering over other machine learning paradigms in our research context. (2) Comparison with other prompt strategies: we highlight the benefits of continuous prompts over discrete prompts in maximizing the capabilities of LLMs within our binary mental disorder detection/classification task. (3) Ablation studies: we analyze the contribution of each component in our ensemble prompt to overall performance. (4) We present three experiments to illustrate the unique advantages of prompt engineering: few-shot learning, early identification, and general iz ability to scenarios with very limited labeled data. (5) Post-analysis to explain the effectiveness of our method.

为评估所提方法的性能,我们进行了以下实验:(1) 基准对比:在本文研究场景中,验证基于大语言模型(LLM)的提示工程相比其他机器学习范式的优势。(2) 提示策略对比:在二元精神障碍检测/分类任务中,证明连续提示相较离散提示更能充分释放大语言模型的潜力。(3) 消融实验:分析集成提示中各组件对整体性能的贡献度。(4) 通过少样本学习、早期识别和标签数据极稀缺场景下的泛化能力三项实验,展示提示工程的独特优势。(5) 后效分析以阐释本方法的有效性。

4.1. Experimental Setup

4.1. 实验设置

All prompt engineering methods rely on pre-trained LLMs to accomplish downstream tasks. Therefore, the major characteristics of pre-trained LLMs, including their main training objective, type of text noising, auxiliary training objective, attention mask, and typical architecture, can significantly influence the performance of downstream tasks in the pre-train and prompt engineering learning paradigm (Liu et al. 2023). Thus, we review mainstream LLM architectures and describe how we chose one for our experiments. Most widely used LLMs adopt the Transformer architecture, which can be classified into three types based on their characteristics (see Table 3): decoder models, encoder models, and encoder-decoder models.

所有提示工程方法都依赖于预训练的大语言模型来完成下游任务。因此,预训练大语言模型的主要特征——包括其核心训练目标、文本噪声类型、辅助训练目标、注意力掩码和典型架构——会显著影响预训练与提示工程学习范式中的下游任务性能 (Liu et al. 2023)。为此,我们回顾主流大语言模型架构,并说明实验中的选择依据。当前广泛采用的大语言模型基于Transformer架构,根据特性可分为三类(见表3): 解码器模型、编码器模型和编码器-解码器模型。

Table 3. Summary of Mainstream Transformer-based LLMs

CategoryTrainingExamples
Decodermodels (Auto-regressive models)UseonlythedecoderofaTransformermodel. Ateachstage,foragivenwordtheattentionlayerscanonlyaccessthewordspositioned beforeitinthesentence. Pretraining:predictingthenextwordinthesentence.CTRL,GPT, GPT-2, TransformerXL, DeepSeek-R1
Encodermodels (auto-encoding models)UseonlytheencoderofaTransformermodel. At each stage, the attention layers can access all the words in the initial sentence (i.e., bi-directionalattention). Pretraining: corrupting a given sentence(e.g.,masking random words) and tasking the model withfindingorreconstructingtheinitialsentence.ALBERT,BERT, DistilBERT, ELECTRA, RoBERTa
Encoder-decodelo rmodels (sequence-to-se quencemodels)UsebothpartsoftheTransformerarchitecture. Ateachstage,theencoderattentionlayerscanaccessallthewordsintheinitialsentence, whereasthedecoderattentionlayerscanonlyaccessthe wordspositionedbeforeagivenwordintheinput. Pretraining:usingtheobjectivesofencoderordecodermodels,orreplacingrandomwords andpredictingthemaskedwords.BART,Marian,T5

表 3: 主流基于Transformer的大语言模型总结

类别 训练方式 示例
解码器模型 (自回归模型) 仅使用Transformer模型的解码器部分。在每个阶段,对于给定单词,注意力层只能访问句子中位于它之前的单词。预训练目标:预测句子中的下一个单词。 CTRL, GPT, GPT-2, TransformerXL, DeepSeek-R1
编码器模型 (自编码模型) 仅使用Transformer模型的编码器部分。在每个阶段,注意力层可以访问初始句子中的所有单词(即双向注意力)。预训练目标:破坏给定句子(如随机掩码单词)并让模型寻找或重建初始句子。 ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa
编码器-解码器模型 (序列到序列模型) 使用Transformer架构的两部分。在每个阶段,编码器注意力层可以访问初始句子中的所有单词,而解码器注意力层只能访问输入中位于给定单词之前的单词。预训练目标:使用编码器或解码器模型的目标,或替换随机单词并预测被掩码的单词。 BART, Marian, T5

Given the specific nature of our experimental environment, prompt design (continuous prompt, prefix tuning, and rule-based prompt), prediction task (binary classification for subject-level mental disorder detection), and research context (chronic disease predictive analytics), we employed an exclusion-based approach to select appropriate LLMs. (1) We exclude encoder models (e.g., BERT and RoBERTa) since our prompt design incorporates prefix-tuning to achieve personalized prompts. The implementation of prefix tuning requires LLMs to allow the injection of past key values of Hugging Face LLMs (caching and reusing the intermediate hidden states from the previous steps) into the model. Encoder models on Hugging Face do not permit such operations. (2) We exclude decoder models because previous studies show that for supervised learning tasks with input-label pair data (as opposed to self-supervised learning), encoder-decoder models offer advantages by requiring fewer parameters and achieving better results compared to decoder-only models (Jiang et al. 2023). (3) We exclude non-open-source LLMs (e.g., GPT-4o) since our approach relies on continuous prompt engineering, which requires open-source LLMs to operate in the embedding space (However, non-open-source LLMs with discrete prompt can still be used as benchmark models). Moreover, since healthcare analytics problems often involve user-generated data or fine-grained patient-level information, utilizing such models could pose potential privacy issues. As a result, we selected FLAN-T5 (one of the best-performing open-source encoder-decoder models) as the base LLM for our experiments.

鉴于实验环境的特殊性、提示设计(连续提示、前缀调优和基于规则的提示)、预测任务(针对主体层面精神障碍检测的二元分类)以及研究背景(慢性病预测分析),我们采用基于排除的方法来选择合适的LLM。(1)我们排除了编码器模型(如BERT和RoBERTa),因为我们的提示设计结合了前缀调优以实现个性化提示。前缀调优的实施要求LLM允许将Hugging Face LLM的过去键值(缓存并重用先前步骤的中间隐藏状态)注入模型中,而Hugging Face上的编码器模型不支持此类操作。(2)我们排除了仅解码器模型,因为先前研究表明,对于带有输入-标签对数据的监督学习任务(与自监督学习相对),编码器-解码器模型相比仅解码器模型具有参数需求更少且效果更好的优势(Jiang et al. 2023)。(3)我们排除了非开源LLM(如GPT-4o),因为我们的方法依赖于连续提示工程,这需要开源LLM在嵌入空间中运行(不过,使用离散提示的非开源LLM仍可作为基准模型)。此外,由于医疗健康分析问题常涉及用户生成数据或细粒度患者层面信息,使用此类模型可能带来潜在的隐私问题。因此,我们选择FLAN-T5(性能最佳的开源编码器-解码器模型之一)作为实验的基础LLM。

For evaluation, we mainly use three datasets (Table 4) from the eRisk database (Losada et al. 2018, Parapar et al. 2021). Specifically, we selected the detection of depression, anorexia, and pathological gambling as the primary tasks given their prevalence? and broad societal impact.

在评估中,我们主要使用来自eRisk数据库 (Losada et al. 2018, Parapar et al. 2021) 的三个数据集 (表4)。具体而言,我们选择抑郁症、厌食症和病理性赌博检测作为主要任务,考虑到它们的普遍性和广泛社会影响。

To assess the usability of the methods to new users, we use $60%$ of the data for training, $20%$ for validation, and $20%$ for testing. The reported results are the average performances of 10 experiment runs. We also report the standard deviation of these experiments to show our results? statistical significance. In our evaluations, we report AUC, F1-score, precision, and recall. The goal of the proposed method is to achieve the highest AUC and F1-score. It is important to note that in our research context, a model with high precision but low recall, or vice versa, does not necessarily indicate good overall performance. When precision (exactness: correctly identifying true patients) is low, false-positive patients will suffer from unnecessary mental burden and diagnostic costs. When recall (completeness: capturing as many patients as possible) is low, the practical utility of the model is compromised.

为了评估这些方法对新用户的可用性,我们使用60%的数据进行训练,20%用于验证,20%用于测试。报告结果是10次实验运行的平均性能。我们还报告了这些实验的标准差,以展示结果的统计显著性。在评估中,我们报告了AUC、F1分数、精确率和召回率。所提方法的目标是实现最高的AUC和F1分数。需要注意的是,在我们的研究背景下,一个高精确率但低召回率的模型,或者相反,并不一定代表整体性能良好。当精确率(准确性:正确识别真实患者)较低时,假阳性患者将承受不必要的心理负担和诊断成本。当召回率(完整性:尽可能多地捕捉患者)较低时,模型的实用性就会受到影响。

Table 4.Datasets Summary

Dataset# of subjects# of posts# of words Avg # of posts per subjectAvg # of days from first to last post
P Depression21490,2222,480,216421676
1,49398636022.461,242660664
N P Anorexia N13442,4931,583,227317679
1,153781,76816,781,263678848
Pathologicall24569,3012,119,872282545
gambling N4,1822,088,00244.077.018499663

Note: $\mathrm{P}=$ positive examples; N= negative examples.

表 4: 数据集摘要

数据集 受试者数量/帖子数量/单词数量/每受试者平均帖子数/首次到最后一次发帖平均天数
P Depression 214 90,222 2,480,216 421 676
1,493 986,360 22,461,242 660 664
N P Anorexia N 134 42,493 1,583,227 317 679
1,153 781,768 16,781,263 678 848
Pathologicall 245 69,301 2,119,872 282 545
gambling N 4,182 2,088,002 44,077,018 499 663

注: $\mathrm{P}=$ 正例; N= 负例。

4.2. Comparison with Benchmark Methods

4.2. 与基准方法的比较

4.2.1. Comparison with Existing Mental Disorder Detection Methods

4.2.1. 与现有精神障碍检测方法的对比

We begin by comparing our proposed method with state-of-the-art approaches from other machine learning paradigms other than prompt engineering. The goal is to highlight the advantages of LLM-based prompt engineering over these alternative paradigms in the context of our research, which focuses on detecting mental disorders through user-generated content. Table 5 reports our results and compares with state-of-the-art methods for benchmarking (Benton et al. 2017, Chau et al. 2020, Choudhury et al. 2013, Coppersmith et al. 2014, Khan et al. 2021, Lin et al. 2020, Malviya et al. 2021, Preotiuc-Pietro et al. 2015, Reece et al. 2017).

我们首先将提出的方法与除提示工程(prompt engineering)外其他机器学习范式的最新方法进行比较,旨在凸显基于大语言模型的提示工程在我们研究背景下的优势——该研究专注于通过用户生成内容检测心理障碍。表5报告了我们的实验结果,并与以下基准测试的最新方法进行了对比(Benton et al. 2017, Chau et al. 2020, Choudhury et al. 2013, Coppersmith et al. 2014, Khan et al. 2021, Lin et al. 2020, Malviya et al. 2021, Preotiuc-Pietro et al. 2015, Reece et al. 2017)。

Table5.ComparisonwithState-of-the-artMentalDisorderDetectionStudies
DatasetModelAUCF1PrecisionRecall
DepressionTraditionalChoudhury et al. (2013)0.569 ± 0.0010.588 ± 0.0020.716 ± 0.0010.569 ±0.003
Coppersmith et al. (2014)0.685 ±0.0020.705 ± 0.0010.735 ± 0.0030.685 ±0.002
Machine Learning with FeaturePreotiuc-Pietro et al.(2015)0.723±0.0030.760 ± 0.0020.820 ± 0.0010.720 ± 0.001
Benton et al. (2017)0.716 ± 0.0290.722 ± 0.0260.730 ± 0.0220.716 ± 0.029
EngineeringReece et al. (2017)0.729 ± 0.0120.717 ± 0.0110.708 ± 0.0130.729 ± 0.011
Chau et al. (2020)0.623±0.0060.573 ±0.0050.570 ± 0.0020.622 ±0.005
Deep Learning withCNN-based (Lin et al. 2020)0.710 ± 0.0050.711 ± 0.0040.728 ± 0.0180.710 ± 0.005
LSTM-based(Khanetal.20