LLM Security Research and Resources
Jailbreaking
- Defending ChatGPT against Jailbreak Attack via Self-Reminder: This paper introduces methods to defend ChatGPT from attacks that elicit undesired behavior.
- Jailbroken: How Does LLM Safety Training Fail?: The study examines the vulnerabilities of safety-trained LLMs to adversarial misuse and jailbreak attacks.
Prompt Injection
- Prompt Injection attack against LLM-integrated Applications: An exploration of the vulnerabilities of LLM-integrated applications to prompt injection attacks.
Backdoors & Data Poisoning
- Anti-Backdoor Learning: Training Clean Models on Poisoned Data: The paper discusses methods to train machine learning models on poisoned datasets without introducing backdoors.
Adversarial Inputs
- Certifying LLM Safety against Adversarial Prompting: This research outlines strategies for defending LLMs against various adversarial prompting techniques.
Insecure Output Handling
- Secure GenAI adoption: Threats and risk of large language models: The article describes the risks and threats posed by insecure output handling in LLMs and suggests adaptation of security controls.
Data Extraction and Privacy
- Extracting Training Data from Large Language Models: Demonstrates how adversaries might extract individual training examples from LLMs, posing privacy risks.
Data Reconstruction
- Deconstructing Classifiers: Towards A Data Reconstruction Attack Against Text Classification Models: Proposes a novel data reconstruction attack that can exploit text classification models based on LLMs.
Model Denial of Service (DoS)
- OWASP Top 10 for LLM 2023: Understanding the risks of Large Language Models: Discusses the top security risks for LLMs, including Model Denial of Service, and how to understand and mitigate them.
Privilege Escalation
- Evaluating LLMs for Privilege-Escalation Scenarios: This paper introduces a benchmarking tool to assess how different LLMs handle privilege-escalation attacks.
Watermarking and Evasion
- Unbiased Watermark for Large Language Models: Studies how watermarking can be used in LLMs to track outputs without significantly impacting model performance.