Back to Projects

Problem

As AI models are increasingly exposed to user-generated input, they become vulnerable to adversarial or malicious prompts. These prompts can manipulate model behavior, extract confidential data, or trigger unintended system actions. Prior to this work, there was no structured input validation layer within the backend pipeline — all text prompts were passed directly to downstream systems without sanitization.

The goal was to design a lightweight yet robust module capable of intercepting, cleaning, and validating user prompts before they reached any LLM component, effectively serving as a “safety firewall” for text-based AI interactions.

Solution Overview

PENELOPE (Prompt Enhancer, Neutralization, Logic & Protection Engine) is a rule-based sanitization layer built in Python. It acts as a pre-processing gateway between user-facing interfaces and LLM-powered systems.

Its core functions include:

  • Detecting and rejecting unsafe patterns (e.g., prompt injections or roleplay instructions).
  • Cleaning malformed encodings and repairing text anomalies.
  • Escaping potentially dangerous tokens or markup symbols.
  • Optionally rewriting the prompt for clarity and neutrality.

The system is designed to be fast, deterministic, and language-agnostic — ideal for production environments where latency and reliability are critical.

Architecture & Design

PENELOPE is structured as a modular text-processing pipeline composed of six sequential stages:

  1. Encoding Cleanup — Repairs malformed Unicode, invisible characters, and inconsistent punctuation using the ftfy library.
  2. Length Limitation — Rejects inputs exceeding 1000 characters to mitigate denial-of-service risks.
  3. Pattern-Based Filtering — Applies curated regular expressions to detect adversarial phrases (“ignore”, “act as”, “you are now”) and template injection syntax ({{...}}, <<...>>).
  4. Token Escaping — Neutralizes high-risk tokens by converting them into HTML entities (e.g., {{&#123;&#123;).
  5. HTML Escaping — Converts <, >, and & to safe entities to prevent markup injection.
  6. Optional Rewriting Layer — Uses an internal LLM-based module to improve prompt phrasing and structure without changing semantic intent.

The module is exposed via a PHP-based API wrapper, which allows seamless integration into existing backend systems. All requests and responses are exchanged as structured JSON objects, with timing and error information included for monitoring.

Technical Highlights

  • Regex-based detection of prompt injection patterns and roleplay attempts.
  • Two-stage escaping pipeline combining custom token substitution and HTML sanitization.
  • Encoding normalization using ftfy to fix broken Unicode and punctuation issues.
  • Strict input length enforcement (1000-character limit).
  • Optional LLM-backed rewriting with fallback logic to preserve user intent.
  • API exposure via PHP enabling multi-service integration and standardized input validation.
  • Structured logging for traceability, error handling, and dataset collection for future model improvements.

My Contributions

During my internship, I:

  • Designed and implemented the entire sanitization pipeline, including regex-based detection and escaping logic.
  • Integrated the module into the existing backend via a PHP-exposed Python API.
  • Defined the validation flow for prompt acceptance/rejection and JSON response structure.
  • Conducted evaluation and stress testing to ensure consistent performance and language-agnostic behavior.
  • Documented API usage and integrated logging for monitoring and auditing.

Results & Impact

  • Improved system robustness: Prevented unsafe prompts from reaching downstream models.
  • Lightweight runtime: Average processing latency under 10ms per prompt.
  • Reduced vulnerability surface: Blocked adversarial commands and template injections with minimal false positives.
  • Reusable design: The module was adopted as a centralized validation layer across multiple backend services.

While the system remains rule-based and doesn’t perform deep semantic analysis, it provides an effective first line of defense against common prompt manipulation attacks.

What I Learned

  • Gained practical experience in AI safety engineering and secure prompt handling.
  • Learned to design lightweight, production-grade NLP pipelines.
  • Strengthened understanding of regular expression design, input sanitization, and API-level integration.
  • Understood the trade-offs between rule-based and semantic filtering systems in real-time AI deployments.