Prompt engineering brief: Voxpost email → spoken assistant line¶

Brief for a prompt engineer (human or LLM) refining Voxpost Block 3 summarization prompts.

Your role¶

You are an expert prompt engineer for small instruction-tuned chat models (0.8B–2B parameters, running on CPU in a desktop app). Your job is to design and refine the system + user prompt contract so the model outputs a single spoken assistant briefing per incoming email — not a neutral summary.

Success criterion: A human would happily hear the line aloud when new mail arrives, without reading the email. Manual review on 24 diverse fixture emails is the bar; automated keyword grading is a weak sanity check only.

Product context¶

Voxpost is a local Gmail companion: new mail → summarize in memory → speak via on-device TTS → discard. No mail storage. Summarization must run 100% on-device after model download (privacy). Cloud LLM APIs are dev-only oracles for prompt validation, not production.

The user configures spoken output language in voxpost.toml (e.g. target_lang = en). The briefing must always be in that language, even when the email is French, Spanish, German, etc.

Output goes to Supertonic TTS. Numbers, times, and OTP codes must be TTS-friendly (e.g. “three p.m.” not “3:00 PM”, “four four two one” for invoice numbers when needed).

What the model must produce¶

One block of natural spoken prose — no JSON, no labels, no “Here is a summary”, no thinking blocks, no markdown.

Intent-first assistant voice: decide what the listener most needs to hear.

Situation	Good pattern
Spam / marketing newsletter	Brief: “This looks like spam about …”
Important / actionable mail	Who, what, deadline, required action
Forwarded mail	Name the original sender, never the Gmail forwarder
Security alert	Treat as urgent; say what happened and what to do
Booking / invoice / tax / interview	Concrete doc, amount, date/time, place, deadline
Short ping (“Did you get the doc?”)	One tight sentence
Long official letter	Multiple short sentences OK; don’t drop dates or limits

Length adapts to email size:

≤50 words body → one brief sentence
51–120 words → one or two sentences
>120 words → as many short sentences as needed; don’t omit deadlines, venues, or actions

Hard bans:

Vague phrasing: “the sender”, “sent a message saying”, “about booking”
Email addresses or @ symbols
Invented urgency or details not in the email
Promotional CTAs like “you should buy…”
Output in the email’s language when config says English (or other configured lang)
Reasoning / thinking traces (Qwen3.5 models leak these if not constrained)

Current prompt architecture (what you are improving)¶

Chat format: system message + user message.

System prompt is assembled from four blocks (in order). Source: src/voxpost/summarize.py (_chat_system_prompt).

1. Language lock¶

Output language: {lang} only. Never reply in the email's language — always use {lang} from the user config.

2. Core contract (`_SPEAKABLE_CONTRACT_CORE`)¶

You are a voice assistant for incoming email. Your output is one spoken assistant briefing for someone who cannot see the email. Decide what they most need to hear — not a neutral summary. Use only facts from the email. If it is obvious spam or noise, say so plainly (for example: this looks like spam about …). If it is important or needs action, say what matters and why. Match the situation with natural phrasing: a booking confirmation arrived; you got a job offer; don't forget a reminder; someone is asking whether …; an official document is ready with a deadline. Name the true author when useful: original sender, signatory, company, or official service — never the Gmail forwarder on forwarded mail. For official, legal, security, invoice, application, booking, or event mail, include the concrete document or topic, any deadline or date and time, place if relevant, and the required action or limit when present (for example download window or retry count). Do not use vague wording like 'sent a message saying', 'about booking', 'the sender', or email addresses. Do not invent urgency, details, or promotional calls to action like 'you should'. Write natural spoken prose — not JSON, not labels, not meta narration.

3. Length guidance (by body word count)¶

≤50 words: one brief sentence — spam, request, confirmation, reminder, or alert; true author if useful
51–120 words: one or two sentences — what happened, why it matters, author, date/time/deadline/action
>120 words: as many short sentences as needed; do not omit dates, venues, limits, or artificially shorten

4. TTS number rules (English example)¶

Always reply in en only — the configured output language from voxpost.toml, never the email language. Write every number as spoken words for text-to-speech, never digits: verification and OTP codes digit by digit; money in full words with currency; other counts as whole numbers unless digit-by-digit is clearer.

Plain user message template¶

Write the spoken assistant briefing for this email:

From: {from_address}
Subject: {subject}
Body: {cleaned_body}

Structured variant (optional)¶

JSON input with original_sender, is_forward, signatory_name, company, application_role, attachment metadata — same output contract. Built in summarizer_context.py; toggled via [summarize] chat_input_format = plain | structured.

Inference settings (small models)¶

temperature=0
max_new_tokens: 96 / 128 / 160 by body length (see hf_inference_prompt.py and local chat-LM path)

Evaluation set (24 fixtures)¶

Fixtures live in src/voxpost/speech_check_cases.py.

Case ID	Scenario
`fr_forward_phone`	French forward — client asks for phone number
`en_deploy_alert`	English — staging deploy failed
`en_short_ack`	English — short acknowledgment
`fr_meeting_move`	French — meeting moved to Thursday
`en_invoice_due`	English — invoice payment due
`en_forward_partner`	English forward — account manager intro
`es_delivery`	Spanish — package delivery window
`en_security_alert`	English — sign-in from new device
`fr_complaint`	French — product complaint
`en_minimal_ping`	English — one-line ping
`en_newsletter_noise`	English — newsletter marketing (noise)
`de_tax_notice`	German — tax assessment notice
`pt_hotel_confirm`	Portuguese — hotel booking confirmation
`en_ooo_autoreply`	English — out-of-office auto-reply
`en_github_pr`	English — GitHub pull request review
`en_rent_increase`	English — landlord rent increase letter
`it_dinner_invite`	Italian — informal dinner invitation
`en_api_bullet_list`	English — three numbered API questions
`en_subject_only_flight`	English — subject-only flight cancellation
`en_angry_order`	English — angry ALL CAPS order complaint
`nl_interview_invite`	Dutch — job interview invitation
`en_school_snow_day`	English — school closure alert
`en_dentist_reminder`	English — dental appointment reminder
`ja_en_mixed_vendor`	Mixed JP/EN — vendor shipment delay

Each fixture has: intent description, must_mention_any keywords, must_not_mention forbidden hallucinations, max_words (40 default for auto-grade).

Known results (do not overfit auto-grade)¶

Model	Role	Result
Qwen3-235B (HF oracle)	Same prompts	~21/24 manual quality — contract is basically sound
Qwen3.5-2B (HF Featherless)	Same prompts	24/24 auto PASS but ~9 false “spam” labels on important mail
Qwen3.5-0.8B (local)	Same prompts	Thinking leaks, wrong output language, vague fallbacks

235B good vs 2B bad (same email, same prompt)¶

Case	235B (good)	2B (bad)
French forward asking for phone	“personal message from a client asking for your phone number”	“This looks like spam about a phone number”
Security sign-in from Berlin	“Security alert… reset password if not you”	“This looks like spam about a sign-in from Berlin”
School snow day closure	“Lincoln Elementary closed Thursday… remote on Teams”	“This looks like spam about school closures”
Angry customer order #99281	Describes complaint + callback demand	“This looks like spam about an unpaid order”

Root issue to fix in prompts: Small models over-trigger the spam template whenever the line “If it is obvious spam…” appears. They need clearer decision rules: spam = marketing/newsletters/automated promos; not forwards, security alerts, complaints, school alerts, interviews, invoices, etc.

Secondary issues: language lock on multilingual bodies; Qwen3.5 thinking blocks in output; time errors (4 PM → 4 AM).

Target deployment models (optimize for these)¶

Primary candidates: Qwen/Qwen3.5-0.8B, Qwen/Qwen3.5-2B (~2–4.5 GB, CPU, mass-market RAM).

Prompt must work on small models, not only 235B. Shorter, clearer rules beat long prose. Few-shot examples in-system may help if token budget allows.

Oracle for prompt iteration: Qwen/Qwen3-235B-A22B-Instruct-2507 via HF Inference (dev only).

What we need from you¶

Deliver all of the following:

Revised system prompt (plain + structured variants if they differ), ready to drop into Python string constants in summarize.py.
Revised user message template (if needed), in hf_inference_prompt.py / local chat path.
Explicit spam vs important decision rubric — bullet rules or a tiny decision tree small models can follow.
3–5 few-shot examples (input snippet → ideal spoken line) covering: newsletter spam, security alert, forward, interview invite, angry customer.
Anti-patterns list — phrases to never output.
Notes for Qwen3.5 — how to suppress thinking/reasoning in output (no ``, no “Thinking Process”).
Regression table: for each of the 24 fixture intents, one example ideal line (English output, target_lang=en).

Constraints:

Do not require cloud APIs or tools in production.
Do not add attachment body content (metadata only in structured JSON).
Prefer rules small models obey over nuanced prose only big models understand.
Keep total system prompt under ~800–1200 tokens if possible.

How we will validate your prompt¶

# Dev oracle (cloud, same prompts)
uv run voxpost summarize prompt-check --model Qwen/Qwen3-235B-A22B-Instruct-2507

# Small model on HF (Featherless — auto-routing does not support Qwen3.5-2B on all accounts)
uv run voxpost summarize prompt-check --model Qwen/Qwen3.5-2B --provider featherless-ai

# Local (production path)
uv run voxpost summarize speech-check --model Qwen/Qwen3.5-0.8B --workers 1

Manual review is authoritative. Auto-grade PASS with “spam” on important mail = failure.

Example ideal outputs (reference tone)¶

Email: Staging deploy failed, migration timed out, check pipeline logs.

Ideal line: “The staging deploy failed around nine forty-five a.m. due to a migration timeout — Alex Chen asked you to check the pipeline logs when you can.”

Email: Marketing newsletter 50% off sale.

Ideal line: “This looks like spam about a fifty percent off sale.”

Email: Sign-in from Berlin at 02:14 UTC, reset password if not you.

Ideal line: “Security alert: someone signed in to your account from Berlin at two fourteen a.m. UTC — reset your password if that wasn’t you.”

One-line ask¶

Design the best system + user prompt for a local 0.8B–2B chat model that turns any incoming email into one TTS-ready assistant briefing in a fixed config language, with strict spam-vs-important classification (newsletters only), forward-aware sender naming, length-adaptive detail, and no thinking leaks — validated against 24 diverse email fixtures where a 235B oracle already proves the intent contract works but 2B mislabels important mail as spam.