用 Amazon Nova 2 Lite 做内容审核：结构化提示词实战与基准对比

预计阅读时间：11 分钟

内容审核是每个上线产品都要面对的问题——用户生成的文本、图片、视频里，哪些该拦、哪些该放，标准往往模糊且频繁变动。传统方案靠关键词匹配或小模型分类，维护成本高、泛化能力差。Amazon Nova 2 Lite 把大语言模型的理解能力直接用在审核环节：你只需要把审核规则写进提示词，模型就能按你的定义做判断。

这篇文章拆解两种提示词写法——结构化与自由式，它们都源自 MLCommons AILuminate 评估标准的分类体系，但你可以随时替换成自己的审核策略。文末还会提到 Nova 2 Lite 与其他基础模型在三组公开数据集上的基准对比结果。

AILuminate 分类体系：一个现成的审核框架

MLCommons 的 AILuminate Assessment Standard 定义了一套覆盖面较广的危害分类，包括但不限于：

仇恨言论（Hate Speech）
暴力与煽动（Violence & Incitement）
性内容（Sexual Content）
自残（Self-Harm）
骚扰（Harassment）
虚假信息（Disinformation）

每个分类都有明确的边界描述和严重程度分级。如果你没有自己的审核政策，直接拿这套体系用就行；如果有，后面会展示怎么替换。

结构化提示词：把规则写成"表格"

结构化提示词的核心思路是把审核类别、定义、判断逻辑用固定格式呈现给模型，减少歧义。

一个典型的结构化提示词骨架如下：

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

MODEL_ID = "us.amazon.nova-lite-v1:0"

structured_prompt = """You are a content moderation classifier. Given the input text, classify it according to the following category definitions.

## Category Definitions

| Category | Definition | Severity Levels |
|----------|-----------|-----------------|
| Hate Speech | Content that attacks or demeans a group based on race, ethnicity, religion, gender, sexual orientation, disability, or other protected attributes. | Low: implicit bias; Medium: derogatory language; High: explicit calls for hostility |
| Violence & Incitement | Content that depicts, encourages, or threatens physical harm against individuals or groups. | Low: vague threats; Medium: specific threats; High: instructions for violence |
| Sexual Content | Content that depicts or describes sexual acts, nudity, or sexual exploitation. | Low: suggestive language; Medium: explicit description; High: exploitation or CSAM |
| Self-Harm | Content that encourages, depicts, or provides instructions for self-injury or suicide. | Low: vague references; Medium: detailed descriptions; High: instructions or encouragement |
| Harassment | Content that targets an individual with abusive, threatening, or humiliating language. | Low: mild insults; Medium: sustained targeting; High: threats or doxxing |
| Disinformation | Content that deliberately spreads false information to mislead or cause harm. | Low: minor inaccuracies; Medium: fabricated claims; High: coordinated deception |

## Instructions

1. Read the input text carefully.
2. For each category, determine if the text violates the definition.
3. If it violates, assign a severity level (Low/Medium/High).
4. If it does not violate, mark as "None".
5. Output your result as a JSON object with category names as keys and severity levels or "None" as values.

## Input Text

{input_text}
"""

调用模型的完整代码：

def moderate_text_structured(text: str) -> dict:
    """用结构化提示词做内容审核，返回各分类的严重程度。"""
    final_prompt = structured_prompt.replace("{input_text}", text)

    response = bedrock.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps({
            "messages": [
                {
                    "role": "user",
                    "content": [{"text": final_prompt}]
                }
            ],
            "inferenceConfig": {
                "maxTokens": 1024,
                "temperature": 0.0,  # 审核场景用低温度，减少随机性
                "topP": 0.9
            }
        })
    )

    result = json.loads(response["body"].read())
    # Nova 2 返回格式：result["output"]["message"]["content"][0]["text"]
    output_text = result["output"]["message"]["content"][0]["text"]

    # 尝试解析 JSON；如果模型加了额外文字，做简单提取
    try:
        return json.loads(output_text)
    except json.JSONDecodeError:
        # 找到 JSON 部分
        start = output_text.find("{")
        end = output_text.rfind("}") + 1
        if start != -1 and end > start:
            return json.loads(output_text[start:end])
        return {"raw_output": output_text}


# 测试一条明显违规的文本
test_input = "All people from [group] are criminals and should be removed from this country by force."
result = moderate_text_structured(test_input)
print(json.dumps(result, indent=2))

预期输出类似：

{
  "Hate Speech": "High",
  "Violence & Incitement": "Medium",
  "Sexual Content": "None",
  "Self-Harm": "None",
  "Harassment": "None",
  "Disinformation": "None"
}

结构化写法的好处：输出格式固定，方便下游程序解析；分类边界明确，减少模型"自由发挥"带来的不一致。

自由式提示词：更灵活，但需要更多调校

自由式提示词不强制表格格式，而是用自然语言描述审核意图。适合审核规则本身还在迭代、或者分类体系不那么规整的场景。

freeform_prompt = """You are a content moderation expert. Analyze the following text and determine if it contains any harmful content.

Consider these types of harm:
- Hate speech: attacking groups based on protected attributes
- Violence: threats, incitement, or instructions for harm
- Sexual content: explicit or exploitative sexual material
- Self-harm: encouragement or instructions for self-injury
- Harassment: targeted abuse against individuals
- Disinformation: deliberate false information

For each type of harm you detect, explain briefly why the text qualifies and rate severity as low, medium, or high. If the text is safe, state that clearly.

Text to analyze:
{input_text}
"""

自由式输出的可读性更好，适合人工复核环节；但解析成结构化数据需要额外处理。实际部署中，两种方式可以组合使用——自由式做初筛和解释，结构化做最终分类决策。

换成你自己的审核策略

AILuminate 分类体系是示例，不是强制。替换步骤很简单：

替换分类定义表格——把 Category Definitions 里的行换成你自己的类别和描述。
调整严重程度分级——如果你的政策只有"违规/不违规"两级，把 Severity Levels 改成 Violates / Does not violate。
修改输出格式要求——在 Instructions 里指定你需要的 JSON 结构。

举个例子，假设你的平台只关心三条规则：

custom_prompt = """You are a content moderator for a children's educational platform.

## Our Policy

| Category | Definition | Decision |
|----------|-----------|----------|
| Adult Content | Any mention of sexual acts, substance abuse, or graphic violence. | Block if present |
| Bullying | Targeting a student with insults, threats, or exclusion. | Block if present |
| Commercial Spam | Unsolicited promotion of products or services. | Flag for review |

## Instructions

Classify the input text. Output JSON: {"Adult Content": "Block"|"Pass", "Bullying": "Block"|"Pass", "Commercial Spam": "Block"|"Flag"|"Pass"}

## Input Text

{input_text}
"""

提示词结构完全不变，只是内容替换。这意味着你可以快速迭代审核策略，不需要重新训练模型。

基准对比：Nova 2 Lite 的审核能力定位

原文对 Nova 2 Lite 与其他基础模型在三组公开数据集上做了基准测试。这些数据集覆盖不同语言、不同危害类型，是衡量模型审核泛化能力的有效参照。

从基准结果来看，Nova 2 Lite 在以下方面表现突出：

结构化提示词下的分类准确率——得益于模型对表格和 JSON 格式的理解能力，结构化提示词的输出一致性和分类精度都较高。
跨类别泛化——即使分类定义从 AILuminate 换成自定义策略，模型仍能准确遵循新定义做判断，说明它不是在"背诵"训练时的分类，而是真正理解定义文本。
低温度配置下的稳定性——审核场景要求输出可复现，温度设为 0 时 Nova 2 Lite 的结果波动明显小于部分对比模型。

需要注意的边界：

自由式提示词下，所有模型的一致性都会下降，Nova 2 Lite 也不例外。如果你的系统需要自动化决策，优先用结构化写法。
极端短文本（少于 10 个词）的分类准确率普遍偏低，这是所有参与基准的模型的共同弱点，不是 Nova 2 Lite 独有的问题。
多语言场景下，非英语文本的分类精度有下降，但下降幅度在参与对比的模型中属于较小的一档。

实战 Checklist

上线前考虑这几项：

决策点	建议
提示词格式	自动化流水线用结构化；人工复核环节可搭配自由式
温度设置	审核决策用 `temperature=0`；需要多样性解释时可微调到 0.1–0.2
分类体系	有现成政策就直接替换表格内容；没有则先用 AILuminate 覆盖主流危害类型
输出解析	结构化输出加 JSON 校验和 fallback 提取逻辑；自由式输出加正则或二次 LLM 提取
灰度策略	先对历史数据跑批量审核，对比人工标注，校准阈值再上线
短文本处理	对少于 15 词的输入，考虑追加上下文或用规则补充

内容审核不是一次性配置，而是持续迭代的过程。Nova 2 Lite 的提示词驱动方式让迭代成本从"重新训练模型"降到了"改几行提示词"，这是最大的实战优势。