用 Amazon Bedrock Data Automation 精准提取四类金融文档信息

预计阅读时间：12 分钟

金融文档处理一直是企业数据管道中的硬骨头。银行流水格式各异、税表字段密集、合同条款嵌套复杂——传统 OCR 能拿到文字，却很难把"文字"变成结构化的"数据"。Amazon Bedrock Data Automation（BDA）针对这类场景提供了自定义提取能力，可以直接输出你关心的字段，而不是一整页原始文本。这篇文章聚焦它在四类常见金融文档上的实际表现：银行流水、W-2 税表、1099-B 税表和供应商合同。

四类文档各自的提取难点

银行流水——行数多、格式不统一。不同银行的列名、日期格式、金额正负号规则都不一样，手工写正则几乎不可能覆盖所有变体。BDA 的做法是定义你需要的字段（如交易日期、描述、金额、余额），让模型自行对齐不同格式的流水。

W-2 税表——字段固定但密度高。一个 W-2 上有几十个编号框（Box 1–Box 12），部分框还有子代码。提取不仅要拿到值，还要正确映射框号与含义。BDA 通过自定义 blueprint 指定每个框的语义标签，避免"拿到数字却不知道它代表工资还是联邦税"的问题。

1099-B 税表——证券交易记录，核心难点在于同一表上可能有多行交易，每行都有日期、描述、数量、金额等多个字段，需要按行分组提取而非按页扁平化。BDA 支持定义分组结构，把每一行交易作为一个重复单元输出。

供应商合同——非结构化文本的典型。条款编号、日期、金额、责任方散布在不同段落，甚至嵌套在子条款中。BDA 可以定义"合同方名称""生效日期""终止日期""付款金额"等目标字段，由模型在全文中定位并提取。

自定义 Blueprint：告诉 BDA 你要什么

BDA 的核心思路是"你定义输出 schema，它负责从文档里找数据"。下面是一个针对银行流水的自定义 blueprint JSON，可以直接在 BDA 控制台或通过 API 创建：

{
  "document": {
    "extraction": {
      "granularity": {
        "type": "PAGE"
      },
      "outputFormat": {
        "type": "JSON"
      }
    },
    "blueprint": {
      "name": "bank-statement-extraction",
      "fields": [
        {
          "name": "account_holder",
          "type": "string",
          "description": "Account holder name on the bank statement"
        },
        {
          "name": "account_number",
          "type": "string",
          "description": "Account number, typically last 4 digits shown"
        },
        {
          "name": "statement_period",
          "type": "string",
          "description": "Statement period start and end dates"
        },
        {
          "name": "transactions",
          "type": "array",
          "description": "List of individual transactions",
          "items": {
            "type": "object",
            "fields": [
              {
                "name": "date",
                "type": "string",
                "description": "Transaction date"
              },
              {
                "name": "description",
                "type": "string",
                "description": "Transaction description or memo"
              },
              {
                "name": "amount",
                "type": "number",
                "description": "Transaction amount, negative for debits"
              },
              {
                "name": "balance",
                "type": "number",
                "description": "Running balance after transaction"
              }
            ]
          }
        },
        {
          "name": "opening_balance",
          "type": "number",
          "description": "Opening balance at start of statement period"
        },
        {
          "name": "closing_balance",
          "type": "number",
          "description": "Closing balance at end of statement period"
        }
      ]
    }
  }
}

几个设计要点：

transactions 定义为 array，每项是一个 object，这样 BDA 会按行分组输出，而不是把所有日期拼成一个字符串。
每个字段都带 description，帮助模型理解语义——"amount, negative for debits"这种提示直接影响提取准确率。
granularity 设为 PAGE，适合单页或几页的流水；如果是超长合同，可以改为 DOCUMENT。

W-2 税表的 blueprint 结构类似，但字段更扁平：

{
  "fields": [
    { "name": "employee_name", "type": "string" },
    { "name": "employer_ein", "type": "string", "description": "Employer Identification Number" },
    { "name": "box1_wages", "type": "number", "description": "Wages, tips, other compensation (Box 1)" },
    { "name": "box2_federal_tax", "type": "number", "description": "Federal income tax withheld (Box 2)" },
    { "name": "box12_codes", "type": "array", "items": { "type": "object", "fields": [
      { "name": "code", "type": "string" },
      { "name": "value", "type": "number" }
    ]}}
  ]
}

Box 12 是 W-2 上最容易出错的地方——它可能有多组"代码+值"的配对，用 array 才能完整捕获。

用 Python SDK 跑一遍完整提取流程

下面的脚本演示了从创建 blueprint 到调用提取、再到解析结果的完整流程。你需要先配置好 AWS 凭证和区域（BDA 目前在部分区域可用）。

import boto3
import json
import time

bda = boto3.client("bedrock-data-automation", region_name="us-east-1")

# 1. 创建 Blueprint
blueprint_response = bda.create_blueprint(
    blueprintName="bank-statement-extraction",
    type="DOCUMENT",
    blueprintStage="LIVE",
    schema=json.dumps({
        "fields": [
            {"name": "account_holder", "type": "string"},
            {"name": "account_number", "type": "string"},
            {"name": "transactions", "type": "array", "items": {
                "type": "object",
                "fields": [
                    {"name": "date", "type": "string"},
                    {"name": "description", "type": "string"},
                    {"name": "amount", "type": "number"},
                    {"name": "balance", "type": "number"}
                ]
            }},
            {"name": "opening_balance", "type": "number"},
            {"name": "closing_balance", "type": "number"}
        ]
    })
)

blueprint_arn = blueprint_response["blueprint"]["blueprintArn"]
print(f"Blueprint ARN: {blueprint_arn}")

# 2. 创建 Data Automation Project，关联 Blueprint
project_response = bda.create_data_automation_project(
    projectName="financial-doc-project",
    projectStage="LIVE",
    customOutputConfiguration={
        "blueprints": [
            {
                "blueprintArn": blueprint_arn,
                "documentType": "BANK_STATEMENT"
            }
        ]
    },
    standardOutputConfiguration={
        "document": {
            "extraction": {
                "granularityTypes": ["PAGE"],
                "outputFormatTypes": ["JSON"]
            }
        }
    }
)

project_arn = project_response["projectArn"]
print(f"Project ARN: {project_arn}")

# 3. 上传文档到 S3 并触发提取
s3 = boto3.client("s3")
bucket = "my-financial-docs-bucket"
key = "bank-statements/sample_statement.pdf"

# 假设文件已上传到 S3，直接调用 invoke
response = bda.invoke_data_automation_async(
    dataAutomationConfiguration={
        "dataAutomationProjectArn": project_arn,
        "stage": "LIVE"
    },
    inputConfiguration={
        "s3InputConfiguration": {
            "s3Uri": f"s3://{bucket}/{key}"
        }
    },
    outputConfiguration={
        "s3OutputConfiguration": {
            "s3Uri": f"s3://{bucket}/output/"
        }
    }
)

invocation_id = response["invocationId"]
print(f"Invocation ID: {invocation_id}")

# 4. 等待完成并读取结果（轮询状态）
status = "IN_PROGRESS"
while status == "IN_PROGRESS":
    time.sleep(5)
    check = bda.get_data_automation_status(
        invocationId=invocation_id,
        dataAutomationConfiguration={
            "dataAutomationProjectArn": project_arn,
            "stage": "LIVE"
        }
    )
    status = check["status"]
    print(f"Status: {status}")

if status == "COMPLETED":
    # 结果已写入 S3 output 路径，读取自定义输出
    output_key = f"output/{key.split('/')[-1]}/custom_output/bank-statement-extraction.json"
    result_obj = s3.get_object(Bucket=bucket, Key=output_key)
    result = json.loads(result_obj["Body"].read().decode("utf-8"))

    print("=== 提取结果 ===")
    print(f"账户持有人: {result.get('account_holder')}")
    print(f"期初余额: {result.get('opening_balance')}")
    print(f"期末余额: {result.get('closing_balance')}")
    print(f"交易笔数: {len(result.get('transactions', []))}")

    for tx in result.get("transactions", [])[:5]:
        print(f"  {tx['date']} | {tx['description']} | {tx['amount']} | {tx['balance']}")
else:
    print(f"提取失败，状态: {status}")

运行前需要修改的地方：

region_name：确认你的账户在哪个区域启用了 BDA。
bucket 和 key：替换为你自己的 S3 桶和文档路径。
Blueprint schema：根据你的文档类型调整字段定义。

提取效果与实际踩坑

根据原文的测试结果，BDA 在这四类文档上的表现有几个值得注意的点：

银行流水——跨银行格式差异大，但 BDA 的自定义字段配合语义描述，能较好地对齐不同列名。对于金额正负号（debit/credit）的判断，建议在 description 中明确标注规则，否则模型可能把退款当成收入。

W-2 和 1099-B——结构化程度高，提取准确率最好。Box 12 的多值配对和 1099-B 的多行交易是主要挑战，用 array 类型定义后基本能正确分组。偶尔出现的错误是金额字段被当成字符串提取（如 "$1,234.56"），可以在 blueprint 中强制 type: number 并加 description 说明"不含货币符号"。

供应商合同——准确率波动最大。合同越长、条款嵌套越深，模型遗漏关键条款的概率就越高。实践中建议：把合同拆成关键章节单独提取，而不是整份合同一次跑完；同时在 blueprint 中尽量把字段描述写得具体，比如"付款金额，通常出现在 Section 3 或 Payment Terms 段落"。

上手清单

确认区域与配额——BDA 不是所有区域都开放，先在 AWS 控制台确认可用性。
从最结构化的文档开始——先跑 W-2 或 1099-B，验证 blueprint 语法和 SDK 调用流程，再挑战银行流水和合同。
字段描述是关键杠杆——每个字段的 description 不要写得太泛，"金额"不如"交易金额，debit 为负数，不含 $ 符号"。
array 类型用于多行/多值场景——流水交易行、1099-B 交易行、W-2 Box 12 配对，都必须用 array+object 嵌套，否则数据会被压平丢失结构。
长合同分段处理——超过 10 页的合同，考虑按章节拆成多个 PDF 分别提取，再在下游合并。
结果校验不可省略——即使准确率很高，金融数据入库前仍需抽样人工复核，尤其是金额和税号字段。

BDA 把"从文档里挖数据"这件事从写正则变成了写 schema——门槛降低了，但 schema 的设计质量直接决定输出质量。花时间把 blueprint 写清楚，比事后修补数据要划算得多。