For data extraction from images such as receipts, ID cards, or paper forms, traditional methods often use OCR (Optical Character Recognition) combined with rules or regex to extract data. This is complicated and difficult when data formats change.

In reality, we have another option: LLM Multimodal which can "understand images" and "answer questions" directly.

Concept

Multimodal LLMs can process both text and images simultaneously, allowing us to input images along with prompts such as:

This image is a restaurant receipt. Please extract data into JSON: restaurant name, date, food items, total amount

Code Example

Model

For this round, we'll try using gemma3-12b-vision

You can follow the Getting Started in the README for anyone who wants to try deploying the model themselves

Code

client-gemma3.ts

Set a clear prompt about what this image is, what fields are needed with examples, what to do if not found, and specify that it must return in JSON format only. Here we'll also tell it to look at the category to determine what it is. If it's invoice, then extract the fields we want, but if not, tell us what it is.

This is an image of a document. Please analyze the document and return a JSON object that strictly follows the schema below:

{
  "category": string,                    // Document category: "invoice", "receipt", or "other-[description]"
  "result": object | null           // Invoice data if category is "invoice", otherwise null
}

If the document category is "invoice", return the result as:
{
  "category": "invoice",
  "result": {
    "invoice_number": string,              // e.g. "INV-2024-00123"
    "invoice_date": string,             // Date as shown in document (any format: DD/MM/YYYY, MM-DD-YYYY, YYYY-MM-DD, etc.)
    "due_date": string | null,        // Date as shown in document (any format) or null if not found
    "total_amount": number | null,    // Final total amount (e.g. 1234.50) in INR
    "items": [
      {
        "description": string,        // Name of the product or service
        "quantity": number | null,
        "unit_price": number | null,
        "line_total": number | null
      }
    ]
  }
}

If the document category is NOT "invoice", return the result as:
{
  "category": "receipt" | "other-[description]",
  "result": null
}

DOCUMENT CATEGORY GUIDELINES:
- "invoice": Only tax invoices, billing invoices, commercial invoices
- "receipt": Payment receipts, transaction receipts
- "other-[description]": For any other content, describe what you see after "other-"
  Examples: "other-id_card", "other-child_photo", "other-certificate", "other-bill", "other-menu", "other-text_document"

IMPORTANT INSTRUCTIONS:
- Only extract invoice data if the document category is clearly "invoice"
- For receipts, use exactly "receipt" as the category
- For anything else, use "other-" followed by a brief English description of what you see
- For dates, extract them exactly as they appear in the document - do not convert or reformat
- For all non-invoice documents, set result to null
- Do not guess or fabricate values
- Return ONLY the JSON object directly. Do not wrap it in markdown code blocks (\`\`\`json\`\`\`)
- Do not include any explanation, comments, or extra text before or after the JSON
- Your response should start with { and end with }
- The response must be valid JSON that can be parsed directly

Pass image through image_url type in content message back and forth

  const { data } = await axios.post(
    `${process.env.BASE_URL}/chat/completions`,
    {
      model: "gemma3-12b-vision",
      messages: [
        {
          role: "user",
          content: [
            {
              type: "text",
              text: prompt,
            },
            {
              type: "image_url",
              image_url: { url: `data:image/jpeg;base64,${image}` },
            },
          ],
        },
      ],
    },
    {
      headers: {
        Authorization: `Bearer ${process.env.API_KEY}`,
        "Content-Type": "application/json",
      },
    }
  );

client-openai.ts

For cases using other Models that support OpenAI Completions and Multimodal

Result

{
  "category": "invoice",
  "result": {
    "invoice_number": "INV-005",
    "invoice_date": "Jun 22, 2021",
    "due_date": "Jun 27, 2021",
    "total_amount": 1564,
    "items": [
      {
        "description": "Desktop furniture",
        "quantity": 1,
        "unit_price": 232,
        "line_total": 232
      },
      {
        "description": "Plumbing and electrical services",
        "quantity": 2,
        "unit_price": 514,
        "line_total": 1028
      },
      {
        "description": "Water tank repair works",
        "quantity": 2,
        "unit_price": 152,
        "line_total": 304
      }
    ]
  }
}

{
  "category": "invoice",
  "result": {
    "invoice_number": "100",
    "invoice_date": "1/30/23",
    "due_date": "1/30/23",
    "total_amount": 420,
    "items": [
      {
        "description": "20\" x 30\" hanging frames",
        "quantity": 10,
        "unit_price": 15,
        "line_total": 150
      },
      {
        "description": "5\" x 7\" standing frames",
        "quantity": 50,
        "unit_price": 5,
        "line_total": 250
      }
    ]
  }

{
  "category": "receipt",
  "result": null
}

Advantages

Works with multiple formats (even if layout changes)
Can specify desired data through prompts
No need to write regex or document-specific templates
Can add conditions like if it's a then extract b
Can perform classification/validation

Cautions

Relatively high cost (charged by token)
Accuracy depends on prompt and image clarity
Should have fallback or validation in case model gives wrong answer

Compared to OCR Approach

OCR: Convert image to text
OCR Post-processing: Improve text obtained from OCR to be more accurate
Text Extraction & Cleaning: Extract important text and clean data
Document Structure Understanding: Understand document structure (receipts, invoices, approval reports)
Named Entity Recognition (NER): Identify company name, date, amount, document number

Conclusion

Using Multimodal LLM for data extraction from images is an excellent approach, especially when documents have many fields and we only want to extract specific fields, layouts may not be consistent, or there are certain conditions for data extraction. We can control it through prompts without building complex parsing systems. This means we no longer need to stick to the idea that this type of work must only use OCR and then sit and write regex or analyze further to extract the data we really want.

The examples given are simple data extraction, but in reality, there are many other approaches we can take. We can add certain conditions, use it for Classification/Validation.

Float16

Managed GPU resource platform for developers. Experience the cheapest serverless GPU in spot mode and the fastest GPU endpoint in deploy mode.

Connect with us:

Medium : Float16.cloud
Facebook : Float16.cloud
X : Float16.cloud
Discord : Float16.cloud
Youtube : Float16.cloud

Data Extraction from Images Using Gemma3