Extract, Batch & Auto-Split: The Complete Guide to Document Extraction

Every document that enters Crystl passes through an extraction engine — a vision-language model that reads the file as a human would and pulls out the exact fields you need. There are three ways to trigger that process, each designed for a different situation. This guide walks through all of them: Extract (one document at a time), Batch (up to 50 files in one go), and Auto-Split(one bundled PDF that contains several documents). By the end you will know which mode to reach for, how the upload process works in each, and what every option actually does.

The Three Extraction Modes

📄

Extract

Single document

One file at a time

Full control over type

Instant results

Best for one-offs

📦

Batch

Up to 50 documents

2–50 files at once

Mixed document types

Background processing

Live progress tracking

✂️

Auto-Split

Bundled PDF packets

One PDF, many docs

Auto-detects boundaries

Extracts each segment

KYC / compliance packs

Extract — One Document at a Time

The Extract tab is the starting point. You upload a single file, configure a handful of options, click the button, and within seconds your document's data appears as structured fields — ready to copy, export, or act on.

Here is the full process from upload to result:

01Upload

⬆PDF · PNG · JPG · TIFFDrag & drop or click

02Doc Type

Invoice

Contract

ID Document

Bank Statement

Auto-detect ✦

03Engine

Fast ⚡

2–5 sec · Default

Moderate 🎯

5–15 sec · Higher accuracy

04Results

Invoice No.98%

#INV-2025-001

Vendor94%

Acme Corp Ltd

Total99%

$4,250.00

Due Date91%

2025-02-15

Step 1 — Upload your file

Drag your document onto the upload area or click to browse. Crystl accepts PDF, PNG, JPG, TIFF, and BMP files. Multi-page PDFs are fully supported — all pages are read and extracted together as one document.

Step 2 — Choose a document type

The document type tells Crystl which fields to look for. When you pick Invoice, Crystl extracts invoice number, vendor, line items, totals, and due date. When you pick ID Document, it extracts name, ID number, date of birth, expiry, and nationality.

Crystl ships with a library of system document profiles — Invoice, Contract, ID Document, Bank Statement, Receipt, Medical Document, Form, and more. If your organisation has created custom document profiles, those appear at the top of the list.

Leaving the field blank triggers Auto-detect. Crystl runs a classification pass first to identify what kind of document you have, then extracts accordingly. Auto-detect is convenient but it costs one additional page from your monthly quota per file, and classification is slightly less precise than telling Crystl the type upfront.

Rule of thumb: if you know the document type, pick it. Reserve Auto-detect for ad-hoc one-offs where you genuinely are not sure.

Step 3 — Add custom instructions (optional)

The instructions field lets you guide the extraction in plain English. Examples of what you can write here:

"Extract all line items as a table with quantity, description, and unit price."
"The date format used in this document is DD/MM/YYYY."
"Focus on the second signatory's details, not the first."

Instructions are passed directly to the engine along with your document. They do not change which document profile is used — they give the model additional context on top of the document profile's standard field list.

Step 4 — Extract and review results

Click Extract Document. The result panel shows every extracted field alongside a confidence score — a percentage that tells you how certain the model is about that value. Fields at 90 %+ are shown in green; 60–90 % in amber; below 60 % in red as a signal for manual review.

Underneath the fields you will also see the total processing time (typically 2–10 seconds) and how many pages were consumed from your quota.

Export the results as Excel, Word, or Markdown using the buttons at the top of the results panel. If something looks wrong, click Report an Issue — your feedback goes directly to the Crystl support team.

Choosing Your Engine

Crystl offers two AI engines. Your organisation admin sets a default; if the Allow provider override setting is enabled, you can switch per-extraction.

Fast ⚡

Default

Speed2–5 seconds

Best forInvoices, forms, receipts

Complex tablesGood

HandwritingBasic

Use when processing volume is high or speed matters.

Moderate 🎯

Speed5–15 seconds

Best forContracts, dense tables

Complex tablesExcellent

HandwritingStrong

Use when accuracy on complex or dense documents is the priority.

In practice: use Fast for anything that follows a predictable structure — invoices, receipts, standard forms, ID documents. Switch to Moderate when you are dealing with multi-page contracts, complex data tables, dense financial statements, or documents with handwritten annotations.

Batch — Process Up to 50 Files at Once

Batch is Extract at scale. Instead of uploading one file and waiting, you drop up to 50 files at once, configure them as a group (with per-file overrides if needed), submit the job, and watch a live progress feed as Crystl works through the queue in the background.

Uploading a batch

Switch to the Batch tab and drop your files onto the upload area. You can mix any combination of PDFs, images, and formats — up to 50 files per job, with a page-per-document limit depending on your plan.

Setting document types per file

After upload, each file appears in the list with its own document-type selector. You have three options:

Set all to one type — use the "Set all to…" dropdown to apply one document profile to every file at once. Best when you are processing a uniform stack (e.g., 40 invoices).
Override per file — click the selector on any individual file to give it a different type. Useful in mixed batches.
Leave blank — any file without a type triggers auto-detection. Crystl will flag how many files are set to auto-detect and remind you that each one costs an extra page from your quota.

Batch Job #b-00291

KYC Onboarding · 5 files

Processing

Progress3 of 5 complete

sarah_johnson.pdf

Detected: Passport

97%

tom_richards.pdf

Detected: Bank Statement

91%

aisha_diallo.pdf

Invoice

95%

mark_chen.pdf

Auto-detect

processing

lisa_park.pdf

Auto-detect

queued

4 auto-detect files → +4 pages used for classification

Live progress tracking

Once you click Extract N files, the job is submitted and processing starts immediately. The progress panel updates in real time via a WebSocket connection — you see each file tick from Queued to Processing to Success (or Failed, with an error message). The overall progress bar shows completed versus total, colour-coded green when all succeeds, amber if some files failed.

Per-file actions

Each completed file offers three actions:

Export (Excel / Word / Markdown) — download that file's extracted data immediately. The files are generated as soon as extraction completes and are available via a secure link.
Re-extract — if the wrong document type was used (or auto-detect misclassified), click Re-extract, choose the correct type, and Crystl fetches the original file from storage and re-runs extraction. The result in the batch updates in real time.
Report an Issue — available for both successful and failed files. Describe what went wrong and the team follows up.

Auto-Split — One PDF, Multiple Documents

Auto-Split solves a problem that comes up constantly in KYC and compliance workflows: a client scans several different documents into a single PDF file. One PDF. Four document types. Crystl detects where each one begins and ends, extracts each segment with the right document profile, and returns them all in a single structured result.

When to use Auto-Split

KYC bundles: passport + bank statement + proof of address in one scan
Compliance packs: multiple contracts or forms submitted as a single file
Scanned mail packs: a stack of incoming documents scanned in sequence

If you know the PDF contains only one document type, use Extract or Batch instead — Auto-Split is overkill and costs more pages (see below).

How it works — two passes

Auto-Split runs two AI passes on your document. Understanding this is important because it directly affects your page quota:

📎

client_bundle.pdf

4 pages · Mixed document types

pg 1

Passport

pg 2

Passport

pg 3

Bank Stmt

pg 4

Utility Bill

Pass 1 — classify each page (Fast engine)

↓

Boundary Detection

Page 1Passport · start of new document

Page 2Passport · continuation of page 1

Page 3Bank Statement · new document boundary →

Page 4Utility Bill · new document boundary →

Pass 2 — extract each segment (your chosen engine)

↓

Segment 1

Passport

Pages 1–2

96% conf.

Segment 2

Bank Statement

Page 3

92% conf.

Segment 3

Utility Bill

Page 4

88% conf.

Pass 1 — Classification. Crystl converts the PDF to page images and sends each page to the Fast engine in sequence. Each page is classified in context of the previous one so the model can detect "this is a continuation" versus "this is a new document starting." The result is a list of boundary decisions and a grouping of pages into segments.

Pass 2 — Extraction. For each identified segment, Crystl extracts structured fields using the auto-detected document profile and the engine you selected (Fast or Moderate). Pages within a segment run concurrently to keep total time down.

⚠

Quota note: Auto-Split uses 2× pages

A 10-page PDF costs 20 pages of quota — 10 for the classification pass and 10 for the extraction pass. This is expected and intentional: two AI calls per page means twice the accuracy on boundary detection. Keep it in mind when processing large bundles.

Results and per-segment actions

The result panel lists each detected segment as a collapsible card showing the document type, page range, confidence score, and all extracted fields. You can:

Re-extract a segment with a different type — if the classifier got a segment wrong, click Re-extract, choose the correct document profile, and only that segment is re-run. The rest stay untouched.
Export all segments at once — Excel (multiple worksheets, one per segment type), Word (sections per segment), or Markdown (headers per segment).
Report an Issue — target a specific segment or the whole document.

Understanding Pages & Quota

Every plan includes a monthly page quota. One page of a document = one page consumed. Here is how each mode counts:

Mode	Pages consumed	Auto-detect surcharge
Extract	1 page per document page	+1 page per file
Batch	1 page per document page, per file	+1 page per auto-detect file
Auto-Split	2 pages per document page (2 passes)	Included — all pages are classified

Org admins receive email alerts at 80 % and 100 % of the monthly limit. You can check current usage at any time from the organisation settings page.

Tips for Better Results

Always specify the document type when you know it

Auto-detect is a convenience, not a replacement for an explicit type. When you process a batch of 50 invoices, set them all to Invoice upfront — you save 50 quota pages and the extraction is faster and more accurate.

Use Fast for volume, Moderate for precision

A 2-second extraction is fine for a standard invoice. A dense 30-page contract with complex clauses and nested tables is worth the extra 10 seconds that Moderate takes. Mixing is fine — Batch lets you set different engines per file if needed (contact support if you need per-file engine override enabled for your org).

Check confidence scores, not just values

An extracted value can look right and still be wrong. A field showing 62 % confidence on a critical financial amount deserves a second look. Build a review step into your workflow for any field below 80 %.

Use Re-extract instead of starting over

If the wrong document type was applied to a file — in a batch or in Auto-Split — use the Re-extract button. It fetches the original file from storage and re-runs extraction with your corrected document profile. You do not pay for the pages again.

Scan quality matters more than file size

A high-resolution scan at 300 DPI will extract better than a blurry phone photo, regardless of file size. If you are getting low confidence scores consistently on a particular source, the scan quality is usually the first thing to investigate.

Common Issues & What to Do

Symptom	Likely cause	What to try
Fields come back empty	Wrong document type selected	Re-extract with the correct type or Auto-detect
Low confidence across all fields	Poor scan quality	Re-scan at higher resolution; try Moderate engine
Auto-Split merges two documents	Similar document types on adjacent pages	Re-extract that segment with the correct type manually
Batch file shows "Failed"	Corrupted file, encrypted PDF, or engine timeout	Check the error message; re-upload a clean copy
Quota exceeded mid-batch	Monthly page limit reached	Wait for the reset date or upgrade your plan
Numbers are slightly off	Ambiguous layout or blurred digits	Switch to Moderate engine; report the issue

Extract, Batch, and Auto-Split are three tools that cover the full range of document workloads — from one-off lookups to nightly processing runs. Pick the mode that fits your workflow, set the right document type, and let the engines do the rest.

Extract, Batch & Auto-Split: The Complete Guide to Document Extraction

The Three Extraction Modes

Extract — One Document at a Time

Step 1 — Upload your file

Step 2 — Choose a document type

Step 3 — Add custom instructions (optional)

Step 4 — Extract and review results

Choosing Your Engine

Batch — Process Up to 50 Files at Once

Uploading a batch

Setting document types per file

Live progress tracking

Per-file actions

Auto-Split — One PDF, Multiple Documents

When to use Auto-Split

How it works — two passes

Results and per-segment actions

Understanding Pages & Quota

Tips for Better Results

Always specify the document type when you know it

Use Fast for volume, Moderate for precision

Check confidence scores, not just values

Use Re-extract instead of starting over

Scan quality matters more than file size

Common Issues & What to Do

Start extracting data from documents today

Ready to see the crystal clear difference?