All posts
TutorialMarch 18, 2026

Extract, Batch & Auto-Split: The Complete Guide to Document Extraction

Everything you need to know about Crystl's three extraction modes — process one document at a time, run 50 files overnight in a batch, or drop a bundled compliance PDF and let Auto-Split handle the rest.

Crystl Team
March 18, 2026
10 min read

Every document that enters Crystl passes through an extraction engine — a vision-language model that reads the file as a human would and pulls out the exact fields you need. There are three ways to trigger that process, each designed for a different situation. This guide walks through all of them: Extract (one document at a time), Batch (up to 50 files in one go), and Auto-Split(one bundled PDF that contains several documents). By the end you will know which mode to reach for, how the upload process works in each, and what every option actually does.

The Three Extraction Modes

📄
Extract
Single document
One file at a time
Full control over type
Instant results
Best for one-offs
📦
Batch
Up to 50 documents
2–50 files at once
Mixed document types
Background processing
Live progress tracking
✂️
Auto-Split
Bundled PDF packets
One PDF, many docs
Auto-detects boundaries
Extracts each segment
KYC / compliance packs

Extract — One Document at a Time

The Extract tab is the starting point. You upload a single file, configure a handful of options, click the button, and within seconds your document's data appears as structured fields — ready to copy, export, or act on.

Here is the full process from upload to result:

01Upload
PDF · PNG · JPG · TIFFDrag & drop or click
02Doc Type
Invoice
Contract
ID Document
Bank Statement
Auto-detect ✦
03Engine
Fast ⚡
2–5 sec · Default
Moderate 🎯
5–15 sec · Higher accuracy
04Results
Invoice No.98%
#INV-2025-001
Vendor94%
Acme Corp Ltd
Total99%
$4,250.00
Due Date91%
2025-02-15

Step 1 — Upload your file

Drag your document onto the upload area or click to browse. Crystl accepts PDF, PNG, JPG, TIFF, and BMP files. Multi-page PDFs are fully supported — all pages are read and extracted together as one document.

Step 2 — Choose a document type

The document type tells Crystl which fields to look for. When you pick Invoice, Crystl extracts invoice number, vendor, line items, totals, and due date. When you pick ID Document, it extracts name, ID number, date of birth, expiry, and nationality.

Crystl ships with a library of system document profiles — Invoice, Contract, ID Document, Bank Statement, Receipt, Medical Document, Form, and more. If your organisation has created custom document profiles, those appear at the top of the list.

Leaving the field blank triggers Auto-detect. Crystl runs a classification pass first to identify what kind of document you have, then extracts accordingly. Auto-detect is convenient but it costs one additional page from your monthly quota per file, and classification is slightly less precise than telling Crystl the type upfront.

Rule of thumb: if you know the document type, pick it. Reserve Auto-detect for ad-hoc one-offs where you genuinely are not sure.

Step 3 — Add custom instructions (optional)

The instructions field lets you guide the extraction in plain English. Examples of what you can write here:

  • "Extract all line items as a table with quantity, description, and unit price."
  • "The date format used in this document is DD/MM/YYYY."
  • "Focus on the second signatory's details, not the first."

Instructions are passed directly to the engine along with your document. They do not change which document profile is used — they give the model additional context on top of the document profile's standard field list.

Step 4 — Extract and review results

Click Extract Document. The result panel shows every extracted field alongside a confidence score — a percentage that tells you how certain the model is about that value. Fields at 90 %+ are shown in green; 60–90 % in amber; below 60 % in red as a signal for manual review.

Underneath the fields you will also see the total processing time (typically 2–10 seconds) and how many pages were consumed from your quota.

Export the results as Excel, Word, or Markdown using the buttons at the top of the results panel. If something looks wrong, click Report an Issue — your feedback goes directly to the Crystl support team.


Choosing Your Engine

Crystl offers two AI engines. Your organisation admin sets a default; if the Allow provider override setting is enabled, you can switch per-extraction.

Fast ⚡
Default
Speed2–5 seconds
Best forInvoices, forms, receipts
Complex tablesGood
HandwritingBasic
Use when processing volume is high or speed matters.
Moderate 🎯
Speed5–15 seconds
Best forContracts, dense tables
Complex tablesExcellent
HandwritingStrong
Use when accuracy on complex or dense documents is the priority.

In practice: use Fast for anything that follows a predictable structure — invoices, receipts, standard forms, ID documents. Switch to Moderate when you are dealing with multi-page contracts, complex data tables, dense financial statements, or documents with handwritten annotations.


Batch — Process Up to 50 Files at Once

Batch is Extract at scale. Instead of uploading one file and waiting, you drop up to 50 files at once, configure them as a group (with per-file overrides if needed), submit the job, and watch a live progress feed as Crystl works through the queue in the background.

Uploading a batch

Switch to the Batch tab and drop your files onto the upload area. You can mix any combination of PDFs, images, and formats — up to 50 files per job, with a page-per-document limit depending on your plan.

Setting document types per file

After upload, each file appears in the list with its own document-type selector. You have three options:

  • Set all to one type — use the "Set all to…" dropdown to apply one document profile to every file at once. Best when you are processing a uniform stack (e.g., 40 invoices).
  • Override per file — click the selector on any individual file to give it a different type. Useful in mixed batches.
  • Leave blank — any file without a type triggers auto-detection. Crystl will flag how many files are set to auto-detect and remind you that each one costs an extra page from your quota.
Batch Job #b-00291
KYC Onboarding · 5 files
Processing
Progress3 of 5 complete
sarah_johnson.pdf
Detected: Passport
97%
tom_richards.pdf
Detected: Bank Statement
91%
aisha_diallo.pdf
Invoice
95%
mark_chen.pdf
Auto-detect
processing
lisa_park.pdf
Auto-detect
queued
4 auto-detect files → +4 pages used for classification

Live progress tracking

Once you click Extract N files, the job is submitted and processing starts immediately. The progress panel updates in real time via a WebSocket connection — you see each file tick from Queued to Processing to Success (or Failed, with an error message). The overall progress bar shows completed versus total, colour-coded green when all succeeds, amber if some files failed.

Per-file actions

Each completed file offers three actions:

  • Export (Excel / Word / Markdown) — download that file's extracted data immediately. The files are generated as soon as extraction completes and are available via a secure link.
  • Re-extract — if the wrong document type was used (or auto-detect misclassified), click Re-extract, choose the correct type, and Crystl fetches the original file from storage and re-runs extraction. The result in the batch updates in real time.
  • Report an Issue — available for both successful and failed files. Describe what went wrong and the team follows up.

Auto-Split — One PDF, Multiple Documents

Auto-Split solves a problem that comes up constantly in KYC and compliance workflows: a client scans several different documents into a single PDF file. One PDF. Four document types. Crystl detects where each one begins and ends, extracts each segment with the right document profile, and returns them all in a single structured result.

When to use Auto-Split

  • KYC bundles: passport + bank statement + proof of address in one scan
  • Compliance packs: multiple contracts or forms submitted as a single file
  • Scanned mail packs: a stack of incoming documents scanned in sequence

If you know the PDF contains only one document type, use Extract or Batch instead — Auto-Split is overkill and costs more pages (see below).

How it works — two passes

Auto-Split runs two AI passes on your document. Understanding this is important because it directly affects your page quota:

📎
client_bundle.pdf
4 pages · Mixed document types
pg 1
Passport
pg 2
Passport
pg 3
Bank Stmt
pg 4
Utility Bill
Pass 1 — classify each page (Fast engine)
Boundary Detection
Page 1Passport · start of new document
Page 2Passport · continuation of page 1
Page 3Bank Statement · new document boundary →
Page 4Utility Bill · new document boundary →
Pass 2 — extract each segment (your chosen engine)
Segment 1
Passport
Pages 1–2
96% conf.
Segment 2
Bank Statement
Page 3
92% conf.
Segment 3
Utility Bill
Page 4
88% conf.

Pass 1 — Classification. Crystl converts the PDF to page images and sends each page to the Fast engine in sequence. Each page is classified in context of the previous one so the model can detect "this is a continuation" versus "this is a new document starting." The result is a list of boundary decisions and a grouping of pages into segments.

Pass 2 — Extraction. For each identified segment, Crystl extracts structured fields using the auto-detected document profile and the engine you selected (Fast or Moderate). Pages within a segment run concurrently to keep total time down.

Quota note: Auto-Split uses 2× pages

A 10-page PDF costs 20 pages of quota — 10 for the classification pass and 10 for the extraction pass. This is expected and intentional: two AI calls per page means twice the accuracy on boundary detection. Keep it in mind when processing large bundles.

Results and per-segment actions

The result panel lists each detected segment as a collapsible card showing the document type, page range, confidence score, and all extracted fields. You can:

  • Re-extract a segment with a different type — if the classifier got a segment wrong, click Re-extract, choose the correct document profile, and only that segment is re-run. The rest stay untouched.
  • Export all segments at once — Excel (multiple worksheets, one per segment type), Word (sections per segment), or Markdown (headers per segment).
  • Report an Issue — target a specific segment or the whole document.

Understanding Pages & Quota

Every plan includes a monthly page quota. One page of a document = one page consumed. Here is how each mode counts:

ModePages consumedAuto-detect surcharge
Extract1 page per document page+1 page per file
Batch1 page per document page, per file+1 page per auto-detect file
Auto-Split2 pages per document page (2 passes)Included — all pages are classified

Org admins receive email alerts at 80 % and 100 % of the monthly limit. You can check current usage at any time from the organisation settings page.


Tips for Better Results

Always specify the document type when you know it

Auto-detect is a convenience, not a replacement for an explicit type. When you process a batch of 50 invoices, set them all to Invoice upfront — you save 50 quota pages and the extraction is faster and more accurate.

Use Fast for volume, Moderate for precision

A 2-second extraction is fine for a standard invoice. A dense 30-page contract with complex clauses and nested tables is worth the extra 10 seconds that Moderate takes. Mixing is fine — Batch lets you set different engines per file if needed (contact support if you need per-file engine override enabled for your org).

Check confidence scores, not just values

An extracted value can look right and still be wrong. A field showing 62 % confidence on a critical financial amount deserves a second look. Build a review step into your workflow for any field below 80 %.

Use Re-extract instead of starting over

If the wrong document type was applied to a file — in a batch or in Auto-Split — use the Re-extract button. It fetches the original file from storage and re-runs extraction with your corrected document profile. You do not pay for the pages again.

Scan quality matters more than file size

A high-resolution scan at 300 DPI will extract better than a blurry phone photo, regardless of file size. If you are getting low confidence scores consistently on a particular source, the scan quality is usually the first thing to investigate.


Common Issues & What to Do

SymptomLikely causeWhat to try
Fields come back emptyWrong document type selectedRe-extract with the correct type or Auto-detect
Low confidence across all fieldsPoor scan qualityRe-scan at higher resolution; try Moderate engine
Auto-Split merges two documentsSimilar document types on adjacent pagesRe-extract that segment with the correct type manually
Batch file shows "Failed"Corrupted file, encrypted PDF, or engine timeoutCheck the error message; re-upload a clean copy
Quota exceeded mid-batchMonthly page limit reachedWait for the reset date or upgrade your plan
Numbers are slightly offAmbiguous layout or blurred digitsSwitch to Moderate engine; report the issue
Extract, Batch, and Auto-Split are three tools that cover the full range of document workloads — from one-off lookups to nightly processing runs. Pick the mode that fits your workflow, set the right document type, and let the engines do the rest.

Ready to try it?

Start extracting data from documents today

Process your first 100 pages free. No credit card required.

Get Started Free
Loved by document teams worldwide

Ready to see the crystal clear difference?

Start extracting documents for free. No credit card required.