Train-your-own Splitting Model#
Our API Hub offers a pre-built generic splitting workflow which delivers excellent results for many situations. However, business use cases vary a lot, as do the type of documents which need to be split apart. With our train-your-own splitting capabilities, you can solve any splitting use case by fine-tuning our splitting base model to your specific needs and get your customized splitting workflow up and running in no time.
Create a workflow#
To set up the model for training, start by creating your custom workflow by clicking on Train Your Own Model +
and then selecting
or, alternatively, click here
You will be guided through the following steps:
Metadata#
Define the name of your workflow. Optionally, you may add a description to help you remember the purpose of this workflow and its intended use. Furthermore, you may set a thumbnail image that helps you to identify the workflow in the list of workflows.
Document Specification#
In this step you can tell us more about your documents, so that we can better steer the model training process. Some of the options are already pre-selected, but you may change them according to your needs.
When documents are photographed or small documents are scanned to A4, the files often contain unwanted background that surrounds the document. Cropping is the process of removing these unwanted outer areas, which can drastically improve your workflow’s quality. For digitally born documents, cropping is usually not necessary.
If you choose Latin, your model can understand English, German, French, and Spanish. If your documents contain Japanese characters, pick this option. If your documents contain Japanese mixed with English, choose Japanese, which supports both languages.
If you choose "only printed text", handwritten text will be ignored and not used in your model. Similarly, if you choose "only handwritten text", all printed text will be ignored.
Finally, create your workfow by clicking on Create Workflow
. You may of course step back and change settings, but once the workflow is created, settings cannot be changed.
Workflow Dashboard#
After creating your workflow, you will be redirected to the dashboard, where you see multiple sections to interact with your workflow.
Most importantly, in the upper section, to the right, you can upload your documents and start training your model.
Below you may drop documents or browse your computer to upload them. The documents will be processed with the current state of your model.
When creating the workflow, your model will default to a generic base model, which might already deliver good results for your documents.
However, to improve the model's performance, you may follow the Upload Training Data
above,
where you can upload your own documents and annotate them to train the model on your specific data.
In the lower section, you can see usage statistics of your workflow over time. This includes usage through the API.
Training Data#
The training data view will guide you in uploading your documents and annotating them, thereby building a comprehensive dataset for training your model.
In this view, you can create templates through the + Create Template
, which will be explained next.
Templates and Documents for Workflow Training#
Templates can be used to organise your documents. For example, if you receive several payslips in one document and want to split them, you could create a template for that. Another template could be used for documents that contain both an invoice and a delivery note and you want to split them.
You can also upload all the different cases that occur without organizing them by template. However, using templates helps you to manage larger amounts of data and allows us to better train and evaluate your workflow per template.
There are two ways to upload your documents:
Upload to a Template
- The documents can be assigned to a template, see the example above.
Upload Without Template
- Simply upload the documents without a template. Our AI will still be trained, even without a template.
Upload modes for splitting workflows#
Now, when uploading documents, you have furthermore to options to choose from:
Documents are already split
- The documents should be uploaded as individual files, each containing one logical unit, so that we can combine them into random sequences and train the AI to split accurately. It is essential that these are truly individual documents.
Example: You upload 12 separate files, each containing the payslip for one month.
Documents are merged together
- The document contains several individual logical units. Here, you can annotate and indicate to the system where splitting should occur. The AI also learns from this.
Example: A 12-page PDF containing payslips from January to December should be split into 12 individual files, with one file per month.
Building a good dataset#
The more documents you upload and annotate, the better the model will be trained. The tool will guide you on how many documents you should upload to achieve a good model.
Training a model#
Heading over to the Training
tab, you can start training your model. Before you start, you can see how many documents you have uploaded and how many are still needed to achieve a good model.
After starting the training, it will take some time for the training process to complete. You can see the progress in the Training History
section.
You will be notified by Email once the training is complete.
In case you have trained a model before in the context of this workflow, you can compare the performance of the new model with the previous one.
In that way, you can iteratively improve your model by adding more data and retraining it.
Processing your documents#
After training your model, you can start processing your documents. You can upload them directly in the Dashboard
tab as described above.
OpenAPI Documentation#
For API usage, we recommend to have a look at the Documentation
tab, where you can find a custom OpenAPI based documentation customized for your workflow.
In particular interest for splitting is following endpoint:
POST /processing/{workflow_key}
It is used to process a document with the trained model. The workflow key is the UUID of your workflow, which you can find in the URL of the dashboard.
A typical result looks like this:
{
"processing_id": "61726269-7472-4172-b920-62797465732e",
"workflow_id": "56af509f-349c-45d5-9214-3c0ff4ec75e7",
"workflow_name": "My Custom Splitting Workflow",
"available_results": [
"document-splitting",
"ocr",
"page-images",
"thumbnail",
"pdf",
"hocr",
"sub-pdfs"
],
"document_splitting": {
"schema_version": 1,
"sub_documents": [
{
"name": "sub_document_1",
"pages": [
1,
2
]
},
{
"name": "sub_document_2",
"pages": [
3
]
}
],
"split_point_confidences": [
0.8,
0.9
]
},
"ocr": {
"pages": [
{
"width": 1000,
"height": 1414,
"bboxes": [
{
"id": 1,
"x1": 0,
"y1": 22,
"x2": 100,
"y2": 46,
"text": "Natif",
"text_entropy": 0.02
}
],
"fulltext": "..."
},
{
"width": 1000,
"height": 1414,
"bboxes": [
{
"id": 1,
"x1": 10,
"y1": 22,
"x2": 140,
"y2": 46,
"text": "Natif",
"text_entropy": 0.02
}
],
"fulltext": "..."
},
{
"width": 1000,
"height": 1414,
"bboxes": [
{
"id": 1,
"x1": 20,
"y1": 22,
"x2": 190,
"y2": 46,
"text": "Natif",
"text_entropy": 0.02
}
],
"fulltext": "..."
}
]
},
"page_images": [
"/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/1",
"/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/2",
"/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/3"
],
"thumbnail": "/processing/results/49b4e3fb-6986-4ed8-8a1d-edfc8d1a5c8c/thumbnail",
"pdf": "/processing/results/06947a1e-d157-4c91-8103-4ca587476a51/pdf",
"hocr": "/processing/results/a4a666da-9314-4239-b5fd-ffb143fa53dd/hocr",
"sub_pdfs": [
"/processing/results/c37a4cb2-15fa-4872-9df6-22b481e59978/sub-pdfs/0",
"/processing/results/c37a4cb2-15fa-4872-9df6-22b481e59978/sub-pdfs/1"
]
}
The document_splitting
key contains the information about the split points and the confidence of the model in the predicted split points.
The confidences can be particularly useful for a human-in-the-loop process. See below for more information.
The ocr
key contains the text extracted from the document.
The sub_pdfs
key contains the links to the sub-documents that were split from the original document.
Code Snippets#
Along with the specific OpenAPI documentation, you can find code snippets for different programming languages to help you get started with the API.
Active Learning#
Active learning is a feature that helps you to improve your model iteratively. If you have a model that is already trained and used for processing documents, it can be useful to upload the results of the processing back to the training data, in particular for the documents that were not split correctly.
For this, we offer a specific endpoint in the API, which allows you to upload the results of the processing back to the training data, so that you can make use of our annotation tool and improve your model.
POST /processing/feedback/{processing_id}
The processing_id
is the UUID of the processing result that you want to give feedback on.
Following the example above, you can provide feedback in the request body in the following JSON format:
{
"description": "The splitting went wrong for this document",
"tag": "Invoices",
"expected_sub_documents": [
{
"name": "sub_document_1",
"pages": [
1
]
},
{
"name": "sub_document_2",
"pages": [
2,
3
]
}
]
}
This will add the document immediately to the training data with the given feedback without the need to further annotate.
By setting "tag" : "Invoices"
The document will be added to the template Invoices
.
In case you do not have the means to annotate the documents yourself,
you can also provide the feedback without specifying the expected_sub_documents
:
The document will then be added to the training data and queued for manual annotation.
Finally, you can head back to the Trainig Data
to further annotate,
or directly to the Trainings
tab and retrain your model with the new data.
Human-in-the-loop for splitting workflows#
While we provide state-of-the-art splitting results, we cannot guarantee that our AI is always perfectly accurate. Therefore, we encourage you to implement additional cross-checks of splitting outputs against your business data as part of a Human-in-the-loop process.
You can do so by either integrating our Stand-Alone Interface, or by using your own interfaces and integrate our verification
API endpoint,
both of which are described in Verification.
For your splitting workflow, the specific endpoint for verification is
POST /processing/results/{processing_id}/document-splitting/verification
with processing_id
being the job id returned by the processing endpoint. Set the request body like this:
This will mark the document as verified. Now, in both cases - Stand-Alone interface or verification API - the processing of results can continue by checking the verification status of the document with the same endpoint using a GET request:
GET/processing/results/{processing_id}/document-splitting/verification
This will return a positive response when the document is verified:
If the document is not yet verified, the response will be:
Considering split point confidences to decide on manual verification#
As mentioned above, the document_splitting
result contains the information about the split points and the confidence of the model in the predicted split points.
These can be useful to decide automatically whether a splitting result should be verified by a human or not.
Consider the following document_splitting
result:
...
"document_splitting": {
"schema_version": 1,
"sub_documents": [
{
"name": "sub_document_1",
"pages": [
1,
2
]
},
{
"name": "sub_document_2",
"pages": [
3
]
}
],
"split_point_confidences": [
0.6,
0.9
]
},
...
A value approaching 1.0 indicates high confidence, while a value approaching 0.0 indicates low confidence.
This means, the model is 90% confident that the split point between sub_document_1
and sub_document_2
is correct,
while it is only 60% confident that not having a split point between pages 1 and 2 is correct.
Based on our evaluations we recommend a confidence threshold of 0.7 to balance manual effort in correction against errors remaining after correction.
In our example, this means we would recommend to manually verify whether pages 1 and 2 should indeed not be split.
If you find the results of your model are not accurate, you can use the feedback
endpoint as described in the Active Learning section to add this document to the training data of your workflow for further improvement.
Note that in a stream, every split/merge prediction point has its own confidence score. Manual correction effort can focus on low confidence split/merge points instead of going through the whole stream.
Credit cost#
A Freemium account allows for up to 100 pages per month, where the cost is 25 credits per split output document.