Train-your-own Splitting Model#

Our API Hub offers a pre-built generic splitting workflow which delivers excellent results for many situations. However, business use cases vary a lot, as do the type of documents which need to be split apart. With our train-your-own splitting capabilities, you can solve any splitting use case by fine-tuning our splitting base model to your specific needs and get your customized splitting workflow up and running in no time.

Create a workflow#

To set up the model for training, start by creating your custom workflow by clicking on Train Your Own Model +

step 00

and then selecting

step 0

or, alternatively, click here

You will be guided through the following steps:

Metadata#

Define the name of your workflow. Optionally, you may add a description to help you remember the purpose of this workflow and its intended use. Furthermore, you may set a thumbnail image that helps you to identify the workflow in the list of workflows.

step 1

Document Specification#

In this step you can tell us more about your documents, so that we can better steer the model training process. Some of the options are already pre-selected, but you may change them according to your needs.

step 3a

When documents are photographed or small, potentially multiple, documents are scanned to A4, the files often contain unwanted background that surrounds the document. Cropping is the process of removing these unwanted outer areas and rendering each scanned document individually, which can drastically improve your workflow’s quality. For digitally born documents, cropping is usually not necessary.

step 3b

If you choose Latin, your model can understand English, German, French, and Spanish. If your documents contain Japanese characters, pick this option. If your documents contain Japanese mixed with English, choose Japanese, which supports both languages.

step 3c

If you choose "only printed text", handwritten text will be ignored and not used in your model. Similarly, if you choose "only handwritten text", all printed text will be ignored.

Finally, create your workfow by clicking on Create Workflow. You may of course step back and change settings, but once the workflow is created, settings cannot be changed.

step 3e

Workflow Dashboard#

After creating your workflow, you will be redirected to the dashboard, where you see multiple sections to interact with your workflow.

Most importantly, in the upper section, to the right, you can upload your documents and start training your model.

dashboard_1

Below you may drop documents or browse your computer to upload them. The documents will be processed with the current state of your model. When creating the workflow, your model will default to a generic base model, which might already deliver good results for your documents. However, to improve the model's performance, you may follow the Upload Training Data above,
where you can upload your own documents and annotate them to train the model on your specific data.

dashboard_2

In the lower section, you can see usage statistics of your workflow over time. This includes usage through the API.

dashboard_3

Training Data#

The training data view will guide you in uploading your documents and annotating them, thereby building a comprehensive dataset for training your model.

training_1

In this view, you can create collections through the + Create Collection, which will be explained next.

Collections and Documents for Workflow Training#

Collections can be used to organise your documents. For example, if you receive several payslips in one document and want to split them, you could create a collection for that. Another collection could be used for documents that contain both an invoice and a delivery note and you want to split them.

You can also upload all the different cases that occur without organizing them by collection. However, using collections helps you to manage larger amounts of data and allows us to better train and evaluate your workflow per collection.

There are two ways to upload your documents:

training_2

Upload to a Collection - The documents can be assigned to a collection, see the example above.

Upload Without Collection - Simply upload the documents without a collection. Our AI will still be trained, even without a collection.

Upload modes for splitting workflows#

Now, when uploading documents, you have furthermore to options to choose from:

training_2

Documents are already split - The documents should be uploaded as individual files, each containing one logical unit, so that we can combine them into random sequences and train the AI to split accurately. It is essential that these are truly individual documents. Example: You upload 12 separate files, each containing the payslip for one month.

Documents are merged together - The document contains several individual logical units. Here, you can annotate and indicate to the system where splitting should occur. The AI also learns from this. Example: A 12-page PDF containing payslips from January to December should be split into 12 individual files, with one file per month.

Building a good dataset#

The more documents you upload and annotate, the better the model will be trained. The tool will guide you on how many documents you should upload to achieve a good model.

training_2

Training a model#

Heading over to the Training tab, you can start training your model. Before you start, you can see how many documents you have uploaded and how many are still needed to achieve a good model.

training_2

After starting the training, it will take some time for the training process to complete. You can see the progress in the Training History section. You will be notified by Email once the training is complete. In case you have trained a model before in the context of this workflow, you can compare the performance of the new model with the previous one. In that way, you can iteratively improve your model by adding more data and retraining it.

training_2

The platform also helps you assess the confidence of your trained model and recommends uploading more data if needed to enhance its reliability.

Processing your documents#

After training your model, you can start processing your documents. You can upload them directly in the Dashboard tab as described above. They can also be uploaded from the Uploads tab, which also contains a table with all your documents uploaded for processing.

tyos_uploads

OpenAPI Documentation#

For API usage, we recommend to have a look at the Documentation tab, where you can find a custom OpenAPI based documentation customized for your workflow.

training_2

In particular interest for splitting is following endpoint:

POST /processing/{workflow_key}

It is used to process a document with the trained model. The workflow key is the UUID of your workflow, which you can find in the URL of the dashboard.

A typical result looks like this:

{
  "processing_id": "61726269-7472-4172-b920-62797465732e",
  "workflow_id": "56af509f-349c-45d5-9214-3c0ff4ec75e7",
  "workflow_name": "My Custom Splitting Workflow",
  "available_results": [
    "document-splitting",
    "ocr",
    "page-images",
    "thumbnail",
    "pdf",
    "hocr",
    "sub-pdfs"
  ],
  "document_splitting": {
    "schema_version": 1,
    "sub_documents": [
      {
        "name": "sub_document_1",
        "pages": [
          1,
          2
        ]
      },
      {
        "name": "sub_document_2",
        "pages": [
          3
        ]
      }
    ],
    "split_point_confidences": [
      0.8,
      0.9 
    ]
  },
  "ocr": {
    "pages": [
      {
        "width": 1000,
        "height": 1414,
        "bboxes": [
          {
            "id": 1,
            "x1": 0,
            "y1": 22,
            "x2": 100,
            "y2": 46,
            "text": "Natif",
            "text_entropy": 0.02
          }
        ],
        "fulltext": "..."
      },
      {
        "width": 1000,
        "height": 1414,
        "bboxes": [
          {
            "id": 1,
            "x1": 10,
            "y1": 22,
            "x2": 140,
            "y2": 46,
            "text": "Natif",
            "text_entropy": 0.02
          }
        ],
        "fulltext": "..."
      },
      {
        "width": 1000,
        "height": 1414,
        "bboxes": [
          {
            "id": 1,
            "x1": 20,
            "y1": 22,
            "x2": 190,
            "y2": 46,
            "text": "Natif",
            "text_entropy": 0.02
          }
        ],
        "fulltext": "..."
      }
    ]
  },
  "page_images": [
    "/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/1",
    "/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/2",
    "/processing/results/d99263bb-5c83-4a78-8bc9-2a15e2411389/page-images/3"
  ],
  "thumbnail": "/processing/results/49b4e3fb-6986-4ed8-8a1d-edfc8d1a5c8c/thumbnail",
  "pdf": "/processing/results/06947a1e-d157-4c91-8103-4ca587476a51/pdf",
  "hocr": "/processing/results/a4a666da-9314-4239-b5fd-ffb143fa53dd/hocr",
  "sub_pdfs": [
    "/processing/results/c37a4cb2-15fa-4872-9df6-22b481e59978/sub-pdfs/0",
    "/processing/results/c37a4cb2-15fa-4872-9df6-22b481e59978/sub-pdfs/1"
  ]
}

The document_splitting key contains the information about the split points and the confidence of the model in the predicted split points. The confidences can be particularly useful for a human-in-the-loop process. See below for more information. The ocr key contains the text extracted from the document. The sub_pdfs key contains the links to the sub-documents that were split from the original document.

Code Snippets#

Along with the specific OpenAPI documentation, you can find code snippets for different programming languages to help you get started with the API.

training_2

Feedback#

Our feedback API helps you to improve your model iteratively. If you have a model that is already trained and used for processing documents, it can be useful to upload the results of the processing back to the training data, in particular for the documents that were not split correctly.

For this, we offer a specific endpoint in the API, which allows you to upload the results of the processing back to the training data, so that you can make use of our annotation tool and improve your model.

POST /processing/feedback/{processing_id}

The processing_id is the UUID of the processing result that you want to give feedback on. Following the example above, you can provide feedback in the request body in the following JSON format:

{
  "description": "The splitting went wrong for this document",
  "tag": "Invoices",
  "expected_sub_documents": [
    {
      "name": "sub_document_1",
      "pages": [
        1
      ]
    },
    {
      "name": "sub_document_2",
      "pages": [
        2,
        3
      ]
    }
  ]
}

This will add the document immediately to the training data with the given feedback without the need to further annotate. By setting "tag" : "Invoices" The document will be added to the collection Invoices.

In case you do not have the means to annotate the documents yourself, you can also provide the feedback without specifying the expected_sub_documents:

{
  "description": "The splitting went wrong for this document",
  "tag": "Invoices"
}

The document will then be added to the training data and queued for manual annotation. Finally, you can head back to the Trainig Data to further annotate, or directly to the Trainings tab and retrain your model with the new data.

Human-in-the-loop for splitting workflows#

While we provide state-of-the-art splitting results, we cannot guarantee that our AI is always perfectly accurate. Therefore, we encourage you to implement additional cross-checks of splitting outputs against your business data as part of a Human-in-the-loop process.

You can do so by either integrating our Stand-Alone Interface, or by using your own interfaces and integrate our verification API endpoint, both of which are described in Verification.

For your splitting workflow, the specific endpoint for verification is

POST /processing/results/{processing_id}/document-splitting/verification

with processing_id being the job id returned by the processing endpoint. Set the request body like this:

{
   "request_type":  "verify"
}

This will mark the document as verified. Now, in both cases - Stand-Alone interface or verification API - the processing of results can continue by checking the verification status of the document with the same endpoint using a GET request:

GET/processing/results/{processing_id}/document-splitting/verification

This will return a positive response when the document is verified:

{
   "verified": true,
   "verified_by": "john.doe@example.org",
   "verified_at": "2024-12-03T12:00:00Z"
}

If the document is not yet verified, the response will be:

{
   "verified": false
}

Considering split point confidences to decide on manual verification#

As mentioned above, the document_splitting result contains the information about the split points and the confidence of the model in the predicted split points. These can be useful to decide automatically whether a splitting result should be verified by a human or not.

Consider the following document_splitting result:

  ...
  "document_splitting": {
    "schema_version": 1,
    "sub_documents": [
      {
        "name": "sub_document_1",
        "pages": [
          1,
          2
        ]
      },
      {
        "name": "sub_document_2",
        "pages": [
          3
        ]
      }
    ],
    "split_point_confidences": [
      0.6,
      0.9 
    ]
  },
  ...

A value approaching 1.0 indicates high confidence, while a value approaching 0.0 indicates low confidence. This means, the model is 90% confident that the split point between sub_document_1 and sub_document_2 is correct, while it is only 60% confident that not having a split point between pages 1 and 2 is correct.

If you find the results of your model are not accurate, you can use the feedback endpoint as described above to add this document to the training data of your workflow for further improvement.

Note that in a stream, every split/merge prediction point has its own confidence score. Manual correction effort can focus on low confidence split/merge points instead of going through the whole stream.

To make those corrections, you can open your document by going to the Uploads tab and clicking Open Document Splitting Scissor Icon next to the document.

tyos_open_document_splitting

On the left side of the document splitting interface, the Show Confidence toggle and the Filter by Confidence toggle can be enabled to show the confidence threshold slider. Using this slider, you can set the confidence threshold. This shows you how many split points exist below the threshold. You can use the arrow buttons next to the slider to jump to the split points and manually make the corrections where necessary.

You can also generate links to the stand-alone interface with the desired confidence threshold set, described in Verification for Document Splitting.

tyos_confidence_filter

Based on our evaluations, we provide the following general guidelines in the table below. These are intended to help you balance manual correction effort against errors remaining after correction, though they may not be optimal for every use case. We encourage you to experiment to find the best confidence threshold for your needs.

Confidence Threshold	Error Rate After Correction	Correction Effort
0.5	4.82%	40.69%
0.6	4.14%	36.57%
0.7	3.39%	31.28%
0.8	2.58%	24.44%
0.9	1.40%	18.00%
0.95	0.80%	14.87%

For instance, to reduce the total error rate to about 3%, we recommend manually reviewing all split points with a confidence of 0.7 or lower. Of the reviewed split points, approximately 31% (about one in three) will be incorrect. In our example, this means we would recommend manually verifying whether pages 1 and 2 should indeed not be split.

Credit cost#

A Freemium account allows for up to 100 pages per month, where the cost is 25 credits per split output document.

Previewing Workflow Updates#

natif.ai is constantly improving the model architectures and baselines for self-trained workflows which sometimes requires (beneficial) updates to existing workflows. In order to not interfere with productive usage of your workflow, natif.ai will inform you in advance by email about such updates and will provide a preview version of the upcoming workflow update for you to try out before the automatic migration.

Please refer to the preview endpoint documentation to make use of the endpoint to test the upcoming version of your workflow for production usage and provide feedback to us.