Overview

In this guide, we’ll walk through the steps to create your own custom document extraction models. We’ll use Utility Bills for our example, but you can feel free to follow along with any type of documents you'd like.

You'll learn how to do the following:

  1. Create a new custom model
  2. Specify what information you'd like extracted from your model
  3. Train your custom model to increase its accuracy

By the end of this guide you'll have a highly accurate custom document extraction model that can be used to process documents in your product using our REST APIs. Let's get started!

Create your custom model

To begin building your new custom model, go to the Library page:

2880

Simply click the Create Custom Model button to begin building your new model.

Once done, you'll land in the interactive Model Training interface. From this single page you'll be able to do a number of things, including:

  1. Edit Model Name: Update the name of your model
  2. Manage Training Documents: Upload and delete documents from your training dataset
  3. Label Training Documents: Label your training documents so that they can be used to train your Custom Model.
2836

First, let's give your new model a name. You can do so by clicking the Edit icon in the top left corner, and entering a helpful name. We'll name ours Utility Bills:

910

Upload training documents

You'll notice by default, your model needs training before it is ready to use:

1254

This means that it has not been trained yet, and as such accuracy won't be as high as we'd like. Let's fix that!

In order to train, we'll need to label at least 5 example documents for our model to learn from. Drag and drop (or click to browse) your sample documents onto the document uploader in the middle of the screen. Once your documents finish the initial OCR processing, they'll be ready to label:

2832

📘

How many documents should I upload for training?

The short answer is it depends on what your documents look like.

As a general rule of thumb, we advise using 5 training documents to get started. After using those 5 to train and test your model (more on this later) you can always come back and label more examples to further increase your accuracy.

For subsequent rounds of training, we recommend labeling 5-10 additional documents, re-training and testing the improved accuracy. You can repeat this process until your model reaches the accuracy you want!

Adding fields to your model

The next step is to define what information you want extracted from your model. We call the pieces of data to extract from your documents Fields. You can see right now your model doesn't have any fields yet, which means nothing will be extracted.

780

To add your first field, click the Add button.

1046

You'll notice that you can train your model to extract a number of different types of information:

Text Fields:
This is the default field type. It can be used to extract names, dates, addresses, amounts and more. When in doubt, your field type should be a Text Field.

Checkboxes:
This field type enables you to extract the selection status of a Checkbox from your documents. The value for the Checkbox field type is either Selected or Not Selected.

📘

What if I have many different checkboxes?

If you have many different checkboxes (or radio buttons) on your document, you'll want to add a single field for each checkbox (or radio button) that appears on the document. This means you'll get the selection state for all of them in your API response.

Signatures:
This field type enables you to extract whether or not a document is signed in a specific location. The value for the Signature field type is either Present or Not Present .

Tables:
This field type enables you to extract tabular information from your documents. You can define your table to extract as many different columns as you'd like!

Go ahead and add as many different fields to your model as you'd like.

For simplicity, we only added the following three fields to our Utility Bill model:

  1. Account Number: the number associated with a given Utility Bill
  2. Statement Date: the date the Utility Bill was issued
  3. Total Amount Due: the total amount due for this Utility Bill

Here is an example of what an API response for this custom model would look like:

{
  'documentId': 'ee56aa3b-7eff-4e24-b875-cd806015ecce', 
  'documentStatus': 'Completed', 
  'fileName': 'utility_bill0.jpeg', 
  'mimeType': 'image/jpeg', 
  'documentType': 'Utility Bills', 
  'confidenceScore': 'High', 
  'formFields': [
    {
      'fieldName': 'Account Number', 
      'value': '0151096524-5', 
      'confidenceScore': 'High'
    }, 
    {
      'fieldName': 'Statement Date', 
      'value': '07/13/2018', 
      'confidenceScore': 'High'
    }, 
    {
      'fieldName': 'Total Amount Due', 
      'value': '$79.69', 
      'confidenceScore': 'High'
    }
  ], 
  'tables': []
}

Keep this API response in mind while defining your fields as its how you'll integrate your model into your product!

📘

How many fields can I add?

You can add as many Fields to your model as you'd like! That being said, we recommend trying your first model out with a few fields just to get comfortable with the platform. You can always create a new model with your full schema later!

Once you have added all the fields you need for now, let's start labeling those documents.

Labeling your sample documents

Labeling is extremely easy with the labeling interface. All you need to do is select which field you'd like to label first by clicking on the field on the right panel. You'll know the field is active if it has a colored background, rather than a white one.

Once you have your active field, simply click (or drag) on the text in the documents to select the value that should be extracted for that field. If done correctly, you'll notice the selected values underneath the field's name as well as a colored block on top of the value on the document:

2252

Thats all it takes! Once you've labeled the values for each of your fields, you've officially labeled your first document!

📘

Hotkeys

There are a number of hotkeys available to make your labeling experience even easier! They are:

  1. Move to next field: Tab automatically moves to the next field on the right panel
  2. Add row to a table: Cmd+A adds a new row to a table. This hotkey only works if you are actively labeling a table field.

Once you've finished your first document, you can click on the next document on the left panel to begin labeling it.

1554

While labeling, you may have noticed the border colors on the list of documents on the left changing. Lets go over that quickly:

  1. Blue: A blue border indicates that this is the document you are actively labeling
  2. Yellow: A yellow border indicates that this document has not yet been labeled and is ready to be looked at
  3. Green: A green border indicates that this document has already been labeled

Training your model

After you have labeled at least five documents, you are ready to train your model! You should now see the blue Training Button enabled and the green progress bar complete:

558

Go ahead and click the Train button to kick off your first model training. Great work! You are only seconds away from testing your first Custom Extraction Model.

Generally training is quite fast, and should complete in about 15-30 seconds. As you add more documents to your training set, training may take a bit longer, but never more than a couple minutes.

2790

Once training has completed, your custom model is ready for use!

Next steps

Now that you have a trained custom model, typically you'll want to try it out using the REST API. Check out this guide to help you get started!