📘

Estimated Time: 15 Minutes

The Basics

In this Getting Started Guide, you'll learn how to automate document processing by uploading documents to Butler's REST API using Queues.

Uploading documents to the API is a simple two step workflow:

  1. Upload documents to begin an extraction job
  2. Get the extraction results for use in downstream workflows

We'll create a simple Python script to upload a document to Butler for processing and then print the extraction results.

Note This is a follow up guide to Extracting data from your documents so check that out if you haven't already!

Step 1: Pre-Requisites

Before getting started, you'll need to collect a few things:

API Key
If you haven't already, make sure to generate and copy your API Key down for use within the Python script. See here to learn how to do that.

Queue Id
You'll need a copy of the API Id for the Queue you'd like to upload documents to for processing. To do this, go to the Queues page and copy the Id like this:

1332

Step 2: Prepare environment

First, make sure to install all the necessary libraries for our script:

# We'll use requests for making all the necessary API calls
pip install requests

Once done, create a new python script named process_docs.py and make sure to import all the necessary libraries, as well as define some helpful variables to use throughout:

# Import necessary python libraries
import os
import requests
import time
import mimetypes
 
# Specify variables for use in script below
api_base_url = 'https://app.butlerlabs.ai/api'

# Use the API key you grabbed in Step 1 to define headers for authorization 
api_key = 'MY_API_KEY'
auth_headers = {
  'Authorization': f'Bearer {api_key}'
}
# Use the Queue API Id you grabbed in Step 1
queue_id = 'MY_QUEUE_ID'
# Specify the path to the file you would like to process
file_location = 'PATH_TO_MY_FILE'

Step 3: Upload document to API for processing

The first API call to make is to the /queue/{queueId}/uploads endpoint. This endpoint enables us to upload files to our Queue, which will start an asynchronous extraction job.

It returns an uploadId that we'll use to later retrieve the extraction results.

# Specify the API URL
upload_url = f'{api_base_url}/queues/{queue_id}/uploads'

# Prepare file for upload
file = open(file_location, 'rb')
mime_type = mimetypes.guess_type(file_location)[0]

files_to_upload = [
 ('files', (file_location, file, mime_type)) 
]

# Upload file to api
print(f'Uploading {file} to Butler for processing')
upload_json = requests.post(
 upload_url,
 headers=auth_headers,
 files=files_to_upload
).json()

print(upload_json)

file.close()

If done correctly, you should see a JSON response that looks something like this:

{
  "uploadId": "dd47aead-3143-42ef-9423-42asa3675ed6",
  "documents": [
    {
      "filename": "my_file.pdf",
      "documentId": "f1ea4e64-b514-4b51-aba7-65545ba243a6"
    }
  ]
}

Notice the uploadId property on the response. This is the Id that we'll use to get the extracted results.

Step 4: Getting the extraction results

We'll use the /queue/{queueId}/extraction_results endpoint to retrieve the extraction results.

Processing a document can take a few seconds. In general, it may take around 30 seconds to process up to 5 pages of data (although often times it could be much faster).

You'll want to poll on this results endpoint until the processing has completed for each document. You can use the documentStatus property to understand the status of the extraction results for any single document.

extraction_results_url = f'{api_base_url}/queues/{queue_id}/extraction_results'

# Prepare query parameters
upload_id = upload_json['uploadId']
params = { 'uploadId': upload_id }

# Poll on extraction results until the extraction job has completed
# We'll set a placeholder for extraction_results
extraction_results = {'documentStatus': 'UploadingFile'}
while extraction_results['documentStatus'] != 'Completed':
  results_json = requests.get(
    extraction_results_url,
    headers=auth_headers,
    params=params
  ).json()
 
  # items contains the list of extraction results for all documents you 
  # uploaded. For this guide, we'll assume you only uploaded a single doc
  extraction_results = results_json['items'][0]
  status = extraction_results['documentStatus']

  if status != 'Completed':
    print('Upload still processing. Sleeping for 10 seconds...')
    time.sleep(10)
  else:
    print('Uploaded complete. Extraction results ready')

You can see we use the documentStatus property to check the status of the document's extraction results.

If the extraction job completed successfully, you should see the following in your shell:

Upload still processing. Sleeping for 10 seconds...
Uploaded complete. Extraction results ready

📘

Document Status

There are a few different Status values for a document's extraction results. For more details on how to use this endpoint in a production setting, see here.

Once the extraction results are ready, lets print out the results!

# Print out the extraction results
file_name = extraction_results['fileName']
print(f'\nExtracted data from {file_name}:')

fields = extraction_results['formFields']
for field in fields:
  field_name = field['fieldName']
  extracted_value = field['value']
  
  print(f'{field_name}: {extracted_value}')
{
  "items": [
    {
      "documentId": "c90418ca-038d-4839-8274-b468273cb230",
      "documentStatus": "Completed",
      "fileName": "utility_bill3.pdf",
      "mimeType": "application/pdf",
      "documentType": "Utility Bills",
      "confidenceScore": "High",
      "extractedFields": [
        {
          "fieldName": "Account Number",
          "value": "1234 5678 900",
          "confidenceScore": "High"
        }
      ],
      "tables": []
    },
  ],
  "hasNext": false,
  "hasPrevious": false,
  "totalCount": 1
}

You'll notice that the response includes some metadata about the file as well as the extracted results in the extractedFields property. You can then use these extracted values in any downstream workflows you'd like.

Assuming your script ran successfully, you should see the following output in your shell:

Extracted data from utility_bill3.pdf:
Account Number: 1234 5678 900

If you reached this point, congrats! You just processed your first document with Butler!

Where to next

You now have all the tools you need to add document processing into any product or workflow. We can’t wait to see what you’re going to build!

If you haven't already, make sure to check out the Getting Started Guide for Extracting data from your documents to learn how to create additional document types and train them to reach high accuracy.

If you're ready to build your production integration, check out the API Reference Section for more details on the specific endpoints.