Automate Sensitive Data Extraction

Hong Lin Published May 17, 2023 #iDox.ai#

Automate Sensitive Data Extraction_image_1

Reading contracts is a common task for businesses. However, it is also a time-consuming task. In this article, we demonstrate how to use AI-powered iDox.ai API to extract sensitive data from contracts, such as organization names, person names, dates, or person roles. In this article, we use RapidAPI to demonstrate the iDox.ai API. After you finish all steps, you can redact sensitive data from a document looks like the screenshot below.

Redacted PDF
Redacted PDF

RapidAPI is an API platform for developers to find and test APIs to build their products. It also allows developers to share their API services.

The iDox.ai Document API provides several intelligent services for document processing. Below lists a few of

  • Discover, highlight, and redact sensitive data in a PDF and MS Word (doc and docx) file.
  • Classify a document by its contents.
  • Identify the type of contract.
  • Check completeness of a contract.

For the general flow of using these services:

  1. Call upload document API to upload a file and get a job Id.
  2. Call find job status API with the job Id to get a document Id.
  3. Call any of the document intelligence APIs with the document Id.
  4. Call download document API to get the processed document.

Create an iDox.ai API developer account

Before starting to use the iDox.ai API, you need to obtain an API key by creating a developer account on iDox.ai.

1. Visit the iDox.ai developer center. Click Create a Developer Account.

iDox.ai Developer Center
iDox.ai Developer Center

2. Register a developer account. After sign-up, you will receive an activation email with a link to verify your registration.

Fill in Your Email and Password to Sign-Up
Fill in Your Email and Password to Sign-Up
Account Activation Email for Registration Verification
Account Activation Email for Registration Verification

3. To get an API key, you have to create an organization and then a project. Click New Organization in the left sidebar to create an organization. Then click Create a new project to create a project under the organization.

Create Your Organization
Create Your Organization
Create Your First Project
Create Your First Project

4. After a project is created, click the project panel to go to the project view. In the project view, click Key Management in the left sidebar to open the key management view. Click Generate to create an API key.

Copy API Key
Copy the API Key

Step-By-Step Guide to Build Sensitive Data Extraction Workflow

Next, you will visit the RapidAPI Hub and type iDox or PII to find the iDox.ai API documentation. You need a RapidAPI account for testing APIs.

1. Select Update Document. Copy the API key you created in the iDox.ai developer console and paste it into the idox-api-engine-key variable of the Header Parameters section.

2. Scroll down to the Request Body section, click Choose File to add a file for analysis. The file format can be a PDF, DOC, or DOCX. In this example, we will upload a PDF file because currently the highlight feature only supports PDF. Click Test Endpoint to make a request.

Paste Your iDox.ai API Key and Add a File
Paste Your iDox.ai API Key and Add a File

This API triggers several backend services to process the uploaded file. The text contents are extracted. If it is a scanned file, the OCR service will be triggered. The layout analysis service applies AI to detect the boundary of each paragraph.

If the upload is successful, a response body similar to the image below is returned. Copy your jobId value in the response body for accessing the Find job status in the next step.

Copy jobId from the Response Body
Copy jobId from the Response Body

3. Move to the find job status view. Paste the jobId in the last step in the job-id of Path Variables. Click Test Endpoint to make a request.

Paste jobId You Obtained in the Previous Step
Paste jobId You Obtained in the Previous Step

The response body will then be returned. You need docId in the response body for many document processing APIs.

4. Move to the Classification group. We use Classify document API to let AI decide the type of the uploaded file. Paste the docId from the previous step in the path variable document-id. Click Test Endpoint to make a request.

Paste docID You Obtained in the Previous Step
Paste docID You Obtained in the Previous Step

Depending on the type of your file, the API returns different results.

Automate Sensitive Data Extraction_image_2
Invoice Type Return

4. Move to the PII discovery group. There are 4 PII discovery API variants. For this example, we will use the document PII discovery API to find PII in the uploaded file. Paste the docId from Step 2 in the path variable document-id.

Automate Sensitive Data Extraction_image_3
Paste docId You Obtained in the Previous Step

Before clicking Send, let us review the supported entity categories by clicking the Document icon in the right sidebar. They are listed in the table Entity Categories. You can a full list of entity categories by clicking this link on Postman.

You Can Configure Up to 31 types of PII Entities.
You Can Configure Up to 31 types of PII Entities.

Scroll down to the bottom to add detection configuration in the Request Body section. Add entity categories that you want to detect in the categoriesFilter. In this example, to detect possible sensitive data in a contract, we use Person, PersonType, Email, Organization, Currency, and DateTime for detecting person names person roles, email addresses, organization names, currency amount, dates, and times.

Configure Entity Categories in Request Body.
Configure Entity Categories in Request Body.

Click Test Endpoint and wait a few seconds, the response body is returned with a list of the detected paragraphs and PII entities. The example below shows some of the detected entities. The same contract we used can be found on the SEC’s (U.S. Securities and Exchange Commission)website.

Detected Organization Names.
Detected Organization Names.

You can imagine that compared to reading contracts by yourself, you save a lot of time to capture data with our APIs.

With the iDox.ai API and other APIs, you can build various contract processing automation workflows. For example, building a contract signing workflow with PDF eSigning API. Another example is to redact sensitive data in a contract before sharing it with outsiders.

Bonus contents

We are just one step away from redacting sensitive data.

5. In this step, we will redact the detected sensitive data in the contract. Move to the redact document API. Before using this API, we need to rearrange the output in the last step in the JSON format and add it to the request body. Below is the sample JSON. The highlight variable controls to redact (false) or highlight (true) text. You will create a similar JSON with the entities returned from the previous step.

Configure Keywords You Want to Redact in Request Body
Configure Keywords You Want to Redact in Request Body

Paste your docId from the last step in document-id and the JSON text into the request body under the Request Body section. Unfortunately, unlike Postman, RapidAPI cannot visualize the returned PDF. Therefore, the only text contents of the redacted PDF file are returned after you click Test Endpoint. Below is a screenshot of the same configuration by using Postman.

Redacted PDF
Redacted PDF

Please visit the iDox.ai Developer Center to create an account. You can use 1000 cool APIs for free. You can also access API documentation on Postman.

You can start to experience our smart document technologies with iDox.ai Suite, which is a bundle of web apps, without writing any code. It’s free to start with the basic plan.

Feel free to drop your suggestions, feedback, and questions in our Slack support channel. You can also leave your messages down below. We love checking them out.

Related posts