PDF.co REST API for Data Extraction

In today’s world, whether you are aware of it or not but APIs are everywhere. They control how we get information from social media sites, how we navigate our GPS, and how we book tickets online. There are many applications and projects that use a rest API due to its scalability, portability, and many more features.

Here's what we will cover:

API Key Features
Getting Started with PDF.co API
PDF.co API Documentation
PDF.co API Details
How to Proceed with PDF.co Web API

API Key Features

With this in mind, we at ByteScout introduce PDF.co REST full Web API. In this session, we will understand the PDF.co web API in brief. PDF.co web API is the API with a rich set of tools for document manipulation, data extraction, data conversion, etc. You can also split and merge PDF documents using this API. Some of the major features which PDF.co includes are; it has built-In AI-powered OCR.

It means it can extract text not only from scanned images, but it can intelligently extract text from unstructured documents and images using AI and machine learning. You can generate barcodes or reading barcodes and QR codes from a variety of file formats, such as PNG, JPEG, TIFF, or PDF using this API. PDF.co web API can automatically decode all popular barcodes such as EAN, Code 39, Code 129, UPC, and many more others. You can intelligently extract data from PDF or Docx file or RTF file or XLS file with preserved formatting. You can extract structured data from tables or from form fields. This API automatically detects the table structure and smartly extracts the data from it. You can convert your PDF file to TXT or CSV or XLS or JSON and so on.

PDF API

You can also convert document or image file to PDF or HTML to PDF file or JPG to PDF file etc. PDF.co web API also includes a document parser feature which provides template-based data extraction where you can design a template and then use that template for accurate data extraction from documents. We will study this document parser template editor tool later on in this course. PDF.co web API also provides the on-premise version and customization of the API. It means you can install PDF.co API on your own in-house server and use it with your local files without an Internet connection. Now, as a customer, the first question that pops up in your mind is why should I choose PDF.co web API? First and most important is security.

Getting Started with PDF.co API

This API runs on the secure Amazon AWS infrastructures. All the data transfers are encrypted by SSL or TLS encryption. Second, this API is providing an asynchronous mode option. It means now you can process large files with hundreds of pages in the background. Most important is we are providing free technical or developer support at any time. PDF.co web API providing a rich set of documentation. We are providing thousands of premade source code samples for easy implementation in your own favorite programming language. On this GitHub site, you can find a sample source code in your favorite programming languages for the PDF.co web API. It includes a number of functions and options.

Data Extraction

Open this API, you can find the number of samples in your favorite programming languages like C Sharp, Javascript, PHP, Python, etc. It includes the number of functions and options you should do calling the API to implement features such as PDF to CSV, PDF to Excel, or PDF to text in this case. Now let's see the API documentation. By clicking on REST API docs, you will be redirected to the API documentation.

Here we are providing documentation for each API endpoint in detail where you can find the resource description, source code, sample link. You can anytime contact our dedicated support team for any issues through the mentioned link or email. Then you can see the endpoints and methods. Upon clicking on this link, you will be redirected to the detail section. As we saw in the previous session, how PDF multi-tool can read the data from the scan images and generate the text file. You can achieve the same functionality using the PDF.co web page also. Find the endpoint for that matter. Click on this link (/pdf/convert/to/text).

PDF.co API Documentation

Now let’s get an overview of the PDF to text API in this documentation. This is the endpoint or URL which we require it to use in our application. Then this is a different, different parameter which you need to pass when you invoke that API. This async parameter is the optional parameter. If you have a large file and you want to process asynchronously, then you need to pass a value true in this parameter. Once you set it true, then you will get one job id as output. Then you can check the status of the processing file by using this URL(/job/check).

This one(encrypt) is the optional parameter. If you want your output file as encrypted, then pass the value as true in this parameter and this (URL) is the required parameter. Here you need to specify the URL of the source PDF file from which you want to extract text and then you need to specify the name of the output file. If your PDF file is password protected, then you need to specify a password in the password parameter. These are the optional parameters. If you want to extract a specific page or specific page range, then you need to pass the parameter in this format. If you pass nothing in this parameter, then API will read all pages of the specified PDF file.

Now when you invoke this API, then you will get this type of output in JSON format. In this URL you will get your converted text file. If you copy this URL and run it in your browser, then you can see your PDF content in this converted text file. Just like any other API, here you also need to set the key in the request header.

PDF Web API

PDF.co API Details

Now you have a better understanding of this API. The easiest way to get started using the PDF.co web API is to use Postman request collection. Postman is free to download tool for making HTTP requests. You can download this API request collection from this link. Here you can see the request collection in JSON. Save it in my local drive to import it in Postman.

Extract Data

Now open Postman. To import our request collection, click on the import button. Here, look at your local drive path, where you save JSON files. Select this JSON file and click on Open Button. You can also import this JSON file directly from this link. Just you need to paste that URL over here and click on the Continue button. It will import it into your Postman. Now let us find our API URL, which will convert PDF documents into text. Click on this link(/pdf/convert/to/text). Now let's go to the body part. Go through the input parameter once again. This is the URL of the PDF file, which we want to convert into a text file.

How to Proceed with PDF.co Web API

Open this PDF file in the browser. This is the PDF file, which we want to convert into text. Here I need to specify my output filename and in the password parameter, if your PDF is password protected, then you need to specify the password here. In the same way, you can specify the page number or page range in this parameter. In our case, let’s pass it as blank and in the header part, you need to set your registered API key over here. I have already set my API key.

Cloud API

Now invoke this API by clicking on the send button. We have our API response, so as you can see here that this is our converted filename and our converted text file resides in this URL. Copy this URL to see our converted text file. Here we have our PDF data in text format in this text file. This is how you can test different PDF.co APIs as per your requirement. In the request collection, you can see the different use cases of PDF.co API.

Document Parser SDK Document Parser API

Video Tutorial:

Other useful articles: