Extracting Table Data from PDFs in C# .NET

Positiwise
4 min readOct 9, 2023

--

PDF documents are a ubiquitous file format used to represent structured documents in a fixed layout. However, the format does not natively support extracting semantic information like tables, images, text, etc. This presents challenges for tasks like analyzing PDF documents programmatically or converting them to other formats like Excel.

In this blog post, we will look at how to build the ability to extract table data from PDF documents directly into a format like a .NET DataTable using C#. Extracting tabular data from PDFs allows us to effectively understand, analyze, and process the structured information contained within PDF tables.

Understanding PDF Table Structure

Before we dive into code, it’s important to understand the underlying structure of tables in PDF format. PDF documents represent visual content using a series of drawing commands that render text, lines, and shapes onto virtual “pages.”

Tables in PDF are rendered as a collection of these drawing primitives with no inherent logical grouping or semantic meaning. They are just rectangular shapes, lines, and text rendered in a tabular layout.

To extract the table data programmatically, we need to analyze the visual layout and recognize common patterns that indicate a table structure like:

  • Rectangular “cells” formed by lines
  • Repeating row/column patterns
  • Text flowing left-to-right or top-to-bottom within cells
  • Alignments and spacing suggest a tabular structure

Most PDF extraction libraries provide facilities to analyze these visual cues and recognize underlying tables by detecting common tabular patterns in the drawing commands.

Using Tabula to Extract Tables

For this blog post, we will use the open-source Tabula library for extracting table data from PDFs in C#. Tabula is a popular Java library that works well for basic table extraction tasks.

To use it from C#, we can add references to Tabula and Apache PDFBox, which it depends on. PDFBox provides low-level PDF parsing and rendering functionality.

// Add NuGet references
PM> Install-Package Tabula
PM> Install-Package PDFBox

With the libraries referenced, we can now write code to extract table data from a sample PDF file:

// Load PDF document
PDFDocument pdf = new PDFDocument("sample.pdf");
// Extract all tables
IList<Table> tables = TableExtractor.Extract(pdf);
// Print extracted table data
foreach (Table table in tables)
{
Console.WriteLine(table.Extract().ToString());
}

The TableExtractor class handles analyzing the PDF visualization and recognizes table structures. It returns the extracted tables, which contain the cell text values organized into rows and columns.

We can iterate through the extracted tables and print out or further process the tabular data. This provides a simple way to parse tables from PDFs programmatically in .NET.

Handling Complex PDF Tables

While simple for many cases, Tabula may struggle with more complex PDF table layouts containing merged cells, spanned rows/columns, nested headers, etc. To handle these, we need a more powerful PDF extraction library.

One option is to use PDFClown, an open-source .NET library for advanced PDF parsing and rendering. It offers very low-level access to parse the entire PDF content structure and constructs a DOM representation.

We can write custom logic on top of PDFClown to recognize complex tabular patterns and reconstruct the table structure from the visualized content drawing commands.

Here is a sample approach:

// Parse PDF and get page content 
PdfDocument pdf = PdfReader.Open("sample.pdf");
PdfPage page = pdf.Pages[0];
// Iterate content objects
foreach(var obj in page.Content.Elements)
{
if(obj is PdfText)
{
// Analyze text object positions
// Detect table cell texts
}
if(obj is PdfLine)
{
// Analyze line positions
// Detect cell borders
}
// Reconstruct table structure
// Handle complex patterns
}
// Return extracted complex table
Table table = BuildTable();

We traverse each content element, analyzing properties like position and size neighbors to understand the visual layout. Complex table logic then reconstructs the underlying data structure based on detected patterns in the visual rendering.

This allows handling any kind of complex table structure by analyzing the raw PDF content primitives.

Final Words

In this blog post, we covered how .NET Developers can extract tabular data from PDF documents using C# programmatically. We looked at the underlying structure of tables in PDF format and some common patterns used to recognize tables visually.

We demonstrated a simple extraction approach using the Tabula library and a more advanced technique using the low-level PDF parsing capabilities of PDFClown. Being able to understand and extract structured data like tables from PDFs opens many possibilities for further processing and analysis of PDF documents.

The techniques shown can be useful for ASP.NET Developers working with PDF files in their .NET applications. By leveraging C# libraries like Tabula and PDFClown, Developers can more easily extract tables and other structured data from PDFs for additional processing and integration into their apps.

--

--

Positiwise
Positiwise

Written by Positiwise

Positiwise.com is a leading IT company, based in India, offering top-notch full-cycle software development through its dedicated team of professionals.

No responses yet