Scopri il nostro servizio di estrazione automatica dei dati

API & Documentation

 How does the PDF2Data service work? The ideal solution.

 Example of invoice, template and generated XML result + short explanatory video.

 Documentation and sample project are available here: PDF2Data java API v1.5.2 and documentation

 

  Supported file types

Electronic documents

File type
Description
Supported Notes
pdf Portable Document Format Yes

Version 1.7 or earlier (including multipage)

Only not secured (password protected) pdf are supported.

doc / docx
Microsoft Word
Yes* *experimental, at this moment only ".doc" is supported, soon will be suppported ".docx"
xls / xlsx
Microsoft Excel Yes* *experimental, at this moment only A4 format of excel page is supported. Information beyond the borders will be moved to next page.
ppt / pptx
Microsoft PowerPoint Yes* *experimental 
rtf
Rich Text Format Soon
 
odt Open Document Text Yes* *experimental  
ods Open Document Spreadsheet Yes* *experimental  
odp OpenDocument Presentation Yes* *experimental  
sxw OpenOffice.org 1.0 Text Soon  
sxc OpenOffice.org 1.0 Spreadsheet Soon  
sxi OpenOffice.org 1.0 Presentation Soon  
 wpd Word Perfect  Soon  
txt Plain Text  Yes* *experimental  
tsv Tab Separated Values Soon  
html HyperText Markup Language Soon  

OCR documents.

The standard accepted quality is 300 dpi black/white, grayscale (preferred) or color.

For the better results we recommend to scan documents

at 400 or 500 dpi grayscale (preferred) or color.

Supported OCR languages: click to see the list.

png Portable Network Graphics Yes Black and white, gray, color
jpeg / jpg Joint Photographic Experts Group Yes Gray, color
jp2 / jpc JPEG 2000 Yes Gray - Part1, color - Part1
pdf Portable Document Format Yes

Version 1.7 or earlier (including multipage)

tiff / tif Tagged Image File Format Yes

Black and white — uncompressed, CCITT3, CCITT4, Packbits, ZIP, LZW;

Gray  uncompressed, Packbits, JPEG, ZIP, LZW;

24-bit color  uncompressed, JPEG, ZIP, LZW;

1-, 4-, 8-bit palette  uncompressed, Packbits, ZIP, LZW

(including multipage TIFF)

gif Graphics Interchange Format Yes 

Black and white — LZW-compressed;

2-, 3-, 4-, 5-, -6, 7-, 8-bit palette — LZW-compressed

djvu / djv DjVu Yes Black and white, gray, color
jb2 JBIG2 Yes Black and white

 

  Requesting available credits of pages

 URL: http://pdf2data.cloudforpeople.com/api/getCredits

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

XML result:   

The server return:

<result>
     <userId>215</userId>
     <userOcrPages>1111</userOcrPages>
     <userElPages>2222</userElPages>
     <userComPages>3333</userComPages>
</result>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

  Requesting list of available templates from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/listTemplates

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

XML result:   

The server return:

<templates>
    <template id="1" key="Name of key1">Name1</template>
    <template id="2" key="Name of key2">Name2</template>
    <template id="3" key="Name of key3">Name3</template>
</templates>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

  Requesting the Template Schema from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/getTemplateSchema

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 template_id integer

Template ID.

This parameter is required.

XML result:   

The server return:

<result>
    .... the content of TemplateSchema
</result>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

  Submit document to PDF2Data server for recognizing

 Submit single document for recognizing:

 URL: http://pdf2data.cloudforpeople.com/api/recognize

 Required: http method "post" and content-type "multipart/form-data"

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

   

 

 template_id integer

Template id from a list of available templates. If you have created on our server template with corresponding "ID" then you can specify "-1" as template_id, so the system will "autorecognize" which template need to apply to document. If the document does not have appropriate template - the server will return Message "Invalid template_ID".

We recommend to specify template_id as "-1".

This parameter is required.

   

 

file byte array

File.

This parameter is required.

   

 

 scanned string

Indicate if document is scanned.

Can be "true/false/auto". If you will not specify this parameter - our system will use "auto".

We recommend to specify this parameter as:

"scanned=auto"

or do not specify this parameter at all.

This parameter is NOT required.

   

 

 language string

The language to use for OCR processing. You can specify one primary and one secondary OCR languages divided by comma. See the list of available languages here. If you don`t specify the language - will be used the language from your Control Panel from web Interface.

If you don`t specify the language - will be applied the primary and secondary language from your Control Panel from PDF2Data web interface.

This parameter is NOT required.

   

 

 mimeType string

MIME type.

This parameter is NOT required.

   

 

 batch Boolean

Used only for batch submitting. "true" if file is composed of a many invoices. *This function is experimental.

This parameter is NOT required.

   

 

 split integer

Used only for batch submitting. This is a number of a splitting step:

0 - split file by separator sheet (download)

1 - split file page by page

"n"... - split file by any "n" page

This parameter is NOT required.

XML result:

 After the document is submitted the server returns to user the "document ID" which is need to be temporarily stored for further requesting of recognizing result.

 NB: the recognizing result may be ready "instantly" (3 - 10 sec.) or later (5 min - 24 h). The time depends on the type of document (Electronic/OCR) and on whether it is human-controlled or no.

The server return:

In case of a single document:

<result>

      <documentId>1</documentId>

</result>

 

In case when document is "in progress" or in case of error:

see the file status.xml

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 

 Getting a "status" of document from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/getStatus

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 document_id integer

Document ID.

This parameter is required.

XML result:   

The server return:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

  

 Getting results from PDF2Data server (for both single documents and batch)

 URL: http://pdf2data.cloudforpeople.com/api/getResult

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 document_id integer

Document ID.

This parameter is required.

 export_format string

Returns to user result as standard XML (if not specified), CSV file or personalized format (to activate personalized format please contact us, we can support large variety of personalized export formats).

This parameter is NOT required.

XML result:   

The server return (just example):

 

In case of "instantly" ready result:

see the file result.xml

 

<result>
    <userName>usermail(at)gmail.com</userName>
    <documentName>DocumentName.pdf</documentName>

    <documentID>DocumentID</documentID>
    <templateName>TemplateName</templateName>

    <documentType>Electronic</documentType>
    <documentPages>1</documentPages>
    <documentPaidPages>1</documentPaidPages>
    <userDefinedDocumentType>Invoice</userDefinedDocumentType>

    <creationTime>12312341234</creationTime>

    <associations>
        <pair name="pair1">
            <label>label1</label>
            <value>value1</value>
        </pair>
        
        <table name="table1">
            <columns>
                <column name="column1">Column1</column>
                <column name="column2">Column2</column>
                <column name="column3">Column3</column>
            </columns>
            <row>
                <value>v11</value>
                <value>v12</value>
                <value>v13</value>
            </row>
            <row>
                <value>v21</value>
                <value>v22</value>
                <value>v23</value>
            </row>
            <row>
                <value>v31</value>
                <value>v32</value>
                <value>v33</value>
            </row>
        </table>
                    
        <textbox name="textbox1">textbox1</textbox>

    </associations>
</result>

 

In case when document is "in progress" or in case of error:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 Error and Info codes

 The list of Error and Info codes is available here.

Code Description
Code

Examples: the document size is > 20 MB; the document format is not supported; the document is "secured"; or other causes.

Code Server error.
XML result:

The server return:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 

Goodies:

For XML viewing/editing — Notepad ++.

For timestamp conversion — EpochConverter.

We recommend the ultimate IDE for developers — Eclipse.