16.0.1
📅 2021-02-26
New features
Export OCR results to ALTO XML format
This 16.0.1 version introduces a new export format: ALTO XML, standing for Analyzed Layout and Text Object.
This open XML schema aims at providing a standardized way of describing OCR and layout information for digitized material. Go to https://www.loc.gov/standards/alto for more details.
This new format is added as an extra export type: IDRS_EXPORT_TYPE::IDRS_EXPORT_FORMAT_XML_ALTO (C++) or ExportType.XmlAlto (.NET). It can therefore be provided as an argument of CExport class' constructor.
Also, a new member is added to class CExport in order to allow appending the export of a new page to an existing ALTO XML. You can set it via method CExport::SetAppendMode() (C++) or property CExport.AppendMode (.NET).
Improvements
PDF loading with graphical zones coordinates
The SDK is now able to retrieve the location of graphical zones and segments when loading a PDF’s content.
PDF graphical zones loading is performed whenever text loading is also requested; to reflect this, the methods CImageLoadOptionsPdf::Get/SetLoadTextContent (C++) and property CImageLoadOptionsPdf.LoadTextContent (.NET) are renamed to CImageLoadOptionsPdf::Get/SetLoadPageContent / CImageLoadOptionsPdf.LoadPageContent respectively.
PDF loading resolution
Now you can select at which resolution a PDF input page should be rasterized. This can be useful to fine-tune output size (smaller resolution) or maximize quality (higher resolution).
To do so, use method CImageLoadOptionsPdf::SetLoadingResolution() (C++) or property CImageLoadOptionsPdf.LoadingResolution (.NET).
The default value is 300 dpi, as used in the previous version of the SDK; it ensures the best compromise between size and quality.
Fixed bugs
| ID | Description |
|---|---|
IDRSRD-5921 |
The iDRS should retrieve graphical zones coordinates when loading PDF’s content |
IDRSRD-5911 |
The iDRS fails to export OCR results to XML FMT for specific documents |
IDRSRD-5902 |
The iDRS should allow an integrator to choose PDF loading resolution |
IDRSRD-5899 |
The iDRS does not properly detect font size for Korean language |
IDRSRD-5895 |
The iDRS does not serialize CPageParagraphStyle.FontStyle member properly |
IDRSRD-5888 |
The iDRS loads Pdf text layer with incorrect font sizes |
IDRSRD-5887 |
The iDRS is not rasterizing Pdfs having forms with fillable fields |
IDRSRD-5874 |
The iDRS should propose exporting OCR results to ALTO standard XML format |
IDRSRD-5868 |
The iDRS charset limitation feature is broken for Korean language |
IDRSRD-5861 |
The iDRS fails to create PDF document when recognizing a specific Korean image |
IDRSRD-5749 |
The iDRS finds an incorrect orientation for specific Greek and Hebrew images |