16.0.5

📅 2021-07-22

New features

N/A

Improvements

High Quality OCR for Latin alphabet

The support of High Quality OCR is extended to most languages using the Latin alphabet.
This change is transparent to you and improves OCR accuracy, especially on challenging documents such as newspapers, magazines, poor quality scans or photos.

That concerns the following languages:

Albanian, Azeri Latin, Basque, Breton, Bosnian Latin, Catalan, Cebuano, Corse, Croatian, Czech, Danish, Esperanto, Estonian, Faroese, Finnish, Frisian, Galician, Greenlandic, Haitian Creole, Hungarian, Icelandic, Irish Gaelic, Kurdish, Latvian, Lithuanian, Maltese, Norwegian, Polish, Rhaeto_roman, Romanian, Sardinian, Slovak, Scottish Gaelic, Slovenian, Swedish, Turkish, and Welsh.

Mandatory files
The list of files to be redistributed with your application has been updated accordingly for the languages mentioned above. Go to Files required for your application for more details.

Improved PDF image loading

The quality of the image generated by the iDRS when rasterizing a PDF page is improved, resulting in a better visual appearance and improved OCR.

However, this improvement requires an extra processing when detecting black and white original content.
That is why a new enum CImageLoadOptionsPdf.eBlackAndWhiteDetectionMode is introduced.

Possible values are:

  • BLACK_AND_WHITE_DETECTION_DISABLED (C++) or eBlackAndWhiteDetectionMode.Disabled (.NET)
    Black and white content is not detected, but loaded as greyscale only. This is the fastest mode.

  • BLACK_AND_WHITE_DETECTION_FAST (C++) or eBlackAndWhiteDetectionMode.Fast (.NET)
    The PDF page is inspected and loaded as black and white only if containing such images. In any other cases, the page is loaded as greyscale.

  • BLACK_AND_WHITE_DETECTION_ACCURATE (C++) or eBlackAndWhiteDetectionMode.Accurate (.NET)
    The PDF page raster is analyzed in detail to detect black and white content. As this may involve two rasterizations (with and without smoothing), this mode is the slowest but the most accurate to detect black and white pages. This is the default mode.

The black and white detection mode can be accessed or modified via the method CImageLoadOptionsPdf::Get/SetBlackAndWhiteDetectionMode() (C++) or the property CImageLoadOptionsPdf.BlackAndWhiteDetectionMode (.NET).

Deprecated/removed features

Output formats WordML and XPS

The output formats WordML and XPS are removed from the iDRS API, preventing you from creating such documents.

The XML Format of Microsoft Office Word 2003, or WordML, has been replaced since 2007 by the new Office Open XML formats (DOCX, XLSX, PPTX).

Microsoft XML Paper Specification, or XPS, is also deprecated because it has low business value compared to its immediate competitor, PDF.

Added/removed resources

N/A

Fixed bugs

ID Description

IDRSRD-5666

The iDRS PDF loading may erase some parts of the text on the rasterized image

IDRSRD-5747

The iDRS takes a long time to analyze a specific image

IDRSRD-5925

The iDRS can create invalid pdf files when integrators specify custom fonts with postscript names containing spaces

IDRSRD-5927

The iDRS can recognize diacritics without base characters, leading to PDF creation failure

IDRSRD-5955

The High Quality OCR engine does not find all characters on a specific image

IDRSRD-5958

The iDRS fails to create output PDF when OCR engine recognizes Arial Unicode symbols

IDRSRD-5970

The iDRS should allow creating an image with dimensions larger than OCR limitations

IDRSRD-5971

The page analysis is taking too much time processing this specific image

IDRSRD-5977

The iDRS is not able to load a specific PDF

IDRSRD-5980

The iDRS license installer does not check for the correct Visual Studio redistributable

IDRSRD-5981

The docx created with Editable display do not indicate the expected document language when no text is selected

IDRSRD-5983

Implementations of IFontProviderCallback provided by integrators via the .NET API are not called by the iDRS

IDRSRD-5984

The iDRS does not set BaseLine property in CPageTextLine when loading content from a pdf file

IDRSRD-5985

The iDRS may leak memory when the idrsbarcodeext engine encounters a timeout

IDRSRD-5986

The iDRS cannot load a specific png image

IDRSRD-5987

The iDRS does not include information about the pdf extension in the output pdf files

IDRSRD-5989

The iDRS is generating a non compliant PDF/A-1b document

IDRSRD-5991

When the iDRS updates an existing PDF with several signatures, all signatures have the same title, which is incorrect

IDRSRD-5992

The iDRS does not properly load the text layer of a specific PDF document

IDRSRD-5993

The iDRS can request font data with incorrect bold and italic properties when generating a PDF document

IDRSRD-6004

The PDF loading with page content throws an exception when a PDF object has the coordinates out of the bound of the page

IDRSRD-6007

The PDF loading with page content throws an exception when a text element is out of the bound of the page

IDRSRD-6009

The iDRS is setting DropCapFont property for a paragraph when loading page content

IDRSRD-6017

The iDRS cannot use the CPageResultsParser on a CPage without source image

Known issues

ID Description

IDRSRD-6019

When the iDRS applies several signatures to a PDF, in some cases only the last one is valid

IDRSRD-5968

The iDRS should apply all supported features when creating PDF output with IStreamFactory interface