16.2.5

📅 2025-03-07

`New Features`

New Page Segmentation

This release comes with a brand new page segmentation algorithm, which brings several improvements compared to the previous one, and increases the quality of text and layout detection in general.

The change is fully transparent. It is not necessary to update your integration to benefit from it.

Detection of documents with complex layouts

The new page segmentation is expected to perform generally better than the previous one on documents with complex, unstructured or semi-structured layouts.

This would correspond to the following type of documents:

Invoices
Camera images
Flyers
Magazines

Detection of table content

The structure of tables, or content organized as table, is improved with this new algorithm. Any type of document can benefit from this improvement, altough it will be more visible on invoices which are usually following an implicit grid-like layout.

Detection of inverted text

Detection of inverted text (i.e. light text on dark background) is also improved. This type of pattern was a known weakness of the previous segmentation algorithm, and is now properly supported by the new one.

Increased maximum page size supported by OCR engine

The new page segmentation implementation allowed to raise the maximum page size supported by the OCR engine, from 75 to 559 million pixels. This new limit allows, for instance, to recognize A0 pages scanned at 600 dpi.

From an integration point of view, the class CImageLimits has been updated to reflect this new threshold.

Detection of overlapping Zones

Lastly, the new segmentation algorithm is able to detect content overlapping each other, unlike the previous one. This case can be rather frequent for instance on magazine pages, with text printed on top of a picture; or with pictures inside table cells; etc.

The output module of iDRS has be adapted in order to properly support overlapping elements for all concerned output formats.

In a nutshell, this additional capability brings a better layout decomposition and will especially improve the visual quality of word processor outputs (docx, rtf, …).

New XHTML output

Previous iDRS HTML output has been fully replaced by a new XHTML output. The brand new XHTML writer used for this format provides the following improvements over the previous one:

Compliance with XHTML standard
Support of overlapping zones
More precise positionning of elements
Optimized CSS management
Better handling of UTF-8 characters

`Improvements`

Performance of High Quality OCR for Japanese language

The Japanese High Quality OCR network has been updated to be more efficient in terms of memory consumption, and more performant in terms of speed. We were able to measure up to 40% of time savings on internal test sets!

Support of Hanja characters in Korean

This version includes an updated Korean engine that supports Hanja characters. Hanja are Chinese characters that were used as the writing script for the Korean language before the widespread adoption of Hangul. They are still used in modern Korean, for example, to represent names.

Improved DOCX Editable and Exact outputs

This iDRS release includes a consequent number of fixes and fine-tuning for DOCX output, especially concerning Editable and Exact layouts.

As a result, thanks to this fine-tuning effort and the new segmentation update, the quality of the DOCX conversion performed by iDRS has been significantly improved.

`Additional Notes`

Charset Limitation

The behavior of the charset limitation feature has changed:

The OCR engine interprets an entire line using only the characters included in the charset, instead of replacing excluded characters with a reject character.

As a result:

Lines may or may not contain reject characters.
A line may be erased if the OCR cannot interpret a significant portion of this line due to the absence of characters excluded from the charset.

We recommend that you use the charset limitation feature only for minor adjustments, i.e. forbidding a small number of characters.

Example: if you select a charset that includes only digits (a very restrictive charset), lines containing digit characters may be completely missed if they contained also a significant number of non-digits characters.

`Bug Fixes`

Internal ID	Description	Service desk IDs
IDRSRD-9741	word spacing of DOCX output can be improved for Thai justified text
IDRSRD-9727	Docx conversion results on customer test set should be improved
IDRSRD-9726	Text lines sometimes are incorrectly merged as paragraphs in DOCX output
IDRSRD-9723	Zonal OCR of a specific image may return empty results depending on the zone size
IDRSRD-9711	Crash when loading a corrupted jpg image
IDRSRD-9705	Page analysis allowed languages are not taken into account when language detection is turned off, resulting in reduced orientation detection accuracy
IDRSRD-9698	The iDRS creates DOCX outputs with incorrect URI links
IDRSRD-9695	The iDRS generates overlapping text results on a specific Japanese image
IDRSRD-9688	The iDRS sometimes sets incorrect textbox right indent for Docx Editable output
IDRSRD-9683	The iDRS XHTML NoLayout output can be improved
IDRSRD-9681	OCR accuracy on 100dpi images is degraded with new page segmentation
IDRSRD-9679	OCR engine freeze on an Arabic image
IDRSRD-9677	The iDRS creates incorrect DOCX output when containing Top to Bottom text
IDRSRD-9656	The iDRS new segmentation find text columns with zones going upwards
IDRSRD-9653	The iDRS SDK crashes when running OCR on a specific document
IDRSRD-9647	Crash when running OCR intel arch on arm macOS
IDRSRD-9640	The new segmentation crashes when running OCR on chinese followed by japanese
IDRSRD-9634	OCR engine returns some 0-sized elements when recognizing Arabic and Farsi documents
IDRSRD-9614	CPageProcessing must be optimized
IDRSRD-9604	iDRS16 .NET generates new object ids for provided array elements
IDRSRD-9593	Update the iDRS to have new segmentation and overlapping zones activated by default
IDRSRD-9587	The iDRS uses incorrect indentation when creating DOCX output with right-to-left text
IDRSRD-9585	The iDRS should expose an option to downscale input if needed, when outputting Word document
IDRSRD-9580	Text display of DOCX created with iDRS can be improved
IDRSRD-9569	The new segmentation doesn’t recognize underscore symbol	ISD-35641
IDRSRD-9551	Japanese HQOCR misses several characters with inverted colors on a specific image
IDRSRD-9536	The new segmentation considers isolated dash characters as graphics
IDRSRD-9535	Detection of table header row is incorrect on a specific image
IDRSRD-9526	The new segmentation often misses comma signs	ISD-35429
IDRSRD-9521	The iDRS cannot load large pdf document	ISD-34080
IDRSRD-9495	Header row of clear table is not properly recognized
IDRSRD-9487	The new page segmentation crash when processing a specific image	ISD-35253
IDRSRD-9486	String class should support conversion from/to utf16-encoded strings using char16_t and wchar_t data types
IDRSRD-9469	Detection of graphic lines is inaccurate
IDRSRD-9458	Memory consumption of HQOCR Japanese is huge on a specific image
IDRSRD-9398	Re-introduce support of Hanja in Korean OCR	ISD-36142
IDRSRD-9358	Memory consumption of iDRS PDF loading can be improved
IDRSRD-9336	Processing time for Japanese language is degraded with iDRS 16, compared to iDRS 15
IDRSRD-9323	iDRS detects justified Korean text as several paragraphs containing a single character	ISD-34474
IDRSRD-9301	The new page segmentation makes substitution errors between 'O' and '0'	ISD-34256
IDRSRD-9280	The iDRS doesn’t properly write tabulation entries in DOCX Editable and Exact layouts
IDRSRD-9190	The iDRS does not recognize clear text next to graphic zone
IDRSRD-9179	The iDRS fails to detect columns on specific Korean image	ISD-33936
IDRSRD-8301	The iDRS misrecognizes . (dot) in specific Japanese documents
IDRSRD-6495	The iDRS orientation detection gives wrong results on specific files.
IDRSRD-6450	The iDRS 16 should be updated to zlib 1.3.1	ISD-35377
IDRSRD-6433	The iDRS incorrectly detects vertical text with zonal OCR, on a specific image
IDRSRD-6378	Layout of docx output created by iDRS can be improved, when the document contains narrow columns of text
IDRSRD-6354	Creation of XLSX with layout RecreateInput fails for a specific image when using new page segmentation
IDRSRD-6352	The iDRS new page segmentation outputs extremely small font size for Hebrew characters
IDRSRD-6350	The iDRS new page segmentation doesn’t handle a clearscan Hebrew document
IDRSRD-6345	The iDRS detects non-existing I2OF5 barcodes on a specific document	ISD-9456
IDRSRD-6335	The iDRS wrongly detects the text of a specific table cell	ISD-31823
IDRSRD-6331	The iDRS does not respect the tabulation when processing a pdf into docx format
IDRSRD-6320	Paragraph spacing of DOCX created by iDRS are not correct when converting a specific image
IDRSRD-6292	The iDRS should support images larger than 75M pixels	ISD-31121
IDRSRD-6165	A full page table is causing an unexpected page break when converting specific image to DOCX output
IDRSRD-6079	Alignment of bullet lists in iDRS DOCX output is incorrect
IDRSRD-5933	Hanja characters no longer part of the Korean charset with latest Korean OCR engine	ISD-36198, ISD-36142
IDRSRD-2988	The iDRS does not detect the border line correctly when converting a TIF to docx.	ISD-8043

Internal ID

Description

Service desk IDs

IDRSRD-9741

word spacing of DOCX output can be improved for Thai justified text

IDRSRD-9727

Docx conversion results on customer test set should be improved

IDRSRD-9726

Text lines sometimes are incorrectly merged as paragraphs in DOCX output

IDRSRD-9723

Zonal OCR of a specific image may return empty results depending on the zone size

IDRSRD-9711

Crash when loading a corrupted jpg image

IDRSRD-9705

Page analysis allowed languages are not taken into account when language detection is turned off, resulting in reduced orientation detection accuracy

IDRSRD-9698

The iDRS creates DOCX outputs with incorrect URI links

IDRSRD-9695

The iDRS generates overlapping text results on a specific Japanese image

IDRSRD-9688

The iDRS sometimes sets incorrect textbox right indent for Docx Editable output

IDRSRD-9683

The iDRS XHTML NoLayout output can be improved

IDRSRD-9681

OCR accuracy on 100dpi images is degraded with new page segmentation

IDRSRD-9679

OCR engine freeze on an Arabic image

IDRSRD-9677

The iDRS creates incorrect DOCX output when containing Top to Bottom text

IDRSRD-9656

The iDRS new segmentation find text columns with zones going upwards

IDRSRD-9653

The iDRS SDK crashes when running OCR on a specific document

IDRSRD-9647

Crash when running OCR intel arch on arm macOS

IDRSRD-9640

The new segmentation crashes when running OCR on chinese followed by japanese

IDRSRD-9634

OCR engine returns some 0-sized elements when recognizing Arabic and Farsi documents

IDRSRD-9614

CPageProcessing must be optimized

IDRSRD-9604

iDRS16 .NET generates new object ids for provided array elements

IDRSRD-9593

Update the iDRS to have new segmentation and overlapping zones activated by default

IDRSRD-9587

The iDRS uses incorrect indentation when creating DOCX output with right-to-left text

IDRSRD-9585

The iDRS should expose an option to downscale input if needed, when outputting Word document

IDRSRD-9580

Text display of DOCX created with iDRS can be improved

IDRSRD-9569

The new segmentation doesn’t recognize underscore symbol

ISD-35641

IDRSRD-9551

Japanese HQOCR misses several characters with inverted colors on a specific image

IDRSRD-9536

The new segmentation considers isolated dash characters as graphics

IDRSRD-9535

Detection of table header row is incorrect on a specific image

IDRSRD-9526

The new segmentation often misses comma signs

ISD-35429

IDRSRD-9521

The iDRS cannot load large pdf document

ISD-34080

IDRSRD-9495

Header row of clear table is not properly recognized

IDRSRD-9487

The new page segmentation crash when processing a specific image

ISD-35253

IDRSRD-9486

String class should support conversion from/to utf16-encoded strings using char16_t and wchar_t data types

IDRSRD-9469

Detection of graphic lines is inaccurate

IDRSRD-9458

Memory consumption of HQOCR Japanese is huge on a specific image

IDRSRD-9398

Re-introduce support of Hanja in Korean OCR

ISD-36142

IDRSRD-9358

Memory consumption of iDRS PDF loading can be improved

IDRSRD-9336

Processing time for Japanese language is degraded with iDRS 16, compared to iDRS 15

IDRSRD-9323

iDRS detects justified Korean text as several paragraphs containing a single character

ISD-34474

IDRSRD-9301

The new page segmentation makes substitution errors between 'O' and '0'

ISD-34256

IDRSRD-9280

The iDRS doesn’t properly write tabulation entries in DOCX Editable and Exact layouts

IDRSRD-9190

The iDRS does not recognize clear text next to graphic zone

IDRSRD-9179

The iDRS fails to detect columns on specific Korean image

ISD-33936

IDRSRD-8301

The iDRS misrecognizes . (dot) in specific Japanese documents

IDRSRD-6495

The iDRS orientation detection gives wrong results on specific files.

IDRSRD-6450

The iDRS 16 should be updated to zlib 1.3.1

ISD-35377

IDRSRD-6433

The iDRS incorrectly detects vertical text with zonal OCR, on a specific image

IDRSRD-6378

Layout of docx output created by iDRS can be improved, when the document contains narrow columns of text

IDRSRD-6354

Creation of XLSX with layout RecreateInput fails for a specific image when using new page segmentation

IDRSRD-6352

The iDRS new page segmentation outputs extremely small font size for Hebrew characters

IDRSRD-6350

The iDRS new page segmentation doesn’t handle a clearscan Hebrew document

IDRSRD-6345

The iDRS detects non-existing I2OF5 barcodes on a specific document

ISD-9456

IDRSRD-6335

The iDRS wrongly detects the text of a specific table cell

ISD-31823

IDRSRD-6331

The iDRS does not respect the tabulation when processing a pdf into docx format

IDRSRD-6320

Paragraph spacing of DOCX created by iDRS are not correct when converting a specific image

IDRSRD-6292

The iDRS should support images larger than 75M pixels

ISD-31121

IDRSRD-6165

A full page table is causing an unexpected page break when converting specific image to DOCX output

IDRSRD-6079

Alignment of bullet lists in iDRS DOCX output is incorrect

IDRSRD-5933

Hanja characters no longer part of the Korean charset with latest Korean OCR engine

ISD-36198, ISD-36142

IDRSRD-2988

The iDRS does not detect the border line correctly when converting a TIF to docx.

ISD-8043

`Known Issues`

Internal ID	Description	Service desk IDs
IDRSRD-9628	Language detection feature requires really unexpected resources
IDRSRD-9392	The new page segmentation breaks down clear pictures
IDRSRD-9754	The iDRS is not compatible with VirtualBox VMs running on Windows Hosts

Internal ID

Description

Service desk IDs

IDRSRD-9628

Language detection feature requires really unexpected resources

IDRSRD-9392

The new page segmentation breaks down clear pictures

IDRSRD-9754

The iDRS is not compatible with VirtualBox VMs running on Windows Hosts

OCR resources required by language detection feature

Currently, the language detection feature requires the OCR lexicon files (.ilex extensions) for all languages included in the allowed list (see property CPageAnalysisParams.AllowedLanguages). This issue will be fixed in the next iDRS release.

Note that if the allowed languages list is empty (default behavior), then all languages allowed by licensing are considered allowed.

Pictures boundaries detection

The new page segmentation tends to create graphic zones around pictures with non-rectangular boundaries, while output would look better with rectangular shape.

The graphic zones boundaries detection will be reworked and improved in a future release.

Compatibility with VirtualBox on Windows

This release has a compatibility issue with Oracle VirtualBox virtualization software, which prevents it to run properly on Windows host systems (whatever the guest system).

The competitor virtualization software VMware is however not impacted by this issue.

This will be addressed in the next release.