setrcarbon.blogg.se - Pdfextractor command line

#Pdfextractor command line pdf

+ 'XMLExtractor': Added property 'IndentedXML' to control indentation. 'Remover2': making unsearchable now performed only for edited pages.

#Pdfextractor command line pdf

'Remover2': fixed handling of PDF page rotation. It replaced unicode spaces and hyphens in the extracted text with normal ' ' and '-' characters. + Added property 'NormalizeText' to all extractors. = Improved background color detection for the option 'ConsiderBackgroundColors'. They helps to prevent underlined text affecting the line grouping in table cells.

+ Added properties 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' to `CSVExtractor` and `XLSExtractor`. = All extractor classes now support extraction of page ranges. = 'JSONExtractor' and 'XMLExtractor' now output the page size for each page. See the property 'RenameMatchingFieldsDuringMerge'. Now it can link fields with matching names or rename them to avoid unwanted linking. = 'DocumentMerger': Improved merging of PDF forms. = Improved the 'LineGroupingMode.JoinOrphanedRows'. = Improved filtering of shadow-like text ('ExtractShadowLikeText' option). = Greatly improved tables detection in 'TableDetector2'. + New column detection mode 'ColumnDetectionMode.ContentGroupsAI' that works better on tables without borders and on pages with multiple tables. Fixed disposing issue in 'SearchablePDFMaker'. Line grouping was not affected by 'ConsiderFontSizes' and 'ConsiderFontColors' properties. NET Core min required version is 2.1 now (was 2.0). + Extractors and SearchablePDFMaker: Added property 'OCRDisableAutoSegmentation' to solve OCR engine's segmentation issues. = Improved COM/ActiveX interfaces for in-memory processing without file operations. + InfoExtractor: Added method 'GetFormFields()' returning information about form fields in PDF document. + Added support for UniKS-UCS2-H text encoding. = JSONExtractor: The mode 'OutputStructure.Full' is renamed to 'OutputStructure.LegacyFixed' and made maximally compatible in field names with the mode 'OutputStructure.Legacy'. + XLSExtractor: Added property 'CustomColumnWidths' allowing to specify exact column widths in generated Excel spreadsheet. + DocumentMerger: Added property 'MergedDocumentTitle' allowing to override the title of merged document. 'SearchablePDFMaker': fixed coordinates of transparent text in the output document when the input is an image. Fixed parsing of names of file attachments. Rotated text objects were combined with unrotated ones.

= 'DocumentRotator' now can automatically fix rotation of PDF files using OCR. Input photo images are now rotated according to EXIF information. They allow to perform in-memory processing when using the SDK as COM/ActiveX object from Delphi, VC++, VBScript, etc. + Added methods to all extractors that support Variant datatype for input and output. + DocumentSplitter: added support for "**" split range that splits document into pairs of pages. Provides ActiveX interface to use from legacy programming languages (Visual Basic 6, Delphi) and scripting (VBscript, JScript and others) Reads text from scanned PDF documents using OCR (Optical Character Recognition)

Searches text inside document with regex support Extracts PDF document information (author, subject, producer etc) Extracts data from whole document page or specified rectangular region Splits and merges PDF files, extracts a single page or range of pages Extracts embedded images, files and attachments from PDF files Extracts data from PDF files in TXT, CSV, XML, XLS, XLSX, JSON formats NET, ASP.NET, ActiveX, Visual Basic 6, Classic ASP, Delphi and others.