File Converters

pdf

PDFToTextConverter

class PDFToTextConverter(BaseConverter)

__init__

 | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)

Arguments:

  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.

txt

TextConverter

class TextConverter(BaseConverter)

__init__

 | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)

Arguments:

  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.

convert

 | convert(file_path: Path, meta: Optional[Dict[str, str]] = None, encoding: str = "utf-8") -> Dict[str, Any]

Reads text from a txt file and executes optional preprocessing steps.

Arguments:

  • file_path: Path of the file to convert
  • meta: Optional meta data that should be associated with the the document (e.g. name)
  • encoding: Encoding of the file

Returns:

Dict of format {"text": "The text from file", "meta": meta}}

tika

TikaConverter

class TikaConverter(BaseConverter)

__init__

 | __init__(tika_url: str = "http://localhost:9998/tika", remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)

Arguments:

  • tika_url: URL of the Tika server
  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.

convert

 | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Arguments:

  • file_path: Path of file to be converted.

Returns:

a list of pages and the extracted meta data of the file.

docx

DocxToTextConverter

class DocxToTextConverter(BaseConverter)

convert

 | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Extract text from a .docx file. Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here. For compliance with other converters we nevertheless opted for keeping the methods name.

Arguments:

  • file_path: Path to the .docx file you want to convert

base

BaseConverter

class BaseConverter()

Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.

__init__

 | __init__(remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None)

Arguments:

  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.

convert

 | @abstractmethod
 | convert(file_path: Path, meta: Optional[Dict[str, str]]) -> Dict[str, Any]

Convert a file to a dictionary containing the text and any associated meta data.

File converters may extract file meta like name or size. In addition to it, user supplied meta data like author, url, external IDs can be supplied as a dictionary.

Arguments:

  • file_path: path of the file to convert
  • meta: dictionary of meta data key-value pairs to append in the returned document.

validate_language

 | validate_language(text: str) -> bool

Validate if the language of the text is one of valid languages.

© 2020 - 2021 deepset. All rights reserved.Imprint