Skip to main content

Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Extends

Constructors

new HTMLReader()

new HTMLReader(): HTMLReader

Returns

HTMLReader

Inherited from

FileReader.constructor

Methods

getOptions()

getOptions(): Partial<Opts>

Wrapper for our configuration options passed to string-strip-html library

Returns

Partial<Opts>

An object of options for the underlying library

See

https://codsen.com/os/string-strip-html/examples

Defined in

packages/readers/html/dist/index.d.ts:32


loadData()

loadData(filePath): Promise<Document<Metadata>[]>

Parameters

filePath: string

Returns

Promise<Document<Metadata>[]>

Inherited from

FileReader.loadData

Defined in

packages/core/schema/dist/index.d.ts:187


loadDataAsContent()

loadDataAsContent(fileContent): Promise<Document<Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

fileContent: Uint8Array

The content of the file.

Returns

Promise<Document<Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Overrides

FileReader.loadDataAsContent

Defined in

packages/readers/html/dist/index.d.ts:18


parseContent()

parseContent(html, options?): Promise<string>

Wrapper for string-strip-html usage.

Parameters

html: string

Raw HTML content to be parsed.

options?: Partial<Opts>

An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Defined in

packages/readers/html/dist/index.d.ts:26


addMetaData()

static addMetaData(filePath): (doc, index) => void

Parameters

filePath: string

Returns

Function

Parameters

doc: BaseNode<Metadata>

index: number

Returns

void

Inherited from

FileReader.addMetaData

Defined in

packages/core/schema/dist/index.d.ts:188