Public Member Functions | List of all members
nvtext::character_normalizer Struct Reference

Normalizer object to be used with nvtext::normalize_characters. More...

#include <normalize.hpp>

Public Member Functions

 character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Normalizer object constructor. More...
 

Detailed Description

Normalizer object to be used with nvtext::normalize_characters.

Use nvtext::create_normalizer to create this object.

This normalizer includes:

The padding process adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories

If do_lower_case = true, lower-casing also removes any accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.

If special_tokens are included the padding after [ and before ] is not inserted if the characters between them match one of the given tokens. Also, the special_tokens are expected to include the [] characters at the beginning of and end of each string appropriately.

Definition at line 143 of file normalize.hpp.

Constructor & Destructor Documentation

◆ character_normalizer()

nvtext::character_normalizer::character_normalizer ( bool  do_lower_case,
cudf::strings_column_view const &  special_tokens,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Normalizer object constructor.

This initializes and holds the character normalizing tables and settings.

Parameters
do_lower_caseIf true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokensEach row is a token including the [] brackets. For example: [BOS], [EOS], [UNK], [SEP], [PAD], [CLS], [MASK]
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory

The documentation for this struct was generated from the following file: