Normalizer object to be used with nvtext::normalize_characters. More...
#include <normalize.hpp>
Public Member Functions | |
character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) | |
Normalizer object constructor. More... | |
Normalizer object to be used with nvtext::normalize_characters.
Use nvtext::create_normalizer to create this object.
This normalizer includes:
"\t", "\n", "\r"
) to just space " "
The padding process adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories
If do_lower_case = true
, lower-casing also removes any accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.
If special_tokens
are included the padding after [
and before ]
is not inserted if the characters between them match one of the given tokens. Also, the special_tokens
are expected to include the []
characters at the beginning of and end of each string appropriately.
Definition at line 143 of file normalize.hpp.
nvtext::character_normalizer::character_normalizer | ( | bool | do_lower_case, |
cudf::strings_column_view const & | special_tokens, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Normalizer object constructor.
This initializes and holds the character normalizing tables and settings.
do_lower_case | If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed. |
special_tokens | Each row is a token including the [] brackets. For example: [BOS] , [EOS] , [UNK] , [SEP] , [PAD] , [CLS] , [MASK] |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |