Files | Functions
MinHashing

Files

file  minhash.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::minhash (cudf::strings_column_view const &input, uint32_t seed, cudf::device_span< uint32_t const > parameter_a, cudf::device_span< uint32_t const > parameter_b, cudf::size_type width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the minhash values for each string. More...
 
std::unique_ptr< cudf::columnnvtext::minhash64 (cudf::strings_column_view const &input, uint64_t seed, cudf::device_span< uint64_t const > parameter_a, cudf::device_span< uint64_t const > parameter_b, cudf::size_type width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the minhash values for each string. More...
 
std::unique_ptr< cudf::columnnvtext::minhash_ngrams (cudf::lists_column_view const &input, cudf::size_type ngrams, uint32_t seed, cudf::device_span< uint32_t const > parameter_a, cudf::device_span< uint32_t const > parameter_b, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the minhash values for each input row. More...
 
std::unique_ptr< cudf::columnnvtext::minhash64_ngrams (cudf::lists_column_view const &input, cudf::size_type ngrams, uint64_t seed, cudf::device_span< uint64_t const > parameter_a, cudf::device_span< uint64_t const > parameter_b, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the minhash values for each input row. More...
 

Detailed Description

Function Documentation

◆ minhash()

std::unique_ptr<cudf::column> nvtext::minhash ( cudf::strings_column_view const &  input,
uint32_t  seed,
cudf::device_span< uint32_t const >  parameter_a,
cudf::device_span< uint32_t const >  parameter_b,
cudf::size_type  width,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the minhash values for each string.

This function uses MurmurHash3_x86_32 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint32
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
and where i=[0,a.size())
constexpr CUDF_HOST_DEVICE scale_type min(scale_type const &a, scale_type const &b)
Returns the smaller of the given scales.
Definition: fixed_point.hpp:82

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the width < 2
std::invalid_argumentif parameter_a is empty
std::invalid_argumentif parameter_b.size() != parameter_a.size()
std::overflow_errorif parameter_a.size() * input.size() exceeds the column size limit
Parameters
inputStrings column to compute minhash
seedSeed value used for the hash algorithm
parameter_aValues used for the permuted calculation
parameter_bValues used for the permuted calculation
widthThe character width of substrings to hash for each row
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed

◆ minhash64()

std::unique_ptr<cudf::column> nvtext::minhash64 ( cudf::strings_column_view const &  input,
uint64_t  seed,
cudf::device_span< uint64_t const >  parameter_a,
cudf::device_span< uint64_t const >  parameter_b,
cudf::size_type  width,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the minhash values for each string.

This function uses MurmurHash3_x64_128 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint64
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the width < 2
std::invalid_argumentif parameter_a is empty
std::invalid_argumentif parameter_b.size() != parameter_a.size()
std::overflow_errorif parameter_a.size() * input.size() exceeds the column size limit
Parameters
inputStrings column to compute minhash
seedSeed value used for the hash algorithm
parameter_aValues used for the permuted calculation
parameter_bValues used for the permuted calculation
widthThe character width of substrings to hash for each row
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed

◆ minhash64_ngrams()

std::unique_ptr<cudf::column> nvtext::minhash64_ngrams ( cudf::lists_column_view const &  input,
cudf::size_type  ngrams,
uint64_t  seed,
cudf::device_span< uint64_t const >  parameter_a,
cudf::device_span< uint64_t const >  parameter_b,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the minhash values for each input row.

This function uses MurmurHash3_x64_128 for the hash algorithm.

The input row is first hashed using the given seed over a sliding window of ngrams of strings. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint64
mp = (1 << 61) - 1
hv[i] = hash value of a ngrams at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each set of ngrams and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all ngrams in row j
and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the ngrams < 2
std::invalid_argumentif parameter_a is empty
std::invalid_argumentif parameter_b.size() != parameter_a.size()
std::overflow_errorif parameter_a.size() * input.size() exceeds the column size limit
Parameters
inputList strings column to compute minhash
ngramsThe number of strings to hash within each row
seedSeed value used for the hash algorithm
parameter_aValues used for the permuted calculation
parameter_bValues used for the permuted calculation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed

◆ minhash_ngrams()

std::unique_ptr<cudf::column> nvtext::minhash_ngrams ( cudf::lists_column_view const &  input,
cudf::size_type  ngrams,
uint32_t  seed,
cudf::device_span< uint32_t const >  parameter_a,
cudf::device_span< uint32_t const >  parameter_b,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the minhash values for each input row.

This function uses MurmurHash3_x86_32 for the hash algorithm.

The input row is first hashed using the given seed over a sliding window of ngrams of strings. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint32
mp = (1 << 61) - 1
hv[i] = hash value of a ngrams at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each set of ngrams and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all ngrams in row j
and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the ngrams < 2
std::invalid_argumentif parameter_a is empty
std::invalid_argumentif parameter_b.size() != parameter_a.size()
std::overflow_errorif parameter_a.size() * input.size() exceeds the column size limit
Parameters
inputStrings column to compute minhash
ngramsThe number of strings to hash within each row
seedSeed value used for the hash algorithm
parameter_aValues used for the permuted calculation
parameter_bValues used for the permuted calculation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed