K-mers/Minimizers¶
K-mers (also known as q-grams) are substrings of length k
.
These allow for a convenient handling of long string, by splitting it into overlapping
substrings and perform analysis on these substrings.
Shorter k-mers can even be placed into a 64bit integer, allowing for quick comparision with other k-mers.
K-mers¶
template <alphabet_c Alphabet>
struct compact_encoding;
This structure creates a view over a rank based values and encodes k-mers
according to some Alphabet
. If Alphabet
allow for complement
this encoding will take the canonical kmer.
It allows for typical for-range loop syntax.
Example¶
// SPDX-FileCopyrightText: 2006-2023, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2023, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <ivsigma/ivsigma.h>
#include <iostream>
int main()
{
std::vector<uint8_t> input = ivs::convert_char_to_rank<ivs::dna4>("GCGACGTAC");
for (auto enc : ivs::compact_encoding<ivs::dna4>(input, /*._k=*/ 3)) {
std::cout << enc << ' ';
}
std::cout << '\n';
}
25 24 33 6 6 44 44 259
Winnowing minimizers¶
Winnowing minimizers select a smaller subset from all k-mers. These are representatives of all the k-mers of a sequence.
Dealing with a smaller set of k-mers allows faster analysis of sequences. These work similar to the compact_encoding
view.
Additionally it takes a window, from which it picks the smallest value.
Example¶
// SPDX-FileCopyrightText: 2006-2023, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2023, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <ivsigma/ivsigma.h>
#include <iostream>
int main()
{
std::vector<uint8_t> input = ivs::convert_char_to_rank<ivs::dna4>("GCGACGTAC");
for (auto enc : ivs::winnowing_minimizer<ivs::dna4>(input, /*._k=*/ 3, /*._window=*/ 2)) {
std::cout << enc << ' ';
}
std::cout << '\n';
}
24 6 6 44