Skip to content

K-mers/Minimizers

K-mers (also known as q-grams) are substrings of length k. These allow for a convenient handling of long string, by splitting it into overlapping substrings and perform analysis on these substrings.

Shorter k-mers can even be placed into a 64bit integer, allowing for quick comparision with other k-mers.

K-mers

    template <alphabet_c Alphabet>
    struct compact_encoding;

This structure creates a view over a rank based values and encodes k-mers according to some Alphabet. If Alphabet allow for complement this encoding will take the canonical kmer. It allows for typical for-range loop syntax.

Example

// SPDX-FileCopyrightText: 2006-2023, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2023, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <ivsigma/ivsigma.h>
#include <iostream>

int main()
{
    std::vector<uint8_t> input = ivs::convert_char_to_rank<ivs::dna4>("GCGACGTAC");
    for (auto enc : ivs::compact_encoding<ivs::dna4>(input, /*._k=*/ 3)) {
        std::cout << enc << ' ';
    }
    std::cout << '\n';
}
Output:
25 24 33 6 6 44 44 259 


Winnowing minimizers

Winnowing minimizers select a smaller subset from all k-mers. These are representatives of all the k-mers of a sequence. Dealing with a smaller set of k-mers allows faster analysis of sequences. These work similar to the compact_encoding view. Additionally it takes a window, from which it picks the smallest value.

Example

// SPDX-FileCopyrightText: 2006-2023, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2023, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <ivsigma/ivsigma.h>
#include <iostream>

int main()
{
    std::vector<uint8_t> input = ivs::convert_char_to_rank<ivs::dna4>("GCGACGTAC");
    for (auto enc : ivs::winnowing_minimizer<ivs::dna4>(input, /*._k=*/ 3, /*._window=*/ 2)) {
        std::cout << enc << ' ';
    }
    std::cout << '\n';
}
Output:
24 6 6 44