pyigd API
Module for reading and writing Indexable Genotype Data (IGD) files.
- class pyigd.BpPosFlags(*values)
Bases:
EnumAn enumeration of bitwise flags that describe the contents of each variants genotype data.
- IS_MISSING = 144115188075855872
- MASK = 18374686479671623680
- SPARSE = 72057594037927936
- class pyigd.IGDConstants
Bases:
object- FLAG_IS_PHASED = 1
- HEADER_FORMAT = 'QQIIQIIQQQQQQQQQQQ'
- HEADER_MAGIC = 4182841125617939585
- INDEX_ENTRY_BYTES = 16
- NUM_HEADER_BYTES = 128
- SUPPORTED_FILE_VERSION = 4
- class pyigd.IGDHeader(magic: int, version: int, ploidy: int, sparse_threshold: int, num_variants: int, num_individuals: int, reserved: int, flags: int, fp_index: int, fp_variants: int, fp_individualids: int, fp_variantids: int)
Bases:
object- flags: int
- fp_index: int
- fp_individualids: int
- fp_variantids: int
- fp_variants: int
- magic: int
- num_individuals: int
- num_variants: int
- pack() bytes
- ploidy: int
- reserved: int
- sparse_threshold: int
- version: int
- class pyigd.IGDReader(file_obj: BinaryIO)
Bases:
objectConstruct an IGDReader object for reading data from a file.
- Parameters:
file_obj – The file object to read from; should be opened in binary mode (“rb”).
- property description
Description of the IGD file.
- get_alt_allele(variant_idx: int) str
Get the alternative allele string for the given variant index.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The string representation of the alternate allele.
- get_individual_ids() List[str]
Get a list of identifiers for the individuals in this dataset. The 0th individual’s label is at list position 0, and the last individual is at list position (num_individuals-1).
- Returns:
Empty list if there are no identifiers (it is optional). Otherwise a list of strings.
- get_position_and_flags(variant_idx: int) Tuple[int, int]
Given a variant index between 0…(num_variants-1), return the tuple (position, flags).
Much faster than get_samples or get_samples_bv because it only scans the variant index and does not read the actual genotype data.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The tuple (position, flags) where position is the base-pair position (integer) and flags is an integer that can be bitwise ANDed with BpPosFlags values.
- get_position_flags_copies(variant_idx: int) Tuple[int, int, int]
Given a variant index between 0…(num_variants-1), return the tuple (position, flags, num_copies).
Much faster than get_samples or get_samples_bv because it only scans the variant index and does not read the actual genotype data.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The tuple (position, flags, num_copies) where position is the base-pair position (integer), flags is an integer that can be bitwise ANDed with BpPosFlags values, and num_copies is an integer indicating how many copies of the alternate allele this variant represents (for unphased data only).
- get_ref_allele(variant_idx: int) str
Get the reference allele string for the given variant index.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The string representation of the reference allele.
- get_samples(variant_idx: int) Tuple[int, bool, List[int]]
Given a variant index between 0…(num_variants-1), return the tuple (position, missing, samples) where position is the base-pair position, missing is True when this represents a row of missing data, samples is a list of sample indexes that have the alt allele for the given variant, and copies is the number of copies of the alternate allele (for unphased data only).
When missing is True, the sample list contains samples that are missing the given variant.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The tuple (position, is_missing, samples).
- get_samples_bv(variant_idx: int)
Given a variant index between 0…(num_variants-1), return the tuple (position, missing, sample_bv) where position is the base-pair position, missing is True when this represents a row of missing data, and sample_bv is a bitvector representing samples that have the alt allele for the given variant.
When missing is True, the sample vector contains samples that are missing the given variant.
- Parameters:
variant_idx – Variant index between 0…(num_variants-1). Variants are ordered from smallest to largest base-pair position.
- Returns:
The tuple (position, is_missing, samples).
- get_variant_ids() List[str]
Get a list of identifiers for the variants in this dataset. The 0th variants’s label is at list position 0, and the last variant is at list position (num_variants-1).
- Returns:
Empty list if there are no identifiers (it is optional). Otherwise a list of strings.
- property is_phased
True if the data is phased. IGD doesn’t support mixed phasedness.
- lower_bound_position(position) int
Return the first variant index with position that is greater than or equal to the given position. Will return num_variants if the given position is greater than all positions in the IGD.
- Parameters:
position (int) – The position to search for.
- Returns:
The first variant index with position greater than or equal to the given position.
- Return type:
int
- property num_individuals
Number of individuals.
- property num_samples
Number of samples. For phased data, this is num_individuals * ploidy. For unphased data this is just num_individuals. Every returned sample index (from get_samples()) will be less than num_samples.
- property num_variants
Number of variants in the file. This is not necessarily the same as the number of variants in the equivalent VCF file (for example), since IGD stores multi-allelic variants as multiple bi-allelic variants with the same base-pair position.
- property ploidy
Ploidy of each individual, between 1 and 8.
- property source
Source description of where the the IGD file came from.
- property version
IGD file format version.
- class pyigd.IGDTransformer(in_stream: BinaryIO, out_stream: BinaryIO, use_bitvectors: bool = False)
Bases:
objectClass for transforming one IGD file to another.
- Parameters:
in_stream – The input stream for the input IGD file. Usually a file opened via mode “rb”.
out_stream – The output stream for the output IGD file. Usually a file opened via mode “wb”.
use_bitvectors – If True, the modify_samples callback will be invoked with a a List if 1s and 0s, where position “i” being 1 means sample “i” has the alternate allele.
- modify_samples(position: int, is_missing: bool, samples: List[int], num_copies: int = 0) List[int] | None
- transform()
Transform the input file to the output file, invoking modify_samples() for every variant from the input file. If modify_samples() returns None then the variant will not be emitted to the output file. Otherwise the variant will be emitted with whatever sample list is returned from modify_samples().
- class pyigd.IGDWriter(out_stream: BinaryIO, individuals: int, ploidy: int = 2, phased: bool = True, source: str = '', description: str = '', sparse_threshold: int = 32)
Bases:
objectConstruct an IGDWriter for a given output stream.
- Parameters:
out_stream – The output stream to write to; usually a file opened via mode “wb”.
individuals – The number of individual samples in the file. NOT the number of haploids unless ploidy=1.
ploidy – The ploidy of each individual sample.
phased – Whether the data being stored is phased.
source – A string describing where the data came from.
description – A string describing the contents of the file.
sparse_threshold – The threshold for choosing between sparse and dense sample lists when writing variant data to the file. Default is 32, which means that we still store variants sparsely if their frequency is less than or equal to 1/32. This is the threshold that is theoretically the break-even point between sparse and dense representations (since the sparse representation uses 32-bit integers, and dense uses a bit per sample).
- write_header()
Write the file header to the current output buffer position. Fails if that position if not the start of the buffer.
- write_index()
Write the variant position index. Must be called _after_ all calls to write_variant().
- write_individual_ids(labels: List[str])
Write the identifiers for the individual samples (optional).
- Parameters:
labels – Empty list or a list of strings, one for each individual.
- write_variant(position: int, ref_allele: str, alt_allele: str, samples: List[int], is_missing: bool = False, num_copies: int = 0)
Write the next variant, including sample information, to the file. Variants must be written in ascending order of their base pair position.
- Parameters:
position – Base-pair position.
ref_allele – The reference allele.
alt_allele – The alternate allele.
samples – The list of samples, as indexes. E.g. the list [4, 10] means that samples numbered 4 and 10 have this variant’s alternate allele. This list must be in ascending order.
is_missing – [Optional] Set to true if the sample list represents the list of samples that are missing allele values at this position (in which case the reference and alt allele are somewhat irrelevant).
num_copies – [Optional] For unphased data, set to the number of copies of the alternate allele that this variant represents (1…ploidy).
- write_variant_ids(labels: List[str])
Write the identifiers for the variants (optional).
- Parameters:
labels – Empty list or a list of strings, one for each variant. That is, if you called write_variant() X times, then there should be X entries in this list.
- write_variant_info()
Write the variant information table. Must be called _after_ all calls to write_variant().
- pyigd.flags_is_missing(flags: int)
Returns true if the flags specify that the variant represents missing data.
- Parameters:
flags – The flags, e.g. as returned from get_position_and_flags
Extra functionality that is not core to accessing an IGD, but helpful for manipulating the information in/related to an IGD.
- pyigd.extra.collect_next_site(igd_reader, variant_index: int) List[int]
Given a variant index to start at, iterate all consecutive variants that have the same position and return the variant indices for that position. The returned indices are in ascending order.
- Parameters:
igd_reader (IGDReader) – The IGDReader representing the IGD file.
variant_index (int) – The variant index to start scanning from.
- Returns:
The list of subsequent variant indexes that all share the same position.
- Return type:
List[int]
- pyigd.extra.get_inverse_sample_list(igd_reader, variant_indices: List[int]) List[List[int]]
Given a list of variant indices, compute the set of samples that are covered by those indices, and then invert that list. Works for both phased and unphased data. One usage for this function is to get an explicit list of samples that have the reference, since the reference is stored implicitly in an IGD.
If the input data has overlapping variants that add up to more than PLOIDY for some samples, then we return an empty list. In real data this can happen at sites containing indels/SVs, depending on how the dataset was created. If you don’t want this behavior (skipping the whole site) then you must filter out these problematic variants in the IGD file.
- Parameters:
igd_reader (IGDReader) – The IGDReader representing the IGD file.
variant_indices (List[int]) – The indices of the variants who total sample set (i.e., all of them unioned together) you want the inverse of.
- Returns:
A list of sample lists, one for each number of copies between 1…PLOIDY. For phased data, there will always be a single sample list. For unphased data, there will always be PLOIDY sample lists.
- Return type:
List[List[int]]