Example Usage
IGD Traversal
The two core concepts for IGD traversal are the IGD index and the IGD genotype data. The
IGD index is a contiguous file region that only contains the information about position
and flags for each variant, so it can be scanned very quickly. The IGD genotype data has
all the sample information for each variant, and is therefore slower to access. The IGD
Index is accessed via pyigd.IGDReader.get_position_and_flags() or
pyigd.IGDReader.get_position_flags_copies(). The genotype data is accessed via
pyigd.IGDReader.get_samples() or pyigd.IGDReader.get_samples_bv().
Traverse while skipping missing data
Missing data can be identified during traversal of the IGD index or IGD genotype data. We show the latter first:
import pyigd
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
for variant_index in range(igd_file.num_variants):
position, is_missing, sample_list = igd_file.get_samples(variant_index)
if not is_missing:
pass # Do something with sample_list
If you expect there to be a lot of missing data, the above can be slightly slow, as you are retrieving the entire sample list for each variant with missing alleles. Here is the more efficient IGD index version:
import pyigd
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
for variant_index in range(igd_file.num_variants):
position, flags = igd_file.get_position_and_flags(variant_index)
if not pyigd.flags_is_missing(flags):
_, _, sample_list = igd_file.get_samples(variant_index)
# Do something with sample_list
Variants in a specific range
Similar to the missing data example above, variants in a specific range can be efficiently found by scanning the IGD index:
import pyigd
my_range = (5_000_000, 10_000_000) # 5MBP to 10MBP
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
for variant_index in range(igd_file.num_variants):
position, flags = igd_file.get_position_and_flags(variant_index)
if position >= my_range[0] and position < my_range[1]:
_, _, sample_list = igd_file.get_samples(variant_index)
# Do something with sample_list
Or even more efficient is to use pyigd.IGDReader.lower_bound_position() to binary search for
first position greater-than-or-equal-to the start of the range, and then traverse until the end.
import pyigd
my_range = (5_000_000, 10_000_000) # 5MBP to 10MBP
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
variant_index = igd_file.lower_bound_position(my_range[0])
while variant_index < igd_file.num_variants:
position, flags = igd_file.get_position_and_flags(variant_index)
if position >= my_range[1]:
break
_, _, sample_list = igd_file.get_samples(variant_index)
# Do something with sample_list
Runs of homozygosity
Phased and unphased data are both stored in IGD with a row being equal to an alternate allele at a particular
polymorphic site (i.e., a variant). However, unphased data may have multiple rows for each such variant, one
for each number of copies that the unphased individual has. For example, for diploid individuals there may be
up to two rows for each variant: one row for the scenario where an individual has a single copy of the
alternate allele, and one row for the scenario where an individual has two copies of the alternate allele.
This is stored on the IGD index, as the num_copies field, and can be accessed via the
pyigd.IGDReader.get_position_flags_copies() method.
Whereas for phased data each sample is a haplotype, for unphased data each sample is an individual. That is,
igd_file.num_samples == igd_file.num_individuals.
Here is an example that finds runs of homozygosity beyond some given threshold, by traversing the IGD index:
import pyigd
THRESHOLD = 500_000 # Only report ROH exceeding 500kbp
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
assert not igd_file.is_phased, "This example is only for unphased data"
assert igd_file.ploidy == 2, "This example is for diploids only"
last_het_site_per_idv = [0 for _ in range(igd_file.num_individuals)]
print(f"INDIV\tSTART_BP\tEND_BP")
for variant_index in range(igd_file.num_variants):
position, flags, num_copies = igd_file.get_position_flags_copies(variant_index)
# Homozygous occurs when the number of copies of the alt allele are equal to the
# organism ploidy. We check all heterozygous cases as they "break" the ROH, so
# if we break an ROH that exceeds the threshold we emit it.
if num_copies < igd_file.ploidy:
_, is_missing, sample_list = igd_file.get_samples(variant_index)
assert not is_missing, "Missing data not handled in this example"
for indiv in sample_list:
hom_span = position - last_het_site_per_idv[indiv]
if hom_span >= THRESHOLD:
print(f"{indiv}\t{last_het_site_per_idv[indiv]+1}\t{position-1}")
last_het_site_per_idv[indiv] = position
# The last ROH may have gone to the end of the chromosome, so we check for those.
for indiv in range(igd_file.num_individuals):
hom_span = position - last_het_site_per_idv[indiv]
if hom_span >= THRESHOLD:
print(f"{indiv}\t{last_het_site_per_idv[indiv]+1}\t{position-1}")
Print zygosity counts for each variant
Here is a similar example, but instead of finding runs of homozygosity we are simply printing out the zygosity information for each variant.
import pyigd
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
assert not igd_file.is_phased, "This example is only for unphased data"
assert igd_file.ploidy == 2, "This example is for diploids only"
print(f"POSITION\tREF\tALT\tAA\tAa\taa")
for variant_index in range(igd_file.num_variants):
he_count = 0
hz_count = 0
position, is_missing, bitvect = igd_file.get_samples_bv(variant_index)
for indiv in range(igd_file.num_individuals):
sample0 = indiv * 2
sample1 = sample0 + 1
if bitvect[sample0] and bitvect[sample1]:
hz_count += 1
elif bitvect[sample0] or bitvect[sample1]:
he_count += 1
ref = igd_file.get_ref_allele(variant_index)
alt = igd_file.get_alt_allele(variant_index)
print(f"{position}\t{ref}\t{alt}\t{igd_file.num_individuals - (hz_count+he_count)}\t{he_count}\t{hz_count}")
Find common positions between two IGD files
Here is an example where we compare two IGD files, to see if the polymorphic sites they contain are the same:
import pyigd
unique_pos1 = set()
with open("file1.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
for variant_index in range(igd_file.num_variants):
position, flags, num_copies = igd_file.get_position_flags_copies(variant_index)
unique_pos1.add(position)
unique_pos2 = set()
with open("file2.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
for variant_index in range(igd_file.num_variants):
position, flags, num_copies = igd_file.get_position_flags_copies(variant_index)
unique_pos2.add(position)
extra1 = unique_pos1 - unique_pos2
extra2 = unique_pos2 - unique_pos1
if len(extra1) != 0:
print(f"File1 has {len(extra1)} extra positions")
if len(extra2) != 0:
print(f"File2 has {len(extra2)} extra positions")
elif len(extra1) == len(extra2):
print("Positions are identical")
Traverse by site instead of variant
Use the pyigd.extra.collect_next_site() method to more easily traverse the data by site instead of
by variant. This method collects all the variant indices for the next site into a list.
import pyigd
from pyigd.extra import collect_next_site
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
next_index = 0
while next_index < igd_file.num_variants:
variant_indices = collect_next_site(igd_file, next_index)
# The indices are ordered, so the next time we iterate we start at the last index + 1
next_index = variant_indices[-1] + 1
# Count the number of samples that have _any alternate allele_ at the site:
alt_count = 0
for index in variant_indices:
position, is_missing, samples = igd_file.get_samples(index)
alt_count += len(samples)
print(f"At site {position} there are {len(variant_indices)} variants and {alt_count} total alternate alleles")
Polarize during traversal
This example uses the pyfaidx package for reading FASTA files. We assume you have an input ancestral sequence in FASTA format (like the ones in ENSEMBL release v112)
You can use pyigd.extra.collect_next_site() method to traverse the data by site and then
pyigd.extra.get_inverse_sample_list() to flip the reference allele (if needed).
import pyigd
import pyfaidx
from pyigd.extra import collect_next_site, get_inverse_sample_list
fasta_reader = pyfaidx.Fasta("ancestral.fa")
assert len(fasta_reader.values()) == 1
# Assumes FASTA is 1-based, but we need 0-based indexing.
ancestral_str = ("X" + str(list(fasta_reader.values())[0])).upper()
with open("myfile.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
while variant_index < igd_file.num_variants:
# Collect all the variant indices for the site.
site_indices = collect_next_site(igd_file, variant_index)
variant_index = site_indices[-1] + 1
# Get our position and see if there is an ancestral allele.
site_position, _, _ = igd_file.get_position_flags_copies(variant_index)
if site_position < len(ancestral_str):
flip_ref = ancestral_str[site_position]
if flip_ref in "ACTG":
# Get all the alleles
refs = set(map(lambda i: igd_file.get_ref_allele(i), site_indices))
alts = list(map(lambda i: igd_file.get_alt_allele(i), site_indices))
if len(refs) > 1:
print(f"WARNING: Multiple REF alleles at position {site_position}: {refs}. Skipping site.")
else:
old_ref = list(refs)[0]
if flip_ref == old_ref:
pass # Nothing to do. REF already matches the ancestral allele.
elif flip_ref in alts:
# The old reference samples become a new alternate allele sample list.
old_ref_samples = get_inverse_sample_list(igd_file, site_indices)
else:
pass # Skipped because the ancestral allele was not found in our alternate alleles
IGD Transformation
pyigd.IGDWriter() can create arbitrary IGD files, but often we want to just transform one
IGD file into another. Using this type of transformation for filtering is often involved in
bioinformatics pipelines, though that is not the primary use-case target of IGD. Developers of
statistical genetics and population genetics tools often find it useful to manipulate incoming
data for testing and evaluation purposes. For example, adding noise to a dataset to simulate
genotyping error is sometimes used during evaluation, to make the otherwise “clean” simulated
data more realistic. For this reason, we have pyigd.IGDTransformer(), which makes
simple modification of an existing IGD file very easy.
Filter out low-frequency variants
Here we transform “file1.igd” into “file2.igd” by removing all of the variants with minor allele
frequency less than 0.01. NOTE: this does not filter out the entire site, so if you have multi-
allelic sites there may still be another alternate allele at the site with a frequency exceeding
0.01.
import pyigd
class RemoveLF(pyigd.IGDTransformer):
def modify_samples(self, position, is_missing, samples, num_copies):
frequency = len(samples) / self.reader.num_samples
if frequency < 0.01:
return None # None means "delete this variant"
# Otherwise, return the sample list unmodified
return samples
with open("file1.igd", "rb") as fin, open("file2.igd", "wb") as fout:
xformer = RemoveLF(fin, fout, use_bitvectors=False)
xformer.transform()
Filter out variants outside of range
Here we transform “file1.igd” into “file2.igd” by only keeping variants within a specified base- pair range.
import pyigd
my_range = (5_000_000, 10_000_000) # 5MBP to 10MBP
class KeepRange(pyigd.IGDTransformer):
def modify_samples(self, position, is_missing, samples, num_copies):
if position >= my_range[0] and position < my_range[1]:
return samples
return None # Delete the entire variant
with open("file1.igd", "rb") as fin, open("file2.igd", "wb") as fout:
xformer = KeepRange(fin, fout, use_bitvectors=False)
xformer.transform()
IGD Creation
IGD creation is done with pyigd.IGDWriter().
Writing a genotype matrix
Usually IGD would be useful when you have large datasets, that cannot be loaded into RAM, but sometimes it is useful to convert a genotype matrix to an IGD file (e.g. for testing, learning about IGD, etc.). Assume we have a matrix \(X\) where each of the \(N\) rows represents a (haploid) sample and each of the \(M\) columns represents a bi-allelic variant. Then we can write this matrix to IGD:
# Matrix is NxM (N = haplotypes, M = mutations)
def igd_from_matrix(genotype_matrix: numpy.typing.NDArray, filename: str):
with open(filename, "wb") as fout:
ploidy = 1
num_individuals = genotype_matrix.shape[0]
writer = pyigd.IGDWriter(fout, num_individuals, ploidy)
# We have to write the header at the start of the file, to "hold its place",
# even though it doesn't have all the information yet.
writer.write_header()
# Write each variant as described in the matrix
for col in range(genotype_matrix.shape[1]):
sample_list = numpy.flatnonzero(genotype_matrix[:, col])
writer.write_variant(col, "0", "1", sample_list)
# Write the IGD index to the file.
writer.write_index()
# Write the variant information (allele strings, mostly)
writer.write_variant_info()
# Now we seek back to the beginning of the file and write the header again, since
# it has been updated with all of the variant information above.
writer.out.seek(0)
writer.write_header()
The above is the common pattern for writing an IGD file. Often, people just want to convert from vcf(.gz) or BGEN, which can be done more efficiently with igdtools or bgen2igd, respectively.
IGDWriter can also write identifiers for variants and individuals, via the
pyigd.IGDWriter.write_variant_ids() and pyigd.IGDWriter.write_individual_ids()
methods.
Note: the above does a column-wise traversal of the numpy matrix, but it may be more efficient to first transpose the genotype matrix and then do a row-wise traversal.
Metadata
The metadata is stored separately from the IGD file, in a *.meta/ directory. Each file contains
one metadata array associated with the variants in the IGD file. numpy.loadtxt
can be used to load the data, and each element in the resulting (1-dimensional) numpy.array is
the metadata item for that variant index in the IGD file.
For example, consider variant with index 0 <= j < igd_file.num_variants. We can use pyigd.IGDReader.get_position_and_flags()
to get the position of variant j, and given a metadata vector meta (as loaded by numpy.loadtxt) we can
get that metadata for variant j via meta[j].
This is all pretty simple until you modify the IGD file after the metadata has been exported, e.g.
by filtering out some variants. Below is an example that shows you how to match up the variant IDs from
the newly filtered IGD file to the original metadata (which always stores the variant IDs in variants.txt).
This example assumes that you converted a VCF file containing the AC (allele counts) field from the
INFO column.
import numpy
import pyigd
# The original IGD file was called "original.igd", and it was created via: igdtools some.vcf.gz -o original.igd -e all
original_ids = numpy.loadtxt("original.meta/variants.txt", dtype=str)
original_ac = numpy.loadtxt("original.meta/info.AC.txt", dtype=int)
assert len(original_ids) == len(original_ac)
# Now filter "original.igd" to remove all odd-numbered base-pair positions. This is just a silly example of filtering,
# to illustrate how metadata is handled after filtering.
class OddXformer(pyigd.IGDTransformer):
def modify_samples(self, position, is_missing, samples, num_copies):
if position % 2 == 1:
return None
return samples
with open("original.igd", "rb") as fin, open("filtered.igd", "wb") as fout:
xformer = OddXformer(fin, fout)
xformer.transform()
# Now open our filtered IGD file, and print out the variant positions plus the corresponding AC value from the metadata
with open("filtered.igd", "rb") as f:
igd_file = pyigd.IGDReader(f)
# The variant IDs that we still have after filtering
remaining_ids = numpy.array(igd_file.get_variant_ids())
# Find the indices for the intersection of the variant IDs between the metadata and the filtered IGD.
_, metadata_indices, remaining_indices = numpy.intersect1d(original_ids, remaining_ids, return_indices=True)
assert len(metadata_indices) == igd_file.num_variants
# Sort according to the indices in the filtered IGD file. This way we can just use metadata_indices directly,
# where metadata_indices[0] is the index into the metadata for the 0th variant in filtered.igd, etc.
metadata_indices = metadata_indices[numpy.argsort(remaining_indices)]
# Traverse the filtered variants and their metadata indices together
for variant_index, metadata_index in zip(range(igd_file.num_variants), metadata_indices):
assert remaining_ids[variant_index] == original_ids[metadata_index]
pos, flags = igd_file.get_position_and_flags(variant_index)
print(f"Position: {pos}, AC={original_ac[metadata_index]}")