Introduction
Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the htslib C-API.
This page provides a quick introduction in using pysam followed by the API. See Working with BAM/CRAM/SAM-formatted files for more detailed usage instructions.
To use the module to read a file in BAM format, create a
AlignmentFile object:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
Once a file is opened you can iterate over all of the reads mapping to
a specified region using fetch(). Each
iteration returns a AlignedSegment object which
represents a single read along with its fields and optional tags:
for read in samfile.fetch('chr1', 100, 120):
print(read)
samfile.close()
To give:
EAS56_57:6:190:289:82 0 99 <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; 69 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA 0 192 1
EAS56_57:6:190:289:82 0 99 <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; 137 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC 73 64 1
EAS51_64:3:190:727:308 0 102 <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844 99 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG 99 18 1
...
You can also write to a AlignmentFile:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
if read.is_paired:
pairedreads.write(read)
pairedreads.close()
samfile.close()
An alternative way of accessing the data in a SAM file is by iterating
over each base of a specified region using the
pileup() method. Each iteration returns a
PileupColumn which represents all the reads in the SAM
file that map to a single base in the reference sequence. The list of
reads are represented as PileupRead objects in the
PileupColumn.pileups property:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup("chr1", 100, 120):
print("\ncoverage at base %s = %s" % (pileupcolumn.pos, pileupcolumn.n))
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
# query position is None if is_del or is_refskip is set.
print('\tbase in read %s = %s' %
(pileupread.alignment.query_name,
pileupread.alignment.query_sequence[pileupread.query_position]))
samfile.close()
The above code outputs:
coverage at base 99 = 1
base in read EAS56_57:6:190:289:82 = A
coverage at base 100 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 101 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 102 = 2
base in read EAS56_57:6:190:289:82 = G
base in read EAS51_64:3:190:727:308 = G
...
Commands available in samtools are available as simple function calls. For example:
pysam.sort("-o", "output.bam", "ex1.bam")
corresponds to the command line:
samtools sort -o output.bam ex1.bam
Analogous to AlignmentFile, a
TabixFile allows fast random access to compressed and
tabix indexed tab-separated file formats with genomic data:
import pysam
tabixfile = pysam.TabixFile("example.gtf.gz")
for gtf in tabixfile.fetch("chr1", 1000, 2000):
print(gtf.contig, gtf.start, gtf.end, gtf.gene_id)
TabixFile implements lazy parsing in order to iterate
over large tables efficiently.
More detailed usage instructions are available at Working with BAM/CRAM/SAM-formatted files.
Note
Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates.
Note
- The above examples can be run in the
testsdirectory of the installation directory. Type ‘make’ before running them.
See also
https://github.com/pysam-developers/pysam
The pysam code repository, containing source code and download instructions
http://pysam.readthedocs.org/en/latest/
The pysam website containing documentation
API
SAM/BAM/CRAM files
Objects of type AlignmentFile allow working with
BAM/SAM formatted files.
- class pysam.AlignmentFile
AlignmentFile(filepath_or_object, mode=None, template=None, reference_names=None, reference_lengths=None, text=None, header=None, add_sq_text=True, add_sam_header=True, check_header=True, check_sq=True, reference_filename=None, filename=None, index_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=True, ignore_truncation=False, format_options=None, threads=1)
A SAM/BAM/CRAM formatted file.
If filepath_or_object is a string, the file is automatically opened. If filepath_or_object is a python File object, the already opened file will be used.
If the file is opened for reading and an index exists (if file is BAM, a .bai file or if CRAM a .crai file), it will be opened automatically. index_filename may be specified explicitly. If the index is not named in the standard manner, not located in the same directory as the BAM/CRAM file, or is remote. Without an index, random access via
fetch()andpileup()is disabled.For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):
If template is given, the header is copied from another AlignmentFile (template must be a
AlignmentFile).If header is given, the header is built from a multi-level dictionary.
If text is given, new header text is copied from raw text.
The names (reference_names) and lengths (reference_lengths) are supplied directly as lists.
When reading or writing a CRAM file, the filename of a FASTA-formatted reference can be specified with reference_filename.
By default, if a file is opened in mode ‘r’, it is checked for a valid header (check_header = True) and a definition of chromosome names (check_sq = True).
- Parameters:
mode (string) –
mode should be
rfor reading orwfor writing. The default is text mode (SAM). For binary (BAM) I/O you should appendbfor compressed orufor uncompressed BAM output. Usehto output header information in text (TAM) mode. Usecfor CRAM formatted files.If
bis present, it must immediately followrorw. Valid modes arer,w,wh,rb,wb,wbu,wb0,rcandwc. For instance, to open a BAM formatted file for reading, type:f = pysam.AlignmentFile('ex1.bam','rb')
If mode is not specified, the method will try to auto-detect in the order ‘rb’, ‘r’, thus both the following should work:
f1 = pysam.AlignmentFile('ex1.bam') f2 = pysam.AlignmentFile('ex1.sam')
template (AlignmentFile) – when writing, copy header from file template.
header (dict or AlignmentHeader) – when writing, build header from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order. Alternatively, an
AlignmentHeaderobject can be passed directly.text (string) – when writing, use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – when writing or opening a SAM file without header build header from list of chromosome names and lengths. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permits construction SAM formatted files without a header.
add_sam_header (bool) – when outputting SAM the default is to output a header. This is equivalent to opening the file in ‘wh’ mode. If this option is set to False, no header will be output. To read such a file, set check_header=False.
check_header (bool) – obsolete: when reading a SAM file, check if header is present (default=True)
check_sq (bool) – when reading, check if SQ entries are present in header (default=True)
reference_filename (string) – Path to a FASTA-formatted reference file. Valid only for CRAM files. When reading a CRAM file, this overrides both
$REF_PATHand the URL specified in the header (URtag), which are normally used to find the reference.index_filename (string) – Explicit path to the index file. Only needed if the index is not named in the standard manner, not located in the same directory as the BAM/CRAM file, or is remote. An IOError is raised if the index cannot be found or is invalid.
filepath_index (string) – Alias for index_filename.
require_index (bool) – When reading, require that an index file is present and is valid or raise an IOError. (default=False)
filename (string) – Alternative to filepath_or_object. Filename of the file to be opened.
duplicate_filehandle (bool) – By default, file handles passed either directly or through File-like objects will be duplicated before passing them to htslib. The duplication prevents issues where the same stream will be closed by htslib and through destruction of the high-level python object. Set to False to turn off duplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current file appears to be truncated due to a missing EOF marker. Only applies to bgzipped formats. (Default=False)
format_options (list) – A list of
key=valuestrings, as accepted by samtools’s global fmt-options.threads (integer) – Number of threads to use for compressing/decompressing BAM/CRAM files. Setting threads to > 1 cannot be combined with ignore_truncation. (Default=1)
- check_index(self)
return True if index is present.
- Raises:
AttributeError – if htsfile is SAM formatted and thus has no index.
ValueError – if htsfile is closed or index could not be opened.
- close(self)
closes the
pysam.AlignmentFile.
- count(self, contig=None, start=None, stop=None, region=None, until_eof=False, read_callback='nofilter', reference=None, end=None)
count the number of reads in region
The region is specified by contig, start and stop. reference and end are also accepted for backward compatibility as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be supplied.
A SAM file does not allow random access and if region or contig are given, an exception is raised.
- Parameters:
contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive)
stop (int) – end of the genomic region (0-based exclusive)
region (string) – a region string in samtools format.
until_eof (bool) – count until the end of the file, possibly including unmapped reads as well.
read_callback (string or function) –
select a call-back to ignore reads when counting. It can be either a string with the following values:
allskip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilteruses every single read
Alternatively, read_callback can be a function
check_read(read)that should return True only for those reads that shall be included in the counting.reference (string) – backward compatible synonym for contig
end (int) – backward compatible synonym for stop
- Raises:
ValueError – if the genomic coordinates are out of range or invalid.
- count_coverage(self, contig, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None)
count the coverage of genomic positions by reads in region.
The region is specified by contig, start and stop. reference and end are also accepted for backward compatibility as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be supplied. The coverage is computed per-base [ACGT].
- Parameters:
contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive). If not given, count from the start of the chromosome.
stop (int) – end of the genomic region (0-based exclusive). If not given, count to the end of the chromosome.
region (string) – a region string.
quality_threshold (int) – quality_threshold is the minimum quality score (in phred) a base has to reach to be counted.
read_callback (string or function) –
select a call-back to ignore reads when counting. It can be either a string with the following values:
allskip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilteruses every single read
Alternatively, read_callback can be a function
check_read(read)that should return True only for those reads that shall be included in the counting.reference (string) – backward compatible synonym for contig
end (int) – backward compatible synonym for stop
- Raises:
ValueError – if the genomic coordinates are out of range or invalid.
- Returns:
four array.arrays of the same length in order A C G T
- Return type:
tuple
- fetch(self, contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)
fetch reads aligned in a region.
See
parse_region()for more information on how genomic regions can be specified. reference and end are also accepted for backward compatibility as synonyms for contig and stop, respectively.Without a contig or region all mapped reads in the file will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file. This mode of iteration still requires an index. If there is no index, use until_eof=True.
If only contig is set, all reads aligned to contig will be fetched.
A SAM file does not allow random access. If region or contig are given, an exception is raised.
- Parameters:
until_eof (bool) – If until_eof is True, all reads from the current file position will be returned in order as they are within the file. Using this option will also fetch unmapped reads.
multiple_iterators (bool) – If multiple_iterators is True, multiple iterators on the same file can be used at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.
- Returns:
An iterator over a collection of reads.
- Return type:
IteratorRow
- Raises:
ValueError – if the genomic coordinates are out of range or invalid or the file does not permit random access to genomic coordinates.
- find_introns(self, read_iterator)
Return a dictionary {(start, stop): count} Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings), and their support ( = number of reads ).
read_iterator can be the result of a .fetch(…) call. Or it can be a generator filtering such reads. Example samfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)
- find_introns_slow(self, read_iterator)
Return a dictionary {(start, stop): count} Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings), and their support ( = number of reads ).
read_iterator can be the result of a .fetch(…) call. Or it can be a generator filtering such reads. Example samfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)
- get_index_statistics(self)
return statistics about mapped/unmapped reads per chromosome as they are stored in the index, similarly to the statistics printed by the
samtools idxstatscommand.CRAI indexes do not record these statistics, so for a CRAM file with a .crai index the returned statistics will all be 0.
- Returns:
a list of records for each chromosome. Each record has the attributes ‘contig’, ‘mapped’, ‘unmapped’ and ‘total’.
- Return type:
list
- get_tid(self, reference)
return the numerical tid corresponding to reference
returns -1 if reference is not known.
- getrname(self, tid)
deprecated, use
get_reference_name()instead
- has_index(self)
return true if htsfile has an existing (and opened) index.
- head(self, n, multiple_iterators=True)
return an iterator over the first n alignments.
This iterator is is useful for inspecting the bam-file.
- Parameters:
multiple_iterators (bool) – is set to True by default in order to avoid changing the current file position.
- Returns:
an iterator over a collection of reads
- Return type:
IteratorRowHead
- is_valid_tid(self, int tid)
return True if the numerical tid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
- lengths
tuple of the lengths of the reference sequences. This is a read-only attribute. The lengths are in the same order as
pysam.AlignmentFile.references
- mapped
int with total number of mapped alignments according to the statistics recorded in the index. This is a read-only attribute. (This will be 0 for a CRAM file indexed by a .crai index, as that index format does not record these statistics.)
- mate(self, AlignedSegment read)
return the mate of
pysam.AlignedSegmentread.Note
Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.
Note
This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.
- Returns:
the mate
- Return type:
- Raises:
ValueError – if the read is unpaired or the mate is unmapped
- nocoordinate
int with total number of reads without coordinates according to the statistics recorded in the index, i.e., the statistic printed for “*” by the
samtools idxstatscommand. This is a read-only attribute. (This will be 0 for a CRAM file indexed by a .crai index, as that index format does not record these statistics.)
- pileup(self, contig=None, start=None, stop=None, region=None, reference=None, end=None, **kwargs)
perform a pileup within a region. The region is specified by contig, start and stop (using 0-based indexing). reference and end are also accepted for backward compatibility as synonyms for contig and stop, respectively. Alternatively, a samtools ‘region’ string can be supplied.
Without ‘contig’ or ‘region’ all reads will be used for the pileup. The reads will be returned ordered by contig sequence, which will not necessarily be the order within the file.
Note that SAM formatted files do not allow random access. In these files, if a ‘region’ or ‘contig’ are given an exception is raised.
Note
‘all’ reads which overlap the region are returned. The first base returned will be the first base of the first read ‘not’ necessarily the first base of the region used in the query.
- Parameters:
truncate (bool) – By default, the samtools pileup engine outputs all reads overlapping a region. If truncate is True and a region is given, only columns in the exact region specified are returned.
max_depth (int) – Maximum read depth permitted. The default limit is ‘8000’.
stepper (string) –
The stepper controls how the iterator advances. Possible options for the stepper are
allskip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilteruses every single read turning off any filtering.
samtoolssame filter and read processing as in samtools pileup. For full compatibility, this requires a ‘fastafile’ to be given. The following options all pertain to filtering of the
samtoolsstepper.
fastafile (
FastaFileobject.) – This is required for some of the steppers.ignore_overlaps (bool) – If set to True, detect if read pairs overlap and only take the higher quality base. This is the default.
flag_filter (int) – ignore reads where any of the bits in the flag are set. The default is BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP.
flag_require (int) – only use reads where certain flags are set. The default is 0.
ignore_orphans (bool) – ignore orphans (paired reads that are not in a proper pair). The default is to ignore orphans.
min_base_quality (int) – Minimum base quality. Bases below the minimum quality will not be output. The default is 13.
adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for no adjustment. The recommended value for adjustment is 50.
min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.
compute_baq (bool) – re-alignment computing per-Base Alignment Qualities (BAQ). The default is to do re-alignment. Realignment requires a reference sequence. If none is present, no realignment will be performed.
redo_baq (bool) – recompute per-Base Alignment Quality on the fly ignoring existing base qualities. The default is False (use existing base qualities).
- Returns:
an iterator over genomic positions.
- Return type:
IteratorColumn
- text
deprecated, use
referencesandlengthsinstead
- unmapped
int with total number of unmapped reads according to the statistics recorded in the index. This number of reads includes the number of reads without coordinates. This is a read-only attribute. (This will be 0 for a CRAM file indexed by a .crai index, as that index format does not record these statistics.)
- write(self, AlignedSegment read) int
write a single
pysam.AlignedSegmentto disk.- Raises:
ValueError – if the writing failed
- Returns:
the number of bytes written. If the file is closed, this will be 0.
- Return type:
int
- class pysam.AlignmentHeader
header information for a
AlignmentFileobject- Parameters:
header_dict (dict) – build header from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order. Alternatively, an
AlignmentHeaderobject can be passed directly.text (string) – use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – build header from list of chromosome names and lengths. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permits construction SAM formatted files without a header.
- copy(self)
- classmethod from_dict(cls, header_dict)
- classmethod from_references(cls, reference_names, reference_lengths, text=None, add_sq_text=True)
- classmethod from_text(cls, text)
- get(self, *args)
- get_reference_length(self, reference)
- get_reference_name(self, tid)
- get_tid(self, reference)
return the numerical tid corresponding to reference
returns -1 if reference is not known.
- is_valid_tid(self, int tid)
return True if the numerical tid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
- items(self)
- iteritems(self)
- keys(self)
- lengths
tuple of the lengths of the reference sequences. This is a read-only attribute. The lengths are in the same order as
pysam.AlignmentFile.references
- to_dict(self)
return two-level dictionary with header information from the file.
The first level contains the record (
HD,SQ, etc) and the second level contains the fields (VN,LN, etc).The parser is validating and will raise an AssertionError if if encounters any record or field tags that are not part of the SAM specification. Use the
pysam.AlignmentFile.textattribute to get the unparsed header.The parsing follows the SAM format specification with the exception of the
CLfield. This option will consume the rest of a header line irrespective of any additional fields. This behaviour has been added to accommodate command line options that contain characters that are not valid field separators.If no @SQ entries are within the text section of the header, this will be automatically added from the reference names and lengths stored in the binary part of the header.
- values(self)
An AlignedSegment represents an aligned segment within
a SAM/BAM file.
- class pysam.AlignedSegment(AlignmentHeader header=None)
Class representing an aligned segment.
This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.
For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to be resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state.
One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.
- Parameters:
header –
AlignmentHeaderobject to map numerical identifiers to chromosome names. If not given, an empty header is created.
- aend
deprecated, use
reference_endinstead.
- alen
deprecated, use
reference_lengthinstead.
- aligned_pairs
deprecated, use
get_aligned_pairs()instead.
- bin
properties bin
- blocks
deprecated, use
get_blocks()instead.
- cigar
deprecated, use
cigarstringorcigartuplesinstead.
- cigarstring
the cigar alignment as a string.
The cigar string is a string of alternating integers and characters denoting the length and the type of an operation.
Note
The order length,operation is specified in the SAM format. It is different from the order of the
cigarproperty.Returns None if not present.
To unset the cigarstring, assign None or the empty string.
- cigartuples
the cigar alignment. The alignment is returned as a list of tuples of (operation, length).
If the alignment is not present, None is returned.
The operations are:
M
pysam.CIGAR_OPS.CMATCH
0
I
pysam.CIGAR_OPS.CINS
1
D
pysam.CIGAR_OPS.CDEL
2
N
pysam.CIGAR_OPS.CREF_SKIP
3
S
pysam.CIGAR_OPS.CSOFT_CLIP
4
H
pysam.CIGAR_OPS.CHARD_CLIP
5
P
pysam.CIGAR_OPS.CPAD
6
=
pysam.CIGAR_OPS.CEQUAL
7
X
pysam.CIGAR_OPS.CDIFF
8
B
pysam.CIGAR_OPS.CBACK
9
Note
The output is a list of (operation, length) tuples, such as
[(0, 30)]. This is different from the SAM specification and thecigarstringproperty, which uses a (length, operation) order, for example:30M.To unset the cigar property, assign an empty list or None.
- compare(self, AlignedSegment other)
return -1,0,1, if contents in this are binary <,=,> to other
- flag
properties flag
- classmethod from_dict(cls, sam_dict, AlignmentHeader header)
parses a dictionary representation of the aligned segment.
- Parameters:
sam_dict – dictionary of alignment values, keys corresponding to output from
todict().
- classmethod fromstring(cls, sam, AlignmentHeader header)
parses a string representation of the aligned segment.
The input format should be valid SAM format.
- Parameters:
sam – SAM formatted string
- get_aligned_pairs(self, matches_only=False, with_seq=False, with_cigar=False)
a list of aligned read (query) and reference positions.
Each item in the returned list is a tuple consisting of the 0-based offset from the start of the read sequence followed by the 0-based reference position.
For inserts, deletions, skipping either query or reference position may be None.
For padding in the reference, the reference position will always be None.
- Parameters:
matches_only (bool) – If True, only matched bases are returned — no None on either side.
with_seq (bool) – If True, return a third element in the tuple containing the reference sequence. For CIGAR ‘P’ (padding in the reference) operations, the third tuple element will be None. Substitutions are lower-case. This option requires an MD tag to be present.
with_cigar (bool) – If True, return an extra element in the tuple containing the CIGAR operator corresponding to this position tuple.
- Returns:
aligned_pairs
- Return type:
list of tuples
- get_blocks(self)
a list of start and end positions of aligned gapless blocks.
The start and end positions are in genomic coordinates.
Blocks are not normalized, i.e. two blocks might be directly adjacent. This happens if the two blocks are separated by an insertion in the read.
- get_cigar_stats(self)
summary of operations in cigar string.
The output order in the array is “MIDNSHP=X” followed by a field for the NM tag. If the NM tag is not present, this field will always be 0. (Accessing this field via index -1 avoids changes if more CIGAR operators are added in future.)
M
pysam.CIGAR_OPS.CMATCH
0
I
pysam.CIGAR_OPS.CINS
1
D
pysam.CIGAR_OPS.CDEL
2
N
pysam.CIGAR_OPS.CREF_SKIP
3
S
pysam.CIGAR_OPS.CSOFT_CLIP
4
H
pysam.CIGAR_OPS.CHARD_CLIP
5
P
pysam.CIGAR_OPS.CPAD
6
=
pysam.CIGAR_OPS.CEQUAL
7
X
pysam.CIGAR_OPS.CDIFF
8
B
pysam.CIGAR_OPS.CBACK
9
NM
NM tag
10 or -1
If no cigar string is present, empty arrays will be returned.
- Returns:
two arrays. The first contains the nucleotide counts within each cigar operation, the second contains the number of blocks for each cigar operation.
- Return type:
arrays
- get_forward_qualities(self)
return the original base qualities of the read sequence, in the same format as the
query_qualitiesproperty.Reads mapped to the reverse strand have their base qualities stored reversed in the BAM file. This method returns such reads’ base qualities reversed back to their original orientation.
- get_forward_sequence(self)
return the original read sequence.
Reads mapped to the reverse strand are stored reverse complemented in the BAM file. This method returns such reads reverse complemented back to their original orientation.
Returns None if the record has no query sequence.
- get_overlap(self, uint32_t start, uint32_t end)
return number of aligned bases of read overlapping the interval start and end on the reference sequence.
Return None if cigar alignment is not available.
- get_reference_positions(self, full_length=False)
a list of reference positions that this read aligns to.
By default, this method returns the (0-based) positions on the reference that are within the read’s alignment, leaving gaps corresponding to deletions and other reference skips.
When full_length is True, the returned list is the same length as the read and additionally includes None values corresponding to insertions or soft-clipping, i.e., to bases of the read that are not aligned to a reference position. (See also
get_aligned_pairs()which additionally returns the corresponding positions along the read.)
- get_reference_sequence(self)
return the reference sequence in the region that is covered by the alignment of the read to the reference.
This method requires the MD tag to be set.
- get_tag(self, tag, with_value_type=False)
retrieves data from the optional alignment section given a two-letter tag denoting the field.
The returned value is cast into an appropriate python type.
This method is the fastest way to access the optional alignment section if only few tags need to be retrieved.
Possible value types are “AcCsSiIfZHB” (see BAM format specification) as well as additional value type ‘d’ as implemented in htslib.
- Parameters:
tag (str) – Data tag.
with_value_type (Optional[bool]) – If set to True, the return value is a tuple of (tag value, type code). (default False)
- Returns:
A python object with the value of the tag. The type of the object depends on the data type in the data record.
- Return type:
Any or tuple(Any, str)
- Raises:
KeyError – If tag is not present, a KeyError is raised.
- get_tags(self, with_value_type=False)
the fields in the optional alignment section.
Returns a list of all fields in the optional alignment section. Values are converted to appropriate python values. For example:
[(NM, 2), (RG, "GJP00TM04")]If with_value_type is set, the value type as encode in the AlignedSegment record will be returned as well:
[(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will convert all values in the optional alignment section. When getting only one or few tags, please see
get_tag()for a quicker way to achieve this.
- has_tag(self, tag)
returns true if the optional alignment section contains a given tag.
- infer_query_length(self, always=False)
infer query length from CIGAR alignment.
This method deduces the query length from the CIGAR alignment but does not include hard-clipped bases.
Returns None if CIGAR alignment is not present.
If always is set to True, infer_read_length is used instead. This is deprecated and only present for backward compatibility.
- infer_read_length(self)
infer read length from CIGAR alignment.
This method deduces the read length from the CIGAR alignment including hard-clipped bases.
Returns None if CIGAR alignment is not present.
- inferred_length
deprecated, use
infer_query_length()instead.
- is_duplicate
true if optical or PCR duplicate
- is_forward
true if read is mapped to forward strand (implemented in terms of
is_reverse)
- is_mapped
true if read itself is mapped (implemented in terms of
is_unmapped)
- is_paired
true if read is paired in sequencing
- is_proper_pair
true if read is mapped in a proper pair
- is_qcfail
true if QC failure
- is_read1
true if this is read1
- is_read2
true if this is read2
- is_reverse
true if read is mapped to reverse strand
- is_secondary
true if not primary alignment
- is_supplementary
true if this is a supplementary alignment
- is_unmapped
true if read itself is unmapped
- isize
deprecated, use
template_lengthinstead.
- mapping_quality
mapping quality
- mapq
deprecated, use
mapping_qualityinstead.
- mate_is_forward
true if the mate is mapped to forward strand (implemented in terms of
mate_is_reverse)
- mate_is_mapped
true if the mate is mapped (implemented in terms of
mate_is_unmapped)
- mate_is_reverse
true if the mate is mapped to reverse strand
- mate_is_unmapped
true if the mate is unmapped
- modified_bases
Modified bases annotations from MM/ML tags. The output is Dict[(canonical base, strand, modification)] -> [ (pos,qual), …] with qual being (256*probability), or -1 if unknown. Strand==0 for forward and 1 for reverse strand modification
- modified_bases_forward
Modified bases annotations from MM/ML tags. The output is Dict[(canonical base, strand, modification)] -> [ (pos,qual), …] with qual being (256*probability), or -1 if unknown. Strand==0 for forward and 1 for reverse strand modification. The positions are with respect to the original sequence from get_forward_sequence()
- mpos
deprecated, use
next_reference_startinstead.
- mrnm
deprecated, use
next_reference_idinstead.
- next_reference_start
the position of the mate/next read.
- overlap(self)
deprecated, use
get_overlap()instead.
- pnext
deprecated, use
next_reference_startinstead.
- pos
deprecated, use
reference_startinstead.
- positions
deprecated, use
get_reference_positions()instead.
- qend
deprecated, use
query_alignment_endinstead.
- qlen
deprecated, use
query_alignment_lengthinstead.
- qname
deprecated, use
query_nameinstead.
- qqual
deprecated, use
query_alignment_qualitiesorquery_alignment_qualities_strinstead.
- qstart
deprecated, use
query_alignment_startinstead.
- qual
deprecated, use
query_qualitiesorquery_qualities_strinstead.
- query
deprecated, use
query_alignment_sequenceinstead.
- query_alignment_end
end index of the aligned query portion of the sequence (0-based, exclusive).
This the index just past the last base in
query_sequencethat is not soft-clipped. (For unmapped reads and when CIGAR is unavailable, this will be the length of the query/read.)
- query_alignment_length
length of the aligned query sequence.
This is the number of bases in
query_sequencethat are not soft-clipped. (For unmapped reads and when CIGAR is unavailable, this will equal the length of the query/read.)
- query_alignment_qualities
aligned query sequence quality values (None if not present). These are the quality values that correspond to
query_alignment_sequence, that is, they exclude qualities of soft clipped bases. This is equal toquery_qualities[query_alignment_start:query_alignment_end].Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
This property is read-only.
- query_alignment_qualities_str
aligned query sequence quality values, returned as an ASCII-encoded string similar to that in FASTQ or SAM files, or None if base qualities are not present. These are the quality values that correspond to
query_alignment_sequence, i.e., excluding qualities corresponding to soft-clipped bases.This property is read-only.
- query_alignment_sequence
aligned portion of the read.
This is a substring of
query_sequencethat excludes flanking bases that were soft clipped (None if not present). It is equal toquery_sequence[query_alignment_start:query_alignment_end].SAM/BAM files may include extra flanking bases that are not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.
- query_alignment_start
start index of the aligned query portion of the sequence (0-based, inclusive).
This the index of the first base in
query_sequencethat is not soft-clipped. (For unmapped reads and when CIGAR is unavailable, this will be zero.)
- query_length
the length of the query/read.
This value corresponds to the length of the sequence supplied in the BAM/SAM file. The length of a query is 0 if there is no sequence in the BAM/SAM file. In those cases, the read length can be inferred from the CIGAR alignment, see
pysam.AlignedSegment.infer_query_length().The length includes soft-clipped bases and is equal to
len(query_sequence).This property is read-only but is updated when a new query sequence is assigned to this AlignedSegment.
Returns 0 if not available.
- query_name
the query template name (None if not present)
- query_qualities
read sequence base qualities, including soft clipped bases (None if not present).
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
Note that to set quality scores the sequence has to be set beforehand as this will determine the expected length of the quality score array. Setting will raise a ValueError if the length of the new quality scores is not the same as the length of the existing sequence.
Quality scores to be set may be specified as a Python array or other iterable of ints, or as a string of ASCII-encooded FASTQ/SAM-style base quality characters.
- query_qualities_str
read sequence base qualities, including soft clipped bases, returned as an ASCII-encoded string similar to that in FASTQ or SAM files, or None if base qualities are not present.
Note that to set quality scores the sequence has to be set beforehand as this will determine the expected length of the quality score string. Setting will raise a ValueError if the length of the new quality scores is not the same as the length of the existing sequence.
- query_sequence
read sequence bases, including soft clipped bases (None if not present).
Assigning to this attribute will invalidate any quality scores. Thus, to in-place edit the sequence and quality scores, copies of the quality scores need to be taken. Consider trimming for example:
q = read.query_qualities read.query_sequence = read.query_sequence[5:10] read.query_qualities = q[5:10]
The sequence is returned as it is stored in the BAM file. (This will be the reverse complement of the original read sequence if the mapper has aligned the read to the reverse strand.)
- reference_end
aligned reference position of the read on the reference genome.
reference_end points to one past the last aligned residue. Returns None if not available (read is unmapped or no cigar alignment present).
- reference_id
reference ID
Note
This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use
get_reference_name()
- reference_length
aligned length of the read on the reference genome.
This is equal to reference_end - reference_start. Returns None if not available.
- reference_start
0-based leftmost coordinate
- rlen
deprecated, use
query_lengthinstead.
- rname
deprecated, use
reference_idinstead.
- rnext
deprecated, use
next_reference_idinstead.
- seq
deprecated, use
query_sequenceinstead.
- set_tag(self, tag, value, value_type=None, replace=True)
sets a particular field tag to value in the optional alignment section.
value_type describes the type of value that is to entered into the alignment record. It can be set explicitly to one of the valid one-letter type codes. If unset, an appropriate type will be chosen automatically based on the python type of value.
An existing value of the same tag will be overwritten unless replace is set to False. This is usually not recommended as a tag may only appear once in the optional alignment section.
If value is None, the tag will be deleted.
This method accepts valid SAM specification value types, which are:
A: printable char i: signed int f: float Z: printable string H: Byte array in hex format B: Integer or numeric array
Additionally, it will accept the integer BAM types (‘cCsSI’)
For htslib compatibility, ‘a’ is synonymous with ‘A’ and the method accepts a ‘d’ type code for a double precision float.
When deducing the type code by the python type of value, the following mapping is applied:
i: python int f: python float Z: python str or bytes B: python array.array, list or tuple
Note that a single character string will be output as ‘Z’ and not ‘A’ as the former is the more general type.
- set_tags(self, tags)
sets the fields in the optional alignment section with a list of (tag, value) tuples.
The value type of the values is determined from the python type. Optionally, a type may be given explicitly as a third value in the tuple, For example:
x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will not enforce the rule that the same tag may appear only once in the optional alignment section.
- tags
deprecated, use
get_tags()instead.
- template_length
the observed query template length
- tid
deprecated, use
reference_idinstead.
- tlen
deprecated, use
template_lengthinstead.
- to_dict(self)
returns a json representation of the aligned segment.
Field names are abbreviated versions of the class attributes.
- to_string(self)
returns a string representation of the aligned segment.
The output format is valid SAM format if a header is associated with the AlignedSegment.
- tostring(self, htsfile=None)
deprecated, use
to_string()instead.- Parameters:
htsfile – (deprecated) AlignmentFile object to map numerical identifiers to chromosome names. This parameter is present for backwards compatibility and ignored.
- class pysam.PileupColumn
A pileup of reads at a particular reference sequence position (column). A pileup column contains all the reads that map to a certain target base.
This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator advances, the results of this column will change.
- get_mapping_qualities(self)
query mapping quality scores at pileup column position.
- Returns:
a list of quality scores
- Return type:
list
- get_num_aligned(self)
return number of aligned bases at pileup column position.
This method applies a base quality filter and the number is equal to the size of
get_query_sequences(),get_mapping_qualities(), etc.
- get_query_names(self)
query/read names aligned at pileup column position.
- Returns:
a list of query names at pileup column position.
- Return type:
list
- get_query_positions(self)
positions in read at pileup column position.
- Returns:
a list of read positions
- Return type:
list
- get_query_qualities(self)
query base quality scores at pileup column position.
- Returns:
a list of quality scores
- Return type:
list
- get_query_sequences(self, bool mark_matches=False, bool mark_ends=False, bool add_indels=False)
query bases/sequences at pileup column position.
Optionally, the bases/sequences can be annotated according to the samtools mpileup format. This is the format description from the samtools mpileup tool:
Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. The deleted bases will be presented as `*' in the following lines. Also at the read base column, a symbol `^' marks the start of a read. The ASCII of the character following `^' minus 33 gives the mapping quality. A symbol `$' marks the end of a read segment
To reproduce samtools mpileup format, set all of mark_matches, mark_ends and add_indels to True.
- Parameters:
mark_matches (bool) – If True, output bases matching the reference as “.” or “,” for forward and reverse strand, respectively. This mark requires the reference sequence. If no reference is present, this option is ignored.
mark_ends (bool) – If True, add markers “^” and “$” for read start and end, respectively.
add_indels (bool) – If True, add bases for bases inserted into or skipped from the reference. The latter requires a reference sequence file to have been given, e.g. via pileup(fastafile = …). If no reference sequence is available, skipped bases are represented as ‘N’s.
- Returns:
a list of bases/sequences per read at pileup column position.
- Return type:
list
- nsegments
number of reads mapping to this column.
Note that this number ignores the base quality filter.
- pileups
list of reads (
pysam.PileupRead) aligned to this column
- pos
deprecated, use
reference_posinstead.
- reference_id
the reference sequence number as defined in the header
- reference_pos
the position in the reference sequence (0-based).
- set_min_base_quality(self, min_base_quality)
set the minimum base quality for this pileup column.
- tid
deprecated, use
reference_idinstead.
- class pysam.PileupRead
Representation of a read aligned to a particular position in the reference sequence.
- alignment
a
pysam.AlignedSegmentobject of the aligned read
- indel
indel length for the position following the current pileup site.
This quantity peeks ahead to the next cigar operation in this alignment. If the next operation is an insertion, indel will be positive. If the next operation is a deletion, it will be negation. 0 if the next operation is not an indel.
- is_del
1 iff the base on the padded read is a deletion
- is_head
1 iff the base on the padded read is the left-most base.
- is_refskip
1 iff the base on the padded read is part of CIGAR N op.
- is_tail
1 iff the base on the padded read is the right-most base.
- level
the level of the read in the “viewer” mode. Note that this value is currently not computed.
- query_position
position of the read base at the pileup site, 0-based. None if
is_deloris_refskipis set.
- query_position_or_next
position of the read base at the pileup site, 0-based.
If the current position is a deletion, returns the next aligned base.
- class pysam.IndexedReads(AlignmentFile samfile, int multiple_iterators=True)
Index a Sam/BAM-file by query name while keeping the original sort order intact.
The index is kept in memory and can be substantial.
By default, the file is re-opened to avoid conflicts if multiple operators work on the same file. Set multiple_iterators = False to not re-open samfile.
- Parameters:
samfile (AlignmentFile) – File to be indexed.
multiple_iterators (bool) – Flag indicating whether the file should be reopened. Reopening prevents existing iterators being affected by the indexing.
- build(self)
build the index.
- find(self, query_name)
find query_name in index.
- Returns:
Returns an iterator over all reads with query_name.
- Return type:
IteratorRowSelection
- Raises:
KeyError – if the query_name is not in the index.
VCF/BCF files
- class pysam.VariantFile(*args, **kwargs)
(filename, mode=None, index_filename=None, header=None, drop_samples=False, duplicate_filehandle=True, ignore_truncation=False, threads=1)
A VCF/BCF formatted file. The file is automatically opened.
If an index for a variant file exists (.csi or .tbi), it will be opened automatically. Without an index random access to records via
fetch()is disabled.For writing, a
VariantHeaderobject must be provided, typically obtained from another VCF file/BCF file.- Parameters:
mode (string) –
mode should be
rfor reading orwfor writing. The default is text mode (VCF). For binary (BCF) I/O you should appendbfor compressed orufor uncompressed BCF output.If
bis present, it must immediately followrorw. Valid modes arer,w,wh,rb,wb,wbuandwb0. For instance, to open a BCF formatted file for reading, type:f = pysam.VariantFile('ex1.bcf','r')
If mode is not specified, we will try to auto-detect the file type. All of the following should work:
f1 = pysam.VariantFile('ex1.bcf') f2 = pysam.VariantFile('ex1.vcf') f3 = pysam.VariantFile('ex1.vcf.gz')
index_filename (string) – Explicit path to an index file.
header (VariantHeader) –
VariantHeaderobject required for writing.drop_samples (bool) – Ignore sample information when reading.
duplicate_filehandle (bool) – By default, file handles passed either directly or through File-like objects will be duplicated before passing them to htslib. The duplication prevents issues where the same stream will be closed by htslib and through destruction of the high-level python object. Set to False to turn off duplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current file appears to be truncated due to a missing EOF marker. Only applies to bgzipped formats. (Default=False)
threads (integer) – Number of threads to use for compressing/decompressing VCF/BCF files. Setting threads to > 1 cannot be combined with ignore_truncation. (Default=1)
- close(self)
closes the
pysam.VariantFile.
- copy(self)
- fetch(self, contig=None, start=None, stop=None, region=None, reopen=False, end=None, reference=None)
fetch records in a region, specified either by contig, start, and end (which are 0-based, half-open); or alternatively by a samtools region string (which is 1-based inclusive).
Without contig or region all mapped records will be fetched. The records will be returned ordered by contig, which will not necessarily be the order within the file.
Set reopen to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file incurrs some overhead, so use with care.
If only contig is set, all records on contig will be fetched. If both region and contig are given, an exception is raised.
Note that a bgzipped VCF.gz file without a tabix/CSI index (.tbi/.csi) or a BCF file without a CSI index can only be read sequentially.
- get_tid(self, reference)
return the numerical tid corresponding to reference
returns -1 if reference is not known.
- is_valid_tid(self, tid)
return True if the numerical tid is valid; False otherwise.
returns -1 if reference is not known.
- new_record(self, *args, **kwargs)
Create a new empty
VariantRecord.
- open(self, filename, mode='r', index_filename=None, VariantHeader header=None, drop_samples=False, duplicate_filehandle=True, ignore_truncation=False, threads=1)
open a vcf/bcf file.
If open is called on an existing VariantFile, the current file will be closed and a new file will be opened.
- reset(self)
reset file position to beginning of file just after the header.
- subset_samples(self, include_samples)
Read only a subset of samples to reduce processing time and memory. Must be called prior to retrieving records.
- write(self, VariantRecord record) int
write a single
pysam.VariantRecordto disk.returns the number of bytes written.
- class pysam.VariantHeader
Header information for a
VariantFileobject.The VCF header contains metadata lines that define the structure of the VCF file, including the VCF format version, FILTER, INFO, ALT, and FORMAT field definitions, contig (chromosome) definitions, and sample names. Accessed via
VariantFile.headerattribute.Example
Here is an example of reading and printing VCF header information:
>>> variant_file = pysam.VariantFile('/path/to/file.vcf.gz') >>> header = variant_file.header >>> print(f"version: {header.version}") version: VCFv4.5 >>> print(f"samples: {list(header.samples)}") samples: ["NA00001", "NA00002"] >>> print(f"contigs: {list(header.contigs)}") contigs: ["chr1"] >>> print(f"info: {list(header.info)}") info: ["NS", "AN", "AC", "DP", "AF"]
- add_line(self, line)
Add a metadata line to this VCF header.
- Parameters:
line (str) – The metadata line to add to this VCF header.
- Raises:
ValueError – if the provided header line is invalid.
- add_meta(self, key, value=None, items=None)
Add a metadata line to this VCF header.
- Parameters:
key (str) – The header line key (e.g.
"INFO","contig","custom_key").value (str, optional) – The value for generic (unstructured)
##custom_key=valueheader lines. For structured lines such as INFO and FILTER, this must be None. Either value or items must be specified, but not both.items (list of (str, str) pairs, optional) – A list of key/value pairs for a structured header line. For example
[("ID", "NS"), ("Type", "Number")... ]. Eithervalueoritemsmust be specified, but not both.
- Raises:
ValueError – If value and items are both specified, or if neither is specified.
MemoryError – If a VCF header record cannot be allocated.
- add_record(self, VariantHeaderRecord record)
Add an existing
VariantHeaderRecordto this header.- Raises:
ValueError – if record is None.
- add_sample(self, name)
Add a new sample to this header.
- Parameters:
name (str) – The sample name to add.
- add_samples(self, *args)
Add several new samples to this header.
This function takes multiple arguments, each of which may be either a sample name or an iterable returning sample names (e.g., a list of sample names).
- alts
A dictionary mapping all ALT fields in the VCF header to their associated
VariantHeaderRecord.ALT meta-information lines describe the possible symbolic alternate alleles in the ALT column of VCF records. Most commonly used for structural variant types (e.g. “CNV”) and IUPAC ambiguity codes (e.g. “R”).
alts is a copy and modifications to this dictionary will not be reflected in the header metadata.
- Type:
dict[str, VariantHeaderRecord]
- contigs
The header’s
##contigmetadata lines, presented as a mapping of the contig names or their indices to their correspondingVariantContig. SeeVariantHeaderContigsfor details.- Type:
- copy(self)
Return a copy of this
VariantHeaderobject.
- filters
The header’s
##FILTERmetadata lines, presented as a mapping of the filter names to their correspondingVariantMetadata. SeeVariantHeaderMetadatafor details.- Type:
- formats
The header’s
##FORMATmetadata lines, presented as a mapping of the field names to their correspondingVariantMetadata. SeeVariantHeaderMetadatafor details.- Type:
- info
The header’s
##INFOmetadata lines, presented as a mapping of the field names to their correspondingVariantMetadata. SeeVariantHeaderMetadatafor details.- Type:
- merge(self, VariantHeader header)
Merge another
VariantHeaderinto this header.- Parameters:
header (VariantHeader) – The header to merge into this one.
- Raises:
ValueError – If header is None.
- new_record(self, contig=None, start=0, stop=0, alleles=None, id=None, qual=None, filter=None, info=None, samples=None, **kwargs)
Create a new empty
VariantRecord.- Parameters:
contig (str, optional) – The contig name for the new record.
start (int) – The 0-based start position of the variant. Defaults to 0.
stop (int) – The 0-based exclusive end position of the variant. Defaults to 0.
alleles (tuple, optional) – The alleles for the new record (reference allele first).
id (str, optional) – The variant ID (the ID column in the VCF).
qual (float, optional) – The variant quality score (the QUAL column in the VCF).
filter (str or list, optional) – A filter name or list of filter names to apply to the record.
info (dict, optional) – A mapping of INFO field names to values.
samples (list of dict, optional) – A list of mappings of FORMAT field names to values, one per sample.
Warning
Arguments are currently experimental. Use with caution and expect changes in upcoming releases.
- Raises:
MemoryError – If a record is unable to be allocated.
- records
All the header’s metadata lines of any type, as a sequence of
VariantHeaderRecord. SeeVariantHeaderRecordsfor details.- Type:
- samples
The header’s
##SAMPLEmetadata lines, as a sequence of sample names. SeeVariantHeaderSamplesfor details.- Type:
- version
The content of the
##fileformatline (e.g., “VCFv4.5”) if present, otherwise None.- Type:
str or None
- class pysam.VariantRecord(*args, **kwargs)
Variant record (containing data for one entry in a VCF file).
Note
These properties and methods will raise
ValueErrorif errors occur while unpacking the variant record or in other circumstances as documented.- alleles
A tuple containing the reference allele followed by all alternate alleles, or None if not present.
- alleles_variant_types
A tuple of variant types corresponding to the
allelesattribute.Types are one of: “REF”, “SNP”, “MNP”, “INDEL”, “BND”, “OVERLAP”, and “OTHER”.
- alts
A tuple containing just the alternate alleles, or None if not present.
- chrom
The chromosome/contig name (same as
contig).- Raises:
ValueError – If the chromosome name is not present in the headers.
- contig
The chromosome/contig name (same as
chrom).- Raises:
ValueError – If the contig name is not present in the headers.
- copy(self)
Return a copy of this
VariantRecordobject.
- filter
The FILTER field, presented as a mapping of the filter names present (or their indices) to their corresponding
VariantMetadata. SeeVariantRecordFilterfor details.- Type:
- format
The FORMAT field, presented as a mapping of the field names present to their corresponding
VariantMetadata. SeeVariantRecordFormatfor details.- Type:
- id
The variant record identifier, or None if not available.
- info
The INFO field, presented as a mapping of the field names present to their corresponding values. See
VariantRecordInfofor details.- Type:
- qual
The Phred-scaled quality score, or None if not available.
- ref
The reference allele.
- samples
The genotype fields, presented as a mapping of the sample names present (or their indices) to their corresponding
VariantRecordSample. SeeVariantRecordSamplesfor details.- Type:
- translate(self, VariantHeader dst_header)
Set the header for this
VariantRecordto a differentVariantHeader.- Parameters:
dst_header (VariantHeader) – The new variant header.
- Raises:
ValueError – If the existing and new header do not have the same number of samples, or dst_header is None.
- class pysam.VariantContig(*args, **kwargs)
Contig metadata from a
VariantHeadermetadata header line.Example
Given a VCF meta-information line such as:
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens">The corresponding
VariantContigobject would havename= “20” andlength=62435964.- header_record
The
VariantHeaderRecordassociated with thisVariantContigobject.- Type:
- id
The contig internal ID number. To access the
IDfield of a header contig line, usenameinstead.- Type:
int
- length
The contig length, or None if not available.
- Type:
int or None
- name
The contig name.
- Type:
str
- remove_header(self)
Mark the
VariantHeaderRecordassociated with thisVariantContigobject for deletion.Warning
Due to HTSlib limitations, the record is not immediately removed from the header.
- class pysam.VariantHeaderRecord(*args, **kwargs)
A mapping-like interface to the key-value attributes of a single VCF header line. Accessed by iterating over
VariantHeader.recordsor throughVariantMetadata.record.Note
Use of
VariantMetadatashould be generally preferred for accessing standard fields of structured FILTER / INFO / FORMAT header records. Use this class for accessing generic (##key=value) header records and non-standard (i.e. not ID, Number, Type, or Description) fields of structured header records.Example
Here is an example of iterating over
VariantHeaderRecordobjects and printing records:variant_file = pysam.VariantFile('/path/to/file.vcf.gz') for record in variant_file.header.records: if record.type == "GENERIC": print(record.key, record.value) else: print(record.type, dict(record))
- attrs
A tuple of (attribute name, value) pairs for this header record.
- Type:
tuple of 2-tuple of str or None
- get(self, key, default=None)
Retrieve the value of the attribute named key from this header record.
- Parameters:
key (str) – The attribute name to look up (e.g. “ID”, “Number”, “Type”, “Description”).
default – Value to return if key is not present in the record. Defaults to None.
- Return type:
The value associated with key, or default if not present.
- items(self)
Return a list of attribute
(name, value)pairs for this header record.
- iteritems(self)
Return an iterator over attribute
(name, value)pairs for this header record.
- iterkeys(self)
Return an iterator over the attribute names of this header record (e.g. “ID”, “Description”).
- itervalues(self)
Return an iterator over the attribute values of this header record.
- key
The header key (the part before ‘=’, e.g.,
FILTER,contig,fileformat).- Type:
str or None
- keys(self)
Return a list of attribute names in this header record (e.g. “ID”, “Description”).
- pop(self, key, default=_nothing)
Return the value of a given key and remove it from the header record.
- Parameters:
key (str) – The key to be removed.
default (Any) – The default value to return when key is not present in the header record.
- Raises:
KeyError: – If the key is not present and no default value is provided.
- remove(self)
Remove this record from the VCF header.
Warning
Please avoid. This is currently unsafe and causes segfaults.
- type
The header type. One of
FILTER,INFO,FORMAT,CONTIG,STRUCTURED, orGENERIC, or None if the header record is unavailable.- Type:
str or None
- update(self, items=(), **kwargs)
Update header record attributes, using mapping entries if a
dictor other mapping is given, or name/value tuples if an iterable is given, and also using any keyword arguments given. Existing attribute values are overwritten.
- value
The header value. Set only for generic unstructured records. None for structured records such as FILTER, INFO, or FORMAT.
- Type:
str or None
- values(self)
Return a list of attribute values in this header record.
- class pysam.VariantMetadata(*args, **kwargs)
A FILTER, INFO or FORMAT metadata line from a
VariantHeaderobject.Example
Given a VCF meta-information line such as:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">The corresponding
VariantMetadataobject would havename= “NS”,number=1,type= “Integer”, anddescription= “Number of Samples With Data”.Note
Many of these properties will raise
ValueErrorwhen the record is invalid, usually due to an invalid header ID number.- description
The metadata description, or None if the field is not present.
- Type:
str or None
- id
The metadata internal header ID number. To access the
IDfield of a header metadata line, usenameinstead.- Type:
int
- name
The metadata field name (e.g., “AD”).
- Type:
str
- number
The number of values present for this metadata field (i.e., cardinality), which is one of:
An
intn if there is a fixed number of exactly n values.Aif there is one value per alternate allele.Rif there is one value per allele (including the reference).Gif there is one value per genotype—usually used for FORMAT tags..if there is a variable or unknown number of values.None if the header line is not a FILTER, INFO, or FORMAT metadata line.
- Type:
int or str or None
- record
The
VariantHeaderRecordassociated with thisVariantMetadataobject, or None if there is no associated record.- Type:
VariantHeaderRecord or None
- remove_header(self)
Mark the
VariantHeaderRecordassociated with thisVariantMetadataobject for deletion.Warning
Due to HTSlib limitations, the record is not immediately removed from the header.
- type
The metadata value type. One of
Flag,Integer,Float, orString, or None if the header line is not a FILTER, INFO, or FORMAT metadata line.- Type:
str or None
- class pysam.VariantRecordSample(*args, **kwargs)
Data for a single sample from a
VariantRecordobject. Provides data accessors for genotypes and a mapping interface from FORMAT fields to values.The
VariantRecordSampleobject implements a mapping-like object for a specific VCF/BCF row and sample column. The keys are FORMAT fields and the values are the values in the VCF/BCF file. There is special handling for “GT”, through thealleles,allele_indicesandphasedattributes. There is also anameproperty that provides the sample’s name.Example
Here is an example of accessing and printing data in a
VariantRecordSample:variant_file = pysam.VariantFile('/path/to/file.vcf.gz') for variant_record in variant_file.fetch(): for sample_name in variant_record.samples: variant_record_sample = variant_record.samples[sample_name] variant_record_sample["GT"] = (0, 1) print(dict(variant_record_sample))
This code will print data such as the following:
{ 'AD': (0, 80), 'DP': 79, 'GQ': 32, 'GT': (0, 1), 'PL': (33, 34, 0), 'VAF': (1.0,), 'PS': None }
Note
The “GT” value must be provided as a tuple
(0, 1)and not a string"0/1".- allele_indices
Allele indices (e.g.
(0, 1)) for the called genotype (if present), otherwise None.
- alleles
Alleles (e.g.
("CT", "C")) for the called genotype (if present), otherwise None.
- clear(self)
Clear all FORMAT fields (including GT) for this sample.
- get(self, key, default=None)
Retrieve sample data for FORMAT field key (e.g. “DP”). If key is not present, returns default (or None if default is not given).
- items(self)
Return a list of all FORMAT field
(name, value)tuples for this sample.
- iteritems(self)
Return an iterator over all FORMAT field
(name, value)tuples for this sample.
- iterkeys(self)
Return an iterator over all FORMAT field names for this record.
- itervalues(self)
Return an iterator over all FORMAT field values for this sample.
- keys(self)
Return a list of all FORMAT field names for this record.
- name
The sample name.
- phased
False if the genotype is missing or any allele is unphased, otherwise True.
- pop(self, key, default=_nothing)
Remove the FORMAT field key for this sample and return its value. If key is not present, returns default if given (otherwise raises
KeyError).
- update(self, items=(), **kwargs)
Update the FORMAT field values for this sample, using mapping entries if a
dictor other mapping is given, or name/value tuples if an iterable is given, and also using any keyword arguments given. Existing field values are overwritten.
- values(self)
Return a list of all FORMAT field values for this sample.
Internal classes
These classes are used internally to represent particular header and record fields. They cannot be instantiated directly in your Python or Cython code.
- class pysam.VariantHeaderContigs(*args, **kwargs)
A mapping from contig name or integer index to a
VariantContigobject. Accessed viaVariantHeader.contigs.Example
Here is an example of iterating over the contigs in a
VariantHeader:variant_file = pysam.VariantFile('/path/to/file.vcf.gz') for name, contig in variant_file.header.contigs.items(): print(name, contig.length)
- add(self, id, length=None, **kwargs)
Add a new
##contigmetadata line to the VCF header.- Parameters:
id (str) – The contig name.
length (int, optional) – The contig length.
kwargs – Key/value pairs denoting additional non-standard optional fields.
- Raises:
ValueError – If a contig with the given id is already present in the VCF header.
- clear_header(self)
Mark all
##contigmetadata lines for removal from the VCF header.Warning
Due to HTSlib limitations, the records are not immediately removed from the header.
- get(self, key, default=None)
Retrieve the
VariantContigfor the contig named key, or default if not present.
- items(self)
Return a list of contig (
name,VariantContig) pairs in the VCF header.
- iteritems(self)
Return an iterator over contig (
name,VariantContig) pairs in the VCF header.
- iterkeys(self)
Return an iterator over the contig names in the VCF header.
- itervalues(self)
Return an iterator over the
VariantContigobjects in the VCF header.
- keys(self)
Return a list of contig names in the VCF header.
- remove_header(self, key)
Mark a
##contigmetadata line for removal from the VCF header.Warning
Due to HTSlib limitations, the record is not immediately removed from the header.
- Parameters:
key (str or int) – The contig name or integer index to remove.
- Raises:
IndexError – If key is an integer and is out of range.
KeyError – If key is a string and is not present in the VCF header.
- values(self)
Return a list of
VariantContigobjects in the VCF header.
- class pysam.VariantHeaderMetadata(*args, **kwargs)
A mapping from FILTER, INFO, or FORMAT field names to
VariantMetadataobjects. Accessed viaVariantHeader.filters,VariantHeader.info, orVariantHeader.formats.Example
Here is an example of iterating over the INFO metadata in a
VariantHeader:variant_file = pysam.VariantFile('/path/to/file.vcf.gz') for name, meta in variant_file.header.info.items(): print(name, meta.type, meta.number, meta.description)
- add(self, id, number, type, description, **kwargs)
Add a new FILTER, INFO, or FORMAT metadata line to the VCF header.
- Parameters:
id (str) – The metadata field name (the “ID” attribute of the header line).
number (int or str or None) – The number of values for this field. Use None for FILTER fields. Use an integer for a fixed count, or one of
.,A,R, orGfor variable counts.type (str) – The value type for this field. Use None for FILTER fields. Must be one of
Flag,Integer,Float, orString.description (str) – A human-readable description of the metadata field.
kwargs – Key/value pairs denoting additional non-standard optional fields.
- Raises:
ValueError – If id is already present in the VCF header, if number and type are not None when adding a filter, or if an unknown type is provided.
- clear_header(self)
Remove all FILTER, INFO, or FORMAT metadata lines of this type from the VCF header.
Warning
Due to HTSlib limitations, the records are not immediately removed from the header.
- get(self, key, default=None)
Retrieve the
VariantMetadatafor the field named key, or default if not present.- Parameters:
key (str) – The field name to look up.
default – Value to return if key is not present in the header record. Defaults to None.
- items(self)
Return a list of field (
name,VariantMetadata) pairs.
- iteritems(self)
Return an iterator over field (
name,VariantMetadata) pairs.
- iterkeys(self)
Return an iterator over the field names in this metadata collection.
- itervalues(self)
Return an iterator over the
VariantMetadataobjects in this collection.
- keys(self)
Return a list of field names in this metadata collection.
- remove_header(self, key)
Mark a FILTER, INFO, or FORMAT metadata line for deletion from the VCF header.
Warning
Due to HTSlib limitations, the record is not immediately removed from the header.
- Parameters:
key (str) – The metadata field name (“ID” attribute) to remove.
- Raises:
KeyError – If key is not present in the VCF header.
- values(self)
Return a list of
VariantMetadataobjects in this collection.
- class pysam.VariantHeaderRecords(*args, **kwargs)
A sequence of
VariantHeaderRecordobjects from aVariantHeaderobject.Supports iteration and integer indexing to access individual header records. Accessed via
VariantHeader.records.
- class pysam.VariantHeaderSamples(*args, **kwargs)
A sequence of sample names from a
VariantHeaderobject. Supports iteration and integer indexing to access individual sample names. Accessed viaVariantHeader.samples.Example
Here is an example of iterating over the samples in a
VariantHeader:variant_file = pysam.VariantFile('/path/to/file.vcf.gz') for sample_name in variant_file.header.samples: print(sample_name)
- add(self, name)
Add a new sample to the VCF header.
- Parameters:
name (str) – The sample name to add.
- class pysam.VariantRecordFilter(*args, **kwargs)
FILTER column data for a
VariantRecordobject, presented as a mapping of the filter names (e.g., “PASS”) present or their indices to their correspondingVariantMetadata.- add(self, key)
Add a new filter.
- Parameters:
key (str) – The name of the filter to add to this record. If key is “.”, the “PASS” filter is added.
- Raises:
KeyError – If the selected filter is not present in the VCF header.
- clear(self)
Clear all FILTER column data for this record.
- get(self, key, default=None)
Retrieve
VariantMetadatainformation for this record’s filter named key.If key is not present, return default.
- Parameters:
key (str) – The FILTER name for which to retrieve metadata.
default – Data to return if the key is not present in the FILTER field for this record. Defaults to None.
- items(self)
Return a list of (
filter,VariantMetadata) tuples for all filters for this record.
- iteritems(self)
Return an iterator over (
filter,VariantMetadata) tuples for all filters for this record.
- iterkeys(self)
Return an iterator over all filters for this record.
- itervalues(self)
Return an iterator over the
VariantMetadataobjects corresponding to all filters for this record.
- keys(self)
Return a list of all filters for this record.
- values(self)
Return a list of
VariantMetadataobjects corresponding to all filters for this record.
- class pysam.VariantRecordFormat(*args, **kwargs)
FORMAT field data present for each sample in a
VariantRecordobject, presented as a mapping of the format names present (e.g. “GT”) to their correspondingVariantMetadata.- clear(self)
Clear FORMAT field data for all samples within the associated
VariantRecordobject.- Raises:
ValueError – If there are errors while deleting the FORMAT field.
- get(self, key, default=None)
Retrieve
VariantMetadatafor a FORMAT field key.If key is not present, return default.
- Parameters:
key (str) – FORMAT field to retrieve metadata for.
default – Data to return if the key is not present in the FORMAT field for this record. Defaults to None.
- items(self)
Return a list of (
format,VariantMetadata) tuples for all format fields for this record.
- iteritems(self)
Return an iterator over (
format,VariantMetadata) tuples for all format fields for this record.
- iterkeys(self)
Return an iterator over all format fields for this record.
- itervalues(self)
Return an iterator over the
VariantMetadataobjects corresponding to all format fields for this record.
- keys(self)
Return a list of all format fields for this record.
- values(self)
Return a list of the
VariantMetadataobjects corresponding to all format fields for this record.
- class pysam.VariantRecordInfo(*args, **kwargs)
INFO column data stored in a
VariantRecordobject, presented as a mapping from field name to value.- clear(self)
Clear all INFO column data for this record.
- Raises:
ValueError – If there are errors while unpacking the
VariantRecordor deleting INFO.
- get(self, key, default=None)
Retrieve INFO column data for this record at field key.
- Parameters:
key (str) – The INFO field to retrieve data for.
default – Data to return if the key is not present in the INFO field for this record. Defaults to None.
- Raises:
ValueError – If there are errors unpacking the
VariantRecordor the INFO field is not present in the header.
- items(self)
Return a list of (
key,value) tuples for all INFO fields for this record.
- iteritems(self)
Return an iterator over (
key,value) tuples for all INFO fields for this record.- Raises:
ValueError – If there are errors unpacking the
VariantRecord.
- iterkeys(self)
Return an iterator over all INFO keys for this record.
- itervalues(self)
Return an iterator over all INFO values for this record.
- Raises:
ValueError – If there are errors unpacking the
VariantRecord.
- keys(self)
Return a list of all INFO keys for this record.
- pop(self, key, default=_nothing)
Remove the INFO field key for this record and returns its value.
- Parameters:
key (str) – INFO field to retrieve for this record.
default (Any) – Data to return if the key is not present.
- Raises:
KeyError – When the key is not present and default is unset.
ValueError – If there is an error unpacking the
VariantRecord.
- Returns:
The value of the removed INFO field for this record.
- Return type:
Any
- update(self, items=(), **kwargs)
Update the INFO field (
key,value) tuples for this record, using mapping entries if adictor other mapping is given, or key/value tuples if an iterable is given, and also using any keyword arguments given. Existing field values are overwritten.
- values(self)
Return a list of all INFO values for this record.
- class pysam.VariantRecordSamples(*args, **kwargs)
SAMPLE field(s) data in a
VariantRecordobject, presented as a mapping from sample names (e.g. “HG001”) or indices to their correspondingVariantRecordSample.- get(self, key, default=None)
Retrieve
VariantRecordSample(containing FORMAT field data) for sample key.- Parameters:
key (str) – The SAMPLE column to retrieve data for.
default – Data to return if the SAMPLE column key is not present for this record. Defaults to None.
- items(self)
Return a list of (
sample,VariantRecordSample) tuples for all SAMPLE columns for this record.
- iteritems(self)
Return an iterator over (
sample,VariantRecordSample) tuples for all SAMPLE columns for this record.
- iterkeys(self)
Return an iterator over all SAMPLE columns for this record.
- itervalues(self)
Return an iterator over the
VariantRecordSampleobjects corresponding to all SAMPLE columns for this record.
- keys(self)
Return a list of all SAMPLE columns for this record.
- pop(self, key, default=_nothing)
Remove the SAMPLE column key for this record and returns its data.
- Parameters:
key (str) – SAMPLE column to retrieve for this record.
default (Any) – Data to return if the key is not present.
- Raises:
KeyError – When the key is not present and default is unset.
- Returns:
The data in the removed SAMPLE column for this record.
- Return type:
Any
- update(self, items=(), **kwargs)
Update the (
sample,VariantRecordSample) tuples for this record, using mapping entries if adictor other mapping is given, or name/value tuples if an iterable is given, and also using any keyword arguments given. Existing field values are overwritten.
- values(self)
Return a list of
VariantRecordSampleobjects corresponding to all SAMPLE columns for this record.
Tabix files
TabixFile opens tabular files that have been
indexed with tabix.
- class pysam.TabixFile
Random access to bgzf formatted files that have been indexed by tabix.
The file is automatically opened. The index file of file
<filename>is expected to be called<filename>.tbiby default (see parameter index).- Parameters:
filename (string) – Filename of bgzf file to be opened.
index (string) – The filename of the index. If not set, the default is to assume that the index is called
filename.tbimode (char) – The file opening mode. Currently, only
ris permitted.parser (
pysam.Parser) – sets the default parser for this tabix file. If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for exampleasTupleandasGTF).encoding (string) – The encoding passed to the parser
threads (integer) – Number of threads to use for decompressing Tabix files. (Default=1)
- Raises:
ValueError – if index file is missing.
IOError – if file could not be opened
- close(self)
closes the
pysam.TabixFile.
- contigs
list of chromosome names
- fetch(self, reference=None, start=None, end=None, region=None, parser=None, multiple_iterators=False)
fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Without reference or region all entries will be fetched.
If only reference is set, all reads matching on reference will be fetched.
If parser is None, the default parser will be used for parsing.
Set multiple_iterators to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.
- header
the file header.
The file header consists of the lines at the beginning of a file that are prefixed by the comment character
#.Note
The header is returned as an iterator presenting lines without the newline character.
To iterate over tabix files, use tabix_iterator():
- pysam.tabix_iterator(infile, parser)
return an iterator over all entries in a file.
Results are returned parsed as specified by the parser. If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example
asTupleandasGTF).
- pysam.tabix_compress(filename_in, filename_out, force=False)
compress filename_in writing the output to filename_out.
Raise an IOError if filename_out already exists, unless force is set.
- pysam.tabix_index(filename, force=False, seq_col=None, start_col=None, end_col=None, preset=None, meta_char='#', int line_skip=0, zerobased=False, int min_shift=-1, index=None, keep_original=False, csi=False)
index tab-separated filename using tabix.
An existing index will not be overwritten unless force is set.
The index will be built from coordinates in columns seq_col, start_col and end_col.
The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.
Column indices are 0-based. Note that this is different from the tabix command line utility where column indices start at 1.
Coordinates in the file are assumed to be 1-based unless zerobased is set.
If preset is provided, the column coordinates are taken from a preset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”, psltbl”, “pileup”.
Lines beginning with meta_char and the first line_skip lines will be skipped.
If filename is not detected as a gzip file it will be automatically compressed. The original file will be removed and only the compressed file will be retained.
By default or when min_shift is 0, creates a TBI index. If min_shift is greater than zero and/or csi is True, creates a CSI index with a minimal interval size of 1<<min_shift (1<<14 if only csi is set).
index controls the filename which should be used for creating the index. If not set, the default is to append
.tbito filename.When automatically compressing files, if keep_original is set the uncompressed file will not be deleted.
returns the filename of the compressed data
- class pysam.asTuple
converts a tabix row into a python tuple.
A field in a row is accessed by numeric index.
- class pysam.asVCF
converts a tabix row into a VCF record with the following fields:
Column
Field
Contents
1
contig
chromosome
2
pos
chromosomal position, zero-based
3
id
id
4
ref
reference allele
5
alt
alternate alleles
6
qual
quality
7
filter
filter
8
info
info
9
format
format specifier.
Access to genotypes is via index:
contig = vcf.contig first_sample_genotype = vcf[0] second_sample_genotype = vcf[1]
- class pysam.asBed
converts a tabix row into a bed record with the following fields:
Column
Field
Contents
1
contig
contig
2
start
genomic start coordinate (zero-based)
3
end
genomic end coordinate plus one (zero-based)
4
name
name of feature.
5
score
score of feature
6
strand
strand of feature
7
thickStart
thickStart
8
thickEnd
thickEnd
9
itemRGB
itemRGB
10
blockCount
number of bocks
11
blockSizes
‘,’ separated string of block sizes
12
blockStarts
‘,’ separated string of block genomic start positions
Only the first three fields are required. Additional fields are optional, but if one is defined, all the preceding need to be defined as well.
- class pysam.asGTF
converts a tabix row into a GTF record with the following fields:
Column
Name
Content
1
contig
the chromosome name
2
feature
The feature type
3
source
The feature source
4
start
genomic start coordinate (0-based)
5
end
genomic end coordinate (0-based)
6
score
feature score
7
strand
strand
8
frame
frame
9
attributes
the attribute field
GTF formatted entries also define the following fields that are derived from the attributes field:
Name
Content
gene_id
the gene identifier
transcript_id
the transcript identifier
FASTA files
- class pysam.FastaFile
Random access to fasta formatted files that have been indexed by faidx.
The file is automatically opened. The index file of file
<filename>is expected to be called<filename>.fai.- Parameters:
filename (string) – Filename of fasta file to be opened.
filepath_index (string) – Optional, filename of the index. By default this is the filename + “.fai”.
filepath_index_compressed (string) – Optional, filename of the index if fasta file is. By default this is the filename + “.gzi”.
- Raises:
ValueError – if index file is missing
IOError – if file could not be opened
- close(self)
close the file.
- closed
bool indicating the current state of the file object. This is a read-only attribute; the close() method changes the value.
- fetch(self, reference=None, start=None, end=None, region=None)
fetch sequences in a region.
A region can either be specified by reference, start and end. start and end denote 0-based, half-open intervals.
Alternatively, a samtools region string can be supplied.
If any of the coordinates are missing they will be replaced by the minimum (start) or maximum (end) coordinate.
Note that region strings are 1-based, while start and end denote an interval in python coordinates. The region is specified by reference, start and end.
- Returns:
string
- Return type:
a string with the sequence specified by the region.
- Raises:
IndexError – if the coordinates are out of range
ValueError – if the region is invalid
- filename
filename associated with this object. This is a read-only attribute.
- get_reference_length(self, reference)
return the length of reference.
- is_open(self)
return true if samfile has been opened.
FASTQ files
- class pysam.FastxFile
Stream access to fasta or fastq formatted files.
The file is automatically opened.
Entries in the file can be both fastq or fasta formatted or even a mixture of the two.
This file object permits iterating over all entries in the file. Random access is not implemented. The iteration returns objects of type
FastqProxy- Parameters:
filename (string) – Filename of fasta/fastq file to be opened.
persist (bool) – If True (default) make a copy of the entry in the file during iteration. If set to False, no copy will be made. This will permit much faster iteration, but an entry will not persist when the iteration continues and an entry is read-only.
Notes
Prior to version 0.8.2, this class was called FastqFile.
- Raises:
IOError – if file could not be opened
Examples
>>> with pysam.FastxFile(filename) as fh: ... for entry in fh: ... print(entry.name) ... print(entry.sequence) ... print(entry.comment) ... print(entry.quality) >>> with pysam.FastxFile(filename) as fin, open(out_filename, mode='w') as fout: ... for entry in fin: ... fout.write(str(entry) + '\n')
- close(self)
close the file.
- closed
bool indicating the current state of the file object. This is a read-only attribute; the close() method changes the value.
- filename
string with the filename associated with this object.
- is_open(self)
return true if samfile has been opened.
- class pysam.FastqProxy
A single entry in a fastq file.
- get_quality_array(self, int offset=33) array
return quality values as integer array after subtracting offset.
- name
The name of each entry in the fastq file.
- quality
The quality score of each entry in the fastq file, represented as a string.
- sequence
The sequence of each entry in the fastq file.
HTSFile
HTSFile is the base class for pysam.AlignmentFile and
pysam.VariantFile.
- class pysam.HTSFile
Base class for HTS file types
- add_hts_options(self, format_options)
Given a list of
key=valueformat option strings (as accepted by samtools’s global fmt-options), add them to an open htsFile.
- category
General file format category. One of UNKNOWN, ALIGNMENTS, VARIANTS, INDEX, REGIONS
- check_truncation(self, ignore_truncation=False)
Check if file is truncated.
- close(self)
- closed
return True if HTSFile is closed.
- compression
File compression.
One of NONE, GZIP, BGZF, CUSTOM.
- description
Vaguely human readable description of the file format
- flush(self)
Flush any buffered data to the underlying output stream.
- format
File format.
One of UNKNOWN, BINARY_FORMAT, TEXT_FORMAT, SAM, BAM, BAI, CRAM, CRAI, VCF, BCF, CSI, GZI, TBI, BED.
- get_tid(self, contig)
return the numerical tid corresponding to contig
returns -1 if contig is not known.
- is_bam
return True if HTSFile is reading or writing a BAM alignment file
- is_bcf
return True if HTSFile is reading or writing a BCF variant file
- is_closed
return True if HTSFile is closed.
- is_cram
return True if HTSFile is reading or writing a BAM alignment file
- is_open
return True if HTSFile is open and in a valid state.
- is_read
return True if HTSFile is open for reading
- is_sam
return True if HTSFile is reading or writing a SAM alignment file
- is_valid_reference_name(self, contig)
return True if the contig name contig is valid; False otherwise.
- is_valid_tid(self, tid)
return True if the numerical tid is valid; False otherwise.
returns -1 if contig is not known.
- is_vcf
return True if HTSFile is reading or writing a VCF variant file
- is_write
return True if HTSFile is open for writing
- parse_region(self, contig=None, start=None, stop=None, region=None, tid=None, reference=None, end=None)
parse alternative ways to specify a genomic region. A region can either be specified by contig, start and stop. start and stop denote 0-based, half-open intervals. reference and end are also accepted for backward compatibility as synonyms for contig and stop, respectively.
Alternatively, a samtools region string can be supplied.
If any of the coordinates are missing they will be replaced by the minimum (start) or maximum (stop) coordinate.
Note that region strings are 1-based inclusive, while start and stop denote an interval in 0-based, half-open coordinates (like BED files and Python slices).
If contig or region or are
*, unmapped reads at the end of a BAM file will be returned. Setting either to.will iterate from the beginning of the file.- Returns:
a tuple of flag, tid, start and stop. The flag indicates whether no coordinates were supplied and the genomic region is the complete genomic space.
- Return type:
tuple
- Raises:
ValueError – for invalid or out of bounds regions.
- reset(self)
reset file position to beginning of file just after the header.
- Returns:
The file position after moving the file pointer.
- Return type:
pointer
- seek(self, uint64_t offset, int whence=io.SEEK_SET)
move file pointer to position offset, see
pysam.HTSFile.tell().
- tell(self)
return current file position, see
pysam.HTSFile.seek().
- version
Tuple of file format version numbers (major, minor)