parsnip

Overview

An interface for reading CIF files in Python.

Importing parsnip allows users to read CIF 1.1 files, as well as many features from the CIF 2.0 and mmCIF formats. Creating a CifFile object provides easy access to name-value pairs, as well as loop_-delimited loops. Data entries can be extracted as python primitives or numpy arrays for further use.

The CIF Format

This is an example of a simple CIF file. A key (data name or tag) must start with an underscore, and is separated from the data value with whitespace characters. A table begins with the loop_ keyword, and contain a header block and a data block. The vertical position of a tag in the table headings corresponds with the horizontal position of the associated column in the table values.

# A header describing this portion of the file
data_cif_Cu-FCC

# Several key-value pairs
_journal_year 1999
_journal_page_first 0
_journal_page_last 123

_chemical_name_mineral 'Copper FCC'
_chemical_formula_sum 'Cu'

# Key-value pairs describing the unit cell (Å and °)
_cell_length_a     3.6
_cell_length_b     3.6
_cell_length_c     3.6
_cell_angle_alpha  90.0
_cell_angle_beta   90.0
_cell_angle_gamma  90.0

# A table with 6 columns and one row
loop_
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_type_symbol
_atom_site_Wyckoff_label
Cu1 0.0000000000 0.0000000000 0.0000000000  Cu a

_symmetry_space_group_name_H-M  'Fm-3m' # One more key-value pair

# A table with two columns and four rows:
loop_
_symmetry_equiv_pos_site_id
_symmetry_equiv_pos_as_xyz
1  x,y,z
96  z,y+1/2,x+1/2
118  z+1/2,-y,x+1/2
192  z+1/2,y+1/2,x

Classes:

CifFile(file[, cast_values, strict])

Parser for CIF files.

class CifFile(file, cast_values=False, strict=False)

Bases: object

Parser for CIF files.

Example

To get started, simply provide a filename:

>>> from parsnip import CifFile
>>> cif = CifFile("example_file.cif")
>>> print(cif)
CifFile(file='example_file.cif') : 12 data entries, 2 data loops

Data entries are accessible via the pairs and loops attributes:

>>> cif.pairs
{'_journal_year': '1999', '_journal_page_first': '0', ...}
>>> cif.loops[0]
array([[('Cu1', '0.0000000000', '0.0000000000', '0.0000000000', 'Cu', 'a')]],
      dtype=...)
>>> cif.loops[1]
array([[('1', 'x,y,z')],
       [('96', 'z,y+1/2,x+1/2')],
       [('118', 'z+1/2,-y,x+1/2')],
       [('192', 'z+1/2,y+1/2,x')]],
      dtype=...)

Tip

See the docs for __getitem__ and get_from_loops to query for data by key or column label.

Parameters:
  • fn (str | Path) – Path to the file to be opened.

  • cast_values (bool, optional) – Whether to convert string numerics to integers and float. Default value = False

  • file (str | Path | TextIO | Iterable[str])

  • strict (bool)

Attributes:

pairs

A dict containing key-value pairs extracted from the file.

loops

A list of data tables (loop_'s) extracted from the file.

box

Read the unit cell as a freud or HOOMD box-like object.

lattice_vectors

The lattice vectors of the unit cell, with \(\vec{a_1}\perp[100]\).

loop_labels

A list of column labels for each data array.

symops

Extract the symmetry operations in a parsable algebraic form.

wyckoff_positions

Extract symmetry-irreducible, fractional x,y,z coordinates.

cast_values

Whether to cast "number-like" values to ints & floats.

PATTERNS

Regex patterns used when parsing files.

Methods:

__getitem__(index)

Return an item or list of items from pairs() and loops().

get_from_pairs(index)

Return an item or items from the dictionary of key-value pairs.

get_from_loops(index)

Return a column or columns from the matching table in loops.

read_cell_params([degrees, normalize])

Read the unit cell parameters (lengths and angles).

build_unit_cell([n_decimal_places, ...])

Reconstruct fractional atomic positions from Wyckoff sites and symops.

set_wyckoff_positions(wyckoff_sites)

Set the Wyckoff sites in the CIF file data.

structured_to_unstructured(arr)

Convert a structured (column-labeled) array to a standard unstructured array.

property pairs

A dict containing key-value pairs extracted from the file.

Numeric values will be parsed to int or float if possible. In these cases, precision specifiers will be stripped.

Return type:

dict[str , str | float | int]

property loops

A list of data tables (loop_’s) extracted from the file.

These are stored as numpy structured arrays, which can be indexed by column labels. See the structured_to_unstructured helper function below for details on converting to standard arrays.

Returns:

A list of structured arrays containing table data from the file.

Return type:

list[numpy.ndarray[str]]

__getitem__(index)

Return an item or list of items from pairs() and loops().

This getter searches the entire CIF state to identify the input keys, returning None if the key does not match any data. Matching columns from loop tables are returned as 1D arrays.

Tip

This method of accessing data is recommended for most uses, as it ensures data is returned wherever possible. get_from_loops() may be useful when multi-column slices of an array are needed.

Example

Indexing the class with a single key:

>>> cif["_journal_year"]
'1999'
>>> cif["_atom_site_label"]
array([['Cu1']], dtype='<U12')

Indexing with a list of keys:

>>> cif[["_chemical_name_mineral", "_symmetry_equiv_pos_as_xyz"]]
["'Copper FCC'",
array([['x,y,z'],
    ['z,y+1/2,x+1/2'],
    ['z+1/2,-y,x+1/2'],
    ['z+1/2,y+1/2,x']], dtype='<U14')]

Wildcards are supported for lookups with this method:

>>> cif[["_journal*", "_atom_site_fract_?"]]
[['1999', '0', '123'],
...array([['0.0000000000', '0.0000000000', '0.0000000000']], dtype='<U12')]
Parameters:

index (str | Iterable[str]) – An item key or list of keys.

get_from_pairs(index)

Return an item or items from the dictionary of key-value pairs.

Tip

This method supports unix-style wildcards. Use * to match any number of any character, and ? to match any single character. If a wildcard matches more than one key, a list is returned for that index. The ordering of array data resulting from wildcard queries matches the ordering of the matching keys in the file. Lookups using this method are case-insensitive, per the CIF specification.

Indexing with a string returns the value from the pairs() dict. Indexing with an Iterable of strings returns a list of values, with None as a placeholder for keys that did not match any data.

Example

Indexing the class with a single key:

>>> cif.get_from_pairs("_journal_year")
'1999'

Indexing with a list of keys:

>>> cif.get_from_pairs(["_journal_page_first", "_journal_page_last"])
['0', '123']

Indexing with wildcards:

>>> cif.get_from_pairs("_journal*")
['1999', '0', '123']

Single-character wildcards can generalize keys across CIF and mmCIF files:

>>> cif.get_from_pairs("_symmetry?space_group_name_H-M")
"'Fm-3m'"
Parameters:

index (str | Iterable[str]) – An item key or list of keys.

Returns:

A list of data elements corresponding to the input key or keys. If the resulting list would have length 1, the item is returned directly instead.

Return type:

list[str|int|float]

get_from_loops(index)

Return a column or columns from the matching table in loops.

If index is a single string, a single column will be returned from the matching table. If index is an Iterable of strings, the corresponding table slices will be returned. Slices from the same table will be grouped in the output array, but slices from different arrays will be returned separately.

Tip

It is highly recommended that queries across multiple loops are provided in separated calls to this function. This helps ensure output data is ordered as expected and allows for easier handling of cases where non-matching keys are provided.

Example

Extract a single column from a single table:

>>> cif.get_from_loops("_symmetry_equiv_pos_as_xyz")
array([['x,y,z'],
       ['z,y+1/2,x+1/2'],
       ['z+1/2,-y,x+1/2'],
       ['z+1/2,y+1/2,x']], dtype='<U14')

Extract multiple columns from a single table:

>>> table_1_cols = ["_symmetry_equiv_pos_site_id", "_symmetry_equiv_pos_as_xyz"]
>>> table_1 = cif.get_from_loops(table_1_cols)
>>> table_1
array([['1', 'x,y,z'],
       ['96', 'z,y+1/2,x+1/2'],
       ['118', 'z+1/2,-y,x+1/2'],
       ['192', 'z+1/2,y+1/2,x']], dtype='<U14')

Wildcard patterns are accepted for single input keys:

>>> assert (cif.get_from_loops("_symmetry_equiv_pos*") == table_1).all()

Extract multiple columns from multiple loops:

>>> table_1_cols = ["_symmetry_equiv_pos_site_id", "_symmetry_equiv_pos_as_xyz"]
>>> table_2_cols = ["_atom_site_type_symbol", "_atom_site_Wyckoff_label"]
>>> [cif.get_from_loops(cols) for cols in (table_1_cols, table_2_cols)]
[array([['1', 'x,y,z'],
       ['96', 'z,y+1/2,x+1/2'],
       ['118', 'z+1/2,-y,x+1/2'],
       ['192', 'z+1/2,y+1/2,x']], dtype='<U14'),
    array([['Cu', 'a']], dtype='<U12')]

Caution

Returned arrays will match the ordering of input index keys if all indices correspond to a single table. Indices that match multiple loops will return all possible matches, in the order of the input loops. Lists of input that correspond with multiple loops will return data from those loops in the order they were read from the file.

Case where ordering of output matches the input file, not the provided keys:

>>> cif.get_from_loops([*table_1_cols, *table_2_cols])
[array([['Cu', 'a']], dtype='<U12'),
 array([['1', 'x,y,z'],
        ['96', 'z,y+1/2,x+1/2'],
        ['118', 'z+1/2,-y,x+1/2'],
        ['192', 'z+1/2,y+1/2,x']], dtype='<U14')]
Parameters:

index (str | Iterable[str]) – A column name or list of column names.

Returns:

A list of unstructured arrays corresponding with matches from the input keys. If the resulting list would have length 1, the data is returned directly instead. See the note above for data ordering.

Return type:

list[numpy.ndarray] | numpy.ndarray

read_cell_params(degrees=True, normalize=False)

Read the unit cell parameters (lengths and angles).

Parameters:
  • degrees (bool, optional) – When True, angles are returned in degrees (as in the CIF spec). When False, angles are converted to radians. Default value = True

  • normalize (bool, optional) – Whether to scale the unit cell such that the smallest lattice parameter is 1.0. Default value = False

Returns:

The box vector lengths (in angstroms) and angles (in degrees or radians) \((L_1, L_2, L_3, \alpha, \beta, \gamma)\).

Return type:

tuple[float]

Raises:

ValueError – If the stored data cannot form a valid box.

build_unit_cell(n_decimal_places=4, additional_columns=None, parse_mode='python_float', verbose=False)

Reconstruct fractional atomic positions from Wyckoff sites and symops.

Rather than storing an entire unit cell’s atomic positions, CIF files instead include the data required to recreate those positions based on symmetry rules. Symmetry operations (stored as strings of x,y,z position permutations) are applied to the Wyckoff (symmetry irreducible) positions to create a list of possible atomic sites. These are then wrapped into the unit cell and filtered for uniqueness to yield the final crystal.

Tip

If the parsed unit cell has more atoms than expected, decrease n_decimal_places to account for noise. If the unit cell has fewer atoms than expected, increase n_decimal_places to ensure atoms are compared with sufficient precision. In many cases, setting parse_mode='sympy' can improve the accuracy of reconstructed unit cells.

Example

Construct the atomic positions of the FCC unit cell from its Wyckoff sites:

>>> pos = cif.build_unit_cell()
>>> pos
array([[0. , 0. , 0. ],
       [0. , 0.5, 0.5],
       [0.5, 0. , 0.5],
       [0.5, 0.5, 0. ]])

Reconstruct a unit cell with its associated atomic labels. The ordering of the auxiliary data array will match the ordering of the atomic positions:

>>> data = cif.build_unit_cell(additional_columns=["_atom_site_type_symbol"])
>>> data[0] # Chemical symbol for the atoms at each lattice site
array([['Cu'],
       ['Cu'],
       ['Cu'],
       ['Cu']], dtype='<U12')
>>> data[1] # Lattice positions
array([[0. , 0. , 0. ],
       [0. , 0.5, 0.5],
       [0.5, 0. , 0.5],
       [0.5, 0.5, 0. ]])
>>> assert (pos==data[1]).all()
Parameters:
  • n_decimal_places (int, optional) – The number of decimal places to round each position to for the uniqueness comparison. Ideally this should be set to the number of decimal places included in the CIF file, but 3 and 4 work in most cases. Default value = 4

  • additional_columns (str | Iterable[str] | None, optional) – A column name or list of column names from the loop containing the Wyckoff site positions. This data is replicated alongside the atomic coordinates and returned in an auxiliary array. Default value = None

  • parse_mode ({'sympy', 'python_float'}, optional) – Whether to parse lattice sites symbolically (parse_mode='sympy') or numerically (parse_mode='python_float'). Sympy is typically more accurate, but may be slower. Default value = 'python_float'

  • verbose (bool, optional) – Whether to print debug information about the uniqueness checks. Default value = False

Returns:

The full unit cell of the crystal structure.

Return type:

\((N, 3)\) numpy.ndarray[float]

Raises:
  • ValueError – If the stored data cannot form a valid box.

  • ValueError – If the additional_columns are not properly associated with the Wyckoff positions.

  • ImportError – If parse_mode='sympy' and Sympy is not installed.

property box

Read the unit cell as a freud or HOOMD box-like object.

Important

cif.box returns box extents and tilt factors, while CifFile.read_cell_params returns unit cell vector lengths and angles. See the box-like documentation linked above for more details.

Example

This method provides a convenient interface to create box objects.

>>> box = cif.box
>>> print(box)
(3.6, 3.6, 3.6, 0.0, 0.0, 0.0)
>>> import freud, hoomd
>>> freud.Box(*box)
freud.box.Box(Lx=3.6, Ly=3.6, Lz=3.6, xy=0, xz=0, yz=0, ...)
>>> hoomd.Box(*box)
hoomd.box.Box(Lx=3.6, Ly=3.6, Lz=3.6, xy=0.0, xz=0.0, yz=0.0)
Returns:

The box vector lengths (in angstroms) and unitless tilt factors. \((L_1, L_2, L_3, xy, xz, yz)\).

Return type:

tuple[float]

property lattice_vectors: ndarray[3, 3, float64]

The lattice vectors of the unit cell, with \(\vec{a_1}\perp[100]\).

Important

The lattice vectors are stored as columns of the returned matrix, similar to freud to_matrix(). This matrix must be transposed when creating a Freud box or transforming fractional coordinates to absolute.

Example

The box matrix can be used to transform fractional coordinates to absolute coordinates after transposing to row-major form.

>>> lattice_vectors = cif.lattice_vectors
>>> lattice_vectors
array([[3.6, 0.0, 0.0],
       [0.0, 3.6, 0.0],
       [0.0, 0.0, 3.6]])
>>> cif.build_unit_cell() @ lattice_vectors.T # Calculate absolute positions
array([[0.0, 0.0, 0.0],
       [0.0, 1.8, 1.8],
       [1.8, 0.0, 1.8],
       [1.8, 1.8, 0.0]])
Returns:

The lattice vectors of the unit cell \(\vec{a_1}, \vec{a_2},\vec{a_3}\).

Return type:

\((3, 3)\) numpy.ndarray

property loop_labels: list[tuple[str, ...]]

A list of column labels for each data array.

This property is equivalent to [arr.dtype.names for arr in self.loops].

Returns:

Column labels for loops, stored as a nested list of strings.

Return type:

list[tuple[str, …]]

property symops: ndarray | None

Extract the symmetry operations in a parsable algebraic form.

Example

>>> cif.symops
array([['x,y,z'],
       ['z,y+1/2,x+1/2'],
       ['z+1/2,-y,x+1/2'],
       ['z+1/2,y+1/2,x']], dtype='<U14')
Returns:

An array containing the symmetry operations, or None if none are found.

Return type:

\((N,1)\) numpy.ndarray[str]

property wyckoff_positions

Extract symmetry-irreducible, fractional x,y,z coordinates.

Returns:

Symmetry-irreducible positions of atoms in fractional coordinates.

Return type:

\((N, 3)\) numpy.ndarray

set_wyckoff_positions(wyckoff_sites)

Set the Wyckoff sites in the CIF file data.

This method updates the values of the Wyckoff position coordinates in the corresponding loop structure. The input is a NumPy array of floating point values, which will be converted to strings for storage.

If the provided array has a different number of rows than the existing data, the loop will be resized. When adding new sites, placeholder data (“?”) will be used for non-coordinate columns. When removing sites, rows are removed from the end of the loop.

Danger

Changing the Wyckoff positions may invalidate other keys in the original file, most commonly by changing the _chemical_formula_sum and space group data. Correct structures will be built when using build_unit_cell() , but use of keys related to structural or chemical data is discouraged once the basis has been modified. Refer to Refining and Experimenting with Structures for further details.

Parameters:

wyckoff_sites (\((N, 3)\) numpy.ndarray:) – The new Wyckoff site data.

Raises:

ValueError – If the Wyckoff position keys cannot be found in any loop, or if the input array does not have 3 columns.

property cast_values

Whether to cast “number-like” values to ints & floats.

Caution

When set to True after construction, the values are modified in-place. This action cannot be reversed.

Type:

Bool

classmethod structured_to_unstructured(arr)

Convert a structured (column-labeled) array to a standard unstructured array.

This is useful when extracting entire loops from loops for use in other programs. This classmethod calls np.lib.recfunctions.structured_to_unstructured on the input data to ensure the resulting array is properly laid out in memory, with additional checks to ensure the output properly reflects the underlying data. See this page in the structured array docs for more information.

Parameters:

arr (numpy.ndarray: | numpy.recarray) – The structured array to convert.

Returns:

An unstructured array containing a copy of the data from the input.

Return type:

numpy.ndarray

PATTERNS: ClassVar = {'block_delimiter': '(data_)[\t ]*+([^\\n]*+)', 'bracket': '(\\[|\\])', 'comment': '#.*?$', 'key_list': '_\\S+?(?=\\s|$)', 'key_value_general': '^(_\\S+?)\\s++((?s:.)+?)$', 'loop_delimiter': '(loop_)[\t ]*+([^\\n]*+)', 'space_delimited_data': '(;[^;]*?;|\'(?:\'\\S|[^\'])*\'|[^\';\\"\\s]*+)'}

Regex patterns used when parsing files.

Caution

This dictionary can be modified to change parsing behavior, although doing is not recommended. Changes to this variable are shared across all instances of the class.

Please refer to the CIF grammar for further details.