metsrw

Basic Usage

Reading METS files

# Reads a file
mets = metsrw.METSDocument.fromfile('path/to/file')

# Parses a string
mets = metsrw.METSDocument.fromstring("""<?xml version='1.0' encoding='ASCII'?>
<mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.loc.gov/METS/" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
    <metsHdr CREATEDATE="2015-12-16T22:38:48"/>
    <structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical"/>
</mets>""")

# Parses an lxml.Element or lxml.ElementTree
tree = lxml.etree.fromfile('path/to/file')
mets = metsrw.METSDocument.fromtree(tree)

Writing METS files

mets = metsrw.METSDocument()
file1 = metsrw.FSEntry("hello.pdf", file_uuid=str(uuid.uuid4()))
mets.append_file(file1)

mets.serialize()
# <Element {http://www.loc.gov/METS/}mets at 0x104f89c88>

mets.tostring()
# b'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version111/mets.xsd">\n  <mets:metsHdr CREATEDATE="2019-03-26T23:16:08"/>\n  <mets:fileSec>\n    <mets:fileGrp USE="original">\n      <mets:file ID="file-ad6a74d1-f8c1-4a33-a2e4-469608e3331a" GROUPID="Group-ad6a74d1-f8c1-4a33-a2e4-469608e3331a">\n        <mets:FLocat xlink:href="hello.pdf" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>\n      </mets:file>\n    </mets:fileGrp>\n  </mets:fileSec>\n  <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">\n    <mets:div TYPE="Item" LABEL="hello.pdf">\n      <mets:fptr FILEID="file-ad6a74d1-f8c1-4a33-a2e4-469608e3331a"/>\n    </mets:div>\n  </mets:structMap>\n  <mets:structMap ID="structMap_2" LABEL="Normative Directory Structure" TYPE="logical">\n    <mets:div TYPE="Item" LABEL="hello.pdf"/>\n  </mets:structMap>\n</mets:mets>\n'

mets.write("/path/to/file")

API Documentation

class metsrw.METSDocument[source]

Bases: object

all_files()[source]

Return a set of all FSEntrys in this METS document.

Returns:Set containing all FSEntry in this METS document, including descendants of ones explicitly added.
append(fs_entry)

Adds an FSEntry object to this METS document’s tree. Any of the represented object’s children will also be added to the document.

A given FSEntry object can only be included in a document once, and any attempt to add an object the second time will be ignored.

Parameters:fs_entry (metsrw.mets.FSEntry) – FSEntry to add to the METS document
append_file(fs_entry)[source]

Adds an FSEntry object to this METS document’s tree. Any of the represented object’s children will also be added to the document.

A given FSEntry object can only be included in a document once, and any attempt to add an object the second time will be ignored.

Parameters:fs_entry (metsrw.mets.FSEntry) – FSEntry to add to the METS document
classmethod fromfile(path)[source]

Creates a METS by parsing a file.

Parameters:path (str) – Path to a METS document.
classmethod fromstring(string)[source]

Create a METS by parsing a string.

Parameters:string (str) – String containing a METS document.
classmethod fromtree(tree)[source]

Create a METS from an ElementTree or Element.

Parameters:tree (ElementTree) – ElementTree to build a METS document from.
get_file(**kwargs)[source]

Return the FSEntry that matches parameters.

Parameters:
  • file_uuid (str) – UUID of the target FSEntry.
  • label (str) – structMap LABEL of the target FSEntry.
  • type (str) – structMap TYPE of the target FSEntry.
Returns:

FSEntry that matches parameters, or None.

classmethod read(source)[source]

Read source into a METSDocument instance. This is an instance constructor. The source may be a path to a METS file, a file-like object, or a string of XML.

remove(fs_entry)

Removes an FSEntry object from this METS document.

Any children of this FSEntry will also be removed. This will be removed as a child of it’s parent, if any.

Parameters:fs_entry (metsrw.mets.FSEntry) – FSEntry to remove from the METS
remove_entry(fs_entry)[source]

Removes an FSEntry object from this METS document.

Any children of this FSEntry will also be removed. This will be removed as a child of it’s parent, if any.

Parameters:fs_entry (metsrw.mets.FSEntry) – FSEntry to remove from the METS
serialize(fully_qualified=True)[source]

Returns this document serialized to an xml Element.

Returns:Element for this document
tostring(fully_qualified=True, pretty_print=True, encoding='UTF-8')[source]

Serialize and return a string of this METS document.

To write to file, see write().

The default encoding is UTF-8. This method will return a unicode string when encoding is set to unicode.

Returns:String of this document
write(filepath, fully_qualified=True, pretty_print=False, encoding='UTF-8')[source]

Serialize and write this METS document to filepath.

The default encoding is UTF-8. This method will return a unicode string when encoding is set to unicode.

Parameters:filepath (str) – Path to write the METS document to
class metsrw.FSEntry(path=None, label=None, use='original', type='Item', children=None, file_uuid=None, derived_from=None, checksum=None, checksumtype=None, transform_files=None, mets_div_type=None)[source]

Bases: metsrw.di.DependencyPossessor

A class representing a filesystem entry - either a file or a directory.

When passed to a metsrw.mets.METSDocument instance, the tree of FSEntry objects will be used to construct the <fileSec> and <structMap> elements of a METS document.

Unless otherwise specified, an FSEntry object is assumed to be a file; pass the type value as ‘Directory’ to specify that the object is instead a directory.

An FSEntry object must be instantiated with a path as the first argument to the constructor, which represents its path on disk.

An FSEntry object which is a Directory may have one or more children, representing files or directories contained within itself. Directory trees are designed for top-to-bottom traversal. Files cannot have children, and attempting to instantiate a file FSEntry object with children will raise a ValueError.

Any FSEntry object may have one or more metadata entries associated with it; these can take the form of either references to other XML files on disk, which should be wrapped in MDRef objects, or wrapped copies of those XML files, which should be wrapped in MDWrap objects.

Parameters:
  • path (str) – Path to the file on disk, as a bytestring. This will populate FLocat @xlink:href
  • label (str) – Label in the structMap. If not provided, will be populated with the basename of path
  • use (str) – Use for the fileGrp. Items with identical uses will be grouped together.
  • type (str) – Type of FSEntry this is. This will appear in the structMap.
  • children (list) – List of metsrw.fsentry.FSEntry that are direct children of this element in the structMap. Only allowed if type is ‘Directory’
  • file_uuid (str) – UUID of this entry. Will be used to construct the FILEID used in the fileSec and structMap, and GROUPID. Only required if type is ‘Item’.
  • derived_from (metsrw.fsentry.FSEntry) – FSEntry that this FSEntry is derived_from. This is used to set the GROUPID in the fileSec.
  • checksum (str) – Value of the file’s checksum. Required if checksumtype passed.
  • checksumtype (str) – Type of the checksum. Must be one of FSEntry.ALLOWED_CHECKSUMS. Required if checksum passed.
  • transform_files (list) – a list of dicts representing METS transform file elements, which provide “a means to access any subsidiary files listed below a <file> element by indicating the steps required to ‘unpack’ or transform the subsidiary files.”
Raises:
  • ValueError – if children passed when type is not ‘Directory’
  • ValueError – if only one of checksum or checksumtype passed
  • ValueError – if checksumtype is not in FSEntry.ALLOWED_CHECKSUMS
ALLOWED_CHECKSUMS = ('Adler-32', 'CRC32', 'HAVAL', 'MD5', 'MNP', 'SHA-1', 'SHA-256', 'SHA-384', 'SHA-512', 'TIGER WHIRLPOOL')
PREMIS_AGENT = 'PREMIS:AGENT'
PREMIS_EVENT = 'PREMIS:EVENT'
PREMIS_OBJECT = 'PREMIS:OBJECT'
PREMIS_RIGHTS = 'PREMIS:RIGHTS'
add_child(child)[source]

Add a child FSEntry to this FSEntry.

Only FSEntrys with a type of ‘directory’ can have children.

This does not detect cyclic parent/child relationships, but that will cause problems.

Parameters:

child (metsrw.fsentry.FSEntry) – FSEntry to add as a child

Returns:

The newly added child

Raises:
  • ValueError – If this FSEntry cannot have children.
  • ValueError – If the child and the parent are the same
add_digiprovmd(md, mdtype, mode='mdwrap', **kwargs)[source]
add_dmdsec(md, mdtype, mode='mdwrap', **kwargs)[source]
add_dublin_core(md, mode='mdwrap')[source]
add_premis_agent(md, mode='mdwrap')[source]
add_premis_event(md, mode='mdwrap')[source]
add_premis_object(md, mode='mdwrap')[source]
add_premis_rights(md, mode='mdwrap')[source]
add_rightsmd(md, mdtype, mode='mdwrap', **kwargs)[source]
add_techmd(md, mdtype, mode='mdwrap', **kwargs)[source]
admids

Returns a list of ADMIDs for this entry.

children
classmethod dir(label, children)[source]

Return FSEntry directory object.

dmdids

Returns a list of DMDIDs for this entry.

file_id()[source]

Returns the fptr @FILEID if this is not a Directory.

classmethod from_fptr(label, type_, fptr)[source]

Return FSEntry object.

get_premis_agents()[source]
get_premis_event(event_uuid)[source]
get_premis_events()[source]
get_premis_objects()[source]
get_premis_rights()[source]
get_premis_rights_statement(rights_statement_uuid)[source]
get_subsections_of_type(mdtype, md_class)[source]
group_id()[source]

Returns the @GROUPID.

If derived_from is set, returns that group_id.

is_aip
is_empty_dir

Returns True if this fs item is a directory with no children or a directory with only other empty directories as children.

premis_agent_class

alias of metsrw.plugins.premisrw.premis.PREMISAgent

premis_event_class

alias of metsrw.plugins.premisrw.premis.PREMISEvent

premis_object_class

alias of metsrw.plugins.premisrw.premis.PREMISObject

premis_rights_class

alias of metsrw.plugins.premisrw.premis.PREMISRights

remove_child(child)[source]

Remove a child from this FSEntry

If child is not actually a child of this entry, nothing happens.

Parameters:child – Child to remove
serialize_filesec()[source]

Return the file Element for this file, appropriate for use in a fileSec.

If this is not an Item or has no use, return None.

Returns:fileSec element for this FSEntry
serialize_md_inst(md_inst, md_class)[source]

Serialize object md_inst by transforming it into an lxml.etree._ElementTree. If it already is such, return it. If not, make sure it is the correct type and return the output of calling seriaize() on it.

serialize_structmap(recurse=True, normative=False)[source]

Return the div Element for this file, appropriate for use in a structMap.

If this FSEntry represents a directory, its children will be recursively appended to itself. If this FSEntry represents a file, it will contain a <fptr> element.

Parameters:
  • recurse (bool) – If true, serialize and apppend all children. Otherwise, only serialize this element but not any children.
  • normative (bool) – If true, we are creating a “Normative Directory Structure” logical structmap, in which case we add div elements for empty directories and do not add fptr elements for files.
Returns:

structMap element for this FSEntry

Classes for metadata sections of the METS. Include amdSec, dmdSec, techMD, rightsMD, sourceMD, digiprovMD, mdRef and mdWrap.

class metsrw.metadata.AMDSec(section_id=None, subsections=None, tree=None)[source]

Bases: object

An object representing a section of administrative metadata in a document.

This is ordinarily created by metsrw.mets.METSDocument instances and does not have to be instantiated directly.

Parameters:
  • section_id (str) – ID of the section. If not provided, will be generated from ‘amdSec’ and a random number.
  • subsections (list) – List of metsrw.metadata.SubSection that are part of this amdSec
  • tree (Element) – An lxml.Element that is an externally generated amdSec. This will overwrite any automatic serialization. If passed, section_id must also be passed.
id_string(force_generate=False)[source]

Returns the ID string for the amdSec.

Parameters:force_generate (bool) – If True, will generate a new ID from ‘amdSec’ and a random number.
classmethod parse(root)[source]

Create a new AMDSec by parsing root.

Parameters:root – Element or ElementTree to be parsed into an object.
serialize(now=None)[source]

Serialize this amdSec and all children to lxml Element and return it.

Parameters:now (str) – Default value for CREATED in children if none set
Returns:amdSec Element with all children
tag = 'amdSec'
class metsrw.metadata.Agent(role, **kwargs)[source]

Bases: object

An object representing an agent with a relationship to the METS record.

This is ordinarily created by metsrw.mets.METSDocument instances and does not have to be instantiated directly.

Parameters:
  • role (str) – Agent role, e.g. ‘CREATOR’.
  • id (str) – Optional unique identifer for an agent.
  • type (str) – Optional agent type, e.g. ‘ORGANIZATION’.
  • name (str) – Optional agent name, e.g. ‘9461beb-22eb-4942-88af-848cfc3462b2’.
  • notes (List[str]) – Optional agent notes, e.g. ‘Archivematica dashboard UUID’.
AGENT_TAG = <lxml.etree.QName object>
NAME_TAG = <lxml.etree.QName object>
NOTE_TAG = <lxml.etree.QName object>
ROLES = ('CREATOR', 'EDITOR', 'ARCHIVIST', 'PRESERVATION', 'DISSEMINATOR', 'CUSTODIAN', 'IPOWNER')
TYPES = ('INDIVIDUAL', 'ORGANIZATION')
classmethod parse(element)[source]

Create a new Agent by parsing root.

Parameters:element – Element to be parsed into an Agent.
Raises:exceptions.ParseError – If element is not a valid agent.
serialize()[source]
class metsrw.metadata.AltRecordID(alt_record_id, **kwargs)[source]

Bases: object

An object representing an alternative record identifier in the METS document (alternatives to the OBJID).

This is ordinarily created by metsrw.mets.METSDocument instances and does not have to be instantiated directly.

Parameters:
  • id (str) – Optional unique identifer for the identifier.
  • type (str) – Optional identifer type, e.g. ‘Accession number’.
ALT_RECORD_ID_TAG = <lxml.etree.QName object>
classmethod parse(element)[source]

Create a new AltRecordID by parsing root.

Parameters:element – Element to be parsed into an AltRecordID.
Raises:exceptions.ParseError – If element is not a valid altRecordID.
serialize()[source]
class metsrw.metadata.MDRef(target, mdtype, loctype, label=None, otherloctype=None)[source]

Bases: object

An object representing an external XML document, typically associated with an metsrw.fsentry.FSEntry object.

Parameters:
  • target (str) – Path to the external document. MDRef does not validate the existence of this target.
  • mdtype (str) – The string representing the mdtype of XML document being enclosed. Examples include “PREMIS:OBJECT” and “PREMIS:EVENT”.
  • label (str) – Optional LABEL for the mdRef element
  • loctype (str) – LOCTYPE of the mdRef. Must be one of ‘ARK’, ‘URN’, ‘URL’, ‘PURL’, ‘HANDLE’, ‘DOI’ or ‘OTHER’.
  • otherloctype (str) – OTHERLOCTYPE of the mdRef. Should be provided if loctype is OTHER.
VALID_LOCTYPE = ('ARK', 'URN', 'URL', 'PURL', 'HANDLE', 'DOI', 'OTHER')
classmethod parse(root)[source]

Create a new MDWrap by parsing root.

Parameters:root – Element or ElementTree to be parsed into a MDWrap.
serialize()[source]
class metsrw.metadata.MDWrap(document, mdtype, othermdtype=None)[source]

Bases: object

An object representing an XML document enclosed in a METS document. The entirety of the XML document will be included; to reference an external document, use the MDRef class.

Parameters:
  • document (str) – A string copy of the document, and will be parsed into an ElementTree at the time of instantiation.
  • mdtype (str) – The MDTYPE of XML document being enclosed. Examples include “PREMIS:OBJECT”, “PREMIS:EVENT,”, “DC” and “OTHER”.
  • othermdtype (str) – The OTHERMDTYPE of the XML document. Should be set if mdtype is “OTHER”.
classmethod parse(root)[source]

Create a new MDWrap by parsing root.

Parameters:

root – Element or ElementTree to be parsed into a MDWrap.

Raises:
serialize()[source]
class metsrw.metadata.SubSection(subsection, contents, section_id=None)[source]

Bases: object

An object representing a metadata subsection in a document.

This is usually created automatically and does not have to be instantiated directly.

Parameters:
  • subsection (str) – Tag name for the subsection to be created. Should be one of ‘techMD’, ‘rightsMD’, ‘sourceMD’ or ‘digiprovMD’ if contained in an amdSec, or ‘dmdSec’.
  • contents (MDWrap or MDRef) – The MDWrap or MDRef contained in this subsection.
  • section_id (str) – ID of the section. If not provided, will be generated from subsection tag and a random number.
ALLOWED_SUBSECTIONS = ('techMD', 'rightsMD', 'sourceMD', 'digiprovMD', 'dmdSec')
get_status()[source]

Returns the STATUS when serializing.

Calculates based on the subsection type and if it’s replacing anything.

Returns:None or the STATUS string.
id_string(force_generate=False)[source]

Returns the ID string for this SubSection.

Parameters:force_generate (bool) – If True, will generate a new ID from the subsection tag and a random number.
classmethod parse(root)[source]

Create a new SubSection by parsing root.

Parameters:

root – Element or ElementTree to be parsed into an object.

Raises:
replace_with(new_subsection)[source]

Replace this SubSection with new_subsection.

Replacing SubSection must be the same time. That is, you can only replace a dmdSec with another dmdSec, or a rightsMD with a rightsMD etc.

Parameters:new_subsection (SubSection) – Updated version of this SubSection
serialize(now=None)[source]

Serialize this SubSection and all children to lxml Element and return it.

Parameters:now (str) – Default value for CREATED if none set
Returns:dmdSec/techMD/rightsMD/sourceMD/digiprovMD Element with all children
metsrw.validate.get_schematron(sct_path)[source]

Return an lxml isoschematron.Schematron() instance using the schematron file at sct_path.

metsrw.validate.get_xmlschema(xmlschema, mets_doc)[source]

Return a class::lxml.etree.XMLSchema instance given the path to the XMLSchema (.xsd) file in xmlschema and the class::lxml.etree._ElementTree instance mets_doc representing the METS file being parsed. The complication here is that the METS file to be validated via the .xsd file may reference additional schemata via xsi:schemaLocation attributes. We have to find all of these and import them from within the returned XMLSchema.

For the solution that this is based on, see: http://code.activestate.com/recipes/578503-validate-xml-with-schemalocation/

For other descriptions of the problem, see: - https://groups.google.com/forum/#!topic/archivematica/UBS1ay-g_tE - https://stackoverflow.com/questions/26712645/xml-type-definition-is-absent - https://stackoverflow.com/questions/2979824/in-document-schema-declarations-and-lxml

metsrw.validate.report_string(report)[source]

Return a human-readable string representation of all of the validation errors.

metsrw.validate.schematron_validate(mets_doc, schematron='resources/archivematica_mets_schematron.xml')[source]

Validate a METS file using a schematron schema. Return a boolean indicating validity and a report as an lxml.ElementTree instance.

metsrw.validate.sct_report_string(report)[source]

Return a human-readable string representation of the error report returned by lxml’s schematron validator.

metsrw.validate.validate(mets_doc, xmlschema='resources/mets.xsd', schematron='resources/archivematica_mets_schematron.xml')[source]

Validate a METS file using both an XMLSchema (.xsd) schema and a schematron schema, the latter of which typically places additional constraints on what a METS file can look like.

metsrw.validate.xsd_error_log_string(xsd_error_log)[source]

Return a human-readable string representation of the error log returned by lxml’s XMLSchema validator.

metsrw.validate.xsd_validate(mets_doc, xmlschema='resources/mets.xsd')[source]

Exceptions for metsrw.

All exceptions generated by this library will descend from MetsError.

exception metsrw.exceptions.MetsError[source]

Bases: Exception

Base Exception for this module.

exception metsrw.exceptions.ParseError[source]

Bases: metsrw.exceptions.MetsError

Error parsing a METS file.

exception metsrw.exceptions.SerializeError[source]

Bases: metsrw.exceptions.MetsError

Error serializing a METS file.