metsrw¶
Basic Usage¶
Reading METS files¶
# Reads a file
mets = metsrw.METSDocument.fromfile('path/to/file')
# Parses a string
mets = metsrw.METSDocument.fromstring("""<?xml version='1.0' encoding='ASCII'?>
<mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.loc.gov/METS/" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
<metsHdr CREATEDATE="2015-12-16T22:38:48"/>
<structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical"/>
</mets>""")
# Parses an lxml.Element or lxml.ElementTree
tree = lxml.etree.fromfile('path/to/file')
mets = metsrw.METSDocument.fromtree(tree)
Writing METS files¶
mets = metsrw.METSDocument()
file1 = metsrw.FSEntry("hello.pdf", file_uuid=str(uuid.uuid4()))
mets.append_file(file1)
mets.serialize()
# <Element {http://www.loc.gov/METS/}mets at 0x104f89c88>
mets.tostring()
# b'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version111/mets.xsd">\n <mets:metsHdr CREATEDATE="2019-03-26T23:16:08"/>\n <mets:fileSec>\n <mets:fileGrp USE="original">\n <mets:file ID="file-ad6a74d1-f8c1-4a33-a2e4-469608e3331a" GROUPID="Group-ad6a74d1-f8c1-4a33-a2e4-469608e3331a">\n <mets:FLocat xlink:href="hello.pdf" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>\n </mets:file>\n </mets:fileGrp>\n </mets:fileSec>\n <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">\n <mets:div TYPE="Item" LABEL="hello.pdf">\n <mets:fptr FILEID="file-ad6a74d1-f8c1-4a33-a2e4-469608e3331a"/>\n </mets:div>\n </mets:structMap>\n <mets:structMap ID="structMap_2" LABEL="Normative Directory Structure" TYPE="logical">\n <mets:div TYPE="Item" LABEL="hello.pdf"/>\n </mets:structMap>\n</mets:mets>\n'
mets.write("/path/to/file")
API Documentation¶
-
class
metsrw.
METSDocument
[source]¶ Bases:
object
-
all_files
()[source]¶ Return a set of all FSEntrys in this METS document.
Returns: Set containing all FSEntry
in this METS document, including descendants of ones explicitly added.
-
append
(fs_entry)¶ Adds an FSEntry object to this METS document’s tree. Any of the represented object’s children will also be added to the document.
A given FSEntry object can only be included in a document once, and any attempt to add an object the second time will be ignored.
Parameters: fs_entry (metsrw.mets.FSEntry) – FSEntry to add to the METS document
-
append_file
(fs_entry)[source]¶ Adds an FSEntry object to this METS document’s tree. Any of the represented object’s children will also be added to the document.
A given FSEntry object can only be included in a document once, and any attempt to add an object the second time will be ignored.
Parameters: fs_entry (metsrw.mets.FSEntry) – FSEntry to add to the METS document
-
classmethod
fromfile
(path)[source]¶ Creates a METS by parsing a file.
Parameters: path (str) – Path to a METS document.
-
classmethod
fromstring
(string)[source]¶ Create a METS by parsing a string.
Parameters: string (str) – String containing a METS document.
-
classmethod
fromtree
(tree)[source]¶ Create a METS from an ElementTree or Element.
Parameters: tree (ElementTree) – ElementTree to build a METS document from.
-
get_file
(**kwargs)[source]¶ Return the FSEntry that matches parameters.
Parameters: - file_uuid (str) – UUID of the target FSEntry.
- label (str) – structMap LABEL of the target FSEntry.
- type (str) – structMap TYPE of the target FSEntry.
Returns: FSEntry
that matches parameters, or None.
-
classmethod
read
(source)[source]¶ Read
source
into aMETSDocument
instance. This is an instance constructor. Thesource
may be a path to a METS file, a file-like object, or a string of XML.
-
remove
(fs_entry)¶ Removes an FSEntry object from this METS document.
Any children of this FSEntry will also be removed. This will be removed as a child of it’s parent, if any.
Parameters: fs_entry (metsrw.mets.FSEntry) – FSEntry to remove from the METS
-
remove_entry
(fs_entry)[source]¶ Removes an FSEntry object from this METS document.
Any children of this FSEntry will also be removed. This will be removed as a child of it’s parent, if any.
Parameters: fs_entry (metsrw.mets.FSEntry) – FSEntry to remove from the METS
-
serialize
(fully_qualified=True)[source]¶ Returns this document serialized to an xml Element.
Returns: Element for this document
-
-
class
metsrw.
FSEntry
(path=None, label=None, use='original', type='Item', children=None, file_uuid=None, derived_from=None, checksum=None, checksumtype=None, transform_files=None, mets_div_type=None)[source]¶ Bases:
metsrw.di.DependencyPossessor
A class representing a filesystem entry - either a file or a directory.
When passed to a
metsrw.mets.METSDocument
instance, the tree of FSEntry objects will be used to construct the <fileSec> and <structMap> elements of a METS document.Unless otherwise specified, an FSEntry object is assumed to be a file; pass the type value as ‘Directory’ to specify that the object is instead a directory.
An FSEntry object must be instantiated with a path as the first argument to the constructor, which represents its path on disk.
An FSEntry object which is a Directory may have one or more children, representing files or directories contained within itself. Directory trees are designed for top-to-bottom traversal. Files cannot have children, and attempting to instantiate a file FSEntry object with children will raise a ValueError.
Any FSEntry object may have one or more metadata entries associated with it; these can take the form of either references to other XML files on disk, which should be wrapped in MDRef objects, or wrapped copies of those XML files, which should be wrapped in MDWrap objects.
Parameters: - path (str) – Path to the file on disk, as a bytestring. This will populate FLocat @xlink:href
- label (str) – Label in the structMap. If not provided, will be populated with the basename of path
- use (str) – Use for the fileGrp. Items with identical uses will be grouped together.
- type (str) – Type of FSEntry this is. This will appear in the structMap.
- children (list) – List of
metsrw.fsentry.FSEntry
that are direct children of this element in the structMap. Only allowed if type is ‘Directory’ - file_uuid (str) – UUID of this entry. Will be used to construct the FILEID used in the fileSec and structMap, and GROUPID. Only required if type is ‘Item’.
- derived_from (metsrw.fsentry.FSEntry) – FSEntry that this FSEntry is derived_from. This is used to set the GROUPID in the fileSec.
- checksum (str) – Value of the file’s checksum. Required if checksumtype passed.
- checksumtype (str) – Type of the checksum. Must be one of
FSEntry.ALLOWED_CHECKSUMS
. Required if checksum passed. - transform_files (list) – a list of dicts representing METS transform file elements, which provide “a means to access any subsidiary files listed below a <file> element by indicating the steps required to ‘unpack’ or transform the subsidiary files.”
Raises: - ValueError – if children passed when type is not ‘Directory’
- ValueError – if only one of checksum or checksumtype passed
- ValueError – if checksumtype is not in
FSEntry.ALLOWED_CHECKSUMS
-
ALLOWED_CHECKSUMS
= ('Adler-32', 'CRC32', 'HAVAL', 'MD5', 'MNP', 'SHA-1', 'SHA-256', 'SHA-384', 'SHA-512', 'TIGER WHIRLPOOL')¶
-
PREMIS_AGENT
= 'PREMIS:AGENT'¶
-
PREMIS_EVENT
= 'PREMIS:EVENT'¶
-
PREMIS_OBJECT
= 'PREMIS:OBJECT'¶
-
PREMIS_RIGHTS
= 'PREMIS:RIGHTS'¶
-
add_child
(child)[source]¶ Add a child FSEntry to this FSEntry.
Only FSEntrys with a type of ‘directory’ can have children.
This does not detect cyclic parent/child relationships, but that will cause problems.
Parameters: child (metsrw.fsentry.FSEntry) – FSEntry to add as a child
Returns: The newly added child
Raises: - ValueError – If this FSEntry cannot have children.
- ValueError – If the child and the parent are the same
-
admids
¶ Returns a list of ADMIDs for this entry.
-
children
¶
-
dmdids
¶ Returns a list of DMDIDs for this entry.
-
is_aip
¶
-
is_empty_dir
¶ Returns
True
if this fs item is a directory with no children or a directory with only other empty directories as children.
-
premis_agent_class
¶
-
premis_event_class
¶
-
premis_object_class
¶
-
premis_rights_class
¶
-
remove_child
(child)[source]¶ Remove a child from this FSEntry
If child is not actually a child of this entry, nothing happens.
Parameters: child – Child to remove
-
serialize_filesec
()[source]¶ Return the file Element for this file, appropriate for use in a fileSec.
If this is not an Item or has no use, return None.
Returns: fileSec element for this FSEntry
-
serialize_md_inst
(md_inst, md_class)[source]¶ Serialize object
md_inst
by transforming it into anlxml.etree._ElementTree
. If it already is such, return it. If not, make sure it is the correct type and return the output of callingseriaize()
on it.
-
serialize_structmap
(recurse=True, normative=False)[source]¶ Return the div Element for this file, appropriate for use in a structMap.
If this FSEntry represents a directory, its children will be recursively appended to itself. If this FSEntry represents a file, it will contain a <fptr> element.
Parameters: - recurse (bool) – If true, serialize and apppend all children. Otherwise, only serialize this element but not any children.
- normative (bool) – If true, we are creating a “Normative Directory Structure” logical structmap, in which case we add div elements for empty directories and do not add fptr elements for files.
Returns: structMap element for this FSEntry
Classes for metadata sections of the METS. Include amdSec, dmdSec, techMD, rightsMD, sourceMD, digiprovMD, mdRef and mdWrap.
-
class
metsrw.metadata.
AMDSec
(section_id=None, subsections=None, tree=None)[source]¶ Bases:
object
An object representing a section of administrative metadata in a document.
This is ordinarily created by
metsrw.mets.METSDocument
instances and does not have to be instantiated directly.Parameters: - section_id (str) – ID of the section. If not provided, will be generated from ‘amdSec’ and a random number.
- subsections (list) – List of
metsrw.metadata.SubSection
that are part of this amdSec - tree (Element) – An lxml.Element that is an externally generated amdSec. This will overwrite any automatic serialization. If passed, section_id must also be passed.
-
id_string
(force_generate=False)[source]¶ Returns the ID string for the amdSec.
Parameters: force_generate (bool) – If True, will generate a new ID from ‘amdSec’ and a random number.
-
classmethod
parse
(root)[source]¶ Create a new AMDSec by parsing root.
Parameters: root – Element or ElementTree to be parsed into an object.
-
serialize
(now=None)[source]¶ Serialize this amdSec and all children to lxml Element and return it.
Parameters: now (str) – Default value for CREATED in children if none set Returns: amdSec Element with all children
-
tag
= 'amdSec'¶
-
class
metsrw.metadata.
Agent
(role, **kwargs)[source]¶ Bases:
object
An object representing an agent with a relationship to the METS record.
This is ordinarily created by
metsrw.mets.METSDocument
instances and does not have to be instantiated directly.Parameters: - role (str) – Agent role, e.g. ‘CREATOR’.
- id (str) – Optional unique identifer for an agent.
- type (str) – Optional agent type, e.g. ‘ORGANIZATION’.
- name (str) – Optional agent name, e.g. ‘9461beb-22eb-4942-88af-848cfc3462b2’.
- notes (List[str]) – Optional agent notes, e.g. ‘Archivematica dashboard UUID’.
-
AGENT_TAG
= <lxml.etree.QName object>¶
-
NAME_TAG
= <lxml.etree.QName object>¶
-
NOTE_TAG
= <lxml.etree.QName object>¶
-
ROLES
= ('CREATOR', 'EDITOR', 'ARCHIVIST', 'PRESERVATION', 'DISSEMINATOR', 'CUSTODIAN', 'IPOWNER')¶
-
TYPES
= ('INDIVIDUAL', 'ORGANIZATION')¶
-
classmethod
parse
(element)[source]¶ Create a new Agent by parsing root.
Parameters: element – Element to be parsed into an Agent. Raises: exceptions.ParseError – If element is not a valid agent.
-
class
metsrw.metadata.
AltRecordID
(alt_record_id, **kwargs)[source]¶ Bases:
object
An object representing an alternative record identifier in the METS document (alternatives to the OBJID).
This is ordinarily created by
metsrw.mets.METSDocument
instances and does not have to be instantiated directly.Parameters: - id (str) – Optional unique identifer for the identifier.
- type (str) – Optional identifer type, e.g. ‘Accession number’.
-
ALT_RECORD_ID_TAG
= <lxml.etree.QName object>¶
-
classmethod
parse
(element)[source]¶ Create a new AltRecordID by parsing root.
Parameters: element – Element to be parsed into an AltRecordID. Raises: exceptions.ParseError – If element is not a valid altRecordID.
-
class
metsrw.metadata.
MDRef
(target, mdtype, loctype, label=None, otherloctype=None)[source]¶ Bases:
object
An object representing an external XML document, typically associated with an
metsrw.fsentry.FSEntry
object.Parameters: - target (str) – Path to the external document. MDRef does not validate the existence of this target.
- mdtype (str) – The string representing the mdtype of XML document being enclosed. Examples include “PREMIS:OBJECT” and “PREMIS:EVENT”.
- label (str) – Optional LABEL for the mdRef element
- loctype (str) – LOCTYPE of the mdRef. Must be one of ‘ARK’, ‘URN’, ‘URL’, ‘PURL’, ‘HANDLE’, ‘DOI’ or ‘OTHER’.
- otherloctype (str) – OTHERLOCTYPE of the mdRef. Should be provided if loctype is OTHER.
-
VALID_LOCTYPE
= ('ARK', 'URN', 'URL', 'PURL', 'HANDLE', 'DOI', 'OTHER')¶
-
class
metsrw.metadata.
MDWrap
(document, mdtype, othermdtype=None)[source]¶ Bases:
object
An object representing an XML document enclosed in a METS document. The entirety of the XML document will be included; to reference an external document, use the
MDRef
class.Parameters: - document (str) – A string copy of the document, and will be parsed into an ElementTree at the time of instantiation.
- mdtype (str) – The MDTYPE of XML document being enclosed. Examples include “PREMIS:OBJECT”, “PREMIS:EVENT,”, “DC” and “OTHER”.
- othermdtype (str) – The OTHERMDTYPE of the XML document. Should be set if mdtype is “OTHER”.
-
classmethod
parse
(root)[source]¶ Create a new MDWrap by parsing root.
Parameters: root – Element or ElementTree to be parsed into a MDWrap.
Raises: - exceptions.ParseError – If mdWrap does not contain MDTYPE
- exceptions.ParseError – If xmlData contains no children
-
class
metsrw.metadata.
SubSection
(subsection, contents, section_id=None)[source]¶ Bases:
object
An object representing a metadata subsection in a document.
This is usually created automatically and does not have to be instantiated directly.
Parameters: - subsection (str) – Tag name for the subsection to be created. Should be
one of ‘techMD’, ‘rightsMD’, ‘sourceMD’ or ‘digiprovMD’ if contained in an
amdSec
, or ‘dmdSec’. - contents (
MDWrap
orMDRef
) – The MDWrap or MDRef contained in this subsection. - section_id (str) – ID of the section. If not provided, will be generated from subsection tag and a random number.
-
ALLOWED_SUBSECTIONS
= ('techMD', 'rightsMD', 'sourceMD', 'digiprovMD', 'dmdSec')¶
-
get_status
()[source]¶ Returns the STATUS when serializing.
Calculates based on the subsection type and if it’s replacing anything.
Returns: None or the STATUS string.
-
id_string
(force_generate=False)[source]¶ Returns the ID string for this SubSection.
Parameters: force_generate (bool) – If True, will generate a new ID from the subsection tag and a random number.
-
classmethod
parse
(root)[source]¶ Create a new SubSection by parsing root.
Parameters: root – Element or ElementTree to be parsed into an object.
Raises: - exceptions.ParseError – If root’s tag is not in
SubSection.ALLOWED_SUBSECTIONS
. - exceptions.ParseError – If the first child of root is not mdRef or mdWrap.
- exceptions.ParseError – If root’s tag is not in
-
replace_with
(new_subsection)[source]¶ Replace this SubSection with new_subsection.
Replacing SubSection must be the same time. That is, you can only replace a dmdSec with another dmdSec, or a rightsMD with a rightsMD etc.
Parameters: new_subsection ( SubSection
) – Updated version of this SubSection
- subsection (str) – Tag name for the subsection to be created. Should be
one of ‘techMD’, ‘rightsMD’, ‘sourceMD’ or ‘digiprovMD’ if contained in an
-
metsrw.validate.
get_schematron
(sct_path)[source]¶ Return an lxml
isoschematron.Schematron()
instance using the schematron file atsct_path
.
-
metsrw.validate.
get_xmlschema
(xmlschema, mets_doc)[source]¶ Return a
class::lxml.etree.XMLSchema
instance given the path to the XMLSchema (.xsd) file inxmlschema
and theclass::lxml.etree._ElementTree
instancemets_doc
representing the METS file being parsed. The complication here is that the METS file to be validated via the .xsd file may reference additional schemata viaxsi:schemaLocation
attributes. We have to find all of these and import them from within the returned XMLSchema.For the solution that this is based on, see: http://code.activestate.com/recipes/578503-validate-xml-with-schemalocation/
For other descriptions of the problem, see: - https://groups.google.com/forum/#!topic/archivematica/UBS1ay-g_tE - https://stackoverflow.com/questions/26712645/xml-type-definition-is-absent - https://stackoverflow.com/questions/2979824/in-document-schema-declarations-and-lxml
-
metsrw.validate.
report_string
(report)[source]¶ Return a human-readable string representation of all of the validation errors.
-
metsrw.validate.
schematron_validate
(mets_doc, schematron='resources/archivematica_mets_schematron.xml')[source]¶ Validate a METS file using a schematron schema. Return a boolean indicating validity and a report as an
lxml.ElementTree
instance.
-
metsrw.validate.
sct_report_string
(report)[source]¶ Return a human-readable string representation of the error report returned by lxml’s schematron validator.
-
metsrw.validate.
validate
(mets_doc, xmlschema='resources/mets.xsd', schematron='resources/archivematica_mets_schematron.xml')[source]¶ Validate a METS file using both an XMLSchema (.xsd) schema and a schematron schema, the latter of which typically places additional constraints on what a METS file can look like.
-
metsrw.validate.
xsd_error_log_string
(xsd_error_log)[source]¶ Return a human-readable string representation of the error log returned by lxml’s XMLSchema validator.
Exceptions for metsrw.
All exceptions generated by this library will descend from MetsError.
-
exception
metsrw.exceptions.
ParseError
[source]¶ Bases:
metsrw.exceptions.MetsError
Error parsing a METS file.
-
exception
metsrw.exceptions.
SerializeError
[source]¶ Bases:
metsrw.exceptions.MetsError
Error serializing a METS file.