Parsing XML Files in Python

  • Reading time:7 mins read

What are XML files in Python used for?

XML stands for EXtensible Markup Language. It is a text-based format for structuring and sharing data across networks, between programs and between people. Despite its similarity to HTML which is also based on SGML standards, the XML format adheres to the strict formatting rules.

 

 

Furthermore, XML is more predictable and readable making it easier to spot and resolve errors. While virtually all tags used in HTML are predefined, XML tags on the other hand are not, making it even more extensible.

 

xml files in python

Structure of XML documents

All XML documents must have a root element which is considered as the parent element of all other elements. Most XML elements also contain an optional prolog. However if the prolog element exists within an XML document, then it must be in the first line in the XML document.

The XML prolog mat is used to specify a character encoding for XML documents which are often UTF-8, versioning and other international characters.

 

XML tags and XML files in Python language

In addition,  XML tags must have their respective closing tags. An XML document is considered invalid if some or all of its elements do not have their respective closing tags. Therefore, it is illegal to omit a closing tag in an XML document. On the other hand, the Prolog tag is not considered as part of the XML document and is, therefore, an exception to this rule.

 

Code
</book>

 

Tags in XML documents are case sensitive, cases used in opening tags must match those of their counterparts, the closing tags. Tags containing mismatching cases are considered invalid and thus rendering the entire document invalid as well. Below is an illustration of valid and invalid XML tags.

 

 

XML elements can also have attributes with their corresponding attribute values, in such instances attribute values should always be kept in quotes. In the example below the book element has an attribute id ==“bk101”.

 

 

Parsing XML files in Python using the ElementTree Library

The sample XML document below contains information about books with <catalogue>  </catalogue> as the root element of the document.

There is a prolog at the beginning of the document specifying the version of the document;  <?xml version=“1.0”?>. The subsequent <book id=“bk101”> </book> elements containing the id attribute as well as child elements such as <author>, <title>, <genre>,<price>,<publish_date> and <description> with their corresponding closing tags.

 

 

ElementaryTree is a built-in Python library that we can use to load and manipulate XML files using its range of functions. Navigating through an XML file is often a simple process owing to its intuitive structure.

 

xml files in python

 

Since the ElementaryTree is already provided for in the Python standard library we simply need to import it at the top of our program. Apart from the ElementaryTree library Python also provides for us the BeautifulSoup that we can use to parse XML  files as well.

 

 

Importing the ElementaryTree as an alias is a common practice that allows us to easily call its functions without having the need to type in the entire name of the library every other time. Now to load the XML file we simply need to specify the name of the XML file within the function ET.parse() and initialize it with the tree variable as shown in the code below.

 

Parsing XML files in Python with a for loop

Using a for loop we can iterate through each of the child elements of the XML document. We can also access elements with attributes and print them out. In the code below we are using a simple for loop to print out the attribute of every book. The attribute referred to in this case is the ‘id’ attribute.

 

 

We can go further into the tree and print the sub-elements of the <book> element such as the author. Instead of printing the book attribute, we will initialize the name of the sub-element that we want to access with the values returned by the root.findall() method.

 

xml files in python

 

 

Basically, this method allows us to go deeper into the root element <book> and find the element whose name we have specified within the parentheses. In the code snippet below we are accessing <title> </title>, which is the first sub-element under the book element.

 

 

So the findall() method enables us to access the first layer, we can also access other child elements in a similar manner. In the example below we are accessing both title and price elements at the same time.

 

 

This method returns the title of every single book in all the <book> elements printed alongside their corresponding price in the terminal. We can now do anything with these variables since all the information is saved there.

Alternatively, we can also use the root.iter() method to iterate through all the elements under the <book> </book> elements. This method is more precise when accessing a single element. For instance, we can access the author element as shown below.

 

 

If you’d like to see more programming tutorials, check out our Youtube channel, where we have plenty of Python video tutorials in English.

In our Python Programming Tutorials series, you’ll find useful materials which will help you improve your programming skills and speed up the learning process.

Would you like to learn how to code, online? Come and try our first 25 lessons for free at the CodeBerry Programming School.

Learn to code and change your career!

100% ONLINE

IDEAL FOR BEGINNERS

SUPPORTIVE COMMUNITY

SELF-PACED LEARNING

Not sure if programming is for you? With CodeBerry you’ll like it.