Parsing LegalDocML Files

Recently I find that when engaging with abstract theoretical texts (e.g. Tom Bailey’s Ambiguous Sovereignty, it is often helpful to make a website out of it.

The technical barrier of entry is low: LLMs can readily generate scripts that would allow me to split the text into smaller files and then create separate HTML files out of it. For a journal article, the whole process may take no more than 1-2 hours, and the process of playing around the text lets concept sip in and permits me to say (only with some exaggeration) that my eyes passed over all the text once (when I am testing out the website).

Having recently listened to a talk about the COPA v Wright main judgment, I wanted to do the same for the judgment. After all, however interesting the subject matter may be, it is almost 230 pages long, and rather forbidding even for people fond of reading of reading long discursive texts.

Plus, how long could it take if I just feed it to some Python parser and get it over with?

More than 3 hours, as I discovered.

Initially, I thought the task would be even easier than parsing a journal article, because according to the Find Case Law website, the XML files follow an international data standard LegalDocML. If the XML files are all made following a convention, how hard could parsing them be?

But I over-estimated the degree of uniformity imposed by LegalDocML.

For example, the COPA v Wright main judgment is divided into sections, which at first offers quite a good way for me to break down the long text.

For the first 20 sections or so, it works well, in that (using the example of HTML; the XML file is the same) the sections are always

<section id="lvl_2">
  <p>
    <b>SUMMARY</b>
  </p>

  <!-- More content -->
    
</section>

So for those sections, just extracting everything within <section id="lvl_${number}"> works fine.

But problems start to emerge later. Take the example of section 109,

<section id="lvl_109">
  <p>
    <b>THE SECOND CHRONOLOGICAL RUN</b>
  </p>
</section>

The section only contains the heading, and paragraphs lie outside the section element.

I could have written (or more accurately ask LLM to write) a script to clean this, but by that time I have to move on and do other things. Another day.

(Alternatively, I could have just used the hyperlinked table of contents to dice the judgment up to different HTMLs. As Bruce the philosopher might say, the owl of Minerva takes flight only at dusk.)

According a quick search on the Perplexity AI engine, the LegalDocML standard in fact does not impose rigid standards on, for example, the use of sections. This may be the reason behind the uniformity.

This may also be why, to my knowledge, there is no easily searchable LegalDocML parser giving a quick and easy way of, say extracting all the sections of a judgment.

Matters such as paragraph numbers seem to be relatively uniform, however, so it may not be too difficult to add paragraph references to my Chronology of a Judgment website.