Generate Chronology: first steps

TL;DR

I deployed a simple alpha website with python code, where:

  • the user inputs a paragraph-referenced Word (.docx) document
  • website automatically generates a paragraph-referenced Chronology in Word for download and further editing.

These are the first lines of python I have written / glued together with large language models (LLMs). And the approach taken in the code borders on the absurd.

But the story may make good reading for those interested in how current (2024) LLMs affect people early in their programming journeys.

Background and motivation

After a degree of success(?) in converting Golang code to Web Assembly, I wanted to try my hand at a project I wanted to make 4-5 years ago, long before I had an interest in programming.

As a young lawyer, a common junior task is to create a Chronology out of a court document.

For example, suppose a Witness Statement refers to 15 dates. While the narrative is usually chronological, the precise date references might be all over the place. And there are usually more than one relevant document: e.g. pleadings, underlying documentary evidence.

This is why it is often helpful to have a Chronological index, where the main events are described in brief and the relevant page and paragraph references provided. If nothing else, this provides a quick way for someone to start to “read into the papers”. In fact a chronology is even required in at least two kinds of hearings in Hong Kong (substantive interlocutory hearings; appeal hearings).

Now, no self-respecting lawyer would want to delegate the task entirely to computers. In the first place, the documents need to be read and digested, and in many ways the process of compiling the chronology gives the (human!) compiler a chance to “read in” herself.

But why can’t there at least be an automatic process to generate the relevant page and paragraph references? Checking and doing these do not help “reading in” at all. One needs to switch to a different frame of mind when one is simply mechanically copying-and-pasting (sometimes even typing!) arabic numbers.

And it is so painfully obvious that machines can do this faster and better. i-phones can now recognise faces: it is absurd that expensive lawyers need to do even the first draft of chronologies by hand.

Abandoned approach: VBA

When I first started looking at this around 2 years ago, I discovered the leading approach was to use Microsoft’s Visual Basic for Applications for Word: a simple tool that allows you to manipulate Word documents.

There is a small online community dedicated to this: I even bought a book by an editor who specialises in this! But the book is more than 10 years old and the community is really quite a small one (compared with the mainstream programming languages).

More importantly, even if I were to succeed in creating a VBA script, the users need to go through a very complicated installation process to get the programme working on their machines. A far cry from the simple web interfaces everyone has got used to and been spoiled with.

For around a year or so, the project laid dormant in my mind. I had decided to dive deeper into web development and the Javascript/Typescript world. The Word.docx format just isn’t very important there. In fact, just trying to get a website to download something in a Word format took a lot of work and research and resuscitating some 10 year old code!

Eventual approach: local machine

But after the small project with Web Assembly, I decided to try my hand again.

Perhaps if I just get some code that can work on the command line on my local machine, then by the magic of Web Assembly I can get it to work on the browser too.

So I started piecing together code that would permit me to do this.

Pandoc conversion: .docx to markdown

The first tool I reached for was pandoc. Most lawyers use Microsoft Word for their word processing. Unless I wanted to run VBA on servers (seemingly not a viable option), I needed to convert the Word (.docx) file to a target format I could more easily work on.

The target format I eventually chose was markdown. The main reason is due to the simplicity of the converted files.

From my time trying to wrangle text out of the COPA v Wright Main Judgment, I knew that dealing with paragraph numbers in XML format is not straightforward.

My understanding is that, even internally, Word.docx files do not save paragraph numbers as immutable digits: rather they form some sort of a list, with the numbers then displayed to users based on the position of the position of the paragraph in the list.

For the purpose of creating a smooth end-user experience, this makes complete sense. Paragraph insertions and deletions can be handled very gracefully. But for the purpose of converting a Word file to another format, it is a nightmare. There is no sure way to check if the conversion is correct: if some paragraphs (or sub-paragraphs) are not correctly recognised, there can be tricky off-by-one or off-by-more errors that can be hard to detect.

But when converting to markdown, what Pandoc seems to do is simply transcribe the paragraph number to digits. Just what I want for doing paragraph references.

Ordering by date: markdown to CSV and back

This is where I decided to use an approach which must be somewhat absurd for anyone who wants to write server code from scratch.

I did not feel confident/knowledgable enough to do all the necessary data manipulation in one go. Instead, I broke down the tasks as follows:

  • Turn the markdown file to a CSV file (something that can be opened with Microsoft Excel) with 2 columns, “Paragraph number” and “Paragraph Text” (the extract_numbered_items_to_csv(input_file, output_file)function from the utils.py file).

  • Extract all references to dates in the Paragraph Texts and create a 3-column CSV file “Date”, “Paragraph number”, “Paragraph Text” (the extract_dates(input_file, output_file)function from the utils.py file).

  • Create a Word document from the 3-column csv file (the create_word_document_from_csv(input_file, output_file) function from the utils.py file)

At each stage, the workflow was:

  • ask LLMs for a first draft of the code;
  • (generally) tinkering with the code to test with an example input file, almost always using the toolset and libraries in the first draft
  • after completing one stage, moving on to another.

After a few hours, I got everything to work on my local machine. But how to deploy it?

Deployment: server, not wasm

At first, I was hopeful of achieving a pure Web Assembly solution. This would enable absolutely everything to happen on the end user’s browser. Once loaded on the browser, the App would be able to work offline. No data is transmitted to the browser and there is no need to worry about privacy and data security.

But alas, this approach seems to be too cutting-edge for someone at my level. There is apparently a 11-person Y-combinator backed company (wasmer) that is a leading developer of WASM solutions. Their py2wasm tool seems designed to make this work.

But all sorts of questions arises. What happens to code that deal with the file system on wasm? Where do the standard input and standard output go? There are answers for this and good resources to dig into (esp. Katie Bell’s talks). But it really felt too much above my pay grade.

Moreover, even if I could compile the python side of code to wasm on the browser, I still needed to get pandoc to work. One of the best / most easily searchable options seem to be pandoc-wasm, which (unlike a previous project) supports docx to md conversion.

But as the author warns in the readme, “running Pandoc under WebAssembly using this library is fairly fragile at the moment” and indeed the markdown produced by the WebAssembly version of Pandoc is different from my local pandoc and cannot be processed by the functions above.

So the only option seems to be to deploy this on some kind of python server. The name of Flask came up, so I spent some time getting started with that tool.

After some trial and error, it worked on my local machine.

But at that point I realised that the server kept saving the uploaded Word document (and all the intermediary files) without deleting them. This is obviously very bad for privacy and security, not to say occupying entirely unnecessary space.

It was the work of a moment to add a few lines to remove files by name. But it struck me: should a web server be given the os level control to remove files? Would this not be a major security hole open to exploitation?

In fact, would the free hosting options for Flask even allow the programme to create and delete files in the root directory?

I considered for a moment changing the code to include some kind of a call to the database, but decided to deploy it first.

To my surprise, my deployment to Koyeb worked. Admittedly there was some minor hiccups (e.g. using pip pypandoc-binary to ensure that pandoc is installed) and (a fairly shallow) learning curve to deployment.

But apparently Koyeb is happy to give me control (for free!) to add and remove files on their machines.

Perhaps the reason is that, given the input file is immediately turned into markup file by pandoc, it is not straightforward for an attacker to craft an executable and take over the whole machine.

More likely, the hackers are just not interested in my code yet: there are much bigger (and more lucrative) fish in the sea.

And why doesn’t the browser stop/display a warning when the user tries to download the file? Is it because Macros are turned off by default in .docx files, so most hacks can’t be done?

Reflections: interactivity as the killer feature

People often speak of LLMs in hyperbolic terms, as something that may approach human intelligence.

I am agnostic on this issue. To me, the real selling point of LLMs is the interactivity. Assume for the sake of argument that LLMs are no more than an augmented search engine (this is at least what I use them for).

Still we are talking about a search engine that can apparently talk back and forth with you, and also at times point out silly typos which otherwise would have escaped your notice. The user experience is simply amazing: a dummer and more likable version of Goethe’s Mephistopheles.

Recently, I read Goldberg and Larsson’s book on Minecraft and was transported back to the world of indie gaming of the early 2000s. While it was very enjoyable to read about games such as Dwarf Fortress in the abstract, the prospect of actually spending time in front of the screen exploring the fortress doesn’t appeal to me.

LLMs offer a different experience altogether. The fact that plausible sounding texts can be generated immediately, to a level that almost passes the Turing Test, is immensely appealing.

In a way it does not really matter that most of the suggestions are wrong or misleading, or (for the sake of argument) one could learn something quicker without it, say by reading and digesting long and carefully hand-crafted documentation, doing exercises as one goes along. It is just not as fun.

Many Swedish software engineers of Markus Persson’s generation went into programming because of computer games. Perhaps LLMs will be the major reason young people get into programming in the next generation.

Like the French Revolution, it may be too early to tell whether this is good or bad.