In his memoirs, Edward Snowden spoke of an era where computers were still the preserve of a small minority, and random strangers would go to extraordinary lengths to help each other in online chats (something which predated the world wide web, which meant this took place outside of the web browsers e.g. Chrome or Firefox).
He saw control of one’s own computer as key to privacy and integrity, and was scathing of “cloud computing” which misleading implies that your data is naturally and comfortably held “in the cloud”. The reality was that the data instead lives on a database owned by some conglomerates and subject to monitoring by both big government and big data.
I grew up in a much later technological environment where the web and cloud storage were taken for granted. In fact, it came as a surprise for me to learn that emails used to be stored in user’s own computers: web interfaces were my first experience of emails.
It is perhaps unsurprising for this reason that when I first wanted to learn more about digital technology, I was not in the least interested in databases at all. I have my excel. And my local file system. How is learning to store data intellectually stimulating in any way?
It was only perhaps a year onwards that it dawned on me that all the web services I rely on (search engines, web mail, instant messaging etc.) require databases, and once upon a time these projects started with a small database a single person (or a small group of people) could use on their own local machines.
And once I started getting my hands dirty to try to use different database systems, common problems emerged:
What format should I save data in? Even focusing on text, there are a multitude of formats: is it plain text? Or html? Or markdown?
Suppose there is a problem in saving data. Say there is a table with 3 columns; 2 columns are filled, but the last column couldn’t because it needed data from the internet, and the WiFi just broke down. What should the response be? Put in a placeholder for the missing column? Or discard the incomplete row altogether?
How to search through the data? If it is a table, what should be its “key”? It seemed natural at first to isolate an aspect of the data as key, e.g. the sub-heading of a text. But what if it repeats (e.g. there are two sub-sections both called “Summary”)?
These are problems that arise when there is only one user, and there is no need to maintain a 24-7 online presence (as most businesses do) or handle conflicting inputs from different sources (as most messaging apps do).
But they give me a taste for the kind of issues that, for example, Google needs to deal with when it is faced with a sea of user data (e.g. this essay) or the challenges in making sure messages between Signal users are encrypted end-to-end. It also made me understand how Bitcoin is in some sense, just a very unusual database, catering for a specific issue that other databases don’t try to address.
Moreover, the more one uses databases, the more one realises that the kind of data that can be collected and analysed is just beyond the scale of ordinary experience.
Take a classic to-do list, and add a slight twist where everyone with access to a web page can post/delete anything they like from the to-do list, so long as they log-in from (say) their Google account.
In some ways this is on different from an ordinary notice board, only on a geographically wider scale. But so much data is collected in a way that is barely registered by the user. The email; the time of posting/deleting; perhaps even the IP address from which the log in took place.
In real life, a (very) diligent guard could sit in front of the notice board and take down the necessary information. But this would (at least) cost wages for a person, and there it only so much one person can do. Online, the collection costs virtually nothing: there only needs to be a tech person to make (some) sense out of the mass of data collected.
It doesn’t take much imagination then to see how the system can really “scale”, and a company offering a web service can collect and process a vast amount of data with only a tiny staff.
It is one thing to read about the promises and risks of big data; quite another to be see it in action in person. It is for this reason that databases are of inherent interest and are quite different from simply a spiced-up spreadsheet.