Process Locally, Backup Remotely

Submitted by lev_lafayette on Tue, 08/25/2020 - 00:28

Recently, a friend expressed a degree of shock that I could pull old, even very old, items of conversation from emails, Facebook messenger, etc., with apparent ease. "But I wrote that 17 years ago". They were even dismayed when I revealed that this all just stored as plain-text files, suggesting that perhaps I was like a spy, engaging in some sort of data collection on them by way of mutual conversations.

For my own part, I was equally shocked by their reaction. Another night of fitful sleep, where feelings of self-doubt percolate. Is this yet another example that I have some sort of alien psyche? But of course, this is not the case, as keeping old emails and the like as local text files is completely normal in computer science. All my work and professional colleagues do this.

What is the cause of this disparity between the computer scientist and the ubiquitous computer user? Once I realised that the disparity of expected behaviour was not personal, but professional, there was clarity. Essentially, the convenience of cloud technologies and their promotion of applications through Software as a Service (SaaS) has led to some very poor computational habits among general users that have significant real-world inefficiencies.

Webmail

The earliest example of a SaaS application that is convenient and inefficient is webmail. Early providers such as Hotmail and Yahoo! offered the advantage of one being able to access their email from anywhere on any device with a web-browser, and that convenience far out-weighed the more efficient method of processing email with a client with POP (Post-Office Protocol), because POP would delete the email from the mail server as part of the transfer.

Most people opted for various webmail implementations for convenience. There were some advantages of POP, such as storage being limited only by the local computer's capacity, the speed that one could process emails being independent of the local Internet connection, and the security of having emails transferred locally rather than on a remote server. But these weren't enough for the convenience and en masse people adopted cloud solutions, which now have very easily been integrated into the corporate world.

During all this time a different email protocol, IMAP (Internet Access Message Protocol), became more common. IMAP provided numerous advantages over POP. Rather than deleting emails from the server, it copied them to the client and kept them on the server (although some still saw this as a security issue). IMAP clients stay connected to the server whilst active, rather than the short retrieve connection used by POP. IMAP allowed for multiple simultaneous connections, message state information, and multiple mail-boxes. Effectively, IMAP provided the local processing performance, but also the device and location convenience of webmail.

The processing performance that one suffers from webmail is effectively two-fold. One's ability to use and process webmail is dependent on the speed of the Internet connection to the mail server and the spare capacity of that mail-server to perform tasks. For some larger providers (e.g., Gmail) it will only be the former that it is really an issue. But even then, the more complex the regular expression in the search the harder it is, as basic RegEx tools (grep, sed, awk, perl) typically aren't available. With local storage of emails, kept in plain text files, the complexity of the search criteria depends only on the competence of the user and the speed with the performance of the local system. That is why pulling up an archaic email with specified regular expressions search terms is a relatively trivial task for computer professionals who keep their old emails stored locally, but less so for computer users who store them using a webmail application.

Messenger Systems

Where would we be without our various messenger systems? We love the instant message convenience, it's like SMS on steroids. Whether Facebook Messenger, WhatsApp, Instagram, Zoom, WeChat, Discord, or any range of similar products there is the very same issue that one is confronted with webmail because it too operates as a SaaS application. Your ability to process data will depend on the speed of your Internet connection and the spare computational capacity on the receiving server. Yeah, good luck with that on your phone. Have you ever tried to scroll through a few thousand Facebook Messenger posts? It's absolutely hopeless.

But when something like an extensive Facebook message chat is downloaded as a plain HTML file searching and scrolling becomes trivial and blindingly fast in comparison. Google Chrome, for example, has a pretty decent Facebook Message/Chat Downloader. WhattsApp backs up chats automatically to one's mobile device and can be setup to periodically downloaded to Google Drive which then can be moved to a local device (regular expressions on 'phones are not the greatest, yet), and Slack (with some restrictions) offers a download feature. For Discord, there is an open-source chat exporters. Again, the advantages of local copies become clear; speed of processing, the capability of conducting complex searches.

Supercomputers and the Big Reveal

What could this discussion of web-apps and local processing possibility have to do with the operations of big iron supercomputers? Quite a lot actually, as the example illustrates the issue at a scale. Essentially the issue involved a research group that had a very large dataset that was stored interstate. The normal workflow in such a situation is to transfer the dataset as needed to the supercomputer and conduct the operations there. Finding this inconvenient, they made a request to have mount points on each of the compute nodes so the data could stay interstate, the computation could occur on the supercomputer, and they could ignore the annoying requirement of actually transferring the data. Except physics gets in the way, and physics always wins.

The problem was that the computational task primarily involved the use of the Burrows-Wheeler Aligner (BWA) which aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. In carrying out this activity this program makes very extensive use of disk read-writes. Now, you can imagine the issue of the data being sent interstate, the computation being run on the supercomputer, and then the data being sent back to the remote disk, thousands of times per second. The distance between where the data is and where the processing is carried out is meaningful. It is much more efficient to conduct the computational processing close to where the data is. As Grace Hopper, peace-be-upon-her said: "Mind your nanoseconds!"

I like to hand out c30cm pieces of string at the start of introductory HPC training workshops, telling researchers that should hang on to the piece of string to remind themselves to think of where their data is, and where their compute is. And that is the basic rule; keep your data that you want to be processed, even when it's something trivial like a searching email, or even as complex as aligns relatively short nucleotide sequences, close to the processor itself, and the closer the better.

What then is the possible advantage of cloud-enabled services? Despite their ubiquity, when it comes to certain performance-related tasks there isn't much, although the centralised services can provide many advantages in itself. Perhaps the very best processing use is perhaps treating the cloud as "somebody else's computer", that is, a resource where one can offload data for certain computational tasks for can be conducted on the remote system. Rather like how many and an increasing number of researchers find themselves going to HPC facilities when they discover that their datasets and computational problems are too large or too complex for their personal systems.

There is also one extremely useful advantage called cloud services from the user perspective and that is, of course, off-site backup. Whilst various local NAS systems are very good for serving datasets within, say, a household especially for the provision of media, it is perhaps not the wisest choice for the ultimate level of backup. Eventually, disks will fail and with disk failure, there is quite a cost in data recovery. Large cloud storage providers, such as Google, offer massive levels of redundancy which ensures that data loss is extremely improbable, and with tools such as Rclone, synchronisation and backups of local data to remote services can be easily automated. In a nutshell: process data close to the CPU, backup remotely.

Comments

IMO It's just automatic and

IMO It's just automatic and natural to archive your mail. It's more effort to delete it, and not worth the bother - especially when it uses so little disk space.

`find ~/mail/ -type f -ls | awk '$10 ~ /^[1-2][0-9]{3}$/ {print $10}' | sort -u` tells me that the oldest mbox file I have in my main mail archive is from 1995. I lost some earlier stuff in the late 90s due to the failure of one of my backup drives....by that time, I'd stopped bothering with tape backup at home (i couldn't afford tape drives capable of backing up everything)

This 25-year mail archive uses a total of 5.3GB - some of it manually compressed with gzip (mutt can open and use a gzipped mbox with no problem) , but since I switched to ZFS about 10 years ago I just rely on ZFS's built-in compression.

my mboxes are ordered by YYYY-MM or just YYYY and sometimes I use just `grep -l` if I want to find which year/month contains the regex pattern I'm looking for. Good for a quick search.

Mostly I use grepmail (https://github.com/coppit/grepmail, or "apt-get install grepmail" in debian) which outputs the entire message for every match to stdout (easily redirected to a new mbox to read with mutt or something. or piped into formail for further processing).

Getting the entire matching message(s) is far more useful than just the matching LINE, which is all that plain grep gets you.

grepmail also decodes and searches all MIME attachments automatically, which should solve Tim's problem above. You can use the `-M` option to tell it to ignore non-text attachments.

Dunno what, if anything, it does with embedded uuencoded stuff (probably nothing, which probably isn't a problem because it's probably binary data anyway)

It can search the headers of encrypted messages, but (obviously) not the message bodies.
-- Craig Sanders

lev_lafayette | Tue, 08/25/2020 - 03:59 | Permalink

You are here

Process Locally, Backup Remotely

Comments

IMO It's just automatic and