What's up with all those equals signs anyway? (lars.ingebrigtsen.no)
610 points by todsacerdoti 15 hours ago
kstrauser 9 hours ago
For context, this is the Lars Ingebrigtsen who wrote the manual for Gnus[0], a common Emacs package for reading email and Usenet. It’s clever, funny, and wildly informative. Lars has probably forgotten more about email parsing than 99% of us here will ever have learned.
The manual itself says[1]:
> Often when I read the manual, I think that we should take a collection up to have Lars psycho-analysed.
0: https://www.gnu.org/software/emacs/manual/html_mono/gnus.htm...
sovande 7 hours ago
Not only the manual, but Gnus itself. I remember this guy from the university (UiO) when he started working on Gnus. He was a small celebrity among us informatics students, and we all used Emacs and Gnus, of course.
rurban 4 hours ago
Also gmane. The once popular mailing lists search site.
jd3 3 hours ago
kstrauser 5 hours ago
I'd forgotten that! Yeah, I believe Lars also wrote a huge chunk of the current Gnus. I stopped using it a while back and maybe someone else came along and rewrote it again, replacing all his code, but I don't think that's the case.
Gnus was absolutely delightful back in the day. I moved on around the time I had to start writing non-plaintext emails for work reasons. It's also handy to be using the same general email apps and systems as 99.99% of the rest of the world. I still have a soft spot in my heart for it.
PS: Also, I have no idea whatsoever why someone would downvote you for that. Weird.
ruhith 13 hours ago
The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.
lvncelot 12 hours ago
> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't
I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454
josefx 12 hours ago
I prefer the question about CPU pipelines that gets explained using a railroad switch as example. That one does a decent job of answering the question instead of going of on a, how to best put it, mentally deranged one page rant about regexes with the lazy throw away line at the end being the only thing that makes it qualify as an answer at all.
kapep 12 hours ago
MrGilbert 11 hours ago
bityard 9 hours ago
perching_aix 8 hours ago
It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?
Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.
What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.
somat 6 hours ago
tiagod 8 hours ago
bayesnet 11 hours ago
I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.
1718627440 11 hours ago
throwaway_61235 11 hours ago
Cthulhu_ 12 hours ago
HE COMES
umanwizard 9 hours ago
Funny how differently people can perceive things. That's my least favorite SO answer of all time, and I cringe every time I see it.
It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.
The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".
chucksmash 9 hours ago
philistine 9 hours ago
ErigmolCt 8 hours ago
And because the output still looks mostly readable, nobody questions it until years later when it's suddenly evidence in front of Congress
V__ 13 hours ago
They have top men working on it right now.
tiborsaas 14 hours ago
> We see that that’s a quite a long line. Mail servers don’t like that
Why do mail server care about how long a line is? Why don't they just let the client reading the mail worry about wrapping the lines?
direwolf20 13 hours ago
SMTP is a line–based protocol, including the part that transfers the message body
The server needs to parse the message headers, so it can't be an opaque blob. If the client uses IMAP, the server needs to fully parse the message. The only alternative is POP3, where the client downloads all messages as blobs and you can only read your email from one location, which made sense in the year 2000 but not now when everyone has several devices.
fluoridation 11 hours ago
Hey, POP3 still makes sense. Having a local copy of your emails is useful.
direwolf20 11 hours ago
Jaxan 7 hours ago
ahoka 9 hours ago
encom 7 hours ago
layer8 13 hours ago
Mails are (or used to be) processed line-by-line, typically using fixed-length buffers. This avoids dynamic memory allocation and having to write a streaming parser. RFC 821 finally limited the line length to at most 1000 bytes.
Given a mechanism for soft line breaks, breaking already at below 80 characters would increase compatibility with older mail software and be more convenient when listing the raw email in a terminal.
This is also why MIME Base64 typically inserts line breaks after 76 characters.
SoftTalker 11 hours ago
In early days, many/most people also read their email on terminals (or printers) with 80-column lines, so breaking lines at 72-ish was considered good email etiquette (to allow for later quoting prefix ">" without exceeding 80 characters).
bjourne 8 hours ago
GMoromisato 8 hours ago
I don't think kids today realize how little memory we had when SMTP was designed.
For example, the PDP-11 (early 1970s), which was shared among dozens of concurrent users, had 512 kilobytes of RAM. The VAX-11 (late 1970s) might have as much as 2 megabytes.
Programmers were literally counting bytes to write programs.
NetMageSCW 4 hours ago
I assure you we were not, at least it wasn’t really necessary. Virtual Memory is a powerful drug.
liveoneggs 12 hours ago
This is how email work(ed) over smtp. When each command was sent it would get a '200'-class message (success) or 400/500-class message (failure). Sound familiar?
telnet smtp.mailserver.com 25
HELO
MAIL FROM: [email protected]
RCPT TO: [email protected]
DATA
blah blah blah
how's it going?
talk to you later!
.
QUIT
1718627440 11 hours ago
For anyone who wants to try this against a modern server:
openssl s_client -connect smtp.mailserver.com:smtps -crlf
220 smtp.mailserver.com ESMTP Postfix (Debian/GNU)
EHLO example.com
250-smtp.mailserver.com
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-AUTH PLAIN LOGIN
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-DSN
250-SMTPUTF8
250 CHUNKING
MAIL FROM:[email protected]
250 2.1.0 Ok
RCPT TO:postmaster
250 2.1.5 Ok
DATA
354 End data with <CR><LF>.<CR><LF>
Hi
.
250 2.0.0 Ok: queued as BADA579CCB
QUIT
221 2.0.0 ByeTelemakhos 12 hours ago
This brings back some fun memories from the 1990s when this was exactly how we would send prank emails.
kstrauser 9 hours ago
fix4fun 10 hours ago
xg15 11 hours ago
I like how SMTP was at least honest in calling it the "receipt to" address and not the "sender" address.
Edit: wrong.
1718627440 11 hours ago
jcynix 10 hours ago
"BITNET was a co-operative university computer network in the United States founded in 1981 by Ira Fuchs at the City University of New York (CUNY) and Greydon Freeman at Yale University."
https://en.wikipedia.org/wiki/BITNET
BITNET connected mainframes, had gateways to the Unix world and was still active in the 90s. And limited line lengths … some may remember SYSIN DD DATA … oh my goodness …
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
citrin_ru 13 hours ago
Back in 80s-90s it was common to use static buffers to simplify implementation - you allocate a fixed size buffer and reject a message if it has a line longer than the buffer size. SMTP RFC specifies 1000 symbols limit (including \r\n) but it's common to wrap around 87 symbols so it is easy to examine source (on a small screen).
thephyber 14 hours ago
The simplest reason: Mail servers have long had features which will send the mail client a substring of the text content without transferring the entire thing. Like the GMail inbox view, before you open any one message.
I suspect this is relevant because Quoted Printable was only a useful encoding for MIME types like text and HTML (the human readable email body), not binary (eg. Attachments, images, videos). Mail servers (if they want) can effectively treat the binary types as an opaque blob, while the text types can be read for more efficient transfer of message listings to the client.
Pinus 14 hours ago
As far as I can remember, most mail servers were fairly sane about that sort of thing, even back in the 90’s when this stuff was introduced. However, there were always these more or less motivated fears about some server somewhere running on some ancient IBM hardware using EBCDIC encoding and truncating everything to 72 characters because its model of the world was based on punched cards. So standards were written to handle all those bizarre systems. And I am sure that there is someone on HN who actually used one of those servers...
jcynix 10 hours ago
EBCDIC wasn't the problem, this was (part of) the problem:
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
And BITNET …
kstrauser 9 hours ago
tiborsaas 13 hours ago
Thanks, I really expected a tale from the 70's, but did not see punch cards coming :)
jibal 13 hours ago
josefx 13 hours ago
RFC822 explicitly says it is for readability on systems with simple display software. Given that the protocol is from 1982 and systems back then had between 4 and 16kb RAM in total it might have made sense to give the lower end thin client systems of the day something preprocessed.
sumtechguy 12 hours ago
Also it is an easy way to stop a denial of service attack. If you let an infinite amount in that field. I can remotely overflow your system memory. The mail system can just error out and hang up on the person trying the attack instead of crashing out.
fluoridation 11 hours ago
badc0ffee 7 hours ago
You could expect a lot more (512kB, 1MB, 2MB) in an internet-connected machine running Unix or VMS.
codingdave 13 hours ago
Keep in mind that in ye olden days, email was not a worldwide communication method. It was more typical for it to be an internal-only mail system, running on whatever legacy mainframe your org had, and working within whatever constraints that forced. So in the 90s when the internet began to expand, and email to external organizations became a bigger thing, you were just as concerned with compatibility with all those legacy terminal-based mail programs, which led to different choices when engineering the systems.
liveoneggs 13 hours ago
This is incorrect
kstrauser 9 hours ago
heikkilevanto 13 hours ago
I thought the article would be about the various meanings of operators like = == === .=. <== ==> <<== ==>> (==) => =~=
direwolf20 13 hours ago
What is this, a Haskell for ants?
dkga 13 hours ago
It has to be at least… three times bigger than this
fix4fun 10 hours ago
My fist association was brainf..k (*.bf) programming language
ErigmolCt 8 hours ago
This ended up being way more interesting
thedanbob 13 hours ago
I wrote my own email archiving software. The hardest part was dealing with all the weird edge cases in my 20+ year collection of .eml files. For being so simple conceptually, email is surprisingly complicated.
jandrese 5 hours ago
Email is one of those cursed standards where the committee wasn't building a protocol from scratch, but rather trying to build a universal standard by gluing together all of the independently developed existing systems in some way that might allow them to interoperate. Verifying that a string a user has typed is a valid email address is close to impossible short of just throwing up your hands and allowing anything with a @ somewhere in it.
btown an hour ago
Email is one of the very few success cases of the xkcd Standards meme: https://xkcd.com/927/ - and it's due to practicality and ingenuity on the part of people who made very creative parsers and placed real-world understanding behind every word of the early RFCs.
Without a unified email standard, the world would look incredibly different today, especially as it bootstrapped open communication between different countries and institutions in developing every protocol since.
btmiller an hour ago
RFC 2822, for the curious :)
stevekemp 10 hours ago
I wrote a console-based mail client, which was 25% C++ and 75% Lua for defining the UI and the processing.
It never got too popular, but I had users for a few years and I can honestly say MIME was the bane of my life for most of those years.
thedanbob 7 hours ago
Indeed. A big chunk of my email parser deals with missing or incorrect content headers. Most of the rest attempts to sensibly interpret the infinite combinations of parts found in multipart (and single-part!) emails.
TazeTSchnitzel 10 hours ago
The most interesting thing to me wasn't the equals signs, which I knew are from quoted-printable, but the fact that when an equals sign appears, a letter that should have been preceding or following it is missing. It's as if an off-by-one error has occurred, where instead of getting rid of the equals sign, it's gotten rid of part of the actual text. Perhaps the CRLF/LF thing is part of it.
btown 9 hours ago
The article goes into exactly why this happens!
ErigmolCt 8 hours ago
That's exactly how you end up with mystery missing characters in something that's supposed to be evidence
xg15 13 hours ago
I'm just wondering why this problem shows up now. Why do lots of people suddenly post their old emails with a defective QP decoder?
> For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days.
On the risk of having missed the latest meme or social media drama, but does anyone know what this "some reason or other" is?
Edit: Question answered.
SCdF 13 hours ago
Presumably the Epstein files, but I'm not on twitter so not sure
grimgrin 11 hours ago
you presume cor=rect
xg15 11 hours ago
xg15 12 hours ago
Ooh, that reason. Sorry for having been dense. Thanks!
avemg 11 hours ago
Jeff Epstein? The New York financier?
ropp 13 hours ago
the DOJ published another bunch of Epstein emails
beejiu 14 hours ago
> So what’s happened here? Well, whoever collected these emails first converted from CRLF (i.e., “Windows” line ending coding) to “NL” (i.e., “Unix” line ending coding). This is pretty normal if you want to deal with email. But you then have one byte fewer:
I think there is a second possible conclusion, which is that the transformation happened historically. Everyone assumes these emails are an exact dump from Gmail, but isn't it possible that Epstein was syncing emails from Gmail to a third party mail server?
Since the Stackoverflow post details the exact situation in 2011, I think we should be open to the idea that we're seeing data collected from a secondary mail server, not Gmail directly.
Do we have anything to discount this?
(If I'm not mistaken, I think you can also see the "=" issue simply by applying the Quoted-Printable encoding twice, not just by mishandling the line-endings, which also makes me think two mail servers. It also explains why the "=" symbol is retained.)
TazeTSchnitzel 10 hours ago
In one of the email PDFs I saw an XML plist with some metadata that looked like it was from Apple's Mail.app, so these might be extracted from whatever internal format that uses.
topspin 3 hours ago
What happened here is what always happens with all printed and digital material that goes through some evidentiary process.
The shot-callers demand the material, which is a task fobbed off onto some nobody intern who doesn't matter (deliberately, because the lawyers and career LEOs don't want any "officer of the court" or other "party" to put eyes on things they might need to deny knowing about later.) They use only the most primitive, mechanical method possible, with little to no discretion. The collected mass of mangled junk is then shipped to whoever, either in boxes or on CD-ROM/DVD (yes, still) or something. Then, the reverse process is done, equally badly, again by low-level staff, also with zero discretion and little to no technical knowledge or ability, for exactly the same reasons, to get the material into some form suitable for filing or whatever.
Through all of this, the subtle details of data formats and encodings are utterly lost, and the legal archive fills with mangled garbage like raw quoted-printable emails. The parties involved have other priorities, such as minimizing the number of people involved in the process, and tight control over the number of copies created. Their instinct is not to bring in a bunch of clever folk that might make the work product come out better, because "better" for them is different than "better" for Twitter or Facebook. Also, these disclosures are inevitably and invariably challenged by time: the obligation to provide one thing or another is fought to the last possible minute, and when the word does finally go out there is next to no time to piddle around with details.
In the Epstein case, the disclosures were done years ago, the original source material (computers, accounts, file systems, etc.) have all long since been (deliberately) destroyed, and what the feds have is the shrapnel we see today.
flomo 6 hours ago
When they process these emails, it's fairly common to import everything into a MS Outlook PST file (using whatever buggy tool). That's probably why these look like Outlook printouts even though its Yahoo mail or etc.
ErigmolCt 8 hours ago
Yeah, I wouldn't bet on this being a single bad Gmail export; it smells much more like the accumulated scars of multiple mail systems doing "helpful" things to the same messages over time
MoltenMan 14 hours ago
This seems like the most likely reason to me!
maartin0 12 hours ago
Fun how the archive.today article near the top has this exact issue
cachius 3 hours ago
I'd like a good .eml viewer that undoes the quoted printable transformation for the contained plain and html text. useful for mails downloaded from Outlook.
jojomodding 15 hours ago
https://web.archive.org/web/20260203094902/https://lars.inge...
Did the site get the HN kiss of death?
JKCalhoun 11 hours ago
(The title of the blog reminded me the late Bob Pease [1] who had the signature, "What's all this XXX stuff, anyhow?" [2] where XXX might be "noise gain", "capacitor leakage"…)
lordnacho 15 hours ago
I love how HN always floats up the answers to questions that were in my mind, without occupying my mind.
I, too, was reading about the new Epstein files, wondering what text artifact was causing things to look like that.
AlphaAndOmega0 15 hours ago
Same here. I did notice what I think was an actual error on someone's part, there was a chart in the files comparing black to white IQ distributions, and well, just look at it:
https://nitter.net/AFpost/status/2017415163763429779?s=201
Something clearly went wrong in the process.
fredley 14 hours ago
Me too. I first assumced it was an OCR error, then remembered they were emails and wouldn't need to go through OCR. Then I thought that the US Government is exactly the kind of place to print out millions of emails only to scan them back in again.
I'm glad to know the real reason!
rireads 6 hours ago
I just want to add that I would expect the exact same thing from the German government. Glad to see we're not all that different
quibono 15 hours ago
CLRF vs LF strikes again. Partly at least.
I wonder why even have a max line length limit in the first place? I.e. is this for a technical reason or just display related?
brk 12 hours ago
Wait, now we have to deal with Carriage Line Return Feeds too?
I wonder if the person who had the idea of virtualizing the typewriter carriage knew how much trouble they would cause over time.
keybored 11 hours ago
Yeah, and using two bytes for a single line termination (or separation or whatever)? Why make things more complicated and take more space at the same time?
floren 11 hours ago
OJFord 14 hours ago
I haven't seen them other than in the submission - but if the length matches up it may be that they were processed from raw email, the RFC defines a length to wrap at.
Edit: yes I think that's most likely what it is (and it's SHOULD 78ch; MUST 998ch) - I was forgetting that it also specifies the CRLF usage, it's not (necessarily) related to Windows at all here as described in TFA.
Here it is in my 'notmuch-more' email lib: https://github.com/OJFord/amail/blob/8904c91de6dfb5cba2b279f...
FabHK 14 hours ago
> it's not (necessarily) related to Windows at all here as described in TFA.
The article doesn't claim that it's Windows related. The article is very clear in explaining that the spec requires =CRLF (3 characters), then mentions (in passing) that CRLF is the typical line ending on Windows, then speculates that someone replaced the two characters CRLF with a one character new line, as on Unix or other OSs.
OJFord 14 hours ago
dgan 14 hours ago
I am just wondering how it is good idea for a sever to insert some characters into user's input. If a collegue were to propose this, i d laugh in his face
It's just sp hacky i cant belive it's a real life's solution
jagged-chisel 14 hours ago
“Insert characters”?
Consider converting the original text (maintaining the author’s original line wrapping and indentation) to base64. Has anything been “inserted” into the text? I would suggest not. It has been encoded.
Now consider an encoding that leaves most of the text readable, translates some things based on a line length limit, and some other things based on transport limitations (e.g. passing through 7-bit systems.) As long as one follows the correct decoding rules, the original will remain intact - nothing “inserted.” The problem is someone just knowledgeable enough to be aware that email is human readable but not aware of the proper decoding has attempted to “clean up” the email for sharing.
dgan 14 hours ago
flexagoon 14 hours ago
When you post a comment on HN, the server inserts HTML tags into your input. Isn't that essentially the same thing?
dgan 14 hours ago
direwolf20 13 hours ago
It's called escaping, and almost every protocol has it. HN must convert the & symbol to & for displaying in HTML. Many wire protocols like SATA or Ethernet must insert a 1 after a certain number of consecutive 0s to maintain electrical balance. Don't remember which ones — don't quote me that it's SATA and Ethernet.
zoho_seni 9 hours ago
layer8 13 hours ago
Just wait until you learn what mess UTF-8 will turn your characters into. ;)
ErigmolCt 8 hours ago
What's funny is that the failure mode here is so quietly destructive
voxelghost 13 hours ago
My main takeaway from this article, is that I want to know what happened to the modified pigs with non-cloven hoofs
anthk 2 hours ago
Dear GNU's: rewrite the fetching core so it gets performant enough to not crawl under 10000 headers from either usenet or Email. No, even native compilation is fast enough.
Hanzklatil369 5 hours ago
27b09b80f93cecf1-000000001b5e2c7f-0000000069825928
lucb1e 13 hours ago
cat title | sed 's/anyway/in email/'
would save a click for those already familiar with =20 etc.noduerme 14 hours ago
Great. Can't wait for equal signs to be the next (((whatever this is))). Maybe it's a secret code. j/k
On a side note: There are actually products marketed as kosher bacon (it's usually beef or turkey). And secular Jews frequently make jokes like this about our kosher bros who aren't allowed to eat the real stuff for some dumb reason like it has too many toes.
rsynnott 9 hours ago
But... pigs _do_ have cloven hooves! The issue is that they're not ruminant.
That said, there is a _possibly_ kosher pig: https://en.wikipedia.org/wiki/Babirusa#Relationship_with_hum...
LAC-Tech 6 hours ago
Great. Can't wait for equal signs to be the next (((whatever this is))). Maybe it's a secret code. j/k
Yeah clearly you guys are the biggest victims in all this... get in there and make it about you!
MarginalGainz 12 hours ago
"It’s a fascinating case of 'Abstraction Leak'.
We’ve become so accustomed to modern libraries handling encoding transparently that when raw data surfaces (like in these dumps), we often lack the 'Digital Archeology' skills to recognize basic Quoted-Printable.
These artifacts (=20, =3D) are effectively fossils of the transport layer. It’s a stark reminder that underneath our modern AI/React/JSON world, the internet is still largely held together by 7-bit ASCII constraints and protocols from the 1980s.
seydor 15 hours ago
TLDR "=\r\n" was converted to "=\n"
netsharc 15 hours ago
Author seems to think Unix uses a character called "NL" instead of "LF"...
debugnik 14 hours ago
Unicode labels U+000A as all of "LINE FEED (LF)", "new line (NL)" and "end of line (EOL)". I'm guessing different names were imported from slightly different character sets, although I understand the all-uppercase name to be the main/official one.
netsharc 7 hours ago
matsemann 14 hours ago
NL, or New Line, is a character in some character sets, like old mainframe computers. No need to be snarky just because he mistyped or uses a different name for something.
db_admin 14 hours ago
I am more surprised by the description of “rock döts”. A Norwegian certainly knows that ASCII is not enough for all our alphabetical needs.
topaz0 12 hours ago
thaumasiotes 14 hours ago
No, the article is quite explicit that that isn't what happened.
brador 14 hours ago
Could be worsened by inaccurate optical character recognition in some cases.
Back in those days optical scanners were still used.
zabzonk 13 hours ago
People posting Excel formulae?
ccppurcell 14 hours ago
Rock dots? You mean diacritics? Yeah someone invented them: the ancient Greeks, idiöt.
RHSeeger 14 hours ago
It's not the character, its the way / context in which it's used
ccppurcell 12 hours ago
I know what he was referring to. But the use case is obviously languages other than English, not the Motörhead fan club newsletter.
topaz0 12 hours ago
Some combination of people misunderstood some other people's joke, not totally clear which and which.
chr 13 hours ago
Yeah, that dude oughta read books and learn about computers, too.
gerikson 12 hours ago
And live in a country where they use these in their alphabets.