Hacker News

by Ryan Harman

Log level 'error' should mean that something needs to be fixed (utcc.utoronto.ca)

366 points by todsacerdoti 4 days ago

layer8 14 hours ago

> When implementing logging, it's important to distinguish between an error from the perspective of an individual operation and an error from the perspective of the overall program or system. Individual operations may well experience errors that are not error level log events for the overall program. You could say that an operation error is anything that prevents an operation from completing successfully, while a program level error is something that prevents the program as a whole from working right.

This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but they still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.

Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.

Too 13 hours ago

This is why it’s almost always wrong for library functions to log anything, even on ”errors”. Pass the status up through return values or exceptions. As a library author you have no clue as how an application might use it. Multi threading, retry loops and expected failures will turn what’s a significant event in one context into what’s not even worthy of a debug log in another. No rule without exceptions of course, one valid case could be for example truly slow operations where progress reports are expected. Modern tracing telemetry with sampling can be another solution for the paranoid.

cogman10 12 hours ago

Depending on the language and logging framework, debug/trace logging can be acceptable in a library. But you have to be extra careful to make sure that it's ultimately a no-op.

A common problem in Java is someone will drop a log that looks something like this `log.trace("Doing " + foo + " to " + bar);`

The problem is, especially in a hot loop, that throw away string concatenation can ultimately be a performance problem. Especially if `foo` or `bar` have particularly expensive `toString` functions.

The proper way to do something like this in java is either

    log.trace("Doing $1 to $2", foo, bar);

    if (log.traceEnabled()) {
      log.trace("Doing " + foo + " to " + bar);
    }

usefulcat 9 hours ago

rr808 4 hours ago

TZubiri 12 hours ago

MobiusHorizons 12 hours ago

What you are proposing sounds like a nightmare to debug. The high level perspective of the operation is of course valuable for determining if an investigation is necessary, but the low level perspective in the library code is almost always where the relevant details are hiding. Not logging these details means you are in the dark about anything your abstractions are hiding from higher level code (which is usually a lot)

cwillu 11 hours ago

Too 11 hours ago

TZubiri 12 hours ago

esrauch 12 hours ago

I think an example where libraries could sensibly log error is if you have a condition which is recoverable but may cause a significant slowdown, including a potential DoS issue, and the application owner can remediate.

You don't want to throw because destroying someone's production isn't worth it. You don't want to silent continue in that state because realistically there's no way for application owner to understand what is happening and why.

TZubiri 11 hours ago

ivan_gammel 12 hours ago

cyphar 7 hours ago

On paper, USDT probes are the best way for libraries (and binaries) to provide information for debugging because they can be used programmatically and have no performance overhead until they are measured but unfortunately they are not widely used.

Etherlord87 12 hours ago

This seems like such an obvious answer to the problem, your program isn't truly modularized if logging is global. If an error is unexpected it should bubble all the way up, but if it's expected and dealt with, the error message should be suppressed or its type changed to a warning.

dpark 10 hours ago

pca006132 8 hours ago

Wonder if someone used effect handlers for error logging. Sounds like a natural and modular way of handling this problem.

paulddraper 6 hours ago

It may be unwise to log errors at low layers but logging informational and debug messages are useful (at least, when the caller enables them).

echelon 13 hours ago

You need a tuple: (context, level)

The application owner should be able to adjust the contexts up or down. This is the point of ownership and where responsibility over which logs matter is handled.

A library author might have ideas and provide useful suggestions, but it's ultimately the application owner who decides. Some libraries have huge blast radius and their `error` might be your `error` too. In other contexts, it could just be a warning. Library authors should make a reasonable guess about who their customer is and try to provide semantic, granular, and controllable failure behavior.

As an example, Rust's logging ecosystem provides nice facilities for fine-grained tamping down of errors by crate (library) or module name. Other languages and logging libraries let you do this as well.

That capability just isn't adopted everywhere.

Izkata 12 hours ago

renewiltord 11 hours ago

Conflicting goals for the predominant libraries is what causes this. Log4J2 has a rewrite appender that solves the problem. But if you want zero-copy etc I don’t think there’s such a solution.

HendrikHensen 11 hours ago

> Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure.

Some languages (e.g. Java) include a stack trace when reporting an error, which is extremely useful when logging the error. It shows at exactly which point in the code the error was generated, and what the full call stack was to get there.

It's a real shame that "modern" languages or "low level" languages (e.g. Go, Rust) don't include this out of the box, it makes troubleshooting errors in production much more difficult, for exactly the reason you mention.

StellarScience 9 hours ago

C++ with Boost has let you grab a stacktrace anywhere in the application for years. But in April 2024 Boost 1.85 added a big new feature: stacktrace from arbitrary exception ( https://www.boost.org/releases/1.85.0/ ), which shows the call stack at the time of the throw. We added it to our codebase, and suddenly errors where exceptions were thrown became orders of magnitude easier to debug.

C++23 added std::tracktrace, but until it includes stacktrace from arbitrary exception, we're sticking with Boost.

dolmen 5 hours ago

The idiomatic practice in Go for libraries is to wrap returned errors and errors can be unwrapped with stdlib tooling. This is more useful to handle errors at runtime than digging into a stack trace.

layer8 10 hours ago

The point in the code is not the same information as knowing the time, or knowing the order with respect to operations performed during stack unwinding. Stacktraces are very useful, but they don’t replace lower-level logging.

ivan_gammel 14 hours ago

Libraries should not log on levels above DEBUG, period. If there’s something worthy for reporting on higher levels, pass this information to client code, either as an event, or as an exception or error code.

layer8 13 hours ago

From a code modularization point of view, there shouldn’t really be much of a difference between programs and libraries. A program is just a library with a different calling convention. I like to structure programs such that their actual functionality could be reused as a library in another program.

This is difficult to reconcile with libraries only logging on a debug level.

schrodinger 13 hours ago

ivan_gammel 13 hours ago

lanstin 8 hours ago

I have a logging level I call "log lots" where it will log the first time with probability 1, but as it hits more often the same line, it will log with lower and lower probability bottoming out around 1/20000 times. Sort of a "log with probability proportional to the unlikiness of the event". So if I get e.g. sporadic failures to some back end, I will see them all, but if it goes down hard I will see it is still down but also be able to read other log msgs.

1718627440 13 hours ago

Why? Whats wrong with logging it and passing the log object to the caller? The caller can still modify the log entry however it pleases?

ivan_gammel 13 hours ago

kelnos 12 hours ago

Eh, as with anything there are always exceptions. I generally agree with WARN and ERROR, though I can imagine a few situations where it might be appropriate for a library to log at those levels. Especially for a warning, like a library might emit "WARN Foo not available; falling back to Bar" on initialization, or something like that. And I think a library is fine logging at INFO (and DEBUG) as much as it wants.

Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).

ivan_gammel 12 hours ago

hinkley 13 hours ago

Log4j has the ability to filter log levels by subject matter for twenty years. In Java you end up having to use that a lot for this reason.

PartiallyTyped 13 hours ago

Logging in rust also does that, you can set logging levels for individual modules deep within your dependency tree.

TZubiri 11 hours ago

Oh that library that gives you a write() wrapper in exchange for RCE vulns

ivan_gammel 11 hours ago

peacebeard 5 hours ago

I've been thinking about this all day. I think the best approach is probably twofold:

1) Thrown errors should track the original error to retain its context. In JavaScript errors have a `cause` option which is perfect for this. You can use the `cause` to hold a deep stack trace even if the error has been handled and wrapped in a different error type that may have a different semantics in the application.

2) For logging that does not stop program execution, I think this is a great case for dependency injection. If a library allows its consumer to provide a logger, the application has complete control over how and when the library logs, and can even change it at runtime. If you have a disagreement with a library, for example it logs errors that you want to treat as warnings, your injected logger can handle that.

0x696C6961 13 hours ago

Libraries should not log, instead they should allow registering hooks which get called with errors and debug info.

kelnos 12 hours ago

I think this is useful for libraries in a language like C, where there is no standardized logging framework, so there's no way for the application to control what the library logs. But in a language (Java, Rust, etc.) where there are standard, widely-used logging frameworks that give people fine-grained control over what gets logged, libraries should just use those frameworks.

(Even in C, though... errors should be surfaced as return values from functions causing the error, not just logged somewhere. Debug info, sure, have a registerable callback for that.)

dolmen 5 hours ago

ivan_gammel 13 hours ago

They can log if platform permits, i.e. when you can set TRACE and DEBUG to no-op, but of course it should be done reasonably. Having hooks is often an overkill compared to this.

esrauch 12 hours ago

It doesn't seem to work this way in practice, not least because most libraries will be transitive deps of the application owner.

I think creating the hooks is very close to just not doing anything here, if no one is going to use the hooks anyway then you might as well not have them.

Blackthorn 10 hours ago

Libraries should log in a way that is convenient to the developer rather than a way that is ideologically consistent. Oftentimes, that means logging as we know it.

bytefish an hour ago

Making software is 20% actual development and 80% is maintenance. Your code and your libraries need to be easy to debug, and this means logs, logs, logs, logs and logs. The more the better. It makes your life easy in the long run.

So the library you are using fires too many debug messages? You know, that you can always turn it off by ignoring specific sources, like ignoring namespaces? So what exactly do you lose? Right. Almost nothing.

As for my code and libraries I always tend to do both, log the error and then throw an exception. So I am on the safe side both ways. If the consumer doesn’t log the exception, then at least my code does it. And I give them the chance to do logging their way and ignore mine. I am doing a best-guess for you… thinking to myself, what’s an error when I’d use the library myself.

You don’t trust me? Log it the way you need to log it, my exception is going to transport all relevant data to you.

This has saved me so many times, when getting bug reports by developers and customers alike.

There are duplicate error logs? Simply turn my logging off and use your own. Problem solved.

If it is a program level error, maybe a warning and returning the error is the correct way to do. Maybe it’s not? It depends on the context.

And this basically is the answer to any software design question: It depends.

eterm 15 hours ago

How I'd personally like to treat them:

  - Critical / Fatal:  Unrecoverable without human intervention, someone needs to get out of bed, now.
  - Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
  - Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )

The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.

arwhatever 13 hours ago

I like to think of “warning” as something to alert on statistically, e.g. incorrect password attempt rate jumps from 0.4% of login attempts to 99%.

lanstin 8 hours ago

This point is important - the value of a log is inextricably tied to its unlikelihood. Which depends on so many things in the context.

masswerk 13 hours ago

Also, warnings for ambiguous results.

For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"

(Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)

RaftPeople 14 hours ago

> The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"

The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"

wizzwizz4 13 hours ago

Those conditions would be "Critical", no? The error vs warning distinction doesn't apply.

fhcuvyxu 10 hours ago

mewpmewp2 14 hours ago

What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, and let's say after a retry it is fine.

kiicia 14 hours ago

As always „it depends”

- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx

bqmjjx0kac 14 hours ago

I would log a warning when an attempt fails, and an error when the final attempt fails.

mewpmewp2 14 hours ago

marcosdumay 14 hours ago

Well, the GPs criteria are quite good. But what you should actually do depends on a lot more things than the ones you wrote in your comment. It could be so irrelevant to only deserve a trace log, or so important to get a warning.

Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.

cpburns2009 14 hours ago

It really depends on the third party service.

For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.

For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.

For service C, a 500 error may be an anomaly and treat it as hard error and log as error.

srdjanr an hour ago

eterm 14 hours ago

This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.

Because what I'd want to know is how often does it fail, which is a metric not a log.

So expose <third party api failure rate> as a metric not a log.

If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.

If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.

mewpmewp2 14 hours ago

hk__2 14 hours ago

hamandcheese 12 hours ago

sysguest 14 hours ago

hmm maybe we need extra representation?

eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"

wredcoll 14 hours ago

I said this elsewhere, but the point here is what the humans involved are supposed to do with this info. Do I literally get out of bed on an error log or do I grep for them once or twice a month?

ivan_gammel 13 hours ago

mfuzzey 16 hours ago

I think it's difficult to say without knowing how the system is deployed and administered. "If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system"

Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...

If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do

greatgib 15 hours ago

The point is that if your program itself take note of the error from the library it is ok. You, as the program owner, can decide what to do with it (error log or not).

But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.

dminuoso 13 hours ago

This would require a complete new ecosystem and likely new language where any degradation of code flow becomes communicatable in a standardized and fully documented fashion.

The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.

We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.

zamadatix 15 hours ago

The counterpoint made above is while what you describe is indeed the way the author likes to see it that doesn't explain why "an error is something which failed that the program was unable to fix automatically" is supposed to be any less valid a way to see it. I.e. should error be defined as "the program was unable to complete the task you told it to do" or only "things which could have worked but you need to explicitly change something locally".

I don't even know how to say whether these definitions are right or wrong, it's just whatever you feel like it should be. The important thing is what your program logs should be documented somewhere, the next most important thing is that your log levels are self consistent and follow some sort of logic, and that I would have done it exactly the same is not really important.

At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's already not really going to be a universally agreed way to collapse even just these down to 4 categories.

hinkley 13 hours ago

solatic 11 hours ago

> But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...

That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.

What do you think a warning means?

colechristensen 15 hours ago

How about this:

- An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.

Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)

It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.

A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on

INFO is for something a person would likely want to know and unlikely needs action

DEBUG is for something likely to be helpful

TRACE is for just about anything that happens

EMERG/CRIT are for significant errors of immediate impact

PANIC the sky is falling, I hope you have good running shoes

adrianmonk 12 hours ago

> An error is an event that someone should act on. Not necessarily you.

Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.

Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.

But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.

Arrowmaster 9 hours ago

colechristensen 11 hours ago

DanHulton 14 hours ago

If you're logging and reporting on ERRORs for 400s, then your error triage log is going to be full of things like a user entering a password with insufficient complexity or trying to sign up with an email address that already exists in your system.

Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.

I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.

lanstin 8 hours ago

colechristensen 14 hours ago

jayofdoom 15 hours ago

In OpenStack, we explicitly document what our log levels mean; I think this is valuable from both an Operator and Developer perspective. If you're a new developer, without a sense of what log levels are for, it's very prescriptive and helpful. For an operator, it sets expectations.

https://docs.openstack.org/oslo.log/latest/user/guidelines.h...

FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)

quectophoton 15 hours ago

Thank you, this (and jillesvangurp's comment) sounds way more reasonable than the article's suggestion.

If I have a daily cron job that is copying files to a remote location (e.g. backups), and the _operation_ fails because for some reason the destination is not writable.

Your suggestion would get me _both_ alerts, as I want; the article's suggestion would not alert me about the operation failing because, after all, it's not something happening in the local system, the local program is well configured, and it's "working as expected" because it doesn't need neither code nor configuration fixing.

turbobrew 13 hours ago

Agreed, I don’t get the OPs delineation between local and non-local error sources. If your code has a job to do it doesn’t matter if the error was local or non-local, the operator needs to know that the code is not doing its job. In the case of something like you cannot backup files to a remote you can try to contact the humans who own the remote or come up with an alternative backup mechanism.

Xss3 9 hours ago

Some programs are error resistant and need an additional level: Fatal.

A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.

An error should not be ignored, an operation is failing, data loss may be occurring, etc.

Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.

A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.

rwmj 15 hours ago

And the second rule is make all your error messages actionable. By that I mean it should tell me what action to take to fix the error (even if that action means hard work, tell me what I have to do).

chongli 15 hours ago

Suppose I'm writing an http server and the error is caused by a flaky power supply causing the disk to lose power when the server attempts to read a file that's been requested. How is the http server supposed to diagnose this or any other hardware fault? Furthermore, why should it even be the http server's responsibility to know about hardware issues at all?

uniq7 14 hours ago

The error doesn't need to be extremely specific or point to the actual root cause.

In your example,

- "Error while serving file" would be a bad error message,

- "Failed to read file 'foo/bar.html'" would be acceptable, and

- "Failed to read file 'foo/bar.html' due to EIO: Underlying device error (disk failure, I/O bus error). Please check the disk integrity." would be perfect (assuming the http server has access to the underlying error produced by the read operation).

Copenjin 10 hours ago

Some of these replies make me wonder if you have ever written any code at all, nonsensical example.

andoando 15 hours ago

Error: Possible race condition, rewrite codebase

morkalork 15 hours ago

I have written out-of-band sanity checks that have caught race conditions, the recommendation is more like "<Thing> that should be locked, isn't. Check what was merged and deployed in the last 24h, someone ducked it up"

1123581321 15 hours ago

Can you please explain this? That sounds like identifying bugs but not fixing them but I realize you don’t mean that. One hopes the context information in the error will make it actionable when it occurs, never completely successfully, of course.

rwmj 14 hours ago

Here's an example of a bug that I filed about non-actionable error messages: https://github.com/karmab/kcli/issues/456

The first error message was "No usable public key found, which is required for the deployment" which doesn't tell me what I have to do to correct the problem. Nothing about even where it's looking for keys, what is supposed to create the key or how I am supposed to create the key.

There are other examples and discussion of what they should say in the link.

Edit: Here's another one that I filed: https://github.com/containers/podman/issues/20775

1123581321 13 hours ago

Copenjin 10 hours ago

You can hope that the person reading the context will always able to understand it like you would have. Bad assumption in my experience.

1123581321 10 hours ago

pixl97 14 hours ago

So what error do you put if the server is over 500 miles away?

https://web.mit.edu/jemorris/humor/500-miles

Or you can't connect because of a path MTU error.

Or because the TTL is set to low?

Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.

lanstin 8 hours ago

Also put the fucking data in the message that led to the decision to emit the logs. I can't remember how many times I have had a three part test trigger a log "blah: called with illegal parameters, shouldn't happen" and the illegal parameters were not logged.

throw3e98 15 hours ago

Maybe that makes sense for a single-machine application where you also control the hardware. But for a networked/distributed system, or software that runs on the user's hardware, the action might involve a decision tree, and a log line is a poor way to convey that. We use instrumentation, alerting and runbooks for that instead, with the runbooks linking into a hyperlinked set of articles.

My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.

magicalhippo 15 hours ago

This can be difficult or just not possible.

What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.

If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.

Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.

Copenjin 10 hours ago

Exactly. Some applications keep running way after you have long gone. If there is useful information to provide give it.

hyperadvanced 15 hours ago

This is just plain wrong, I vehemently disagree. What happens if a payment fails on my API, and today that means I need to go through a 20-step process with this pay provider, my database, etc. to correct that. But what’s worse is if this error happens 11,000 times and I run a script to do my 20 step process 11,000 times, but it turns out the error was raised in error. Additionally, because the error was so explicit about how to fix it, I didn’t talk to anyone. And of course, the suggested fix was out of date because docs lag vs. production software. Now I have 11,000 pissed off customers because I was trying to be helpful.

teo_zero 14 hours ago

This doesn't resonate with my experience. I place the line between a warning and an error whether the operation can or can't be completed.

A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.

I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.

AndroTux 14 hours ago

“cannot contact port 25 on <remote host>” may very well be a configuration error. How should the program know?

notatoad 14 hours ago

>How should the program know?

if we're talking about logs from our own applications that we have written, the program should know because we can write it in a way that it knows.

user-defined config should be verified before it is used. make a ping to port 25 to see if it works before you start using that config for actual operation. if it fails the verification step, that's not an error that needs to be logged.

tcpkump 13 hours ago

What about when the mail server endpoint has changed, and for whatever reason, this configuration wasn’t updated? This is a common scenario when dealing with legacy infrastructure in my experience.

notatoad 12 hours ago

1718627440 13 hours ago

So when the random error on a remote party happens at one time your system ignores it, bu when it happens at another time, it prevents the server from booting? That's a very brittle system.

notatoad 12 hours ago

HankB99 14 hours ago

Would it make sense to consider anything that prevents a process from completing it's intended function an error? It seems like this message would fall into that category and, as you pointed out, could result from a local fault as well.

kijin 14 hours ago

SMTP clients are designed to try again with exponential backoff. If the final attempt fails and your email gets bounced, now that's an error. Until then, it's just a delay, business as usual.

Insanity 3 hours ago

Coincidentally was reviewing code yesterday that had a confusing/contradictory statement..

  error_msg = "xyz went wrong"
  log.warn(error_msg)

My comment on the CR was about this being an inherent contradiction and incredibly confusing to know if it's actually an error or a warning..

jillesvangurp 16 hours ago

Errors mean I get alerted. Zero tolerance on that from my side.

aunty_helen 7 hours ago

Good logging is critical and actually having the logs turned on in production. No point writing logs if you silence them.

My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.

hedayet 12 hours ago

I agree with the principle: log level error should mean someone needs to fix something.

This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.

In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.

Severity should encode actionability, not just system correctness.

jedberg 11 hours ago

I feel like it's more nuanced than OP writes. Presumably every log line comes from something like a try/catch. An edge case was identified, and the code did something differently.

Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.

Did it fail to do what it needed to do? ERROR

Did it do what it needed to do in the normal way because it was totally recoverable? INFO

Did data get destroyed in the process? FATAL

It should be about what the result was, not who will fix it or how. Because that might change over time.

umpalumpaaa 10 hours ago

What I like about objective-c’s error handling approach is that a method that can fail is able to tell if a caller considers error handling or not. If the passed *error is NULL you know that that is no way for a caller to properly handle the error. My implementations usually have this logic:

if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)

georgefrowny 13 hours ago

Easy to say, but there's "yes we know this is wrong but this will have to do for now" and "we don't expect to see this in real life unless something has gone sideways".

oofbey 13 hours ago

At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR conditions. Network glitches.

Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.

aqme28 14 hours ago

I agree with this take in a steady state, but the process of building software is just that-- it's a process.

So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.

raldi 14 hours ago

Exactly: When you're building software, it has lots of defects (and, thus, error logging). When it's mature, it should have few defects, and thus few error logs, and each one that remains is a bug that should be fixed.

plorkyeran 12 hours ago

I don't understand why you seem to think you're disagreeing with the article? If you're producing a lot of error logs because you have bugs that you need to fix then you aren't violating the rule that an error log should mean that something needs to be fixed.

raldi 11 hours ago

jmull 10 hours ago

I encourage people to think a few moments about what to log and at what level.

You’re kind of telling a story to future potential trouble-shooters.

When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.

But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.

HarHarVeryFunny 15 hours ago

I agree with the sentiment, although not sure if "error" is the right category/verbiage for actionable logs.

In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.

If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.

rsanek 11 hours ago

If something needs to be fixed, why is it just a log? How is someone supposed to even notice a random error log? At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice. Throw an exception / exit the program if it's something that actually needs fixing!

Copenjin 10 hours ago

> If something needs to be fixed, why is it just a log?

What he meant is that is an unexpected condition, that should have never happened, but that did, so it needs to be fixed.

> How is someone supposed to even notice a random error log?

Logs should be monitored.

> At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice.

Because the logs sucked. It not common practice, it should be best practice.

> Throw an exception / exit the program if it's something that actually needs fixing!

I understand the sentiment, but some programs cannot/should not exit. Or you have an error in a subsystem that should not bring down everything.

I completely agree with the approach of the author, but also understand that good logging discipline is rare. I worked in many places where logs sucked, they just dumped stuff, and had to restructure them.

lanstin 6 hours ago

While it is fun to have your code run for 500 days without restart, it is a bad architecture. You should be able to move load around from host to host or network to network without losing any work. This involves graceful draining and then shutting down the old.

For impossible errors exiting and sending the dev team as much info as possible (thread dump, memory dump, etc) is helpful.

In my experience logs are good for finding out what is wrong once you know something is wrong. Also if the server is written to have enough but not too much logging you can read them over and get a feel for normal operation.

raldi 16 hours ago

Yes. Examples of non-defects that should not be in the ERROR loglevel:

* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)

* Network error

* Downstream service overloaded

* Invalid request

Basically, when you make a request to another service and get back a status code, your handler should look like:

    logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning

(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)

Hizonner 16 hours ago

> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?

raldi 16 hours ago

No; I’m not understanding your point about guessing. Could you restate?

As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.

electroly 15 hours ago

makeitdouble 15 hours ago

Hizonner 13 hours ago

zbentley 16 hours ago

I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.

My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.

All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.

The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”

raldi 16 hours ago

But everything you’re describing can be done with metrics and alerts; there’s no need to spam the ERROR loglevel.

zbentley 15 hours ago

jonathrg 16 hours ago

4xx is for invalid requests. You wouldn't log a 404 as an error

raldi 16 hours ago

I’m talking about codes you receive from services you call out to.

mewpmewp2 14 hours ago

jonathrg 16 hours ago

makeitdouble 15 hours ago

> This assumes an error/warning/info/debug set of logging levels instead of something more fine grained, but that's how many things are these days.

Does it ?

Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.

I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.

Waterluvian 11 hours ago

I think this is one of those discussions where there's no one right answer (though there's many wrong answers). All you have to do is pick a reasonable definition, write it down, socialize it, and be consistent when using it.

I think discussions that argue over a specific approach are a form of playing checkers.

peanut-walrus 12 hours ago

Disagree. If you have an error that NEEDS fixing, your program should exit. Error level logs for operation level errors are fine.

Glyptodon 10 hours ago

I agree errors should be errors. Many things that are logged for other reasons should use a different label.

That said, the thing I've cone find being useful as a subcategory of error are errors due to data problems vs errors due to other issues.

alexwasserman 15 hours ago

I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.

Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.

dragonwriter 15 hours ago

> I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.

jonathrg 15 hours ago

Stuff like that is a good argument for using structured logging, but even if you are just parsing text logs, surely you can make the parser be a bit more specific when retrieving the log level.

dpc_01234 11 hours ago

Error log level should be renamed. It's just a terrible name that confuses usage.

Kinrany 12 hours ago

Why are logs usually assumed to be for human consumption only? It seems weird to me that log storage usually exists outside of the system and isn't a general purpose message bus.

Too 13 hours ago

Agree with the post. The job of blackbox is to turn probes into metrics. If a probe fails, that should just become a probe_success=0 metric. Blackbox did its job and should not log an error.

dnautics 15 hours ago

let's say you a bunch of database timeouts in a row. this might mean that nothing needs to be fixed. But also, the "thing that needs to be fixed" might be "the ethernet cable fell out the back of your server".

How do you know?

raldi 15 hours ago

You have an alert on what users actually care about, like the overall success rate. When it goes off, you check the WARNING log and metric dashboard and see that requests are timing out.

ImPostingOnHN 15 hours ago

That is a lagging indicator. By the time you're alerted, you've already failed by letting users experience an issue.

raldi 14 hours ago

danaris 13 hours ago

theli0nheart 16 hours ago

I agree with this.

Not everything that a library considers an error is an application error. If you log an error, something is absolutely wrong and requires attention. If you consider such a log as "possibly wrong", it should be a warning instead.

tgv 14 hours ago

I log authorization errors as errors. Are they errors? It depends on how you read the logs. Perhaps you want to distinguish between internal, external and non-attributable errors for easier grepping.

BiraIgnacio 13 hours ago

It means something is wrong, yes. Now, if it's worth fixing (granted, most of the time it would), that's another story.

leni536 14 hours ago

I make error logs fail happy path functional/integration tests for the backend applications I'm currently writing.

mycall 10 hours ago

Severity is the value and you set thresholds based on context of the error type.

plandis 13 hours ago

I agree. Error or higher should result in an alarm and indicates that some corrective action needs to be taken.

shadowgovt 16 hours ago

This is the standard I use as well. In general, my rule of thumb is that if something is logging error, it would have been perfectly reasonable for the program to respond by crashing, and the only reason it didn't is that it's executing in some kind of larger context that wants to stay up in the event of the failure of an individual component (like one handler suffering a query that hangs it and having to be terminated by its monitoring program in a program with multiple threads serving web requests). In contrast, something like an ill-formed web query from an untrusted source isn't even an error because you can't force untrusted sources to send you correctly formed input.

Warning, in contrast, is what I use for a condition that the developer predicted and handled but probably indicates the larger context is bad, like "this query arrived from a trusted source but had a configuration so invalid we had to drop it on the floor, or we assumed a default that allowed us to resolve the query but that was a massive assumption and you really should change the source data to be explicit." Warning is also where I put things like "a trusted source is calling a deprecated API, and the deprecation notification has been up long enough that they really should know better by now."

Where all of this matters is process. Errors trigger pages. Warnings get bundled up into a daily report that on-call is responsible for following up on, sometimes by filing tickets to correct trusted sources and sometimes by reaching out to owners of trusted sources and saying "Hey, let's synchronize on your team's plan to stop using that API we declared is going away 9 months ago."

nlawalker 15 hours ago

It seems that the easier rule of thumb, then, is that "application logic should never log an error on its own behalf unless it terminates immediately after", and that error-level log entries should only ever be generated from a higher-level context by something else that's monitoring for problems that the application code itself didn't anticipate.

raldi 16 hours ago

Right. If staging or the canary is logging errors, you block/abort the deploy. If it’s logging warnings, that’s normal.

lanstin 6 hours ago

Unless it is logging more warnings because your new code is failing somehow; maybe it stopped parsing the reply correctly from a "is this request rate limited" service so it is only returning 429 to callers never accepting work.

azov 12 hours ago

If my system doesn’t work - I want to be alerted. If notification was supposed to be sent but wasn’t - it’s an error regardless of whether it wasn’t sent because of a bug in my code or external service being down. It may be a warning if I’m still retrying, but if I gave up - it’s an error.

“External service down, not my problem, nothing I can do” is hardly ever the case - e.g. you may need to switch to a backup provider, initiate a support call, or at least try to figure out why it’s down and for how long.

29athrowaway 12 hours ago

Input errors do not need fixing, so no.

lanstin 6 hours ago

If they cause your customers to ditch your product but calling them and saying "your calls are all getting 4xx because you are not putting the state code into the call parameters" would keep them as customers, then you would be wise to make that communication.

dolmen 3 hours ago

But first ensure that the input error is properly reported to the client in the response body (ideally in a structured way), so the client could have figured out by himself.

If a fix is needed on your side for this matter, having a conversation with a customer might be useful before breaking more stuff. ("We have no state code in EU. Why is that mandatory?").

blkflcn3 12 hours ago

> What an error log level should mean (a system administrator's view)

That says it all:

- Backseat driving

- Not a developer by trade

mschuster91 13 hours ago

> If error level messages are not such a sign, I can assure you that most system administrators will soon come to ignore all messages from your program rather than try to sort out the mess, and any actual errors will be lost in the noise and never be noticed in advance of actual problems becoming obvious.

Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.

mkoubaa 14 hours ago

To me it's always a neat trick when you're not allowed to use print() in production code

vpribish 15 hours ago

I just started playing in the Erlang ecosystem and they have EIGHT levels of logging messages. it seems crazily over-specific, but they are the champions of robust systems.

I could live with 4

Error - alert me now.

Warning - examine these later,

Info - important context for investigations.

Debug - usually off in prod.

emmelaich 5 hours ago

I need Notice (between Info and Warning), for important events such as start and shutdown, and successfully connecting to the database, and ready to start serving. These otherwise would be in Info; and enabling Info level produces a torrent of uninteresting muck.

regularfry 15 hours ago

The eight levels in Erlang are inherited from syslog, rather than something specific to Erlang itself.

groundzeros2015 12 hours ago

The first one should be crashing.

Hacker News

by Ryan Harman

Log level 'error' should mean that something needs to be fixed (utcc.utoronto.ca)

layer8 14 hours ago [-]

Too 13 hours ago [-]

cogman10 12 hours ago [-]

usefulcat 9 hours ago [-]

rr808 4 hours ago [-]

TZubiri 12 hours ago [-]

MobiusHorizons 12 hours ago [-]

cwillu 11 hours ago [-]

Too 11 hours ago [-]

TZubiri 12 hours ago [-]

esrauch 12 hours ago [-]

TZubiri 11 hours ago [-]

ivan_gammel 12 hours ago [-]

cyphar 7 hours ago [-]

Etherlord87 12 hours ago [-]

dpark 10 hours ago [-]

pca006132 8 hours ago [-]

paulddraper 6 hours ago [-]

echelon 13 hours ago [-]

Izkata 12 hours ago [-]

renewiltord 11 hours ago [-]

HendrikHensen 11 hours ago [-]

StellarScience 9 hours ago [-]

dolmen 5 hours ago [-]

layer8 10 hours ago [-]

ivan_gammel 14 hours ago [-]

layer8 13 hours ago [-]

schrodinger 13 hours ago [-]

ivan_gammel 13 hours ago [-]

lanstin 8 hours ago [-]

1718627440 13 hours ago [-]

ivan_gammel 13 hours ago [-]

kelnos 12 hours ago [-]

ivan_gammel 12 hours ago [-]

hinkley 13 hours ago [-]

PartiallyTyped 13 hours ago [-]

TZubiri 11 hours ago [-]

ivan_gammel 11 hours ago [-]

peacebeard 5 hours ago [-]

0x696C6961 13 hours ago [-]

kelnos 12 hours ago [-]

dolmen 5 hours ago [-]

ivan_gammel 13 hours ago [-]

esrauch 12 hours ago [-]

Blackthorn 10 hours ago [-]

bytefish an hour ago [-]

eterm 15 hours ago [-]

arwhatever 13 hours ago [-]

lanstin 8 hours ago [-]

masswerk 13 hours ago [-]

RaftPeople 14 hours ago [-]

wizzwizz4 13 hours ago [-]

fhcuvyxu 10 hours ago [-]

mewpmewp2 14 hours ago [-]

kiicia 14 hours ago [-]

bqmjjx0kac 14 hours ago [-]

mewpmewp2 14 hours ago [-]

marcosdumay 14 hours ago [-]

cpburns2009 14 hours ago [-]

srdjanr an hour ago [-]

eterm 14 hours ago [-]

mewpmewp2 14 hours ago [-]

hk__2 14 hours ago [-]

hamandcheese 12 hours ago [-]

sysguest 14 hours ago [-]

wredcoll 14 hours ago [-]

ivan_gammel 13 hours ago [-]

mfuzzey 16 hours ago [-]

greatgib 15 hours ago [-]

dminuoso 13 hours ago [-]

zamadatix 15 hours ago [-]

hinkley 13 hours ago [-]

solatic 11 hours ago [-]

colechristensen 15 hours ago [-]

adrianmonk 12 hours ago [-]

Arrowmaster 9 hours ago [-]

colechristensen 11 hours ago [-]

layer8 14 hours ago

Too 13 hours ago

cogman10 12 hours ago

usefulcat 9 hours ago

rr808 4 hours ago

TZubiri 12 hours ago

MobiusHorizons 12 hours ago

cwillu 11 hours ago

Too 11 hours ago

TZubiri 12 hours ago

esrauch 12 hours ago

TZubiri 11 hours ago

ivan_gammel 12 hours ago

cyphar 7 hours ago

Etherlord87 12 hours ago

dpark 10 hours ago

pca006132 8 hours ago

paulddraper 6 hours ago

echelon 13 hours ago

Izkata 12 hours ago

renewiltord 11 hours ago

HendrikHensen 11 hours ago

StellarScience 9 hours ago

dolmen 5 hours ago

layer8 10 hours ago

ivan_gammel 14 hours ago

layer8 13 hours ago

schrodinger 13 hours ago

ivan_gammel 13 hours ago

lanstin 8 hours ago

1718627440 13 hours ago

ivan_gammel 13 hours ago

kelnos 12 hours ago

ivan_gammel 12 hours ago

hinkley 13 hours ago

PartiallyTyped 13 hours ago

TZubiri 11 hours ago

ivan_gammel 11 hours ago

peacebeard 5 hours ago

0x696C6961 13 hours ago

kelnos 12 hours ago

dolmen 5 hours ago

ivan_gammel 13 hours ago

esrauch 12 hours ago

Blackthorn 10 hours ago

bytefish an hour ago

eterm 15 hours ago

arwhatever 13 hours ago

lanstin 8 hours ago

masswerk 13 hours ago

RaftPeople 14 hours ago

wizzwizz4 13 hours ago

fhcuvyxu 10 hours ago

mewpmewp2 14 hours ago

kiicia 14 hours ago

bqmjjx0kac 14 hours ago

mewpmewp2 14 hours ago

marcosdumay 14 hours ago

cpburns2009 14 hours ago

srdjanr an hour ago

eterm 14 hours ago

mewpmewp2 14 hours ago

hk__2 14 hours ago

hamandcheese 12 hours ago

sysguest 14 hours ago

wredcoll 14 hours ago

ivan_gammel 13 hours ago

mfuzzey 16 hours ago

greatgib 15 hours ago

dminuoso 13 hours ago

zamadatix 15 hours ago

hinkley 13 hours ago

solatic 11 hours ago

colechristensen 15 hours ago

adrianmonk 12 hours ago

Arrowmaster 9 hours ago

colechristensen 11 hours ago

DanHulton 14 hours ago

lanstin 8 hours ago

colechristensen 14 hours ago