Project

General

Profile

Feature #4311

blocking stderr output inhibits the program?

Added by neels 10 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
12/06/2019
Due date:
% Done:

10%

Spec Reference:

Description

I'm currently using urxvt (rxvt-unicode) as terminal program (not sure why, but...)
and I am every now and then seeing odd behavior when the terminal is scrolled up into scrollback, so that new output lines are not seen.

It seems to me that after a while urxvt stops reading from stdout/stderr. That is weird enough in itself, but the Osmocom related problem:
Apparently the server program stops interacting when the log output stops reading?

Even though stderr (or any other log target) is not writable, IMO we should still continue to serve all protocols as usual.
A blocking log target should not affect server operation.

Investigate whether my vague impression is correct and find ways to fix that in logging.c.

History

#1 Updated by fixeria 10 months ago

ACK. I observed the described behaviour in tmux while being in "scroll" (Ctrl + Mod + [) mode. Maybe we can solve this problem by introducing write queues?

#2 Updated by laforge 10 months ago

This is pretty much the expected behavior for virtually any program, I would say.

If the program logs to stdout/stderr, then it relies on those file descriptors to be
writable everyy so often. If they are blocked on the reader side, then (after filling some glibc buffering)
things will stall.

I don't really see this as an Osmocom specific problem, or as a problem at all. I mean, what are
we supposed to do? Allocate ever-growing buffers until we are OOM? Throw away output messages
despite file descriptor / pipe / stream semantics don't normally do that?

If you don't disable stdout/stderr, or don't redirect it to /dev/null, you better keep reading from it.

#3 Updated by laforge 10 months ago

On Fri, Dec 06, 2019 at 01:40:46PM +0000, fixeria [REDMINE] wrote:

ACK. I observed the described behaviour in tmux while being in "scroll" (Ctrl + Mod + [) mode. Maybe we can solve this problem by introducing write queues?

but then how do you size those queues and prevent against OOM? Without
knowing what is stopping to make stdout/stderr writable, you cannot
really know what the best solution is. If somebody wants to discard the
output, let's have them do that discarding based on their
preference/policy outside of osmo-* ?

#4 Updated by neels 10 months ago

well, if we "crash" the entire core network just because we can't log to stderr at the moment, then I'd rather drop logging to stderr (and keep logging to gsmtap anyway, in that sense).
It's also a dangerous / non-recoverable situation, in that osmo-x doesn't really crash and gets restarted, it just sits there and hogs the space doing nothing.
Granted, in a systemd service environment that's probably not going to happen (is it?).

It is also often an inconvenience when testing. I would prefer losing stderr logging to the test not working from forgetting to scroll the terminal window.
The current way keeps a sword over your head of "how long am I allowed to read this backscroll before everything breaks down"

Todo: Is there a simple shell magic workaround that could prevent blocking? like '| tee' or something?
(I still always want stderr to also see MNCC domain socket communication,
and a "live" pipe, not cat to file and tail that other file, to see whether the program is still running)

#5 Updated by laforge 10 months ago

Hi Neels,

On Tue, Dec 10, 2019 at 12:42:07PM +0000, neels [REDMINE] wrote:

well, if we "crash" the entire core network just because we can't log to stderr at the moment, then I'd rather drop logging to stderr (and keep logging to gsmtap anyway, in that sense).

I think this is flawed, sorry. If you configure to write log to any type of output, and that output then
stalls for some reason, then the process stalls. I would consider this the expected behavior of virtually
any 'daemon' style of software.

A logging subsystem that looses events on high system I/O load or otherwise is not worth anything.

How would you ever want to debug anything if you don't know if your debug log contains everything?

Or what if you want to use the log postprocessing for accounting? Or for investigating security incidents?

It's also a dangerous / non-recoverable situation, in that osmo-x doesn't really crash and gets restarted, it just sits there and hogs the space doing nothing.

I would like to invite you to do a test on other 'daemon' style software that is configured to write to stdout (or any file descriptor) for that matter. I would consider this to be perfectly normal behavior for multiple decades.

If a user doesn't want to log to a filedescriptor (with related blocking semantics), he should simply not
configure it and use a non-blocking mechanism like syslog (or GSMTAP) in our case.

Granted, in a systemd service environment that's probably not going to happen (is it?).

I don't know systemd in detail, but I would suppose at some point there could be back-pressure if journald
cannot write sufficiently fast to disk?

It is also often an inconvenience when testing. I would prefer losing stderr logging to the test not working from forgetting to scroll the terminal window.

Once again, I think this argument is (extermely) flawed. If you don't want logging on stdout/stderr, simply disable it.

If you want to ever test or debug or analyze anything, and you cannot know if your log is complete, how would you ever get to any type of reliable result?

The current way keeps a sword over your head of "how long am I allowed to read this backscroll before everything breaks down"

I'm not really following you here. Why would scrolling back in a terminal window block the ability of the
program to write to stdout/stderr? And if so, why would you ever blame anything but said terminal emulator for the results of the actions/

Todo: Is there a simple shell magic workaround that could prevent blocking?

I still don't really understand where the problem is? Why would you (or tools you use) intentionally block the stdout/stderr in the first place? I use 'uxterm' here and I don't recall any situation where I would have accidentially blocked stdout/stderr?

#6 Updated by laforge 10 months ago

On Tue, Dec 10, 2019 at 12:42:07PM +0000, neels [REDMINE] wrote:

I still believe your argument is in favor of fundamentally breaking unix stream / file semantics.

Todo: Is there a simple shell magic workaround that could prevent blocking? like '| tee' or something?

'buffer' is likely what you want, also see
https://stackoverflow.com/questions/8554568/add-a-big-buffer-to-a-pipe-between-two-commands

#7 Updated by neels 9 months ago

On Tue, Dec 10, 2019 at 03:33:36PM +0000, laforge [REDMINE] wrote:

I'm not really following you here. Why would scrolling back in a terminal window block the ability of the
program to write to stdout/stderr? And if so, why would you ever blame anything but said terminal emulator for the results of the actions/

I think that indeed this ends up being the cause, which is why I mentioned
urxvt at the start. Probably this terminal fails to do buffer forwards properly
and that's all that I need to fix... using a proper terminal.

I still don't really understand where the problem is? Why would you (or tools you use) intentionally block the stdout/stderr in the first place? I use 'uxterm' here and I don't recall any situation where I would have accidentially blocked stdout/stderr?

Will try uxterm...

#8 Updated by laforge about 2 months ago

So when a program is started interactively on a shell, the STDOUT + STDERR go through a pty (virtual tty).

The size of the buffer inside any PTY on Linux is not configurable. It is a hard-coded default of 8192 bytes (see drivers/tty/pty.c:pty_common_install() which calls tty_buffer_set_limit(.., 8192)

So this means, if stderr is in blocking mode (which it is by default, and contrary to other I/O, we don't change this in the Osmocom univers), and we have written 8kBytes of log output, there is a chance that the process blocks until the consumer of the PTY (your xterm, ...) is reading the data.

If I/O redirection is used, there's no PTY at the stderr, but typically a pipe. The size of pipe buffers is configurable, and there are both a soft and a hard system-wide limit on the buffer size. How large that buffer is entirely depends on who creates that pipe, wich is happening outside of the respective osmocom program, for example in the shell. The default size is PIPE_DEF_BUFFERS (16) number of pages (typically 4kByte), so 16*4 = 64kBytes.

Once the amount of log data exceeds 64kBytes, stderr will stall until the consumer has been scheduled to read the data.

#9 Updated by laforge about 2 months ago

For the log file target if libosmocore, we are using the default size of the stdio buffered I/O system of glibc, which is a default of only 8kByte. However, we as the application can easily increase it by using setvbuf, and I will be submitting a related patch to libosmocore shortly. I guess file based logging has not yet caused any stalling problems (that we know of) as the kernel I/O system (which has its own buffers) and modern SSD are very fast in consuming the data, so it's much less likely to ever stall.

Related link: https://stackoverflow.com/questions/10904067/in-c-whats-the-size-of-stdout-buffer/36139638#36139638

We could start to buffer log output inside libosmocore it self (put log messages in msgb and put them in a osmo_wqueue, just like we deal with any other data going to sockets or the like. The disadvantages would be:
  • if we crash, whatever log messages happened ahead of the crash may not yet have been written to stderr, and one would have to always manually inspect the log buffer by some kind of gdb scripts. So if you crash without coredump+gdb in place, you will never know what actually happened prior to the crash
  • unless we introduce some kind of queue limit, there is a risk of going out-of-memory.
I would still argue we should not change the stderr logging, but
  • use setvbuf() to configure large amount of buffering for log file logging
  • advise users to generally prefer file (or gsmtap) based logging over stderr, particularly when configuring a lot of log output and particularly in production setup

#10 Updated by laforge about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to laforge
  • % Done changed from 0 to 10

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)