Project

General

Profile

Actions

Feature #5904

open

HNBAP Register Request handling does checks that seem to make no sense

Added by neels about 1 year ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
Start date:
02/11/2023
Due date:
% Done:

0%

Spec Reference:

Description

Looking at this code, a couple of things seem to make no sense:

https://gitea.osmocom.org/cellular-infrastructure/osmo-hnbgw/src/commit/a3c7f750a2f77c0c0e0ade1800f5b3a97a2f0b6e/src/osmo-hnbgw/hnbgw_hnbap.c#L431

  • we are using getpeername() to compare the remote address of the current ctx to all other hnb. But we are handling a PDU in an accept()ed connection, and the hnb_context *ctx has already been looked up according to the remote address the PDU was received from.
  • assuming there are two connections from the same remote address, what do we care about the IP address? if there are two HNB entities connecting via two distinct conns, so be it, from the same address or not. We should only care about the HNB id (LAC,SAC,RAC,CID,MCC,MNC identity)
  • assuming we want to check the remote address: in the error checking path of failed getpeername(), we are still going on to compare as if the result from getpeername() was valid.
  • the loop is guarding against two HNB registering with identical HNB id.
    • When this occurs from the same remote address, we discard the old hnb_context. However, in practice it seems that instead of allocating another hnb_context on the same remote address, instead the HNB-Register-Req is dispatched to the same hnb_context in osmo-hnbgw, and this loop is essentially skipped (the log often shows the "(duplicated)" at the bottom of the function).
    • When this occurs from a different remote address, we reject the new hnb_context (bottom of the loop body). Now, in recent efforts to hunt down UE state leaks in osmo-hnbgw, it became clear that a HNB disconnecting may go unnoticed. Imagine a HNB's conn being disrupted, its state still present in osmo-hnbgw. If a new conn is established for this HNB from another remote address, we refuse this new connection, favoring the disrupted connection. It may be frustrating for a user trying to reconnect a HNB via a different link. The comment names two reasons:
      • misconfiguration: it is the user's responsibility to configure distinct HNBs with distinct identities. Is it helpful to refuse a new connection when a previous connection became stale and lingers? It may avoid oscillation of two HNBs with same id competing, but it is also hard to recover from a failed link when refusing new connections.
      • impersonation: we have no authentication of HNB, we are utterly incapable of avoiding impersonation anyway.

3GPP TS 25.469 8.2.4 Abnormal Conditions:
"""
If the HNB-GW receives a duplicate HNB REGISTER REQUEST (i.e. for an already registered HNB identified by the
unique HNB identity), then the new HNB REGISTER REQUEST shall override the existing registration and the
handling of the new HNB REGISTER REQUEST is according to section 8.2.
"""

Context:

It seems the initial motion to reject a duplicate HNB REGISTER was wrong, and the newer patch tried to alleviate this, but still kept the rejection based on the remote address for reasons not named.

My idea for this is:
  • do not check remote addresses at all.
  • compare only HNB IDs. If they match, always discard the old hnb_context, allow the new HNB register.

To test this, I implemented this behavior, and HNBGW_Tests.ttcn3 still succeed
(with only the expected failure of TC_hnb_register_duplicate() that no longer gets the HNB Register Reject it wants to see)
https://gerrit.osmocom.org/c/osmo-hnbgw/+/31289

feedback welcome

Actions #1

Updated by laforge 11 months ago

neels wrote:

  • assuming there are two connections from the same remote address, what do we care about the IP address? if there are two HNB entities connecting via two distinct conns, so be it, from the same address or not. We should only care about the HNB id (LAC,SAC,RAC,CID,MCC,MNC identity)

ack, doesn't make sense.

  • When this occurs from a different remote address, we reject the new hnb_context (bottom of the loop body). Now, in recent efforts to hunt down UE state leaks in osmo-hnbgw, it became clear that a HNB disconnecting may go unnoticed. Imagine a HNB's conn being disrupted, its state still present in osmo-hnbgw. If a new conn is established for this HNB from another remote address, we refuse this new connection, favoring the disrupted connection. It may be frustrating for a user trying to reconnect a HNB via a different link. The comment names two reasons:
    • misconfiguration: it is the user's responsibility to configure distinct HNBs with distinct identities. Is it helpful to refuse a new connection when a previous connection became stale and lingers? It may avoid oscillation of two HNBs with same id competing, but it is also hard to recover from a failed link when refusing new connections.

I guess the point is that a disruped old connection should timeout shortly (due to keepalives or some SCTP timers). So yes, there might be a period of time when the new connections are refused as the old still lingers. But after a few [dozens of] seconds, I would expect any dead old connection to disappear. If it doesn't do that, some timers and/or keepalive are not configured reasonably, and that should be fixed.

I'm always happy replacing dead old connections. But replacing old connections just because there is a new one is a bad strategy. Imagine two HNB configured by accident to have the same identity. In that situation, the old connection is very much still alive. The existing behavior would make the first ever connecting hNB work flawlessly, and the second one would get rejected and never provide service.

Your suggested alternative would be to ping-pong between the two hNB, disrupting service to both.

So IMHO we need reasonable timers/keepalive/... that cleans up dead old connections reasonably fast, or we need some kidn of other metric or mechanism to determine if the old connection is still alive. One could probably find some way at application level to probe it (in the worst case, send a malformed message and expect a response). But once again, I think this would just be a work-around for non-working SCTP level timers/keepalive timeouts.

3GPP TS 25.469 8.2.4 Abnormal Conditions:
"""
If the HNB-GW receives a duplicate HNB REGISTER REQUEST (i.e. for an already registered HNB identified by the
unique HNB identity), then the new HNB REGISTER REQUEST shall override the existing registration and the
handling of the new HNB REGISTER REQUEST is according to section 8.2.
"""

Ok, so the spec says it should be different than my decades of networking gut feeling described above. So you have a strong argument to change the behavior to what is specified, and I won't veto it.

Actions #2

Updated by laforge 11 months ago

  • Assignee set to neels
Actions #3

Updated by neels 6 months ago

just now, I power cycled a 3G femto cell attached to osmo-hnbgw, and I got this:

There were 5764 HNB-Register-Req and NACK roundtrips in 12 seconds, with the log indicating:

DHNBAP ERROR (xxx) rejecting HNB-REGISTER-REQ with duplicate cell identity 901-70-L1000-R55-S23-C23 (hnbgw_hnbap.c:459)
DHNBAP ERROR (xxx) rejecting HNB-REGISTER-REQ with duplicate cell identity 901-70-L1000-R55-S23-C23 (hnbgw_hnbap.c:459)
DHNBAP ERROR (xxx) rejecting HNB-REGISTER-REQ with duplicate cell identity 901-70-L1000-R55-S23-C23 (hnbgw_hnbap.c:459)
DHNBAP ERROR (xxx) rejecting HNB-REGISTER-REQ with duplicate cell identity 901-70-L1000-R55-S23-C23 (hnbgw_hnbap.c:459)
...

In the log timestamps, it looks like one reject per millisecond.
On average, it appears to be one reject per two milliseconds.
Finally, a successful Register Accept follows and the cell comes up.

These aspects:

inadvertent DoS attack by femto

It is certainly not hnbgw's fault that the femto decides to ask hundreds of times per second for a HNB Register.
It seems odd that the femto cell should decide to DoS the HNBGW while getting nothing but rejects.
This particular model seems to be missing a hold-off timer.

I wonder, is there anything that osmo-hnbgw could do to counter an hNodeB behaving like this?
We could add an artificial delay before responding with an HNB Register Reject.

re-attach identical cell

In this scenario, where I have knowingly restarted a femto cell, the behavior that I desire is that osmo-hnbgw accepts the re-attach immediately. (Which incidentally also avoids above situation...)

I still think current code is optimising for the wrong case:
We are protecting a user from odd behavior in case they have a configuration error (identical CID),
but we are causing this same odd behavior in a case where the configuration has no errors, and a cell has restarted.

So I do think osmo-hnbgw should immediately accept HNB Register Request with identical CID,
and clean out the previous connection.

Actions #4

Updated by laforge 6 months ago

On Thu, Aug 31, 2023 at 10:17:45PM +0000, neels wrote:

inadvertent DoS attack by femto

It is certainly not hnbgw's fault that the femto decides to ask hundreds of times per second for a HNB Register.
It seems odd that the femto cell should decide to DoS the HNBGW while getting nothing but rejects.
This particular model seems to be missing a hold-off timer.

I wonder, is there anything that osmo-hnbgw could do to counter an hNodeB behaving like this?
We could add an artificial delay before responding with an HNB Register Reject.

yes, we can introduce some kind of (exponential, or linear increasing) delay in the reject messages we send.

In this scenario, where I have knowingly restarted a femto cell, the behavior that I desire is that osmo-hnbgw accepts the re-attach immediately. (Which incidentally also avoids above situation...)

so the question is: how can the hnbgw differentiate this situation from another situation?

Why is the HNBGW unable to determine that the old connection is dead? How long does it take today
to reach that conclusion? Can we improve heartbeat or other SCTP timeouts to improve that detection
time?

So I do think osmo-hnbgw should immediately accept HNB Register Request with identical CID,
and clean out the previous connection.

I disagree. We should be quicker in detecting the fault of the old connection. If that happens in
(for example) less than 10s, I'd be more than happy to say we reject new registrations within that period.

Changing the behavior as you request would just plaster over the real issue, IMHO.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)