Project

General

Profile

Actions

Feature #5917

open

immediately detect SCTP SHUTDOWN in SCCP link / in SCCP user / in active SCCP connections

Added by neels about 1 year ago. Updated about 1 year ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
-
Start date:
02/21/2023
Due date:
% Done:

0%

Spec Reference:

Description

This is a question about SCCP concepts for link loss detection.

I'm trying to trigger an SCCP link loss to examine the cleanup / leak behavior of osmo-hnbgw.
I'd like to find out how an SCTP link loss would propagate to osmo-hnbgw code and signal an SCCP link loss.
(IOW looking for a place that would trigger an FSM event like MY_SCCP_EV_LINK_LOST)

I have two scenarios, one that osmo-stp is killed, the other that the remote entity behind osmo-stp is disconnected.


(1) kill osmo-stp

I tried this:
  • in HNBGW_Tests.ttcn, cranked up T_guard to 10000.0 s
  • in a RAB Assignment test, just after the RAB is established, insert f_sleep(5000.0)
  • run the test to establish a context mapping with SCCP connection in osmo-hnbgw
  • kill osmo-stp

I immediately see lots of low level DLINP, DLSS7 and some DLSCCP logging showing that an SCTP SHUTDOWN event was processed, and that the XUA AS restarts and tries to reconnect. But none of this makes its way up into osmo-hnbgw.

After about 15 minutes(!), I receive sccp_sap_up(N-DISCONNECT.indication) on the SCCP connection.
So, we do have a cleanup trigger, but

  • Is it expected to take this long, given that an SCTP SHUTDOWN is detected in libosmo-sigtran immediately?
  • Do we only get an N-DISCONNECT on individual SCCP conns? My idea was to trigger a LINK_LOST event on the SCCP link to the CN, i.e. a signal that the entire SCCP layer is gone. Does that exist, conceptually?

(log attached)

In contrast, when the RUA side of osmo-hnbgw sees an SCTP SHUTDOWN, osmo-hnbgw immediately registers that all HNB are disconnected, by means of the read cb() passed to osmo_stream_srv_create().


(2) disconnect remote SCCP peer, STP still up
i.e. in ttcn, after the conn is established, call f_ran_adapter_cleanup(g_msc); f_ran_adapter_cleanup(g_sgsn);
and continue to f_sleep(5000.0) keeping the HNB connected.

Here I immediately see an N-PCSTATE.indication containing a DUNA (Destination Unavailable) coming up the SCCP user SAP, one each for the MSC and the SGSN point-code. osmo-hnbgw ignores N-PCSTATE so far, I guess it might be a good idea to implement acting on the DUNA messages. I see now that we can simply read out prim->u.pcstate.

Since the DUNA is so far ignored, the same as above happens. After about 15 minutes, sccp_scoc.c sends up an N-DISCONNECT for the individual SCCP connection and we do clean up, eventually.


So in summary:
  • when a remote entity behind STP goes bust, i can already now trigger my LINK_LOST event when osmo-hnbgw sees a PCSTATE indicating that the remote point-code for CS / PS CN becomes unavailable.
  • when the first SCTP hop goes bust (kill osmo-stp), maybe we can implement some prim going up the SCCP user SAP? Would that also be a DUNA, based on active SCCP conns' remote point-code, or is that a hacky layer violation?

Files

stp_killed.log stp_killed.log 139 KB neels, 02/21/2023 02:50 AM
stp_killed.pcapng stp_killed.pcapng 200 KB neels, 02/21/2023 02:56 AM
cn_disconnected.pcapng cn_disconnected.pcapng 107 KB neels, 02/21/2023 03:24 AM
cn_disconnected.log cn_disconnected.log 64 KB neels, 02/21/2023 03:24 AM
Actions #1

Updated by laforge about 1 year ago

On Tue, Feb 21, 2023 at 03:39:24AM +0000, neels wrote:

This is a question about SCCP concepts for link loss detection.

If I want to be strict to my understanding of SS7, what you are asking doesn't exist.

A signaling link is 20 miles below SCCP in the protocol stack. SCCP should never be concerned with that. A link is part of a linkset. From the SS7 PoV linkset should already contain multiple links, and there can and should be multiple routes (via multiple linksets) to reach any given destination point code.

In the SIGTRAN world, each of those ss7 links is mapped to an SCTP association, in turn can and should have multiple IP addresses on each side. And below that you have an (or multiple!) IP networks, each of which should have dynamic routing, with multiple routes. And yet below that you can have redundant ethernet links.

So the deep stack provides ample of opportunity at every layer to make sure that telecom companies in their traditional "you can never have too much reliability" attitude end up in situations where the SS7 link loss does not matter to the applications.

From a layering point of view, SCCP doesn't even know (nor should know) what a signaling link, SCCP or (god forbid) IP is. All it is aware of is point codes and global titles, and whether or not those are reachable right now or not.

I immediately see lots of low level DLINP, DLSS7 and some DLSCCP logging showing that an SCTP SHUTDOWN event was processed, and that the XUA AS restarts and tries to reconnect. But none of this makes its way up into osmo-hnbgw.

I think it's a question whether it should. Whether a non-STP entity in the SS7 network should generate SCCP USER SAP primitives that normally are originated by (in my understanding) STP/SCPs.

Furthermore, SCCP state is not reset on any loss of a single (or even all) underling SS7 links. They could recover some milliseconds or seconds later, and both sides would continue as if nothing happened. So closing SCCP connections just because an underlying signalling link has potentially temporarily disappeared could possibly lead to error amplification.

Look at an analogy: Do all your TCP connections close just because you disconnect your ethernet link? No. Do all your TCP sockets get notified of the Ethernet link loss? No. Why? because TCP runs on top of IP, and TCP has no notion of what happens below IP. All that TCP would note is that at some point its own timeouts are triggering, if neither IP nor underlying transport layers are recovering in time. Note that TCP timeouts / keepalives are entirely optional, and it is valid for an open TCP connection to stay indefinitely that way, until/unless either side starts to transmit something, which will eventually lead to timeouts if ACKs for that are never received.

After about 15 minutes(!), I receive sccp_sap_up(N-DISCONNECT.indication) on the SCCP connection.

Yes, that's the normal SCCP level timeout. Which can of course (from the SS7 point of view) be reconfigured by any user as they please.

  • Is it expected to take this long, given that an SCTP SHUTDOWN is detected in libosmo-sigtran immediately?

I would presume the default of 15 minutes is taken directly out of the SCCP specs. It's only a bug if our default is outside what Q.7xx specifies.

My idea was to trigger a LINK_LOST event on the SCCP link to the CN, i.e. a signal that the entire SCCP layer is gone. Does that exist, conceptually?

No. SCCP should not care about a given single SS7 link, see above. If at all, conceptually one could think of notifications of the last of all potential SS7 links that could route to a given destination point coe, which is what we can indicate using N-PCSTATE indications.

In contrast, when the RUA side of osmo-hnbgw sees an SCTP SHUTDOWN, osmo-hnbgw immediately registers that all HNB are disconnected, by means of the read cb() passed to osmo_stream_srv_create().

You're comparing RUA, a protocol directly over SCTP with an application that runs over a complex, multi-layered SS7 signaling network. Compare that with sending raw ethernet frames and using a TCP socket. Of course they will behave differently.

Also, a SCTP SHUTDOWN doesn't happen at the time a (ethernet, or whatever physical) link is lost. That shutdown would only happen once SCTP itself has either been disconnected, or determined (via its own internal timesouts) that the association is dead.

Here I immediately see an N-PCSTATE.indication containing a DUNA (Destination Unavailable) coming up the SCCP user SAP, one each for the MSC and the SGSN point-code.

This reflects my understanding of SS7.

osmo-hnbgw ignores N-PCSTATE so far, I guess it might be a good idea to implement acting on the DUNA messages. I see now that we can simply read out prim->u.pcstate.

It could, if it wanted to. However, keep in mind that immediately closing all SCCP connections could lead to error amplification (see above). So if at all it might make sense to have a timer, for each destination point code, starting on a negative N-PCSTATE.ind and stopping at a positive one. But at that time you're basically replicating functionality that could be achieved via lower SCCP connection-level timeouts, rihgt?

I think the application (SCCP user) behavior to those indications is user-defined, i.e. up to the application.

So in summary:
  • when a remote entity behind STP goes bust, i can already now trigger my LINK_LOST event when osmo-hnbgw sees a PCSTATE indicating that the remote point-code for CS / PS CN becomes unavailable.

again, it's not a link that is lost, but the reachability of a given point code has (potentially temporarily) changed. This would typically mean that the last ss7 link of any ss7 linkset in the path between HNBGW and the given pointcode is lost at the last potential path/route in the ss7 network. So all redundancy mechanisms at all layers on all potential redundant routes have failed.

  • when the first SCTP hop goes bust (kill osmo-stp), maybe we can implement some prim going up the SCCP user SAP? Would that also be a DUNA, based on active SCCP conns' remote point-code, or is that a hacky layer violation?

I think that could potentially make sense. It's a bit a question of whether we see the SCCP on a SSP (end node) should behave to its local user as a SGP (STP, router) would. Feel free to check the relevant Q.7xx and see if you can find any guidance to that.

Actions #2

Updated by neels about 1 year ago

Thanks for these clarifications!

Again I am amazed that, with all the amount of detail I know about GSM and 3G
networks and after all the direct work I've been doing with SCCP, even
introducing the A interface into osmo-msc, how I got through all that still
being ignorant about some of the most basic concepts embedded in it =) Probably
because it is easy to stay on layer 3 and up and leave the rest to
libosmo-sigtran.

What I take from it:

  • we can detect that a remote point-code becomes unreachable (PCSTATE).
  • but we don't necessarily want an application to tear down every conn immediately.
  • we could immediately act if an application's own entrypoint to the
    SS7 network goes down entirely (STP unreachable, or more like, all
    redundant STP are unreachable?)

About PCSTATE, I know that in osmo-bsc, we have this transmission failure
counter: if an MSC in the MSC pool fails to respond to SCCP CR three times in a
row, we regard it as unreachable and start transmitting BSSMAP RESET messages
to it. Instead, or in addition, we might want to do so based on a timeout after
receiving an N_PCSTATE with "point-code unreachable" -- then we don't
necessarily need three connections to fail before noticing a problem.
Something similar could be added in osmo-hnbgw.

Actions #3

Updated by laforge about 1 year ago

  • Status changed from New to Feedback
  • Assignee set to neels

neels wrote in #note-2:

About PCSTATE, I know that in osmo-bsc, we have this transmission failure
counter: if an MSC in the MSC pool fails to respond to SCCP CR three times in a
row, we regard it as unreachable and start transmitting BSSMAP RESET messages
to it.

The question is whether or not one can reliably expect the underlying SS7/MTP/SIGTRAN network to inform the application about unreachable point codes. My understanding is that this was the case in classic SS7 networks, but may probably not be always the case in the SIGTRAN world as there are so many incomplete implementations out there.

Instead, or in addition, we might want to do so based on a timeout after
receiving an N_PCSTATE with "point-code unreachable" -- then we don't
necessarily need three connections to fail before noticing a problem.

Indeed. As soon as an application has received a PCSTATE/unreachable, it is pointless to send further traffic to any of the affected point codes, until it has received the corresponding PCSTATE/reachable.

Something similar could be added in osmo-hnbgw.

Ack. Or virtually any code using SCCP.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)