osmo-msc creates evil-twin entries in the VLR when an already attached IMSI does a LU by an unknown TMSI (was: MSC_Tests.TC_lu_by_tmsi_noauth_unknown fails sporadically locally)
When running this test locally against a (sanitize-enabled) osmo-msc, I get sporadic failures.
The differnence in the log files seems to start:
Wed Aug 19 08:10:24 2020 DVLR <000e> gsm_04_08.c:1394 SUBSCR(IMSI-262420000000013:MSISDN-491230000013:TMSI-0x01020304) VLR: update for IMSI=262420000000013 (MSISDN=491230000013)
Wed Aug 19 08:10:29 2020 DVLR <000e> gsm_04_08.c:1394 SUBSCR(IMSI-262420000000013:MSISDN-491230000013:TMSI-0x3BEDCD7C) VLR: update for IMSI=262420000000013 (MSISDN=491230000013) (NO CONN!)
The pcap file containing both cases is attached. They seem exactly identical, it's just that in the first (successful) case, there is a LU ACCEPT immediately, while in the second (failing) case, there is a LU REJECT after timeout of 5s.
I've seen the failure both when running a single test case, as well as when running the entire MSC_Tests.control in one batch.
I added some pointer value logging, and apparently the vlr_subscr from the first test run sticks around in the VLR.
The second test run creates a duplicate vlr_subscr for the same IMSI.
When the second test run sends the GSUP subscriber update, it gets directed to the first vlr_subscr, while the active connection is associated with the second vlr_subscr.
There should be all sorts of provisions to avoid duplicate vlr_subscr.
I am now trying to figure out how this evil twin vlr_subscr is possible at all.
Test suite wise the question is whether we should clear the state of osmo-msc, i.e. issue some vty command to clear the entire VLR at the start of each test.
osmo-msc stability wise we should still figure out how this can happen and fix it.
Looking at the code there apparently is a gaping hole in osmo-msc's implementation, and the case that an already attached IMSI does a LU with an unknown TMSI is not covered.
Upon LU by TMSI, a new vlr_subscr gets created with that (so far unknown) TMSI.
The VLR asks for the IMSI identity.
When the response comes back, osmo-msc should in fact look up whether this IMSI is already attached in the VLR, which it fails to do.
Instead the new vlr_subscr also gets assigned that IMSI, and hence we have an evil twin in the VLR.
This occurs because the TC_lu_by_tmsi_noauth_unknown does not do an IMSI-Detach in the end,
but it still acks the TMSI-Reallocation, after which the initial TMSI is no longer kept in the VLR.
The second test run starts with the initial TMSI again, which is then regarded as unknown...
We need a separate test case playing through this scenario: attached subscr does LU with an unknown TMSI.
(btw, the failure is not at all related to the invalid TMSI that is also sent in this test case.)
- Assignee changed from neels to laforge
it's not that trivial though: at the time of the ID Response, the subscriber may not yet be authenticated.
So anyone could come along, send an arbitrary IMSI in the ID Response, and essentially DoS on the VLR state of an already attached authentic subscriber.
The pre-existing VLR state must not be affected by an unauthenticated request.
It seems that osmo-msc must keep an unvalidated duplicate vlr_subscr entry, and only ensure a single validated vlr_subscr entry at the time of successful auth.
This potentially goes pretty deep into the VLR design: so far the assumption is that at most one vlr_subscr per IMSI exists.
The GSUP response is identified by IMSI, hence it potentially has to update multiple vlr_subscr entries.
Are there other code paths that need to deal with multiple vlr_subscr for the same IMSI?
Idea: make the current vlr_subscr_find_by_imsi() return only the one entry that has passed authentication.
Code paths that deal with unauthenticated vlr_subscr could use a separate vlr_subscr_find_by_imsi2() (or so) API.
A quick solution for now could remove the previous vlr_subscr from the VLR and add the new one, in the hope that it will also authenticate later...
(considering that the new vlr_subscr already may have lu_fsm, auth_fsm etc associated on it)
So, how important is this aspect at this point in time?
- Subject changed from MSC_Tests.TC_lu_by_tmsi_noauth_unknown fails sporadically locally to osmo-msc creates evil-twin entries in the VLR when an already attached IMSI does a LU by an unknown TMSI (was: MSC_Tests.TC_lu_by_tmsi_noauth_unknown fails sporadically locally)
all of this seemed vaguely familiar, and now I found this patch from about a year ago:
That patch does not cover the DoS aspect, I guess that is why I did not submit it for review.
test case for LU: https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19718
In above mentioned patch from a year ago, I made provision to also handle ID Responses during CM Service Request and Paging Response.
Adding ttcn3 tests I realize that a CM Service Request for an unknown TMSI gets rejected immediately (pending a LU from the MS).
Tests for those were still missing:
Paging Response: https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19721
actually uncovers a crash in osmo-msc, see #4724
- Status changed from Feedback to Stalled
- Assignee changed from laforge to neels
So, how important is this aspect at this point in time?
I think we care most about consistency of our internal state (and passing the existing test suite) than to care about a DoS possibility at this point. OsmoMSC is primarily used in lab / test environments, for small/private networks etc.
Probably best to separate that second part out as a separate issue?