Project

General

Profile

Actions

Bug #5559

open

OsmoMSC at 100% CPU and unresponsive for up to several minutes!

Added by keith 7 days ago. Updated 1 day ago.

Status:
In Progress
Priority:
Urgent
Assignee:
Category:
SMS
Target version:
-
Start date:
05/13/2022
Due date:
% Done:

20%

Resolution:
Spec Reference:

Description

Not much more to say than the title I'm afraid.

So far, I've actually only noticed it on a system using the RBS and osmo-e1d. But I do not have conclusive proof that it is exclusively happening here.

I'm assuming a culprit might be the sms queue, but I'm not convinced because I'm not seeing it on other systems with more messages in the queue in the sqlite db - and this can be upwards of 1,000 SMS queued.


Related issues

Related to OsmoMSC - Bug #5563: OsmoMSC sometimes stalls for dozens of seconds in a production deploymentIn Progresslaforge05/14/2022

Actions
Actions #1

Updated by laforge 7 days ago

I'm logged into one of the production systems which is currently seeing moderate load (osmo-msc 20-50% (peak) CPU load) and in the perf top for osmo-msc alone I can see the majority of the time is spent in sqlite3 and (most likely) sqlite3 triggered malloc/free:

using perf top -p 3914 --call-graph fp

+    6.58%     6.48%  libc-2.28.so                  [.] _int_malloc
+    5.37%     5.08%  libpthread-2.28.so            [.] __pthread_mutex_lock                                                                                                                                        
+    5.09%     5.03%  libc-2.28.so                  [.] _int_free           
+    3.77%     3.53%  libc-2.28.so                  [.] malloc                        
+    3.71%     3.45%  libsqlite3.so.0.8.6           [.] sqlite3_str_vappendf                                                                                                                                        
+    3.70%     0.07%  libc-2.28.so                  [.] __fcntl64_nocancel_adjusted                                                                                                                                 
+    3.19%     2.94%  libpthread-2.28.so            [.] __pthread_mutex_unlock_usercnt                                                                                                                              
+    3.16%     0.09%  [kernel]                      [k] entry_SYSCALL_64_after_hwframe                                                                                                                              
+    3.08%     0.89%  [kernel]                      [k] do_syscall_64                                                                                                                                               
+    2.33%     1.86%  libc-2.28.so                  [.] __strlen_avx2                 
+    2.29%     2.26%  libc-2.28.so                  [.] malloc_consolidate                                                                                                                                          
+    2.02%     1.43%  libc-2.28.so                  [.] __memmove_avx_unaligned_erms                                                                                                                                
+    1.87%     1.07%  libsqlite3.so.0.8.6           [.] sqlite3_free                                                                                                                                                
+    1.81%     0.07%  [kernel]                      [k] __x64_sys_fcntl                                                                                                                                             
+    1.61%     0.07%  [kernel]                      [k] do_fcntl                                                                                                                                                    
+    1.48%     1.30%  libc-2.28.so                  [.] cfree@GLIBC_2.2.5                                                                                                                                           
+    1.47%     0.12%  [kernel]                      [k] fcntl_setlk                                                                                                                                                 
     1.23%     0.73%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_enter@plt                                                                                                                                     
     1.19%     0.00%  [unknown]                     [.] 0000000000000000                                                                                                                                            
+    1.14%     1.01%  libc-2.28.so                  [.] __strcmp_avx2                                                                                                                                               
+    1.03%     0.30%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_enter                                                                                                                                         
     1.01%     0.02%  [kernel]                      [k] do_lock_file_wait                                                                                                                                           
+    0.91%     0.57%  libsqlite3.so.0.8.6           [.] sqlite3_column_text                                                                                                                                         
+    0.88%     0.76%  libsqlite3.so.0.8.6           [.] sqlite3_exec                                                                                                                                                
     0.86%     0.51%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_leave@plt                                                                                                                                     
     0.82%     0.71%  libsqlite3.so.0.8.6           [.] sqlite3_vsnprintf                                                                                                                                           
     0.79%     0.00%  [unknown]                     [k] 0x0000000000000511                                                                                                                                          
+    0.77%     0.22%  [kernel]                      [k] posix_lock_inode                                                                                                                                            
+    0.75%     0.23%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_leave                                                                                                                                         
+    0.67%     0.23%  libsqlite3.so.0.8.6           [.] pthread_mutex_lock@plt                                                                                                                                      
     0.66%     0.00%  [unknown]                     [.] 0x0000000000000200                                                                                                                                          
     0.64%     0.00%  [unknown]                     [.] 0x7374706d65747461                                                                                                                                          
     0.64%     0.00%  [unknown]                     [.] 0x0000555555ba9a08                                                                                                                                          
     0.63%     0.54%  libc-2.28.so                  [.] __strspn_sse42                                                                                                                                              
     0.62%     0.62%  libsqlite3.so.0.8.6           [.] 0x0000000000011ad5                                                                                                                                          
     0.61%     0.54%  libsqlite3.so.0.8.6           [.] memcpy@plt                                                                                                                                                  
     0.59%     0.00%  [kernel]                      [k] ret_from_intr                                                                                                                                               
     0.59%     0.00%  [kernel]                      [k] do_IRQ                                                                                                                                                      
     0.57%     0.00%  [unknown]                     [k] 0x000000000098159b                                                                                                                                          
     0.57%     0.00%  [unknown]                     [k] 0x00005555557b0d08                                                                                                                                          
     0.56%     0.17%  libsqlite3.so.0.8.6           [.] pthread_mutex_unlock@plt                                                                                                                                    
     0.56%     0.46%  libsqlite3.so.0.8.6           [.] sqlite3_snprintf                                                                                                                                            
+    0.56%     0.47%  libsqlite3.so.0.8.6           [.] sqlite3_step                                                                                                                                                
     0.54%     0.49%  libc-2.28.so                  [.] __strtok_r                                                                                                                                                  
     0.52%     0.36%  libsqlite3.so.0.8.6           [.] sqlite3_initialize                                                                                                                                          
     0.52%     0.37%  libsqlite3.so.0.8.6           [.] strlen@plt                                                                                                                                                  
     0.50%     0.01%  [kernel]                      [k] __softirqentry_text_start                                                                                                                                   
+    0.50%     0.50%  libsqlite3.so.0.8.6           [.] 0x0000000000097161                                                                                                                                          
     0.48%     0.31%  libsqlite3.so.0.8.6           [.] sqlite3_malloc64                                                                                                                                            
     0.45%     0.45%  libsqlite3.so.0.8.6           [.] 0x0000000000018060                                                                                                                                          
Actions #2

Updated by laforge 7 days ago

A number of the messages showing up are mutex related. As osmo-msc is single-threaded, this is completely pointless. We should probably use sqlite3_config(SQLITE_CONFIG_SINGLETHREAD)

Actions #3

Updated by laforge 7 days ago

laforge wrote in #note-2:

A number of the messages showing up are mutex related. As osmo-msc is single-threaded, this is completely pointless. We should probably use sqlite3_config(SQLITE_CONFIG_SINGLETHREAD)

https://gerrit.osmocom.org/c/osmo-msc/+/28108 should implement just that.

Actions #4

Updated by laforge 7 days ago

using the following bpftrace, we can see where the sqlite3 calls are coming from:

uprobe:/usr/lib/x86_64-linux-gnu/libdbi.so.1.1.0:dbi_conn_queryf { 
        @queries[str(arg1)]=count() ;
}

@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 7
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 9
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 9
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 11
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 12
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 19
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 2400

So it's the db_sms_get_next_unsent_rr_msisdn() function.

Actions #5

Updated by laforge 7 days ago

if we plot this every second, we see some very bursty behaviour:

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 1
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 1

========================
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 1
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 1
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 1
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 2
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 148

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 1
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 2

========================
@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 2

========================
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 4
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 174

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 3

========================

========================
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 1
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 1
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 1
@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 1
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 6

========================
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 2
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 2
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 410

========================

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 5

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 2

========================
@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 69

========================
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 2
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 2
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 3
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 3
@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 45
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 235
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 343

========================
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 2
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 4
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 55

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 1
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 2
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 150

========================

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 3

========================
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 2

========================
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 1
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 3
@queries[INSERT INTO SMS (created, valid_until, reply_path_req, status_r]: 8
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 149
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 175
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 195

========================
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 2
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 3
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 4
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 9
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 208

========================
@queries[DELETE FROM SMS WHERE id = %llu AND sent is NOT NULL]: 3
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr = %s AND del]: 4
@queries[UPDATE SMS SET sent = datetime('now') WHERE id = %llu]: 150
@queries[SELECT id,valid_until FROM SMS ORDER BY valid_until LIMIT 1]: 175
@queries[UPDATE SMS SET deliver_attempts = deliver_attempts + 1 WHERE id]: 196
@queries[SELECT * FROM SMS WHERE sent IS NULL AND dest_addr > %s AND del]: 383

So sometimes nothing and sometimes hundreds of SQL statement executions.

Needless to say, the fact that the statements are not prepared doesn't help either.

Actions #6

Updated by laforge 6 days ago

So while ...
  • we see tons of DB querys
  • none of them use prepared statements (how can libdbi as a database library not support the most essential bit of every DB application?)
  • we are using sqlite3 in multi-thread mode
    ... all means osmo-msc is using way more CPU than it should, it still doesn't mean that we are stalling the MSC with database work.

This is a long-term (8 hours?) bpftrace uprobe/uretprobe tracking the number of milli-seconds between calling dbi_conn_queryf and returning:

@dbi_query: 
[0]                37008 |@                                                   |
[1]              1640233 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)           1245771 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[4, 8)             21406 |                                                    |
[8, 16)              325 |                                                    |
[16, 32)              71 |                                                    |
[32, 64)              17 |                                                    |

So the longest duration DB query was in the order of 32..63 ms. Not good, but not a problem either with all the MSC (MM, CC, SMS, BSSAP, SCCP, ...) time-outs being in the multiple-second range.

Actions #7

Updated by keith 5 days ago

  • Related to Bug #5563: OsmoMSC sometimes stalls for dozens of seconds in a production deployment added
Actions #8

Updated by laforge 3 days ago

  • Category set to SMS
  • Status changed from New to In Progress
  • % Done changed from 0 to 20
After many days of looking into #5563 I think this ticket is basically the logical concequence of #5563. The osmo-msc code so far
  • did not prepare (compile) sql statements, but compiled them on every execution
  • did not disable multi-threading support in libsqltie3 (causing lots of bogus CPU cycles regarding mutexes showing up in the profiles)
  • executed (and compiled!) up to 100 SQL statements in one select/poll loop iteration

So if the MSC is under a higher load (by using a coverage/capacity BTSs) this increases the CPU load significantly.

Given that we now have disabled multi-theading support in libsqlite3 (already in master) and are switching to libsqlite3 with prepared statements (laforge/sqlite3 branch), I would think this issue should be very much improved.

Actions #9

Updated by laforge 2 days ago

A perf top -p of current osmo-msc (build is >= 24h old, with direct libsqlite3 support) on the huautla-bsc still looks like this:

  30.80%  libsqlite3.so.0.8.6           [.] sqlite3VdbeExec                                                                                                    
  19.56%  libsqlite3.so.0.8.6           [.] sqlite3VdbeMemAboutToChange                                                                                        
  13.73%  libsqlite3.so.0.8.6           [.] btreeParseCellPtr                                                                                                  
   5.48%  libsqlite3.so.0.8.6           [.] sqlite3VdbeCheckMemInvariants                                                                                      
   2.89%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_held                                                                                                 
   2.58%  libsqlite3.so.0.8.6           [.] fetchPayload                                                                                                       
   2.10%  libsqlite3.so.0.8.6           [.] sqlite3BtreeNext                                                                                                   
   1.88%  libsqlite3.so.0.8.6           [.] cursorOwnsBtShared                                                                                                 
   1.85%  libsqlite3.so.0.8.6           [.] btreeParseCell                                                                                                     
   1.75%  libsqlite3.so.0.8.6           [.] sqlite3VdbeOneByteSerialTypeLen                                                                                    
   1.51%  libsqlite3.so.0.8.6           [.] cursorHoldsMutex                                                                                                   
   1.24%  libsqlite3.so.0.8.6           [.] getCellInfo                                                                                                        
   1.22%  libsqlite3.so.0.8.6           [.] sqlite3DbMallocSize                                                                                                
   1.02%  libsqlite3.so.0.8.6           [.] isLookaside                                                                                                        
   0.96%  libc-2.28.so                  [.] __malloc_usable_size                                                                                               
   0.72%  libsqlite3.so.0.8.6           [.] sqlite3BtreePayloadSize                                                                                            
   0.69%  libsqlite3.so.0.8.6           [.] sqlite3MemSize                                                                                                     
   0.67%  libsqlite3.so.0.8.6           [.] sqlite3VdbeCursorMoveto                                                                                            
   0.65%  libsqlite3.so.0.8.6           [.] sqlite3BtreeCursorIsValid                                                                                          
   0.52%  libsqlite3.so.0.8.6           [.] sqlite3BtreeCursorHasMoved                                                                                         
   0.35%  libsqlite3.so.0.8.6           [.] sqlite3BtreePayloadFetch                                                                                           
   0.33%  libc-2.28.so                  [.] __memmove_avx_unaligned_erms                                                                                       
   0.29%  libsqlite3.so.0.8.6           [.] sqlite3_mutex_held@plt                                                                                             
   0.27%  osmo-msc                      [.] _vlr_subscr_find_by_msisdn                                                                                         
   0.25%  libsqlite3.so.0.8.6           [.] sqlite3PcacheFetchFinish                                                                                           
   0.25%  libsqlite3.so.0.8.6           [.] pcache1FetchNoMutex                                                                                                
   0.22%  libsqlite3.so.0.8.6           [.] pcache1PinPage                                                                                                     
   0.21%  libsqlite3.so.0.8.6           [.] malloc_usable_size@plt   

So basically allmost all of the CPU cycles of osmo-msc is spent inside libsqlite3 :(

Actions #10

Updated by laforge 2 days ago

sqlite> select count(*) from sms;
9431
sqlite> select count(*) from sms where sent is null;
247
almost all (9 out of 10) of the core files file during long queries show that we are in the following query:
  • SELECT id,strftime('%s',created),sent,deliver_attempts,strftime('%s', valid_until),reply_path_req,status_rep_req,is_report,msg_ref,protocol_id,data_coding_scheme,ud_hdr_ind,src_addr,src_ton,src_npi,dest_addr,dest_ton,dest_npi,user_data,header,text FROM SMS WHERE sent IS NULL AND dest_addr > $dest_addr AND deliver_attempts <= $attempts ORDER BY dest_addr, id LIMIT 1
Actions #11

Updated by laforge 2 days ago

Some time later with perf -g -p:

  Children      Self  Shared Object          Symbol                                                       
+   99.52%     0.00%  osmo-msc               [.] smsq_take_next_sms
+   98.88%     0.00%  libsqlite3.so.0.8.6    [.] sqlite3_step
+   98.71%     0.01%  libsqlite3.so.0.8.6    [.] sqlite3Step
+   97.48%    31.95%  libsqlite3.so.0.8.6    [.] sqlite3VdbeExec
+   21.74%    21.66%  libsqlite3.so.0.8.6    [.] sqlite3VdbeMemAboutToChange
+   17.79%     0.79%  libsqlite3.so.0.8.6    [.] sqlite3BtreePayloadSize
+   15.90%     1.42%  libsqlite3.so.0.8.6    [.] getCellInfo
+   14.36%     1.61%  libsqlite3.so.0.8.6    [.] btreeParseCell
+   12.57%    12.01%  libsqlite3.so.0.8.6    [.] btreeParseCellPtr
+   11.87%     0.00%  [unknown]              [.] 0x41fd89415541f689
+   11.87%     0.00%  libc-2.28.so           [.] __libc_start_main
+   11.87%     0.00%  osmo-msc               [.] main
+   11.87%     0.00%  libosmocore.so.18.0.0  [.] osmo_select_main_ctx
+   11.87%     0.00%  libosmocore.so.18.0.0  [.] _osmo_select_main
+   11.87%     0.00%  [uprobes]              [.] 0x00000000f7ffd000
+    9.59%     5.26%  libsqlite3.so.0.8.6    [.] sqlite3VdbeCheckMemInvariants
+    6.66%     2.16%  libsqlite3.so.0.8.6    [.] sqlite3BtreeNext
+    5.18%     0.40%  libsqlite3.so.0.8.6    [.] sqlite3BtreePayloadFetch
+    4.70%     2.32%  libsqlite3.so.0.8.6    [.] fetchPayload
+    4.18%     1.26%  libsqlite3.so.0.8.6    [.] sqlite3DbMallocSize
+    4.15%     1.84%  libsqlite3.so.0.8.6    [.] cursorOwnsBtShared
+    3.61%     1.63%  libsqlite3.so.0.8.6    [.] cursorHoldsMutex
+    3.35%     2.99%  libsqlite3.so.0.8.6    [.] sqlite3_mutex_held
+    3.15%     0.17%  libsqlite3.so.0.8.6    [.] btreeNext
+    2.13%     1.88%  libsqlite3.so.0.8.6    [.] sqlite3VdbeOneByteSerialTypeLen
+    1.84%     0.11%  libsqlite3.so.0.8.6    [.] moveToLeftmost
+    1.60%     0.11%  libsqlite3.so.0.8.6    [.] moveToChild
+    1.38%     0.16%  libsqlite3.so.0.8.6    [.] getAndInitPage
+    1.30%     0.69%  libsqlite3.so.0.8.6    [.] sqlite3VdbeCursorMoveto
+    1.23%     0.03%  libsqlite3.so.0.8.6    [.] sqlite3PagerGet
+    1.20%     0.19%  libsqlite3.so.0.8.6    [.] getPageNormal
+    1.19%     0.93%  libc-2.28.so           [.] __malloc_usable_size
+    1.14%     0.99%  libsqlite3.so.0.8.6    [.] isLookaside
     0.93%     0.01%  osmo-msc               [.] db_sms_get_next_unsent_rr_msisdn
+    0.80%     0.75%  libsqlite3.so.0.8.6    [.] sqlite3MemSize
+    0.73%     0.71%  libsqlite3.so.0.8.6    [.] sqlite3BtreeCursorIsValid
     0.72%     0.15%  libsqlite3.so.0.8.6    [.] moveToParent
+    0.63%     0.08%  libsqlite3.so.0.8.6    [.] out2Prerelease
+    0.62%     0.53%  libsqlite3.so.0.8.6    [.] sqlite3BtreeCursorHasMoved
+    0.60%     0.10%  libsqlite3.so.0.8.6    [.] sqlite3PcacheFetch
+    0.57%     0.02%  libsqlite3.so.0.8.6    [.] sqlite3VdbeIdxKeyCompare
+    0.57%     0.35%  libsqlite3.so.0.8.6    [.] sqlite3_mutex_held@plt
     0.51%     0.05%  libsqlite3.so.0.8.6    [.] pcache1Fetch
     0.47%     0.08%  libsqlite3.so.0.8.6    [.] releasePageNotNull
     0.45%     0.18%  libsqlite3.so.0.8.6    [.] pcache1FetchNoMutex

So we can see > 98% of all CPU cycles spent inside osmo-msc are inside libsqlite3. WTF.

Actions #12

Updated by laforge 2 days ago

SMS queue (0):
 Attempted MT SMS deliveries to subscriber:    12684 (2/s 50/m 2513/h 12225/d)
 Successful MT SMS delivery to subscriber:     9092 (1/s 24/m 1708/h 8769/d)
 Erroneous MT SMS delivery:        0 (0/s 0/m 0/h 0/d)
 Failed MT SMS delivery due to no memory on MS:        0 (0/s 0/m 0/h 0/d)
 Failed MT SMS delivery due to paging timeout (MS gone?):     2909 (1/s 15/m 717/h 2798/d)
visitor location register (0):
 Received GSUP messages for unknown IMSI:        0 (0/s 0/m 0/h 0/d)
 Received GSUP purge for unknown subscriber:        0 (0/s 0/m 0/h 0/d)
 Received GSUP authentication tuples:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Update Location Result messages:    10750 (0/s 40/m 986/h 10484/d)
 Received GSUP Update Location Error messages:       46 (0/s 0/m 9/h 45/d)
 Received GSUP Send Auth Info Result messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Send Auth Info Error messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Insert Subscriber Data Request messages:    10750 (0/s 40/m 986/h 10484/d)
 Received GSUP Cancel Subscriber messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Check IMEI Result messages:    10804 (0/s 40/m 997/h 10537/d)
 Received GSUP Check IMEI Error messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Purge MS Result messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Purge MS Error messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP Delete Subscriber Data Request messages:        0 (0/s 0/m 0/h 0/d)
 Received GSUP message of unknown type:        0 (0/s 0/m 0/h 0/d)
 Transmitted GSUP Update Location Request messages:    10796 (0/s 40/m 995/h 10529/d)
 Transmitted GSUP Insert Subscriber Data Result messages:    10750 (0/s 40/m 986/h 10484/d)
 Transmitted GSUP Send Auth Info Request messages:        0 (0/s 0/m 0/h 0/d)
 Transmitted GSUP Purge MS Request messages:        0 (0/s 0/m 0/h 0/d)
 Transmitted GSUP Check IMEI Request messages:    10804 (0/s 40/m 997/h 10537/d)
 Transmitted GSUP Auth Fail Report messages:        0 (0/s 0/m 0/h 0/d)
 Transmitted GSUP Cancel Result messages:        0 (0/s 0/m 0/h 0/d)
 VLR Subscriber Detach by IMSI DETACH REQ:       98 (0/s 4/m 15/h 89/d)
 VLR Subscriber Detach by GSUP CANCEL REQ:        0 (0/s 0/m 0/h 0/d)
 VLR Subscriber Detach by T3212 timeout:      327 (0/s 5/m 45/h 311/d)
mobile switching center (0):
 Received Location Update (IMSI Attach) requests.:      173 (0/s 1/m 25/h 167/d)
 Received Location Update (LAC change) requests.:     1423 (0/s 5/m 152/h 1379/d)
 Received (periodic) Location Update requests.:     9329 (0/s 36/m 847/h 9102/d)
 Received IMSI Detach indications.:       99 (0/s 4/m 15/h 90/d)
 Rejected Location Update requests.:      203 (0/s 0/m 42/h 194/d)
 Successful Location Update procedures.:    10607 (0/s 40/m 963/h 10341/d)
 Rejected CM Service Requests.:      168 (0/s 1/m 1/h 167/d)
 Accepted CM Service Requests.:     7268 (1/s 49/m 1456/h 6913/d)
 Rejected Paging Responses.:       73 (0/s 0/m 0/h 73/d)
 Accepted Paging Responses.:     8022 (1/s 29/m 1540/h 7700/d)
 Rejected CM Re-Establishing Requests.:        0 (0/s 0/m 0/h 0/d)
 Accepted CM Re-Establishing Requests.:        0 (0/s 0/m 0/h 0/d)
 Total MO SMS received from the MS.:     6268 (1/s 28/m 1244/h 5998/d)
 Failed MO SMS delivery attempts (no receiver found).:        0 (0/s 0/m 0/h 0/d)
 Total MT SMS delivery attempts.:    12684 (2/s 50/m 2513/h 12225/d)
 Failed MT SMS delivery attempts (no memory).:       19 (0/s 0/m 0/h 19/d)
 Failed MT SMS delivery attempts (other reason).:        4 (0/s 0/m 0/h 4/d)
 Failed MO SMS delivery attempts (other reason).:        0 (0/s 0/m 0/h 0/d)
 Received MO SETUP messages (MO call establishment).:     1308 (0/s 9/m 277/h 1225/d)
 Received MO CONNECT messages (MO call establishment).:      610 (0/s 5/m 129/h 573/d)
 Sent MT SETUP messages (MT call establishment).:      752 (0/s 4/m 129/h 714/d)
 Sent MT CONNECT messages (MT call establishment).:      560 (0/s 3/m 94/h 536/d)
 Calls that ever reached the active state.:     1169 (0/s 8/m 223/h 1108/d)
 Calls terminated by DISCONNECT message after reaching the active state.:     1042 (0/s 13/m 199/h 979/d)
 Calls terminated by any other reason after reaching the active state.:      126 (0/s 0/m 27/h 123/d)
 Received MS-initiated call independent SS/USSD requests.:        3 (0/s 0/m 0/h 3/d)
 Established MS-initiated call independent SS/USSD sessions.:        3 (0/s 0/m 0/h 3/d)
 Received network-initiated call independent SS/USSD requests.:        0 (0/s 0/m 0/h 0/d)
 Established network-initiated call independent SS/USSD sessions.:        0 (0/s 0/m 0/h 0/d)
 Number of CIPHER MODE REJECT messages processed by BSSMAP layer:        0 (0/s 0/m 0/h 0/d)
 Number of CIPHER MODE COMPLETE messages processed by BSSMAP layer:        0 (0/s 0/m 0/h 0/d)
SMS queue (0):
 Number of SMSs in the in-RAM pending delivery queue:        8 
visitor location register (0):
 Number of subscribers present in VLR:      382 
 Number of PDP records present in VLR:        0 
network statistics (0):
 Currently active calls :        1 
 Currently active SS/USSD sessions:        0 
Actions #13

Updated by laforge 2 days ago

I think what's obvious is that there is quite a bit of SMS activity:

 Total MT SMS delivery attempts.:    12684 (2/s 50/m 2513/h 12225/d)
 Total MO SMS received from the MS.:     6268 (1/s 28/m 1244/h 5998/d)

and some figures for the SMS queue:

 Attempted MT SMS deliveries to subscriber:    12684 (2/s 50/m 2513/h 12225/d)
 Successful MT SMS delivery to subscriber:     9092 (1/s 24/m 1708/h 8769/d)
 Erroneous MT SMS delivery:        0 (0/s 0/m 0/h 0/d)
 Failed MT SMS delivery due to no memory on MS:        0 (0/s 0/m 0/h 0/d)
 Failed MT SMS delivery due to paging timeout (MS gone?):     2909 (1/s 15/m 717/h 2798/d)

While LU load is relatively low (we trigger some SMS stuff on LU):

 Received Location Update (IMSI Attach) requests.:      173 (0/s 1/m 25/h 167/d)
 Received Location Update (LAC change) requests.:     1423 (0/s 5/m 152/h 1379/d)
 Received (periodic) Location Update requests.:     9329 (0/s 36/m 847/h 9102/d)

Actions #14

Updated by laforge 2 days ago

  • Assignee changed from keith to laforge

While we can solve #5563 by moving the sms_queue to a separate thread, it of course still sucks if that thread then consumes significant amounts of cpu for dealing with very few SMS messages.

I therefore think a more radical approach is needed: The complete removal fo sqlite3 from the code. I think there really is no need for the kind of networks we are seeing osmo-msc used in. Even if we estimate about 1kByte of RAM used for each SMS state, we can easily keep the entire SMS queue in RAM. At 10k SMS that would still only be 10MB of RAM usage. That's fine even for todays low-cost embedded platforms. Having the SMS in RAM means we can keep them in an ordered list and simply iterate that list every time we want to transmit one message.

The only other problem we need to solve is persistency. For this I suggest a rather simple scheme: Every SMS gets stored to a separate file on disk (a bit like a mail spool). The storing is done by a separate thread. All operations by that thread are asynchronous, so the main thread never needs to wait for some result:

New proposal for in-RAM SMSC with persistent, non-blocking storage

This really is rether simple. There's no need for any sophisticated thread-safety in the codebase except the very lean interface between the threads with only two operations:
  • store a SMS on disk
  • remove a SMS from disk

SMS received by SMPP, RAN or SGs

  • main thread:
    • allocate SMS memory
    • insert SMS into linked list
    • notify storage thread about this new SMS requiring storage to disk (enqueue of related inter-thread notification to inter-thread queue)
  • storage thread:
    • write that SMS to disk (use SMS ID as file name)
    • release memory of notification (either hand back to main thread or use some mutex-guarded talloc wrapper)

SMS delivery attempt

  • main thread:
    • pick SMS from list using whatever policy
    • check if validity period is exceeded
      • if yes, notify storeage thread that this SMS can be deleted (same as delivery success below)
      • if no, trigger paging/delivery via gsm411_send_sms()

SMS delivery attempt success

  • main thread:
    • notify storage thread that this SMS can be removed (via inter-thread queue)
  • storage thread:
    • unlink the file
    • release memory of notification (either hand back to main thread or use some mutex-guarded talloc wrapper)

SMS delivery attempt failure

  • main thread:
    • increment delivery_attempts for this SMS
    • do not notify the storage thread or update the number of attempted deliveries on disk, it is not worth it. If we later terminate, we will re-start the number of delivery attempt counter; nothing lost here, just causes a few extra pagings
    • check if delivery_attempts now exceeds maximum specified delivery attempts
      • if true, notify storage thread that message can be deleted (same as after delivery)

osmo-msc start

  • storage thread:
    • cycle through SMS files in filesystem
    • read each of them, and send them to main thread via inter-thread queue
      • attention: memory allocation. We somehow need to make sure a talloc context from the main thread owns all the associated memory, so we don't run into both threads operating on a single talloc context.
  • main thread:
    • pull SMSs from the inter-thread queue, put them in the global list of pending SMS
Actions #15

Updated by laforge 2 days ago

  • Priority changed from Normal to Urgent
Actions #16

Updated by laforge 1 day ago

Some more thoughts in addition to the above "new" concept:

  • if external applications are deleting SMS from storage (rhizomatica has a need for this), we can use a inotify/fanotify mechanism to get notified of such deletions, and remove our RAM copy of the SMS automatically at the same time
  • for those subscribers known to the VLR, we could actually move the in-RAM SMS from the global list to a per-subscriber list. This would
    • make the global list much shorter to iterate
    • allow us to quickly delivery all pending messages for the subscriber once he is online without iterating the global list
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)