Project

General

Profile

Actions

Feature #5848

open

builders need ccache

Added by Hoernchen about 1 month ago. Updated 14 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
-
Start date:
12/28/2022
Due date:
% Done:

90%

Spec Reference:

Description

This is not just a power consumption issue because we're rebuilding objects that rarely change all the time, this also would cut down the build time on our weakest devices (rp4) to seconds instead of 30min as far as osmotrx is concerned.


Related issues

Related to Core testing infrastructure - Bug #5863: Building the jenkins image on rpi4 takes > 1hResolvedosmith01/18/2023

Actions
Actions #1

Updated by Hoernchen about 1 month ago

  • Tracker changed from Bug to Feature
Actions #2

Updated by osmith 23 days ago

I've thought about how to make this work with untrusted code from gerrit and to have limited fallout from cache poisoning.

It would work if we use several separate ccache dirs, to ensure that:
  • ccache dirs are never shared between code from gerrit and code that has been merged
  • especially osmo-trx seems to build with multiple cpu related flags (--with-neon, --with-neon-vfpv4 etc), we need to use separate ccache dirs for those no, ccache hashes the flags used too

But then, as you mentioned, it should lead to significant speed up of builds. I like the idea.

Actions #3

Updated by osmith 23 days ago

Related: due to a bug actually only two of four build verification jobs were running for osmo-trx:
https://gerrit.osmocom.org/c/osmo-ci/+/30985

Actions #4

Updated by Hoernchen 22 days ago

It it reasonable to assume that we will be fighting nation-state actors that are capable of forging blake2b hashes to poison our ccache with bad code that has to go thorugh gerrit where no one will notice it? I don't think that is much of a concern unless you use the ancient md4 ccache..

Actions #5

Updated by osmith 21 days ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30
Actions #6

Updated by osmith 20 days ago

  • % Done changed from 30 to 50

I have a proof of concept ready. However the problem is that we have lots of calls considered "uncachable" by ccache, e.g. when building osmo-trx + deps for x86_64 it's about 500 of them. Looking into what's causing them.

Actions #7

Updated by osmith 16 days ago

  • % Done changed from 50 to 90

osmith wrote in #note-6:

I have a proof of concept ready. However the problem is that we have lots of calls considered "uncachable" by ccache, e.g. when building osmo-trx + deps for x86_64 it's about 500 of them. Looking into what's causing them.

These come from autotools, ltmain.sh. It runs $CC -V to figure out if it's a Sun C++ compiler. GCC doesn't support this flag:

gcc -V
gcc: error: unrecognized command-line option ā€˜-Vā€™
gcc: fatal error: no input files
compilation terminated.

Not sure why this runs 500x during a build, but at least it's not a performance bottleneck as gcc exits instantly here.


Now with ccache enabled, the whole CI job for osmo-trx runs in 9 min, 15 seconds instead of 23 min. It still spends a lot of time in osmo-trx (not the dependencies) on the raspberry pis, apparently most of the time linking binaries.

For x86_64 --with-sse and without manuals, it's running in 1 min 55s instead of 4 min 52s.

(I guess if we changed the rpi OS to aarch64, we could get another speed improvement.)

So all in all, it's not done in a few seconds, but a significant improvement nonetheless.

Patches:
Actions #8

Updated by osmith 16 days ago

  • Related to Bug #5863: Building the jenkins image on rpi4 takes > 1h added
Actions #9

Updated by Hoernchen 15 days ago

my pi4 with cpu freq fixed to max and 2 usable cores building libusrp+osmo-trx -j5 with all options, no ccache :

./buildall.sh  406.40s user 266.79s system 152% cpu 7:20.66 total

Why are our office pis so slow? Ultracheap slow sd card, maybe? Is my btrfs doing some magic?
ī‚° uname -a
Linux raspberrypi 5.15.50-v8-osnoise-raspi #1 SMP PREEMPT Sun Oct 16 15:11:41 CEST 2022 aarch64 aarch64 aarch64 GNU/Linux
ī‚° lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

same numbers for my host:
./buildall.sh  89,01s user 20,37s system 201% cpu 54,306 total

Actions #10

Updated by laforge 15 days ago

On Mon, Jan 23, 2023 at 04:55:34PM +0000, Hoernchen wrote:

Why are our office pis so slow? Ultracheap slow sd card, maybe? Is my btrfs doing some magic?

possibly completely worn out by now in terms of flash writes? I want to mount them
differently mechanically (using bgt-rpi-mount) soon-ish, maybe we should swap SD cards
at that time.

Also, this is yet another reason why building on our LX2 should be considered: It
has a nVMe SSD (faster, likely more TBW permitted than any SD card), and it has 64GB RAM,
so tmpfs-only builds are also an option.

Actions #11

Updated by osmith 15 days ago

Hoernchen wrote in #note-9:

Why are our office pis so slow? Ultracheap slow sd card, maybe? Is my btrfs doing some magic?

differences in your setup that probably make it faster:

  • aarch64
  • clang instead of gcc
  • different SD card

laforge wrote in #note-10:

On Mon, Jan 23, 2023 at 04:55:34PM +0000, Hoernchen wrote:

Why are our office pis so slow? Ultracheap slow sd card, maybe? Is my btrfs doing some magic?

possibly completely worn out by now in terms of flash writes? I want to mount them
differently mechanically (using bgt-rpi-mount) soon-ish, maybe we should swap SD cards
at that time.

Also, this is yet another reason why building on our LX2 should be considered: It
has a nVMe SSD (faster, likely more TBW permitted than any SD card), and it has 64GB RAM,
so tmpfs-only builds are also an option.

I didn't realize moving the arm jenkins builds over to lx2 was an option! That would make it much faster of course. Do you want me to set up an lxc on the lx2 and set it up as jenkins node?

Actions #12

Updated by laforge 15 days ago

On Tue, Jan 24, 2023 at 08:19:30AM +0000, osmith wrote:

I didn't realize moving the arm jenkins builds over to lx2 was an option!

I believe I mentioned it several times during weekly review, when I added the grafana
setup showing how little utilization this unit has (check for yourself).

Do you want me to set up an lxc on the lx2 and set it up as jenkins node?

I would at least make sense to test that and see if it solves our performance troubles.

Actions #13

Updated by osmith 15 days ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

laforge wrote in #note-12:

On Tue, Jan 24, 2023 at 08:19:30AM +0000, osmith wrote:

I didn't realize moving the arm jenkins builds over to lx2 was an option!

I believe I mentioned it several times during weekly review, when I added the grafana
setup showing how little utilization this unit has (check for yourself).

It's still the case.

Do you want me to set up an lxc on the lx2 and set it up as jenkins node?

I would at least make sense to test that and see if it solves our performance troubles.

Opened #5873 to follow up.

Ccache related patches are merged, marking this issue as resolved.

Also I've changed the number of executors from 2 to 1 for each rpi, since that caused the individual builds to be slower and sometimes abort (probably something like linking consuming much ram, running twice at the same time).

Actions #14

Updated by osmith 14 days ago

  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 90

Fixups for not running ccache on e.g. simtester: https://gerrit.osmocom.org/q/topic:ccache-fixup

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)