GX Reboot due to mDNS storm

djagerif · April 12, 2022, 8:31am

How can I determine, after a reload, what caused my GX to reload due to load? The following error is in my messages file:

shutting down the system because of error 253 = ‘load average too high’

plonkster · April 12, 2022, 9:23am

Look in /tmp (assuming you have the device rooted) for a file called /tmp/last_boot_type.orig. Print that out using:

cat /tmp/last_boot_type.orig

The number is zero for a normal boot, or see here:

If the load average is over about 6 or so, the watchdog will reboot it. This should be significantly better in Venus 2.80 and later. In older versions, especially on the CCGX with the slower CPU, a daily reboot is almost normal.

djagerif · April 12, 2022, 11:22am

Ok, it confirms the watchdog reload with code 30253 for my Venus GX running 2.84. Looking at the process list it seems the following are higher than the rest (I removed all the 0% to 2% jobbies):

> 
> Mem: 497248K used, 11928K free, 1832K shrd, 104K buff, 16976K cached
> CPU:  40% usr  47% sys   0% nic   0% idle   0% io   6% irq   4% sirq
> Load average: 13.06 7.40 4.77 7/300 26144
>   PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
>   581     1 avahi    R     4852   1%   9% avahi-daemon: running [venus.local]
>    32     2 root     SW       0   0%   9% [kswapd0]
>   908   876 root     D<   88408  17%   8% /opt/victronenergy/gui/gui -nomouse -display VNC:size=480x272:depth=32:passwordFile=/data/conf/vncpassword.txt:0
>   824   815 root     D    37488   7%   7% {vrmlogger.py} /usr/bin/python3 -u /opt/victronenergy/vrmlogger/vrmlogger.py
>   823   811 root     S    49648  10%   6% {dbus-modbus-cli} /usr/bin/python3 -u /opt/victronenergy/dbus-modbus-client/dbus-modbus-client.py
>   864   845 root     D    21628   4%   6% {dbus_systemcalc} /usr/bin/python3 -u /opt/victronenergy/dbus-systemcalc-py/dbus_systemcalc.py
> 20057   841 root     R    38904   8%   4% {dbus_mqtt.py} /usr/bin/python3 -u /opt/victronenergy/dbus-mqtt/dbus_mqtt.py --init-broker
>   952   921 simple-u D    12844   3%   4% /bin/simple-upnpd --xml /var/run/simple-upnpd.xml -d
>   915   898 root     D     8760   2%   4% /opt/victronenergy/hub4control/hub4control
> 26143 26138 root     R     2704   1%   4% top -b -n1
>   544   542 messageb S     227m  46%   3% dbus-daemon --system --nofork
>  1451  1428 root     D     3612   1%   3% /opt/victronenergy/mk2-dbus/mk2-dbus --log-before 25 --log-after 25 --banner -w -s /dev/ttyO5 -i -t mk3 --settings /
>   660     1 avahi-au S     1936   0%   3% avahi-autoipd: [ll-eth0] bound 169.254.9.102
>     7     2 root     RW       0   0%   3% [ksoftirqd/0]
> 25641     2 root     IW<      0   0%   3% [kworker/0:2H-kb]
>  1296     1 root     S    42012   8%   2% python ./dbus-pvoutput-p3.py
>   623     1 nobody   S    20844   4%   2% /usr/sbin/hiawatha
>   858   835 root     S    11192   2%   2% /opt/victronenergy/dbus-fronius/dbus-fronius
>  1560   934 root     S     9800   2%   2% /opt/victronenergy/dbus-modbustcp/dbus-modbustcp
> 10924 10922 root     S     8524   2%   2% /opt/victronenergy/dbus-cgwacs/dbus-cgwacs /dev/ttyUSB0
>  1595   940 root     S     3176   1%   2% /opt/victronenergy/can-bus-bms/can-bus-bms --log-before 25 --log-after 25 -vv -c socketcan:can1 --banner
>   599     1 root     S    22748   4%   1% php-fpm: master process (/etc/php-fpm.conf)
>   916   900 root     S    21636   4%   1% {dbus_digitalinp} /usr/bin/python3 -u /opt/victronenergy/dbus-digitalinputs/dbus_digitalinputs.py /dev/gpio/digital_
>  1482   872 root     S    19684   4%   1% {dbus_tempsensor} /usr/bin/python3 -u /opt/victronenergy/dbus-tempsensor-relay/dbus_tempsensor_relay.py
>   863   831 root     S    19560   4%   1% {dbus_vebus_to_p} /usr/bin/python3 -u /opt/victronenergy/dbus-vebus-to-pvinverter/dbus_vebus_to_pvinverter.py
>  1623   843 mosquitt S    12336   2%   1% /usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf
> 10278 10276 root     S     3360   1%   1% /opt/victronenergy/vedirect-interface/vedirect-dbus -v --log-before 25 --log-after 25 -t 0 --banner -s /dev/ttyO2
> 24781     2 root     IW<      0   0%   1% [kworker/0:1H-kb]
>  1576   825 root     S    41272   8%   0% {mqtt-rpc.py} /usr/bin/python3 -u /opt/victronenergy/mqtt-rpc/mqtt-rpc.py
>   913   892 root     S    34304   7%   0% {venus-button-ha} /usr/bin/python3 -u /opt/victronenergy/venus-button-handler/venus-button-handler -D
>   854   817 root     S    25568   5%   0% {localsettings.p} /usr/bin/python3 -u /opt/victronenergy/localsettings/localsettings.py --path=/data/conf
>   861   829 root     S    24572   5%   0% {netmon} /usr/bin/python3 -u /opt/victronenergy/netmon/netmon
>   600   599 www-data S    22748   4%   0% php-fpm: pool www
>   601   599 www-data S    22748   4%   0% php-fpm: pool www
>   852   819 root     S    15472   3%   0% /opt/victronenergy/dbus-qwacs/dbus-qwacs

Now I can’t say that this is not self inflicted but I upgraded to 2.84 and ran an updated pvout.org script at roundabout the same time. Looking at the process list it’s not causing a significant load.

Any suggestions?

plonkster · April 12, 2022, 12:00pm

Avahi is using a crapload of CPU (that’s used for multicast-DNS, ie the equivalent of Apple’s Bonjour). That’s not normal. That usually happens on very busy networks, especially where Apple hardware is involved. You could try stopping Avahi. The only loss would be that you cannot lookup venus.local on your network.

Try stopping it:

/etc/init.d/avahi-daemon stop

djagerif · April 12, 2022, 12:10pm

That is very interesting. I have no Apple devices on the network but will put a sniffer on to see if there are mDNS stuff floating around.

I’ve stopped avahi-daemon in the meantime and will report back in a few days to see if the reboots have stopped or if I’ve found a misbehaving mDNS source on my local network. It does only happen maybe once a week or so.

Thanks for the help.

plonkster · April 12, 2022, 12:57pm

Well, maybe I’m jumping to conclusions there. But I have seen this issue once (only once) before, and it was in an office block. When people came in in the mornings, the load average would soar (as all their devices piled onto the network). At the time I saw a really high amount of apple devices looking for each other, and assumed that probably contributes to the problem. For that customer, the solution was to put his GX device on a subnet that’s not used by employees.

djagerif · April 12, 2022, 1:03pm

I totally agree, user traffic and ‘management’ traffic should not share the same network - it’s just out of convenience that I mix them up as my VPN does not do Split-Tunnel so I lose connectivity to my toys when I am connected to Work.

Running a sniffer I quickly realized that Home Assistant and my Unifi router was in a death match to see who could flood the most mDNS packets. Disabling the mDNS on my Unifi router calmed things down so now I want to restart the avahi process and then monitor from there.

Thanks for nudge in the right direction.

Louisvdw · April 12, 2022, 1:57pm

3 posts were split to a new topic: Funny network problems

djagerif · April 12, 2022, 1:58pm

PS. Updated the topic to something more relevant.

djagerif · April 15, 2022, 11:20am

So far so good. Three days down and no reboot yet but I am still monitoring the system.

@plonkster Is there a way to retrieve the uptime of the GX programatically? Via modbus or something? Maybe DBUS under Mgmt perhaps? I am a lazy bum and if I can automate the uptime display then I don’t have to manually check it and perhaps even get Home Assistant to tell me my GX reloaded. All I need is the first string of ‘uptime’ below:

root@beaglebone:~# cat /proc/uptime
284746.61 158953.63
root@beaglebone:~#

Then I can do this all from one single ‘int’ data type however large that can get:

root@beaglebone:~# awk '{print int($1/86400)" day(s), "int(($1/3600)%24)" Hours "int(($1%3600)/60) " Minutes ago."}' /proc/uptime
3 day(s), 7 Hours 12 Minutes ago.
root@beaglebone:~#

kivanov · July 4, 2022, 1:37pm

I find those two services generating too many log messages in the system log, but mostly I do not see reason for them to be running of all network settings are checked and internet connection is working without issues, so I disabled both:

update-rc.d avahi-autoipd disable
update-rc.d avahi-daemon disable

plonkster · July 4, 2022, 9:27pm

Yeah I’ve seen that before. Lot of broadcast/multicase traffic causes the load on the GX device to rise and the watchdog reboots it. Indeed, disabling avahi merely means you can no longer access the device as venus.local, but it continues to work normally.

kivanov · July 4, 2022, 9:50pm

Another service (I think it is part of venusos since v2.8+) that is filling up the logs and I do not thing it is important for system operation

update-rc.d rng-tools disable

It is possible it has some performance impact, should be tested more thoroughly. Probably it is the cause for the faster boot time in comparison with v2.73.

plonkster · July 5, 2022, 8:45am

That happens when you install something with opkg, and for some reason it then decides to also install rng-tools. It is not installed by default.

You can simply remove it (with opkg). Also, when installing packages with opkg, pass the --no-install-recommends option to stop it from installing rng-tools.

plonkster · July 5, 2022, 2:07pm

More specifically, for those reading here and wanting more of the picture. Ipkg is a slimmed down version of the packaging system used by Debian, which is of course “deb” packages. It is so similar that you can unpack and inspect ipkg packages using debian tools. In any case, Debian (and derivatives) have two concepts of dependencies, hard dependencies (without which the package cannot work), and soft dependencies (or recommendations).

Since about Venus 2.80 there is a hanging recommendation on rng-tools. I don’t know why, but it is there. The stock image doesn’t include that package, but some other package feels like it needs it to be happier. The moment you install anything else, opkg (which again is a wrapper similar to apt on Debian) proceeds to resolve this problem along with installing whatever you asked for.

It is not a big enough problem to spend hours on, since it works just fine in the base OS (which 99.9% of users never touch), and it only really affects tinkerers. So the solution is simply to always skip recommended “soft dependencies” when you install anything.

kivanov · July 7, 2022, 11:43am

@plonkster, pefectly agree.
I used to have a good experience with OpenWrt and it uses the same packaging system like venusos.
My experience with embedded systems running on fast ageing flash memories is to keep the write operations as few as possible. That is why I have the habit to inspect system logging and check the services and applications which tend to write more than needed and remove/disable all that is not needed.
I am now working on getting log2ram to work as a service and move all rolling logs to tmpfs. I am expecting problems with some of the systems tools like dbus, but hope will manage to get it working.
https://community.victronenergy.com/questions/143275/gx-device-logging-to-ram-tmpfs.html
Seems that there is no interest at the moment. If manage to have working solution will post a new thread here as well.