GX Reboot due to mDNS storm

How can I determine, after a reload, what caused my GX to reload due to load? The following error is in my messages file:

shutting down the system because of error 253 = ‘load average too high’

Look in /tmp (assuming you have the device rooted) for a file called /tmp/last_boot_type.orig. Print that out using:

cat /tmp/last_boot_type.orig

The number is zero for a normal boot, or see here:

If the load average is over about 6 or so, the watchdog will reboot it. This should be significantly better in Venus 2.80 and later. In older versions, especially on the CCGX with the slower CPU, a daily reboot is almost normal.

Ok, it confirms the watchdog reload with code 30253 for my Venus GX running 2.84. Looking at the process list it seems the following are higher than the rest (I removed all the 0% to 2% jobbies):

> 
> Mem: 497248K used, 11928K free, 1832K shrd, 104K buff, 16976K cached
> CPU:  40% usr  47% sys   0% nic   0% idle   0% io   6% irq   4% sirq
> Load average: 13.06 7.40 4.77 7/300 26144
>   PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
>   581     1 avahi    R     4852   1%   9% avahi-daemon: running [venus.local]
>    32     2 root     SW       0   0%   9% [kswapd0]
>   908   876 root     D<   88408  17%   8% /opt/victronenergy/gui/gui -nomouse -display VNC:size=480x272:depth=32:passwordFile=/data/conf/vncpassword.txt:0
>   824   815 root     D    37488   7%   7% {vrmlogger.py} /usr/bin/python3 -u /opt/victronenergy/vrmlogger/vrmlogger.py
>   823   811 root     S    49648  10%   6% {dbus-modbus-cli} /usr/bin/python3 -u /opt/victronenergy/dbus-modbus-client/dbus-modbus-client.py
>   864   845 root     D    21628   4%   6% {dbus_systemcalc} /usr/bin/python3 -u /opt/victronenergy/dbus-systemcalc-py/dbus_systemcalc.py
> 20057   841 root     R    38904   8%   4% {dbus_mqtt.py} /usr/bin/python3 -u /opt/victronenergy/dbus-mqtt/dbus_mqtt.py --init-broker
>   952   921 simple-u D    12844   3%   4% /bin/simple-upnpd --xml /var/run/simple-upnpd.xml -d
>   915   898 root     D     8760   2%   4% /opt/victronenergy/hub4control/hub4control
> 26143 26138 root     R     2704   1%   4% top -b -n1
>   544   542 messageb S     227m  46%   3% dbus-daemon --system --nofork
>  1451  1428 root     D     3612   1%   3% /opt/victronenergy/mk2-dbus/mk2-dbus --log-before 25 --log-after 25 --banner -w -s /dev/ttyO5 -i -t mk3 --settings /
>   660     1 avahi-au S     1936   0%   3% avahi-autoipd: [ll-eth0] bound 169.254.9.102
>     7     2 root     RW       0   0%   3% [ksoftirqd/0]
> 25641     2 root     IW<      0   0%   3% [kworker/0:2H-kb]
>  1296     1 root     S    42012   8%   2% python ./dbus-pvoutput-p3.py
>   623     1 nobody   S    20844   4%   2% /usr/sbin/hiawatha
>   858   835 root     S    11192   2%   2% /opt/victronenergy/dbus-fronius/dbus-fronius
>  1560   934 root     S     9800   2%   2% /opt/victronenergy/dbus-modbustcp/dbus-modbustcp
> 10924 10922 root     S     8524   2%   2% /opt/victronenergy/dbus-cgwacs/dbus-cgwacs /dev/ttyUSB0
>  1595   940 root     S     3176   1%   2% /opt/victronenergy/can-bus-bms/can-bus-bms --log-before 25 --log-after 25 -vv -c socketcan:can1 --banner
>   599     1 root     S    22748   4%   1% php-fpm: master process (/etc/php-fpm.conf)
>   916   900 root     S    21636   4%   1% {dbus_digitalinp} /usr/bin/python3 -u /opt/victronenergy/dbus-digitalinputs/dbus_digitalinputs.py /dev/gpio/digital_
>  1482   872 root     S    19684   4%   1% {dbus_tempsensor} /usr/bin/python3 -u /opt/victronenergy/dbus-tempsensor-relay/dbus_tempsensor_relay.py
>   863   831 root     S    19560   4%   1% {dbus_vebus_to_p} /usr/bin/python3 -u /opt/victronenergy/dbus-vebus-to-pvinverter/dbus_vebus_to_pvinverter.py
>  1623   843 mosquitt S    12336   2%   1% /usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf
> 10278 10276 root     S     3360   1%   1% /opt/victronenergy/vedirect-interface/vedirect-dbus -v --log-before 25 --log-after 25 -t 0 --banner -s /dev/ttyO2
> 24781     2 root     IW<      0   0%   1% [kworker/0:1H-kb]
>  1576   825 root     S    41272   8%   0% {mqtt-rpc.py} /usr/bin/python3 -u /opt/victronenergy/mqtt-rpc/mqtt-rpc.py
>   913   892 root     S    34304   7%   0% {venus-button-ha} /usr/bin/python3 -u /opt/victronenergy/venus-button-handler/venus-button-handler -D
>   854   817 root     S    25568   5%   0% {localsettings.p} /usr/bin/python3 -u /opt/victronenergy/localsettings/localsettings.py --path=/data/conf
>   861   829 root     S    24572   5%   0% {netmon} /usr/bin/python3 -u /opt/victronenergy/netmon/netmon
>   600   599 www-data S    22748   4%   0% php-fpm: pool www
>   601   599 www-data S    22748   4%   0% php-fpm: pool www
>   852   819 root     S    15472   3%   0% /opt/victronenergy/dbus-qwacs/dbus-qwacs

Now I can’t say that this is not self inflicted but I upgraded to 2.84 and ran an updated pvout.org script at roundabout the same time. Looking at the process list it’s not causing a significant load.

Any suggestions?

Avahi is using a crapload of CPU (that’s used for multicast-DNS, ie the equivalent of Apple’s Bonjour). That’s not normal. That usually happens on very busy networks, especially where Apple hardware is involved. You could try stopping Avahi. The only loss would be that you cannot lookup venus.local on your network.

Try stopping it:

/etc/init.d/avahi-daemon stop

That is very interesting. I have no Apple devices on the network but will put a sniffer on to see if there are mDNS stuff floating around.

I’ve stopped avahi-daemon in the meantime and will report back in a few days to see if the reboots have stopped or if I’ve found a misbehaving mDNS source on my local network. It does only happen maybe once a week or so.

Thanks for the help.

Well, maybe I’m jumping to conclusions there. But I have seen this issue once (only once) before, and it was in an office block. When people came in in the mornings, the load average would soar (as all their devices piled onto the network). At the time I saw a really high amount of apple devices looking for each other, and assumed that probably contributes to the problem. For that customer, the solution was to put his GX device on a subnet that’s not used by employees.

I totally agree, user traffic and ‘management’ traffic should not share the same network - it’s just out of convenience that I mix them up as my VPN does not do Split-Tunnel so I lose connectivity to my toys when I am connected to Work.

Running a sniffer I quickly realized that Home Assistant and my Unifi router was in a death match to see who could flood the most mDNS packets. Disabling the mDNS on my Unifi router calmed things down so now I want to restart the avahi process and then monitor from there.

Thanks for nudge in the right direction.

1 Like

3 posts were split to a new topic: Funny network problems

PS. Updated the topic to something more relevant.

So far so good. Three days down and no reboot yet but I am still monitoring the system.

@plonkster Is there a way to retrieve the uptime of the GX programatically? Via modbus or something? Maybe DBUS under Mgmt perhaps? I am a lazy bum and if I can automate the uptime display then I don’t have to manually check it and perhaps even get Home Assistant to tell me my GX reloaded. All I need is the first string of ‘uptime’ below:

root@beaglebone:~# cat /proc/uptime
284746.61 158953.63
root@beaglebone:~#

Then I can do this all from one single ‘int’ data type however large that can get:

root@beaglebone:~# awk '{print int($1/86400)" day(s), "int(($1/3600)%24)" Hours "int(($1%3600)/60) " Minutes ago."}' /proc/uptime
3 day(s), 7 Hours 12 Minutes ago.
root@beaglebone:~#