How to debug pixelblaze pro + expander pro board hanging once a day or so?

From time to time, a few times a day, the device locks up (animations stop working and the website also stops working).
Can a bad wifi connection cause some thread lockup and in turn cause animation lockup?
Does the device/code have any watchdog to restart wifi/connection and/or reboot?

As of right now, I had to write a web interface watchdog (making sure I get the expected web page with full content, back), as the device can still ping when it’s locked up. And here are timestamps of the last lockups to show it’s not on an exact interval:

Date: Fri, 27 Dec 2024 10:10:41 -0800
Date: Sat, 28 Dec 2024 11:20:42 -0800
Date: Sat, 28 Dec 2024 15:01:03 -0800
Date: Sat, 28 Dec 2024 23:00:57 -0800
Date: Sat, 28 Dec 2024 23:10:41 -0800
Date: Sun, 29 Dec 2024 04:01:09 -0800
Date: Sun, 29 Dec 2024 12:20:41 -0800
Date: Sun, 29 Dec 2024 19:10:41 -0800
Date: Sun, 29 Dec 2024 21:10:40 -0800
Date: Mon, 30 Dec 2024 04:10:47 -0800
Date: Mon, 30 Dec 2024 05:50:45 -0800
Date: Mon, 30 Dec 2024 07:11:52 -0800
Date: Mon, 30 Dec 2024 07:20:52 -0800
Date: Mon, 30 Dec 2024 08:31:29 -0800
Date: Mon, 30 Dec 2024 08:50:44 -0800
Date: Mon, 30 Dec 2024 09:50:43 -0800
Date: Mon, 30 Dec 2024 10:10:43 -0800
Date: Mon, 30 Dec 2024 10:20:53 -0800
Date: Mon, 30 Dec 2024 13:20:44 -0800
Date: Mon, 30 Dec 2024 16:40:40 -0800

For now I have code that catches the lockups and power cycles the outlet, but obviously this is not desirable.
Is this a known issue?
Is it indeed a wifi problem?
Is there any way to get a serial console/debug over the USB port to get more info?

This is my workaround:

/etc/cron.d/log_monitoring:*/10 * * * * root export LOCK=/var/lock/pixelblaze; sleep 3; lockfile-create -r 0 --use-pid $LOCK || exit; screen -d -m bash -c "/var/local/scr/alarm 5 links 'http://pixelblaze1/#wifiSettingsPanel' &>/var/tmp/pixelblaze.out"; grep -q 'Client MAC' /var/tmp/pixelblaze.out || ( echo "pixelblaze down :("; ls -l /var/tmp/pixelblaze.out; echo "vvvvvvvvvvvvvvvvvvvvvvvvv"; ansi2txt < /var/tmp/pixelblaze.out; echo "^^^^^^^^^^^^^^^^^^^^^^^^^"; cat /var/tmp/pixelblaze.out; echo "--------------------------------" ; campower cam8 ) | Mail -Es "pixelblaze unresponsive, power cycling"  Email; rm $LOCK.lock

Sometimes I get a partial web page output, most of the time I get a connection on port 80, but no output

Hi @marcmerlin, this is pretty interesting. Is there anything different about your wifi network? Is the controller connected to a separate 2.4GHz network? Does that network share a SSID & password with a 5GHz?

So, my guess is because I have another ESP32 with my own code that has issues with wifi disconnections, but that one is kind of far in my yard, so I can see that the signal is marginal for that device. On that one (not PB), when wifi disconnects, the wifi thread hangs my device and it needs a reboot (the arduino code gets stuck in the ESP wifi driver and never returns)
Here, I did notice that the signal on that pixelblaze was not great, but I put it on the other side of my window and the AP is literally on the other side of that same window, indoors. Before that, PB was under the roof just 1 meter away and the stucko (stone) killed enough signal that the PB barely worked.

Is there any signal debugging or serial console logging possible on PB, either locally via USB port, or remotely via wifi? (although sending a wifi failure via wifi isn’t going to work).
Or do I need to solder on the RX/TX pins of the board to get serial console out of it?

Oh, I forgot, I have high end ubiquiti APs and I made a 2.4Ghz only network with its own SSID to make things simpler and more reliable ideally speaking, but the point is that wifi can and will fail or hang eventually, so is there some code in the PB’s firmware to deal with wifi issues?
Is wifi a separate background thread, or can wifi hang everything, which would explain why I’ve seen animations hang, which is obviously quite undesirable.

Are you using the websockets API by any chance?

thanks for asking, I am not. Actually I’m not using the web API at all outside of my cron job that checks if the PB is hung or not by retrieving one web page and seeing what happens.

But before I added the check, I did notice the code hangs on occasion and the animations stop.

What LEDs and code are running on the PB?

* I haven’t seen wifi cause a lockup, but I have definitely seen my script doing “bad math” lockup a PB.

Thanks @ctag . Ah yes, some division by zero or whatnot could hang the ESP32. Again, is there any easy way to get the console with error messages so we don’t shoot in the dark?

Uhoh. Are all of those the stock scripts that PB comes with?

I’m not aware of the script troubleshooting processes, hopefully someone who knows more can chime in.

bone stock, didn’t change a single line.
but the problem right now is having some clue of what’s hanging. Guessing and shooting in the dark is not really the proper way to do this.
Is there a supported way to get logs and errors and crashdumps, or not?
ESP32 is quite helpful with tracebacks when it crashes.

Hi @marcmerlin ,
I’m out of town, getting some much needed sun, so late to reply.

Some quick info: Yes you can connect to the PB expansion header pins and get serial console output. No, USB data pins are dead, just for power, no on board USB serial converter since PB is all done over wifi. You can get uptime and reset causes from the websocket status messages.

Tracking uptime might be useful because if PB is resetting frequently it will eventually go into fail safe modes and eventually stop driving LEDs and can drop off the network after many resets. More info here.

I don’t know what would be causing what you are seeing, it’s not something I’ve seen/heard a lot of. Power is often a potential issue when people have issues with a PB going unresponsive. It’s entirely possible that there’s a bug somewhere in the networking stack that is locking things up. For example, a long time ago, in an older version, there was a certain router that would crash a PB in AP mode.

There is a wifi health check in PB in that if it detects problems it will try to reconnect. If it doesn’t have an IP address or wifi isn’t connected it will attempt to stop and start wifi again once a minute. You’d see a message like main reconnect <counter> done, result=<wl_status_t> in the serial output. This attempts to restart the wifi network stack, but if something is very wrong, like a network task is hung or crashed or something, it might not be able to recover from that.

I know some PB users also have Ubiquiti equipment, and we’ve used it for some mission critical events (though for short periods of time). That said, there’s a million variables with settings and versions that could come into play.

If you want to attempt to manually reboot it, you can send an HTTP POST to /reboot. This of course would only work if enough of the network stack and main task was working to properly handle it.

thank you for all those details. Do they happen to be documented somewhere? If not, maybe you can add a troubleshooting header and simply a link to this thread?
Thanks for confirming that the only way to set debugging output is via serial pins, I’ll connect something to them and put a rPi nearby to act as serial port server and save the output.
For power, I’m using the 12V to 5V step down in the expander pro board I got from you, so I’m expecting that even if somehow my 12V were to dip a bit due to the 12V pixels, the 5V output should still be good. Also if it were an animation based brownout, I expect it’d be happening every few minutes when my animation loop cycles.
So I’ll get home this weekend, wire some serial pins, and look into capturing the serial output so that we can have some info instead of guessing :slight_smile:

Good to hear there is a health checker on PB for wifi. Do you know if the main display loop will hang if wifi fails, or wifi can fail independently while led output is still otherwise working (in my case, I’ve seen the output fully dead, and am now using a wifi check to reboot, but the output might still be working some of the times that wifi dies). And actually it may not be wifi that is dead but the TCP stack. Let me add a few pings to my check loop, but I believe I’m doing an http pull because I was able to ping while http wasn’t working anymore.

Right, I think its less likely given your power setup. Saw your post in LEDs are Awesome:

If I’m getting hangs on ESP32 when used as a wifi client, which basically fully hangs my code.
When this happens with telnetd running on ESP32, I see:
[D][WiFiClient.cpp:509] connected(): Disconnected: RES: 0, ERR: 128
At that point, the ESP32 needs a power cycle to recover.

That you have another ESP32 experiencing similar issues does lead me to think that it’s likely some gear or environment related incompatibility. Hopefully your serial capture sheds light! Would be interesting to know if they happen around the same time, like is it some poison broadcast signal that kills ESP32s in range, or is it client specific.

This forum perhaps, scattered in release notes or replies. This isn’t something more than a very small handful of people ever have needed and/or bothered with. This forum has become a much better/deeper resource for information on Pixelblaze than I could have assembled myself in any kind of documentation attempt.

The main task does not hang if wifi does any of the normal things wifi does, like disconnect suddenly, get slow, lose packets, etc. It is possible that an API call gets hung due to some lockup in the network stack due to a bug in the IDF or a library, or deadlocks due to other tasks holding mutexes for things if they get stuck likewise. I don’t know the internals of the ESP network stack, but I have seen other RTOS tasks occasionally emit console messages.

The wifi health check was a workaround for one such bug in the IDF, where it wouldn’t reconnect automatically in some edge cases.

Thanks for the details. Until I get home tomorrow and get a chance to wire things to get more logs, a few notes

  1. my other ESP32 hangs more often because it’s at the edge of wifi reach. I think that one genuinely loses signal. I need to spend more time debugging it.

  2. documentation and notes in forum. I’m not quite volunteering to become documentation manager, but I also hate to cause or see repeated effort, so if it’s not easy enough for you to add a “read more” section in the main docs with quite notes and links to chosen forum posts, but you are comfortable giving me docs access, I’m happy to at least add the bits relevant to the questions I already asked ( from pixelblaze/guides/the_guides_have_moved.txt at master · simap/pixelblaze · GitHub it sounds like the docs are hosted in a new system? )

  3. good to hear about health checks. I haven’t used the IDF directly myself much outside of a couple of codes in an arduino project, but a quick google search found this
    Watchdog - reset esp32 if stuck more than 120 seconds - #2 by kenb4 - Programming - Arduino Forum which I will look at adding to my own code as a last resort if the IDF wifi code hangs outside of my control in the arduino code. I realize that PB is not OSS, so I’m not sure if you have access and a way to look if such a last resort watchdog might be in there or can be added, or if there is a process for me to file a bug for someone to look (because regardless of what I end up finding, I know that I’m seeing animation hangs, and someone else reported the same)