So, my guess is because I have another ESP32 with my own code that has issues with wifi disconnections, but that one is kind of far in my yard, so I can see that the signal is marginal for that device. On that one (not PB), when wifi disconnects, the wifi thread hangs my device and it needs a reboot (the arduino code gets stuck in the ESP wifi driver and never returns)
Here, I did notice that the signal on that pixelblaze was not great, but I put it on the other side of my window and the AP is literally on the other side of that same window, indoors. Before that, PB was under the roof just 1 meter away and the stucko (stone) killed enough signal that the PB barely worked.
Is there any signal debugging or serial console logging possible on PB, either locally via USB port, or remotely via wifi? (although sending a wifi failure via wifi isn’t going to work).
Or do I need to solder on the RX/TX pins of the board to get serial console out of it?
Oh, I forgot, I have high end ubiquiti APs and I made a 2.4Ghz only network with its own SSID to make things simpler and more reliable ideally speaking, but the point is that wifi can and will fail or hang eventually, so is there some code in the PB’s firmware to deal with wifi issues?
Is wifi a separate background thread, or can wifi hang everything, which would explain why I’ve seen animations hang, which is obviously quite undesirable.
thanks for asking, I am not. Actually I’m not using the web API at all outside of my cron job that checks if the PB is hung or not by retrieving one web page and seeing what happens.
But before I added the check, I did notice the code hangs on occasion and the animations stop.
Thanks @ctag . Ah yes, some division by zero or whatnot could hang the ESP32. Again, is there any easy way to get the console with error messages so we don’t shoot in the dark?
bone stock, didn’t change a single line.
but the problem right now is having some clue of what’s hanging. Guessing and shooting in the dark is not really the proper way to do this.
Is there a supported way to get logs and errors and crashdumps, or not?
ESP32 is quite helpful with tracebacks when it crashes.
Hi @marcmerlin ,
I’m out of town, getting some much needed sun, so late to reply.
Some quick info: Yes you can connect to the PB expansion header pins and get serial console output. No, USB data pins are dead, just for power, no on board USB serial converter since PB is all done over wifi. You can get uptime and reset causes from the websocket status messages.
Tracking uptime might be useful because if PB is resetting frequently it will eventually go into fail safe modes and eventually stop driving LEDs and can drop off the network after many resets. More info here.
I don’t know what would be causing what you are seeing, it’s not something I’ve seen/heard a lot of. Power is often a potential issue when people have issues with a PB going unresponsive. It’s entirely possible that there’s a bug somewhere in the networking stack that is locking things up. For example, a long time ago, in an older version, there was a certain router that would crash a PB in AP mode.
There is a wifi health check in PB in that if it detects problems it will try to reconnect. If it doesn’t have an IP address or wifi isn’t connected it will attempt to stop and start wifi again once a minute. You’d see a message like main reconnect <counter> done, result=<wl_status_t> in the serial output. This attempts to restart the wifi network stack, but if something is very wrong, like a network task is hung or crashed or something, it might not be able to recover from that.
I know some PB users also have Ubiquiti equipment, and we’ve used it for some mission critical events (though for short periods of time). That said, there’s a million variables with settings and versions that could come into play.
If you want to attempt to manually reboot it, you can send an HTTP POST to /reboot. This of course would only work if enough of the network stack and main task was working to properly handle it.
thank you for all those details. Do they happen to be documented somewhere? If not, maybe you can add a troubleshooting header and simply a link to this thread?
Thanks for confirming that the only way to set debugging output is via serial pins, I’ll connect something to them and put a rPi nearby to act as serial port server and save the output.
For power, I’m using the 12V to 5V step down in the expander pro board I got from you, so I’m expecting that even if somehow my 12V were to dip a bit due to the 12V pixels, the 5V output should still be good. Also if it were an animation based brownout, I expect it’d be happening every few minutes when my animation loop cycles.
So I’ll get home this weekend, wire some serial pins, and look into capturing the serial output so that we can have some info instead of guessing
Good to hear there is a health checker on PB for wifi. Do you know if the main display loop will hang if wifi fails, or wifi can fail independently while led output is still otherwise working (in my case, I’ve seen the output fully dead, and am now using a wifi check to reboot, but the output might still be working some of the times that wifi dies). And actually it may not be wifi that is dead but the TCP stack. Let me add a few pings to my check loop, but I believe I’m doing an http pull because I was able to ping while http wasn’t working anymore.
Right, I think its less likely given your power setup. Saw your post in LEDs are Awesome:
If I’m getting hangs on ESP32 when used as a wifi client, which basically fully hangs my code.
When this happens with telnetd running on ESP32, I see:
[D][WiFiClient.cpp:509] connected(): Disconnected: RES: 0, ERR: 128
At that point, the ESP32 needs a power cycle to recover.
That you have another ESP32 experiencing similar issues does lead me to think that it’s likely some gear or environment related incompatibility. Hopefully your serial capture sheds light! Would be interesting to know if they happen around the same time, like is it some poison broadcast signal that kills ESP32s in range, or is it client specific.
This forum perhaps, scattered in release notes or replies. This isn’t something more than a very small handful of people ever have needed and/or bothered with. This forum has become a much better/deeper resource for information on Pixelblaze than I could have assembled myself in any kind of documentation attempt.
The main task does not hang if wifi does any of the normal things wifi does, like disconnect suddenly, get slow, lose packets, etc. It is possible that an API call gets hung due to some lockup in the network stack due to a bug in the IDF or a library, or deadlocks due to other tasks holding mutexes for things if they get stuck likewise. I don’t know the internals of the ESP network stack, but I have seen other RTOS tasks occasionally emit console messages.
The wifi health check was a workaround for one such bug in the IDF, where it wouldn’t reconnect automatically in some edge cases.
Thanks for the details. Until I get home tomorrow and get a chance to wire things to get more logs, a few notes
my other ESP32 hangs more often because it’s at the edge of wifi reach. I think that one genuinely loses signal. I need to spend more time debugging it.
documentation and notes in forum. I’m not quite volunteering to become documentation manager, but I also hate to cause or see repeated effort, so if it’s not easy enough for you to add a “read more” section in the main docs with quite notes and links to chosen forum posts, but you are comfortable giving me docs access, I’m happy to at least add the bits relevant to the questions I already asked ( from pixelblaze/guides/the_guides_have_moved.txt at master · simap/pixelblaze · GitHub it sounds like the docs are hosted in a new system? )
good to hear about health checks. I haven’t used the IDF directly myself much outside of a couple of codes in an arduino project, but a quick google search found this Watchdog - reset esp32 if stuck more than 120 seconds - #2 by kenb4 - Programming - Arduino Forum which I will look at adding to my own code as a last resort if the IDF wifi code hangs outside of my control in the arduino code. I realize that PB is not OSS, so I’m not sure if you have access and a way to look if such a last resort watchdog might be in there or can be added, or if there is a process for me to file a bug for someone to look (because regardless of what I end up finding, I know that I’m seeing animation hangs, and someone else reported the same)
As an update, I have the serial port wired now, no issues. Ironically it’s plugged into an rPi, also on the same wireless, so that I can get the serial port over ssh:
I’m not sure if exposing the serial port via some 3 pin connector on the board would be a good idea, it was a bit awkward to do it and the wires are pinched by the case, but it works.
And I found out for now, that when I get hangs over wifi (web page doesn’t reply, ping doesn’t even reply either at times), so far the device has not fully hung, always self-recovered and I have not had to power cycle it yet.
I’m going to write some extra code to capture the last lines of the serial port dump when I detect a network hang, and see what I get.
As an update, while working on wiring, I saw everything go dark and the device rebooting on its own (not initiated by me that I know of).
Logs do not really say why it rebooted, that said, it didn’t hang either and the reboot was fast, so that’s not too much of a problem for me.
For what it’s worth, rr0=1 rr1=14 rebootCounter=0 means that the chip was reset from a hard power event, not a crash or brown out or anything. rr0 is the reset reason for CPU 0, and 1 is the code for power on reset.
The rebootCounter=0 also indicates that power was lost for long enough to lose ram contents, and that the fail safe system isn’t kicking in or anything.
Oh, thank you @wizard, I totally missed another safety watchdog I had to reboot the device daily. I just removed it and now we’ll see if I can get that hang again, or not.
I just got home from a trip, and now that I have a Pi watching the serial port, it hasn’t hung once I have no idea what’s going on but it seems ok now.