Firestorm with HWv3.6 PB(SWv3.30) using Sensor Expansion after standby PB is unresponsive until PB reboot and disconnecting Firestorm

thatcalifire · April 13, 2023, 6:32pm

Greetings-
I am running a few PixelBlazes in our residence, two Picos, and three PB Standards. I’m running Firestorm on a hardwired raspberry pi with 8 GB of ram, on port 80, using the document on the Firestorm repo.

One of the PB has the Sensor Expansion board, and I am thinking of installing another sensor expansion board on another PB Standard (or Pico, depending on how wild I’m feeling) if I can figure out what’s going on with this one PB when Firestorm is running in pm2, and connected to the network.

It smells like a WebSocket or UDP flood, but I do not know where to start troubleshooting this without busting out something like Wireshark to sniff the network and see what’s happening.

Essentially the short and sweet is I have all the PixelBlazes ( standard and Pico) set to turn and off at certain times in the evening. Firestorm remains on, which I have a few feature requests for; I might be willing to contribute upstream for said features in time. I’ll save those for another thread at a later time;

The PixelBlaze attached to the SensorBoard cannot be resolved by IP address or local DNS queries ( configured on my pfSense network router) or on any device across browsers. Additionally, discover.electromage.com will report that PixelBlaze with the SB attached to it was last seen 800+ minutes ago (this varies from whenever I decide I want some blinky in my life during the day).

All of the Pixelblazes will turn back on as usual when scheduled, but they are not in sync with the Firestorm playlist, which can be a bit wild since I have a few running simultaneously. Hence the strong desire to have them all synced with Firestorm, or in the interim, I’ll have to settle with just a single pattern until I can figure out how to resolve the Firestorm issue.

To resolve the network resolution via IP address and local DNS queries and through the discover.electromage.com service, I have to disconnect the power to the PixelBlaze with the SensorBoard attached and effectively reboot it. I need to reboot the Firestorm server using the pm2 service and entirely disconnect the Pi from the network, effectively stopping whatever is clogging the WebSockets on the PB.

Eventually, I can reconnect the Firestorm to the network, and the PixelBlazes will sync up as desired, and all is well in the world.

Any ideas on how to troubleshoot this so that we might be able to resolve this for other users? Thanks for your time.

For further information, each PIxelblaze is running v3.30, the PixelBlaze with the SB attached is HW v3.6, and the Picos are HW v1.7, and running a PB v3.4 w/o the SB.

wizard · April 14, 2023, 11:16pm

It could very well be a flood of websocket connections DOSing the Pixelblaze, and the reconnection attempts may need some tweaking.

When you say turn off, I assume you mean you are powering them off completely, and not using the auto off feature?

Firestorm completely forgets about a Pixelblaze after a few minutes, and won’t change anything until the next event like a playlist transition. So if a PB is offline for a long time and comes online, it won’t sync up what the current pattern is. Quick resets are OK, and will switch back when it comes online if it’s only out for a short time. And of course that will only work if it can talk to that PB.

That does sound like some issue around how the websocket is handled. Can you get log messages from around this time? Something like pm2 logs should spit out some interesting details, and the full logs are stored in the pm2 directory (like ~/.pm2 or something).

thatcalifire · April 15, 2023, 1:35am

I’ll get the deets for ya in a bit. Thanks for the response!

yes this is correct, basically unplugging the device for a few seconds and plugging it back in

thatcalifire · April 17, 2023, 3:19pm

There isn’t anything of particular interest in the server-error.log and in the server-out.log I see the following:

ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
Error: connect ETIMEDOUT 10.0.0.34:81
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
    at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '10.0.0.34',
  port: 81
}
ws not alive 10.0.0.34
stopping 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34
ws not alive 10.0.0.34

I’ve got some PRs up on GitHub working on another. In the meantime, are there some places in the controller.js or discovery.js that I could spit out to the log to help troubleshoot?

This device (10.0.0.34) is the PB with the SB connected and is currently not resolvable in my network. I’ll have to unplug the device to get responsive again physically, but the LED light is online and steady, which has piqued my curiosity to get it sorted out. However, the discoveries.map() does render the “downed” PB in the UI found in its state.

The other device, 10.0.0.35, is still resolvable on the network and is a Pico PB. I disconnected the other standard PB and pico PBs from my networking to assist in troubleshooting. So I only have two listed here.
I am running the branch in PR #44 related to ticket #25 on the repo, which is stable, and no substantial changes to the app/* (API files).

Thanks for the response!

EDIT: I restarted the server to see if I could get something different and the logs again nothing really interesting in the server-error.log but the server-out.log shows

0|server   | ws not alive 10.0.0.34
0|server   | ws not alive 10.0.0.34
0|server   | listening on 80
0|server   | > Pixelblaze Discovery Server listening on 0.0.0.0: 1889
0|server   | done
0|server   | data: [ { id: 9, name: 'rainbow fonts', duration: 15 } ]

0|server  | connected to 10.0.0.35
0|server  | sending to 10225758 at 10.0.0.35 {"getConfig":true,"listPrograms":true,"sendUpdates":false}
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34
0|server  | ws not alive 10.0.0.34

and the UI in the disoveries.map() is the following:

It is odd that it just rendered the HW name and nothing else…

if I try to dump from the device via Firestorm (the code suggests I should be able to by hitting /), I don’t get any zip or anything, just a weird downloaded HTML page that isn’t found… perhaps this is a bug in the UI that needs to be addressed later.

thatcalifire · April 22, 2023, 12:28pm

Okay, so after some overhaul on FS, I have installed on all the PBs on my network the new firmware v3.40 on the Picos and PBs. The PB XL, with the SB installed on it is the leader. The followers all see the PB XL with the SB installed on it as the leader and even acknowledge there is a SB installed (per the recover page for each follower device).

When I view my controllers in FS, only the leader is listed, which is the PB XL with the SB installed. However, FS has newplaylist running, with just a single pattern. The PB leader (with the SB installed) and followers, are stuck on whatever pattern the leader last played.

If I try to navigate to the PB leader UI, the UI never connects and is stuck on the spinning logo never reaching the reading config step.

Is there a way I can help debug what is happening with FS and the PB with the SB on it? I could give up on FS now that sync is running on the PBs but it just seems so fun to have in the network.

EDIT: I just want to say that if FS is disconnected with v3.40, the sync between the leader and followers works perfectly, so I’m torn lol … Solve for FS or just use the new sync?

wizard · April 22, 2023, 3:48pm

Hi @thatcalifire ,
Just to clarify, without Firestorm everything works OK, but with FS running your leader no longer connects via browser? With FS running, your leader isn’t responding to browser/FS commands, but the leader and followers are still synced? That would make sense if the leader’s websocket server was hosed (like the connections were full), but the rest of the networking was otherwise net-working.

The followers shouldn’t show up in FS. That’s intentional so that FS doesn’t try to send them commands, which would confuse things since they are getting everything they need from their leader, and the beacon used for discovery doubles as time-sync, but they are timesyncing from their leader directly. They will still show up on discover.electromage.com and can be seen from the leader.

The status menu has been reworked, and you can get sensor board expansion as well as leader / follower / peer links right from inside the PB interface. The status is now a drop-down menu:

Looking at the FS code, I would expect to see some log entries around 10.0.0.34 that I’m not seeing.

So the discovery.js module handles the UDP beacons and whatnot, and is what is emitting the “ws not alive” messages. When it sees a controller via UDP it will make a PixelblazeController instance (from controller.js) to communicate with it via websocket and start it, which tries to connect.

On the controller side, it should emit messages whenever it connects or fails to connect, or closes. You should see something like “connected to 10.0.0.34” if the WS connects, or on error, console.log is invoked with whatever the error was, that is where the “Error: connect ETIMEDOUT” comes from. You can change this like so (around line 91) so it’s a bit more clear in case the error doesn’t have the IP:

this.ws.on('error', function(err) {console.log("ws error to " + this.props.address, err)});

When a websocket closes, which should get called if the WS connection drops for any reason, it will try to reconnect after 1 second:

  handleClose() {
    console.log("closing " + this.props.address);
    this.reconectTimeout = setTimeout(this.connect, 1000);
  }

I was thinking that might be happening over and over again, though I don’t see the “closing” message in your log. If it was happening over and over again, it might be consuming all available websocket connections on the PB, which would explain why it stops responding.

However, that isn’t happening, the connection is never made, close isn’t called, and eventually the discovery stuff notices it hasn’t been seen for a while (discovery.js line 16) calls stop, which terminates the websocket, which shouldn’t do anything since it didn’t connect.

My guess is that the PB is still sending out beacons, and the discovery code will fire up a new controller, repeating every expireMs interval. That should be way too slow to cause any exhaustion of connections on the PB side.

I wonder if there’s something up with the WebSocket library… it might be worth trying different versions. The old version works for me, but I do see lots and lots of fixes, some related to closing cleanly, etc. Everything still seems to work with the latest "ws": "^8.13.0", I’d give that a shot and see if anything changes.

thatcalifire · April 22, 2023, 5:05pm

Yeep that’s what I am seeing! Thanks for the response! I’ll mess around with it today and see if I can make it burp some interesting things and report back my findings

thatcalifire · April 22, 2023, 5:17pm

Interestingly enough, the first thing I did was upgrade WS, and it seems okay… I’ll run with it for a few hours and see how it goes; if all is well, I’ll push up the upgrade

thatcalifire · April 22, 2023, 7:38pm

@wizard So far, so good, but before I push up an upgrade, I did notice the version and custom name aren’t being transmitted to FS running WS ^8.13.0. I’d rather not regress/introduce a bug if possible. I’ll see if I can find this in the request in the controller. Perhaps the attributes changed.

In the code, I see we set the name when name is missing from the request, here on line 47

        d.name = d.name || "Pixelblaze_" + d.id // set name if missing

Below is the JSON response of a typical request seen by FS in the response when dumping the object discoveries returns in app/api.js here on line 25

  app.get("/discover", function (req, res) {
    res.send(_.map(discoveries, function (v, k) {
      console.log(`${k}:${JSON.stringify(v)}`)
      let res = _.pick(v, ['lastSeen', 'address']);
      _.assign(res, v.controller.props);
      return res;
    }));
  })

I’ve abbreviated the response for brevity.

{
	"lastSeen": 1682189872667,
	"address": "10.0.0.34",
	"port": 1889,
	"controller": {
		"props": {
			"id": 16573876,
			"address": "10.0.0.34",
			"programList": [<...>]
		},
		"command": {},
		"partialList": [<...>],
		"lastSeen": 1682189873064,
		"ws": {<...>}
}

It’s an NIT of course since we fail back cleanly, but it would be cool to maintain parity.

Did something change in the newest release of PB, or is the new version of WS handling the message differently?

Here is the view of FS showing the leader. A screenshot of the leader, with its custom name and followers:

thatcalifire · April 22, 2023, 11:03pm

I guess I spoke too soon. I see the errors again, but nothing of use other than I am dumping logs at various points in the code. I did find that this.ws in the stop() function, which is called first in the connect() block in api/controller.js, has a _closeCode of 1006, which means Abnormal Closure as seen here in the MDN docs.

As another test for giggles I tried to telnet to the PB on port 81 so something is differently up with the WebSocket service on the PB at 10.0.0.34

$ telnet 10.0.0.34 81
Trying 10.0.0.34...
Connected to hypershift-center.pixelblaze.lan.
Escape character is '^]'.
Connection closed by foreign host.

Port 80 still works though:

lastly if I power-cycle the device and attempt to telnet again, magic it works again,

$ telnet 10.0.0.34 81
Trying 10.0.0.34...
Connected to hypershift-center.pixelblaze.lan.
Escape character is '^]'.

wizard · April 23, 2023, 2:47am

@thatcalifire ,
Indeed, it does look like a breaking change from the WS upgrade. None of the text/json messages were getting parsed at all, which would have all the interesting bits including name. Looks like they now pass in a second arg isBinary and we use that instead of the typeof the message. It passes in a Buffer object that isn’t a String, but can be cast to one.

So handleMessage now has this signature:

handleMessage(msg, isBinary) {

and the check changes from a typeof to checking this argument:

    let props = this.props;
    if (!isBinary) {
      try {
        _.assign(this.props, _.pick(JSON.parse(msg), PROPFIELDS));

thatcalifire · April 23, 2023, 4:49am

wizard:

@thatcalifire ,
Indeed, it does look like a breaking change from the WS upgrade. None of the text/json messages were getting parsed at all, which would have all the interesting bits including name. Looks like they now pass in a second arg isBinary and we use that instead of the typeof the message. It passes in a Buffer object that isn’t a String, but can be cast to one.

So handleMessage now has this signature:
handleMessage(msg, isBinary) {
and the check changes from a typeof to checking this argument:
    let props = this.props;
    if (!isBinary) {
      try {
        _.assign(this.props, _.pick(JSON.parse(msg), PROPFIELDS));

Fantastic, I’ll look to add this in a PR tomorrow sometime <3, if you do not have one already in the works!

Thanks for the response

thatcalifire · April 23, 2023, 12:21pm

wizard:

@thatcalifire ,
Indeed, it does look like a breaking change from the WS upgrade. None of the text/json messages were getting parsed at all, which would have all the interesting bits including name. Looks like they now pass in a second arg isBinary and we use that instead of the typeof the message. It passes in a Buffer object that isn’t a String, but can be cast to one.

So handleMessage now has this signature:
handleMessage(msg, isBinary) {
and the check changes from a typeof to checking this argument:
    let props = this.props;
    if (!isBinary) {
      try {
        _.assign(this.props, _.pick(JSON.parse(msg), PROPFIELDS));

validated that works, I’ll continue tinkering with the upgrade to see If any interesting errors show up. When dumping things to console, I did notice we were not processing the message as json, I didn’t know enough about the message to realize its now binary vs string.

Regarding the disconnection and beacon packet still being sent, I did validate that in fact the beacon packet is still being sent out from the PB and heard on FS when FS cannot connect to the WS on the PB.

For now its working if it gets weird again I’ll continue you poking at it

system · August 21, 2023, 12:21pm

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.