OpenGarage › Forums › OpenGarage Firmware › Automatic WiFi reconnect
- This topic has 12 replies, 3 voices, and was last updated 5 years, 5 months ago by idxman01.
-
AuthorPosts
-
July 18, 2019 at 6:15 pm #1775
jagosParticipantIs the OG supposed to be robust against losing its WiFi connection? That is, does it detect if it has lost the connection and try to reestablish it and if so, how soon should it reconnect after it has actually been lost? I have MQTT configured and I have set notification for door state so I normally get those every 15 seconds or so.
Over the last 3 or 4 days, I have found my OG non-responsive twice, a few days in between. When I look at it, I see that its blue LED is flashing about every 5 sec. If I power cycle it, by the time I get back upstairs and check, it is working fine again.
I have an ASUS router running in AP mode with Merlin firmware. I run a script to modify settings so that each of the enabled 6 unique guest SSIDs (3 @ 2.4 GHz and 3 @ 5 GHz) is on its on VLAN. I have one of those guest SSIDs reserved for IOT on 2.4 GHz and right now the OG is the only thing on it. I never have trouble with it connecting on boot. WiFI signal according to OG web page is labeled Good at -63 dBm.
I recall that in the past, if I ran my AP script resetting one of the other guest SSIDs, the OG would lose connection and never get it back until I power cycle the OG.
I also checked uptime on my AP in case it may have glitched but it has been up for 29 days.
Over the 10 months or so I have had the OG, it has been quite stable with the WiFi and distance readings. Never before the last few days have I lost WiFi from the OG unless something quite overt happened. Only once did it get in a period where the distance readings were flailing fairly rapidly. Power cycle fixed that.
If it is supposed to reconnect automatically, any ideas on why it is not doing so? Could reconnection be made more robust?
Thanks.
July 18, 2019 at 7:01 pm #1778
RayKeymasterYes the firmware already handles re-connection. The specific code is here:
https://github.com/OpenGarage/OpenGarage-Firmware/blob/master/OpenGarage/main.cpp#L1463
the basic logic is that if it gets disconnected from the router, it waits for 60 seconds and if still disconnected, it will reboot to try to reconnect from the beginning.July 21, 2019 at 5:29 pm #1785
jagosParticipantI understand that logic for connection testing. But there is still a failure.
I used my phone to reboot the OG and watched while it rebooted. I saw the blue LED flash quickly a number of times and then settle into one blink every 5 seconds. I did not hear any sounds – I thought I should. Anyway, this was a way I could identify the reboot by the quick flashes. And here I still had WiFi access to the OG once rebooted.
Next I enabled one of my disabled guest SSIDs which reconfigures the ASUS AP and causes a loss of connection with the OG. I continuously watched the OG for more than 2 minutes looking for the quick flashes indicating reboot. I did not see this. It continued to flash once every 5 seconds. I could not get the OG web page. I power cycled the OG, saw the quick LED boot flashes and then connectivity was restored. I tried this several times.
So it would appear to me that the OG test “if(WiFi.status() == WL_CONNECTED && WiFi.localIP())” may be failing to detect the disconnected status.
Have you tried disabling your WiFi AP while the OG is up and running and accessible to see if the disconnection is in fact detected and causing reboot? And if once the AP is restored the OG connects again?
Maybe there should be a ping with timeout test periodically to make sure there is actual connectivity, say every 60 or 120 seconds. This could even be optional with a user provided ip address to ping and possibly a user provided interval.
July 22, 2019 at 7:39 am #1787
RayKeymasterYes, this feature has been tested. Here is how it’s tested: OG is connected to our WiFi router in station mode, then we unplug the router, wait for a while (we’ve tested gap of both less than a minute, and much longer, like 10 minutes), then the router is powered back, and OG is able to reconnect to the router afterwards, without having to manually reboot it.
Where did you see the condition of “if(WiFi.status() == WL_CONNECTED && WiFi.localIP())” — that is not the condition it’s testing in station mode. The only place I can think of that checks that condition is when the controller is in AP mode itself, which is not the mode you are referring to.
July 22, 2019 at 11:36 am #1788
jagosParticipantOn the test, yes I accidentally copied the code 10 lines up (line 1438) from where I meant (line 1448). The line I meant just tests the status and not the localIP.
Your test case is clearly detected and the code recovers appropriately.
But my case is a clear example, I think, that there are cases where connectivity is lost and it is not detected. I did not see any evidence of reboot yet in every case, a power cycle regains connectivity immediately suggesting that the AP is fine at that time. And I have waited as much as 10 minutes without regaining connectivity.
This is why I think an optional ping test is a more reliable test. If the user provides a ping IP address, then ping it with timeout every interval number of seconds where that can be user specified as well but defaults to something reasonable say between 60 and 300 seconds. Doing a ping test with 1 second wait every 60 seconds should not compromise other functionality.
What do you think about this idea? It is really unfortunate when you are away from home and find you have lost connectivity.
I appreciate your excellent customer service and being so responsive.
July 22, 2019 at 12:46 pm #1789
RayKeymasterI am having trouble understanding how is it possible that when your router is powered down, the condition (WiFi.status() == WL_CONNECTED) is still true? If this is the case, what if the ping also returns successfully? This doesn’t make sense to me. What firmware version are you on? Note that the auto-reconnect logic was not included in earlier firmwares. I’ve tried two different router, as well as my phone’s WiFi hotspot, as soon as the network is down, WiFi.status() == WL_CONNECTED becomes false. I have not seen a counter example so far.
- This reply was modified 5 years, 5 months ago by Ray.
July 22, 2019 at 1:11 pm #1791
jagosParticipantFirst, I use an ASUS router in AP mode. My router is an EdgeRouter Lite from Ubiquity which has no WiFi.
I am on the 1.1.0 firmware and have been since shortly after it came out.
As I stated earlier, my AP is not powered down. This problem occurs when I enable or disable one of my 6 guest SSIDs – not the one that the OG is connected to. In this case the router has to do some reconfiguration of its bridge and then I run a script to reconfigure the bridges and VLANs so that each enabled guest SSID is on its own VLAN.
Connectivity is lost as soon as I click the Enable button on the web page for the guest SSID. And it remains lost indefinitely after I run the script. Note the AP does not reboot during this change. During this I never see the OG blue led flash quickly indicating a reboot. Yet once I power cycle the OG forcing the reboot, it connects and works just fine with the new AP guest SSID configuration.
I thought it might be related to my VLAN stuff. I separate all guest SSIDs to individual VLANs so that I can use firewalls to isolate each one. When the ASUS enables or disables a guest SSID, it reconfigures bridges and VLANs but puts all the guest SSIDs in the same VLAN as the regular SSIDs. That is why I have to run a script to rebuild the bridges and VLANs to separate them. I have a reserved IP address for the OG within the VLAN subnet it is connected to. I was thinking during the reconfiguration from enable / disable of a guest SSID maybe the OG got a different IP address from the wrong VLAN and then once I ran the script to fix the VLANs, the OG was simply on the wrong IP address. But this would require the OG to have rebooted and I have seen no evidence of that. So I dismissed this idea but explained it for completeness.
When I have lost connectivity to the OG, I can neither get a web page nor ping it successfully. And I have waited at least 10 minutes in some tests.
I do not believe when the OG is in this disconnected state that a ping from the OG will succeed. But I have no way to prove that short of new firmware.
July 22, 2019 at 2:24 pm #1792
RayKeymasterHmm… Sounds complicated. I don’t know how to exactly reproduce this since I don’t have kind of router setup you have. I have reservations regarding adding the ping test, because it could introduce other issues such as false positives. I still don’t quite understand why the condition WiFi.status() == WL_CONNECTED fails. How is resetting SSID different from a reboot of the router?
July 23, 2019 at 12:14 am #1793
jagosParticipantFirst, I would like to emphasize that the ping test would be optional contingent on a user supplied IP address. If no such address is supplied, there is no change to the current behavior.
Before, I suggested an optional interval setting but I no longer think this is necessary. I would set the ping IP address to the IP address of my router in the same subnet as the OG. I tested the speed of this from my desktop and a successful ping takes 2 ms (ping -c 1 -W 1000 192.168.193.1). Other environments will differ but choosing a local ip in the subnet should be quite fast.
I would suggest putting the test right after line 1448 WiFi.status test. If the WiFi.status test fails then the code branches and the ping test is not considered. If WiFi.status succeeds, then the ping test is only done if an ip address for it has been provided. If it succeeds, then the code proceeds as before. If it fails, then it can wait and try again just like WiFi.status failure. There would need to be a way to mark which kind of failure occurred and which to retest. A second failure would again cause a reboot.
I do not think the ping test ever needs to be done during boot. This is to catch conditions during normal running where connectivity has been lost but not detected by WiFi.status. And I do not understand why this happens in my case but it clearly does. There have been a few other times (no more than 5 and maybe less) over the 11 months or so I have had the OG where I have found it non-responsive but I cannot remember the circumstances and some or all may have been on older firmware. But on v1.1.0 I can reliably reproduce the problem I have described. If there is one such case, who is to say there are not other as yet unidentified scenarios? And it is bad if you are away from home, want to check the OG, and find it non-responsive. I think this approach can improve such cases making the OG more robust.
There might be a bit of an issue if the user puts in a bad IP address choice and not have much time after a reboot to correct it. But I am not sure this is any worse than if the user puts in bad static Device IP, Gateway IP, subnet, or DNS. In such a case the user would need to reset to defaults and start over.
I understand your reluctance when you cannot generate a test case. But you could simulate the ping failing by having the OG ping another host on your network and then taking that host down for a desired amount of time. This way you could verify the code. If you made such a test firmware that passed such a test properly and you believed would not brick my OG, I would be willing to give it a try in my situation and give you feedback.
I understand this takes effort on your part for a case you have never seen so it might seem not to be worth it. I do believe it would be a good addition to make things more robust and should fix the problem I know I have. It is certainly your choice. I just hope I have made a compelling case.
Thanks for taking the time to consider this.
July 23, 2019 at 1:27 pm #1794
jagosParticipantIf you have not read the post immediately before this one, please read that one first.
I have thought about this a bit more and I realized that I do not know if a ping with timeout is even available on this platform. If it is not, then I guess end of discussion. So for now I will assume it is.
I thought about the code change and I think it would be really simple. There is of course the need for the UI to get the pingIP. Other than that, the following pseudo-code should work, a one line change:
replace line 1448
if(WiFi.status() == WL_CONNECTED) {
with
if((WiFi.status() == WL_CONNECTED) && (!havePingIP || pingSucceeds()) ) {
and that is it. If havePingIP is false then !havePingIP is true so the || is true thus the test resolves to WiFi.status check as current. If havePingIP is true then !havePingIP is false and the overall test requires both WiFi.status and pingSucceeds to be true else it starts the failure timer. Generally both will be true. If one of the two fails, it is very likely to be the one to continue to fail. But even if one fails and then after 60 seconds it is the other that fails, it is still worth a reboot. Also, if one fails and then later both succeed the no need for reboot. I really do not think the logic needs to be any more complicated.
It is probably worth having a boolean variable havePingIP as to whether the pingIP has been defined since this is in the main loop. I have no idea how on this platform you can test a ping with timeout so as a placeholder I merely used pingSucceeds() for that.
I trust this makes sense and you see it is a rather straightforward modification. Thanks again for your consideration. I am trying to make this as easy and well thought out as I can. If you see any issues, I would love to try to resolve them.
ETA: A way to make this safer still, when the user submits a ping IP address, before it is saved, a test ping is issued. If it succeeds then it is saved. If it fails, it is not saved and the user is informed.
- This reply was modified 5 years, 5 months ago by jagos.
August 4, 2019 at 9:06 pm #1837
idxman01ParticipantSounds like you want to add a watchdog feature. Yeah, I use that on Uniquiti nanostations for a P2P link which is handy.
Anyway it sounds like a wonky edge case between asus AP and OG. Have you considered getting a UniFi AP? I have an AC lite with several SSID’s and vlans with an edge router X SFP at the head.
Do you enable/disable these ssid’s on a regular basis?
August 5, 2019 at 10:06 am #1842
jagosParticipantidxman01, there is already a watchdog feature in the OG which checks WiFi.status. My point is that there are cases where it fails to detect the lack of connection. Whether my case is unusual is not the point. As I pointed out in a post above, if there is one such case, who knows what other cases may not have been found or overlooked. I am proposing an optional feature which I have boiled down to a fairly simple change that has no impact if not used.
I have an EdgeRouter Lite and as backup an EdgeRouter X. I am quite familiar with Ubiquity’s routers though I have not used their APs and am one of the top 40 solution providers on their forums (different username there). My ASUS works just fine and does what I want. I am not going to buy new hardware which would at best fix the single case I have identified but would do nothing for any other case that disconnects but is not detected.
No, I do not change SSIDs often but again what I am after is a more robust way to ensure auto reconnection.
August 5, 2019 at 4:13 pm #1843
idxman01Participantlol, I think you forgot to reply in all caps….
I missed where Ray mentioned the connection check and a subsequent reboot, but it sounds like we’re describing the same watchdog behavior. In the nano and other network equipment I’ll have it ping the next hop forward and reboot after X failures/timeframe. (and/or do wan failover) It’s certainly a good option to have, no argument there.
As for unifi, I’m simply mentioning it as having their AP’s is common if you have other ubiquiti gear… I’m not necessarily suggesting you buy more, though having others attempt to replicate the behavior would be helpful. My interest is to not have OG or opensprinkler start dropping on me either regardless of the root cause. Like you said, there’s nothing to guarantee similar behavior wouldn’t occur in other equipment that inadvertently affects OG.
I’m also curious if anything else changed recently since you said it has been fine for 11 months. I mean, even if this ping-based watchdog is added and works perfectly, we’re not solving the real problem. (though periodic reboots is better than being offline 😀 )
I have noticed the Blynk app reporting various times the board has been online which changes throughout the day. My immediate concern is that it was dropping, but can’t find any evidence yet. Maybe this is simply an indicator it’s still alive.
As a test I’ve been pinging my OG and OSPr on and off for 10-30 minutes at a time. (1 ping per second, size=1472 over the last few hours) So far they haven’t gone offline and I’m seeing the normal http traffic to blynk as well every 5-10 seconds. I really expected them to fall over based on Ray’s indication of the software-based network stack for that chip. (in other threads) For now I’ve also enabled unifi client statistics and will see if anything jumps out in there or the event logs. This tracks signal, packets, tx attempts, etc…
At this point I only have 3 devices in this IoT VLAN without any special client isolation or broadcast settings. Just basic firewall rules to keep this traffic isolated.
top 40 poster: 😀 ……………..
-
AuthorPosts
- You must be logged in to reply to this topic.
OpenGarage › Forums › OpenGarage Firmware › Automatic WiFi reconnect