Lone Wolf Development Forums

Lone Wolf Development Forums (http://forums.wolflair.com/index.php)
-   Hero Lab Online Discussion (http://forums.wolflair.com/forumdisplay.php?f=95)
-   -   [COMPLETED] 5 Mins Planned downtime Thurs (07/30) @9 PM Pacific (4 AM UTC 7/31) (http://forums.wolflair.com/showthread.php?t=64584)

SteveT July 30th, 2020 06:12 PM

[COMPLETED] 5 Mins Planned downtime Thurs (07/30) @9 PM Pacific (4 AM UTC 7/31)
 
We are planning a quick hotfix deployment for Hero Lab Online on Thursday (July 30th) at 9pm pacific (4 am UTC 7/31). We expect downtime to be 5 minutes or less.

EDIT: This has been completed.

flyteach July 30th, 2020 08:03 PM

HLO Down
 
No notice, and right in the middle of Gencon sessions.

rob July 30th, 2020 08:09 PM

Within the product itself, there should have appeared a notice regarding the planned outage. That notice should have appeared roughly TWO HOURS prior to the time and provided an ongoing update of when the outage would occur.

The actual outage lasted for only a few minutes. And given the two hours of advance notice within the product, it should have been practical for GMs to plan for a 5-minute "bio break" at 9pm.

We are in the middle of GenCon, so there is simply no good time to deploy anything. However, there were issues that absolutely needed to be addressed. So we waited until later in the night as a "less bad" option.

Parody July 30th, 2020 08:51 PM

As with PaizoCon Online, anyone can go look at Gen Con Online's list of events. Paizo (who is organizing the majority of Starfinder and Pathfinder 2nd Edition events) runs their Thursday/Friday/Saturday events starting at 8 AM, 2 PM, and 8 PM Eastern. Slots are 5 hours long, so shutting off the server at 9 PM Pacific (Midnight Eastern, 4 hours into the slot) meant you probably interrupted the final encounter of a bunch of events.

Sunday's events start at 9 AM Eastern, if there's an emergency.

I played in our normal non-virtual game (with paper character sheets!) tonight so I don't know what warnings went out.

rob July 30th, 2020 10:29 PM

The information we had showed numerous games going late into the night. So there was simply no "good" time to do it. We consulted with the person on staff most familiar with the convention gaming schedule (the rest of us have been working round the clock), and she didn't flag a conflict with the outage timing. So we did our best to pick a "less bad" time, and it sounds like we could have been more thorough. I apologize for that.

We have literally been working around the clock to get everything into place for GenCon. And to address the rough edges over the past couple of days for things we didn't catch during our own testing. There are limits to what a tiny team like ours can achieve, and I'm proud of what we've managed to put into place this week. It hasn't been perfect, but it's been pretty darn good.

We're all exhausted on this end. I hope everybody has a great weekend gaming and that all of that hard work pays off overall.

rob July 30th, 2020 10:32 PM

Addendum: If there are things we can do to improve the outage notification mechanism, please share your suggestions. We've striven to achieve a balance that accurately conveys upcoming outages without being obtrusive. If we need to adjust that balance, or if there's a use-case we haven't covered adequately, we can make the appropriate changes.

slate July 30th, 2020 11:17 PM

Hey Rob,

Not knowing your infrastructure, is it possible that you could spin up a second front end cluster, deploy update to front end cluster, drain traffic from A to B and then remove A?

Obviously, if there are DB migrations, this might be less ideal and would require a lot more heavy lifting.

flyteach July 31st, 2020 05:25 AM

Rob,
We had 4 in our group, all using HLO. If there was a warning, none of us caught it. Maybe making it a persistent toast until we close it? I know I've seen it in the past, but didn't seem to yesterday, when it was most critical.
Also, yeah, the convention schedule, just like at Paizocon, has been out for weeks and is very public. Please at least consider going outside of the main 5 hour blocks as @Parody suggests.
And there are no release notes, so we don't even know what was fixed.

rob July 31st, 2020 02:09 PM

Quote:

Originally Posted by slate (Post 290000)
Not knowing your infrastructure, is it possible that you could spin up a second front end cluster, deploy update to front end cluster, drain traffic from A to B and then remove A?

Obviously, if there are DB migrations, this might be less ideal and would require a lot more heavy lifting.

You make it sound so easy when you say it like that! ;)

This was something I wanted in place more than a year. Alas, I then found out the server code had to be completed rewritten (see my comments here for more info). During the rewrite process, I've probably put about 50% of the necessary infrastructure into place to accomplish this, but there's still a meaningful chunk of work left to do. And then a TON of testing.

As you surmised, an additional factor has been that most releases (aside from these GenCon hotfixes) entail a bunch of database changes to incorporate the new capabilities we've been steadily adding. That increases the complexity greatly, and definitely wouldn't be supported at first, but we could still use the transition approach for hotfixes that are code changes only, like we've needed the past few days.

So it's definitely something I want to do - and have been working towards in pieces - but we're not there yet. My goal is to be there by the end of the year, finishing up the missing pieces interspersed with all the other new stuff that's in the queue. :)

rob July 31st, 2020 02:12 PM

Quote:

Originally Posted by flyteach (Post 290007)
Rob,
We had 4 in our group, all using HLO. If there was a warning, none of us caught it. Maybe making it a persistent toast until we close it? I know I've seen it in the past, but didn't seem to yesterday, when it was most critical.
Also, yeah, the convention schedule, just like at Paizocon, has been out for weeks and is very public. Please at least consider going outside of the main 5 hour blocks as @Parody suggests.
And there are no release notes, so we don't even know what was fixed.

We'll be changing the behavior to make the toast persistent henceforth.

We're gonna figure out a better way to get the convention game schedule clearly known by the dev team in the future.

The release notes went out this morning. We were wiped yesterday. The release notes were properly staged in advance, but we forgot to unveil them once the hotfix was officially deployed.

flyteach July 31st, 2020 03:30 PM

Rob, thanks. Yeah, I think a persistent toast would be nice. You could also get rid of the persistent toast about multiple logins.....I'd think that one only needs to be there for 5 or 10 seconds. Right now, I have to x it each time it comes up. I'll also suggest that the system stabilize for the week of a premium convention. It would certainly prevent last minute changes, especially during the final battle. While it's nice for a few to have nice shiny things on day 1, the rest of us have to contend with the fallout of any issues, usually manifesting in several outages over the past couple of years.

rob July 31st, 2020 04:55 PM

Quote:

Originally Posted by flyteach (Post 290043)
I'll also suggest that the system stabilize for the week of a premium convention. It would certainly prevent last minute changes, especially during the final battle. While it's nice for a few to have nice shiny things on day 1, the rest of us have to contend with the fallout of any issues, usually manifesting in several outages over the past couple of years.

The problem is that we can't release the books until their "street date". And we don't receive the books from the publishers far enough in advance to be able to get them finished in advance. And even if we did get them far enough in advance, we're gonna overlook something in all of our testing that will be uncovered by actual users.

Given that a LOT users want the nice shiny things the day it's released, it's a no-win situation for us when the book launches in the middle of a big show (e.g. PaizoCon or GenCon).

slate July 31st, 2020 09:59 PM

Quote:

Originally Posted by rob (Post 290028)
You make it sound so easy when you say it like that! ;)

This was something I wanted in place more than a year. Alas, I then found out the server code had to be completed rewritten (see my comments here for more info). During the rewrite process, I've probably put about 50% of the necessary infrastructure into place to accomplish this, but there's still a meaningful chunk of work left to do. And then a TON of testing.

As you surmised, an additional factor has been that most releases (aside from these GenCon hotfixes) entail a bunch of database changes to incorporate the new capabilities we've been steadily adding. That increases the complexity greatly, and definitely wouldn't be supported at first, but we could still use the transition approach for hotfixes that are code changes only, like we've needed the past few days.

So it's definitely something I want to do - and have been working towards in pieces - but we're not there yet. My goal is to be there by the end of the year, finishing up the missing pieces interspersed with all the other new stuff that's in the queue. :)

Awesome man, as a DevOps engineer for a long time, I feel your pain. It takes a lot to get there and do it correctly. Starting off on the wrong foot didn't help. Glad you're getting closer and closer!

slate July 31st, 2020 10:14 PM

Quote:

Originally Posted by rob (Post 290048)
The problem is that we can't release the books until their "street date". And we don't receive the books from the publishers far enough in advance to be able to get them finished in advance. And even if we did get them far enough in advance, we're gonna overlook something in all of our testing that will be uncovered by actual users.

Given that a LOT users want the nice shiny things the day it's released, it's a no-win situation for us when the book launches in the middle of a big show (e.g. PaizoCon or GenCon).

I'm with you on this one. It's lose-lose - people are gonna be upset either way.

flyteach August 1st, 2020 05:07 AM

Rob, I guess we'll have to agree to disagree. Sure a LOT of users want shiny the first day. But aren't there a LOT MORE users who want stability during the biggest game convention of the year and not having a session interrupted? Are you saying that the majority of your customer base has already purchased APG?

rob August 1st, 2020 02:18 PM

Quote:

Originally Posted by flyteach (Post 290061)
Rob, I guess we'll have to agree to disagree. Sure a LOT of users want shiny the first day. But aren't there a LOT MORE users who want stability during the biggest game convention of the year and not having a session interrupted? Are you saying that the majority of your customer base has already purchased APG?

Your argument assumes that ALL of our users will use HLO over the GenCon weekend, which is not correct. If the focus is instead on the portion of our users that are actually using HLO over the GenCon weekend, then the answer to your second question may actually be yes. I don't have exact numbers, so I can't say that as an absolute.

The bigger factor to consider is that a large contingent of our users consider it a huge selling point of Hero Lab to always have access to the latest shiny bits the day they get released by the publisher. To them, if we didn't have the new books available the day they become available, then they would view HLO as "not usable" for the ENTIRE GenCon weekend. Which is a whole different calculus compared to a 5-minute outage once in a 24-hour period to deploy a hotfix.

So this is definitely a no-win situation. And we will continue releasing the books on the publisher street dates for the reasons above.

Hopefully, by next GenCon (ideally PaizoCon), we'll have the transparent server transition solution in place, and this will be a non-concern. Everyone will get the books they want on the street date, and there will be no service interruptions for anyone. <fingers crossed>

flyteach August 1st, 2020 02:55 PM

Rob, I will certainly cross my fingers with you. I know it's been a longer, hard road than originally anticipated. OTOH, the new Starfinder book is not available, so I guess you can put me in the box of the user that didn't get new shiny on release day. And yet, HLO is perfectly usable, the same as it was before, albeit without shiny new stuff. But, it's certainly NOT unusable.
Also, FWIW, that 5 minute outage was mainly during the boss fight at a major con. I do appreciate the one last night being outside of main table games.

rob August 1st, 2020 03:27 PM

As for the bad timing of the release on Thursday, it was regrettable. I apologized already, and I'm happy to do it again. :)

FWIW, I circled back on our end and discovered that we weren't given complete information by the publisher regarding event timing. That's why our person who supposedly was "in the know" didn't flag a conflict when we asked her. Could we have double-checked all that information ourselves? Yes, we could have. SHOULD we have needed to double-check it? That's a separate question that we'll be discussing with the publisher. Suffice to say that we were so focused on bug fixing that we (wrongly) assumed we were given accurate and complete information. The rest, as they say, is history.

At least we got our timing corrected for the next night's hotfix. :)


All times are GMT -8. The time now is 10:06 PM.

Powered by vBulletin® - Copyright ©2000 - 2024, vBulletin Solutions, Inc.
wolflair.com copyright ©1998-2016 Lone Wolf Development, Inc. View our Privacy Policy here.