Why scale node pools with Google Calendar?¶
The scheduler isn’t perfect for us, especially when large classes have assignments due and a hub is flooded with students. This “hack” was introduced to improve cluster scaling prior to known events.
These ‘placeholder’ nodes are used to minimize the delay that occurs when GCP creates new node pools during mass user logins. This common, especially for larger classes.
Structure¶
There is a Google Calendar calendar, Cal-ICOR Scaling
Events
shared with all infrastructure staff. The event descriptions should
contain a YAML fragment, and are of the form pool_name: count, where the name is the corresponding hub name
(data100, stat20) and the count is the number of extra nodes you want.
There can be several pools defined, one per line.
By default, we usually have one spare node ready to go, so if the count in the calendar event is set to 0 or 1, there will be no change to the cluster. If the value is set to >=2, additional hot spares will be created. If a value is set more than once, the entry with the greater value will be used.
You can determine how many placeholder nodes to have up based on how many people you expect to log in at once. Some of the bigger courses may require 2 or more placeholder nodes, but during “regular” hours, 1 is usually sufficient.
The scaling mechanism is implemented as the
node-placeholder-node-placeholder-scaler deployment within
the node-placeholder namespace. The source code is within
https://
Calendar Autoscaler¶
The code for the calendar autoscaler is a python 3.11 script, located
here:
https://
How the scaler works¶
There is a k8s pod running in the node-placeholder
namespace, which simply runs python3 -m
scaler.
This script runs in an infinite loop, and every 60 seconds checks the
scaler config and calendar for entries. It then uses the highest value
provided as the number of placeholder replicas for any given hub. This
means that if there’s a daily evening event to ‘cool down’ the number
of replicas for all hubs to 0, and a simultaneous event to set one or
more hubs to a higher number, the scaler will see this and keep however
many node placeholders specified up and ready to go.
After determining the number of replicas needed for each hub, the scaler
will create a k8s template and run kubectl in the pod.
Updating the scaler config¶
The scaler config sets the default number of node-placeholders that are running at any given time. These values can be overridden by creating events in the Cal-ICOR Scaling Events calendar.
When classes are in session, these defaults are all typically set to
1, and during breaks (or when a hub is not expected to be
in use) they can be set to 0.
After making changes to values.yaml, create a PR normally
and our CI will push the new config out to the node-placeholder pod.
There is no need to manually restart the node-placeholder pod as the
changes will be picked up
automatically.
Working on, testing and deploying the calendar scaler¶
All file locations in this section will assume that you are in the
cal-icor-hubs/images/node-placeholder-scaler/ directory.
It is strongly recommended that you create a new python 3.11 environment
before doing any dev work on the scaler. With conda, you
can run the following commands to create one:
conda create -ny scalertest python=3.11
pip install -r images/node-placeholder-scaler/requirements.txtAny changes to the scaler code will require you to run
chartpress to redeploy the scaler to GCP.
Here is an example of how you can test any changes to
scaler/calendar.py locally in the python interpreter:
# these tests will use some dates culled from the calendar with varying numbers of events.
import scaler.calendar
import datetime
import zoneinfo
tz = zoneinfo.ZoneInfo(key='America/Los_Angeles')
zero_events_noon_june = datetime.datetime(2023, 6, 14, 12, 0, 0, tzinfo=tz)
one_event_five_pm_april = datetime.datetime(2023, 4, 27, 17, 0, 0, tzinfo=tz)
three_events_eight_thirty_pm_march = datetime.datetime(2023, 3, 6, 20, 30, 0, tzinfo=tz)
calendar = scaler.calendar.get_calendar('https://calendar.google.com/calendar/ical/c_s47m3m1nuj3s81187k3b2b5s5o%40group.calendar.google.com/public/basic.ics')
zero_events = scaler.calendar.get_events(calendar, time=zero_events_noon_june)
one_event = scaler.calendar.get_events(calendar, time=one_event_five_pm_april)
three_events = scaler.calendar.get_events(calendar, time=three_events_eight_thirty_pm_march)
assert len(zero_events) == 0
assert len(one_event) == 1
assert len(three_events) == 3get_events returns a list of ical
ical.event.Event class objects.
The method for testing scaler/scaler.py is similar to
above, but the only things you’ll be able test locally are the
make_deployment() and get_replica_counts()
functions.
When you’re ready, create a PR. The deployment workflow is as follows:
Get all authed-up for
chartpressby performing the documented steps.Run
chartpress --pushfrom the rootcal-icor-hubs/directory. If this succeeds, check yourgit statusand addcal-icor-hubs/node-placeholder/Chart.yamlandcal-icor-hubs/node-placeholder/values.ymlto your PR.Merge to
stagingand thenprod.
Changing python imports¶
The python requirements file is generated using
requirements.in and pip-compile. If you need
to change/add/update any packages, you’ll need to do the following:
Ensure you have the correct python environment activated (see above).
Pip install
pip-toolsEdit
requirements.inand save your changes.Execute
pip-compile requirements.in, which will update therequirements.txt.Check your git status and diffs, and create a pull request if necessary.
Get all authed-up for
chartpressby performing the documented steps.Run
chartpress --pushfrom the rootcal-icor-hubs/directory. If this succeeds, check yourgit statusand addcal-icor-hubs/node-placeholder/Chart.yamlandcal-icor-hubs/node-placeholder/values.ymlto your PR.Merge to
stagingand thenprod.
Monitoring¶
You can monitor the scaling by watching for events:
kubectl -n support get events -wAnd by tailing the logs of the pod with the scalar process:
kubectl -n support logs -l app.kubernetes.io/name=node-placeholder-scaler -fFor example if you set epsilon: 2, you might see in the
pod logs:
2022-10-17 21:36:45,440 Found event Stat20/Epsilon test 2 2022-10-17 14:21 PDT to 15:00 PDT
2022-10-17 21:36:45,441 Overrides: {'epsilon': 2}
2022-10-17 21:36:46,475 Setting epsilon to have 2 replicas