jam-cloud/infra-migration.md at 7eba5fe10d7a51550206ab622df5bcfd49c2843f

6.5 KiB

Raw Blame History

We need to begin use a k8s managed 'jam-cloud', over the current ansible/linode-vm managed approach.

Let's first identify our infrastructure as it relates to jam-cloud.

postgres 9.3 - our main database. Users, music sessions, connections, and so on.
redis - used to distribute jobs to 'anyjob' workers running rails, but as a resque worker. Has no state; safe to reboot with loss of data; it's fine.
rabbitmq - used to relay messages between web nodes and websocket-gateway, for a customer messaging protocol. Also stateless; safe to reboot with loss of data.
web - this is the jam-cloud 'web' site. it has controllers, web UI, etc. It's a rails app. this is at jam-cloud/web
admin - this is another rails app; used for backend administrator. It's our 'control panel'. this is at jam-cloud/admin
websocket-gateway. this is a ruby app, used to broker messages between the browser pages hosted by 'web', and rabbitmq. It does support running > 1 at once. It's CPU bound, rather slow.
jam-ui - another web site, all frontend. Deployed to cloudfront/s3. Is, over time, replacing the frontend of 'web'. It's live; and uses the contollers/endpoints of the web app

Servers:

db - runs postgresql, redis, and rabbitmq
The web nodes run on web1, web2, and web3web2, web3, and web4. A strange thing we've had to do though; web1/web2/web3 are all the 'develop' branch of jam-cloud, 'web4' is running the 'promised_api_branch_iteration' branch. So they have diverged, but fundamenally then we have two web apps, that are *very similar. Call web1/web2/web3 the classic web site, and web4 the new-client site. jam-ui uses the controllers of the web1/web/web3. web 4 is only directly accessed by our native client; it's configurled to to https://www.jamkazam.com:444/client#, so only people using our desktop app realistically every go to web4. This is why it's called the 'new-client' site.

Roles:

Job Workers (resque anyjob workers) run on web1/web2/web3, and pick up any jobs sent to them from the redis/resque queue of jobs. They have a connection to redis and db.
A Scheduler job, also built as part as a CLI invocationof 'web', runs on db as well; it will trigger a 'HourlyJob' every hour, and a few others on a schedule. (there is a hourly_job.rb if curious)

We have a staging environment, and a production enviroment. We this both for the 'linode vms', and an existing k8s environment. The k8s environment is managed mostly in video/video-iac.

Note, the description above of all the nodes is for production. For 'staging', for the linode-vms in particular, there is just one server that runs everything, called 'int'. It also runs jenkins.

So here's how we can attack migration and remove all the 'linode':

Say we want to decomission 'int'; i.e. We need to:

Implement daily backups on the staging database server, and production server. We'd do this in a rather trivial way; we will create a script , that we put on both staging and production database servers, that we can invoke over ssh , with a cron on a machine in my house; that's good enough. It'd create the database dump and scp back to my machine. So my machine would be on a cron, and it'd run the script over ssh, and then scp the output when done.
Get latest redis running on the staging k8s (again, video/video-iac would house this; won't repeat that)
- Figure out how to stop redis on int, and have everything use the redis at k8s.
- We may need to find a very old redis to start, if the existing code onint can't use newer redis.
Get latest rabbitmq running on the staging k8s.
- Figure out how to stop rabbitmq on int, and have everything use the rabbitmq at k8s.
- We may need to find a very old rabbitmq to start, if the existing code onint can't use newer rabbitmq.
Get the rails' admin' site functionality running in k8s.
- The web site needs to connect to the 9.3 postgresql running on int, and redis, and MAYBE redis. It currently gets 'mounted' behind 'web' in int (same domain name, behind a path prefix), we may need to temporirly expose it directly, say, admin.staging.jamkazam.com, so we can just vet it works.
- Turn off admin on staging.
Get the rails 'web' site functionality - (not the job workers yet), running in k8s.
- The web site needs to connect to the 9.3 postgresql running on int, and also rabbitmq and redis.
- Point the linode loadbalancer for staging.jamkazam.com:443 and port :80 to the k8s ingestion point.
- Stop the 'web' site on int; all normal web traffic should be working out of the k8s.
Get the websocket-gateway into k8s.
- Needs rabbitmq and db access.
- It has particular portsy, i believe 6767, that i'll have to divert from linode loadbalancer to the inject of the k8s cluster.
Get the scheduler out of int and into k8s. Now the jobs should continue to start.
- Scheduler needs redis and db access.
Get the job works out of int and into k8s.
- Jobs need redis and db access
- Consider removing the loadbalance at linode and moving DNS to point to somewhere in k8s etc. Unsure if we should do this right now, or ever.

The 1st development in this is going to skip CI/CD entirely, or to the greatest extent possible.

We are going to focus on a local iteration driven flow that, say, I make a change locally in video/video-iac that's got k8s changes related to any of the above. I should always have a local terminal command I can run to sync that straight to the cluster, and have the change take effect immediately.

We want to codify all operations done from my machine with a jkctl master command. This command will have many useful subcommands, such as tailing logs or other operational helpers. But one such command would be jkctl sync k8s, which would sync the k8s configuration related to jam-cloud (but, for now, leaving alone the rest of the video/video-iac cluser ). jkctl backup db is another. So that's the actual 1st command we'd build.

Let's make jkctl be ruby based for now. Given the order above.

jkctl will take a --stg or --prd to help you know which env it is.

Other acceptable build technology will be dagger and gitea. We will retire use of jenkins, but that's one of the last steps; we'll rebuild any pipelines we need via dagger/gitea, focusing on local use of dagger anyway, when useful.

When we are done, we'll need to do a similar cut-over for production. It won't be much different, and we can address a plan then, when we've gotten through this staging part successfully.

6.5 KiB Raw Blame History

6.5 KiB

Raw Blame History