infra-migration

2026-02-28 20:40:41 -06:00 · 2026-02-28 20:40:41 -06:00 · 7eba5fe10d
parent 89dca5ae03
commit 7eba5fe10d
1 changed files with 74 additions and 0 deletions
--- a/web/ai/specs/infra-migration.md
+++ b/web/ai/specs/infra-migration.md
@ -0,0 +1,74 @@
+We need to begin use a k8s managed 'jam-cloud', over the current ansible/linode-vm managed approach.
+
+Let's first identify our infrastructure as it relates to jam-cloud.
+
+1. postgres 9.3 - our main database. Users, music sessions, connections, and so on.  
+2. redis - used to distribute jobs to 'anyjob' workers running rails, but as a resque worker.   Has no state; safe to reboot with loss of data; it's fine.
+3. rabbitmq - used to relay messages between web nodes and websocket-gateway, for a customer messaging protocol. Also stateless; safe to reboot with loss of data.
+4. web - this is the jam-cloud 'web' site.  it has controllers, web UI, etc. It's a rails app. this is at jam-cloud/web
+5. admin - this is another rails app; used for backend administrator. It's our 'control panel'.  this is at jam-cloud/admin
+6. websocket-gateway.  this is a ruby app, used to broker messages between the browser pages hosted by 'web', and rabbitmq.  It does support running > 1 at once. It's CPU bound, rather slow.  
+7. jam-ui - another web site, all frontend.  Deployed to cloudfront/s3.  Is, over time, replacing the frontend of 'web'.  It's live; and uses the contollers/endpoints of the web app
+
+
+Servers:
+* **db** - runs postgresql, redis, and rabbitmq
+* The web nodes run on **web1**, **web2**, and **web3**web2, web3, and **web4**.   A strange thing we've had to do though; web1/web2/web3 are all the 'develop' branch of jam-cloud, 'web4' is running the 'promised_api_branch_iteration' branch.  So they have diverged, but fundamenally then we have two web apps, that are *very similar.   Call web1/web2/web3 the classic web site, and web4 the new-client site.  jam-ui uses the controllers of the web1/web/web3.    web 4 is only directly accessed by our native client; it's configurled to to https://www.jamkazam.com:444/client#, so only people using our desktop app realistically every go to web4. This is why it's called the 'new-client' site.
+
+Roles: 
+* Job Workers (resque anyjob workers) run on web1/web2/web3, and pick up any jobs sent to them from the redis/resque queue of jobs. They have a connection to redis and db.
+* A Scheduler job, also built as part as a CLI invocationof 'web', runs on **db** as well; it will trigger a 'HourlyJob' every hour, and a few others on a schedule. (there is a hourly_job.rb if curious)
+
+
+
+We have a staging environment, and a production enviroment. We this both for the 'linode vms', and an existing k8s environment.  The k8s environment is managed mostly in video/video-iac.
+
+Note, the description above of all the nodes is for production.  For 'staging', for the linode-vms in particular, there is just one server that runs everything, called 'int'.   It also runs jenkins.
+
+So here's how we can attack migration and remove all the 'linode':
+
+Say we want to decomission 'int'; i.e.   We need to:
+
+
+1. Implement daily backups on the staging database server, and production server.   We'd do this in a rather trivial way; we will create a script , that we put on both staging and production database servers, that we can invoke over ssh , with a cron on a machine in my house; that's good enough.   It'd create the database dump and scp back to my machine.    So my machine would be on a cron, and it'd run the script over ssh, and then scp the output when done.
+2. Get latest redis running on the staging k8s (again, video/video-iac would house this; won't repeat that)
+   * Figure out how to stop redis on `int`, and have everything use the redis at k8s.
+   * We may need to find a very old redis to start, if the existing code on`int` can't use newer redis.
+3. Get latest rabbitmq running on the staging k8s.
+    * Figure out how to stop rabbitmq on `int`, and have everything use the rabbitmq at k8s.
+    * We may need to find a very old rabbitmq to start, if the existing code on`int` can't use newer rabbitmq.
+
+4. Get the rails' admin' site functionality  running in k8s.
+    * The web site needs to connect to the 9.3 postgresql running on `int`, and redis, and MAYBE redis.   It currently gets 'mounted' behind 'web' in `int` (same domain name, behind a path prefix), we may need to temporirly expose it directly, say, admin.staging.jamkazam.com, so we can just vet it works.
+    * Turn off `admin` on staging.
+5. Get the rails 'web' site functionality - (not the job workers yet), running in k8s.
+    * The web site needs to connect to the 9.3 postgresql running on `int`, and also rabbitmq and redis.
+    * Point the linode loadbalancer for staging.jamkazam.com:443 and port :80 to the k8s ingestion point.
+    * Stop the 'web' site on `int`; all normal web traffic should be working out of the k8s.
+6. Get the `websocket-gateway` into k8s.
+    * Needs rabbitmq and db access.  
+    * It has particular portsy, i believe 6767, that i'll have to divert from linode loadbalancer to the inject of the k8s cluster.
+6. Get the scheduler out of `int` and into k8s.  Now the jobs should continue to start. 
+     * Scheduler needs redis and db access.
+7. Get the job works out of `int` and into k8s.
+     * Jobs need redis and db access
+     * Consider removing the loadbalance at linode and moving DNS to point to somewhere in k8s etc. Unsure if we should do this right now, or ever.
+
+
+The 1st development in this is going to skip CI/CD entirely, or to the greatest extent possible. 
+
+We are going to focus on a local iteration driven flow that, say, I make a change locally in video/video-iac that's got k8s changes related to any of the above. I should always have a local terminal command I can run to sync that straight to the cluster, and have the change take effect immediately.   
+
+We want to codify all operations done from my machine with a `jkctl` master command. This command will have many useful subcommands, such as tailing logs or other operational helpers.  But one such command would be `jkctl sync k8s`, which would sync the k8s configuration related to jam-cloud (but, for now, leaving alone the rest of the video/video-iac cluser ). `jkctl backup db` is another.  So that's the actual 1st command we'd build.
+
+Let's make `jkctl` be ruby based for now. Given the order above.
+
+jkctl will take a --stg or --prd to help you know which env it is.
+
+Other acceptable build technology will be dagger and gitea.  We will retire use of jenkins, but that's one of the last steps; we'll rebuild any pipelines we need via dagger/gitea, focusing on local use of dagger anyway, when useful.
+
+When we are done, we'll need to do a similar cut-over for production.  It won't be much different, and we can address a plan then, when we've gotten through this staging part successfully.
+
+
+
+