diff --git a/web/ai/specs/infra-migration.md b/web/ai/specs/infra-migration.md new file mode 100644 index 000000000..976d1d3ca --- /dev/null +++ b/web/ai/specs/infra-migration.md @@ -0,0 +1,74 @@ +We need to begin use a k8s managed 'jam-cloud', over the current ansible/linode-vm managed approach. + +Let's first identify our infrastructure as it relates to jam-cloud. + +1. postgres 9.3 - our main database. Users, music sessions, connections, and so on. +2. redis - used to distribute jobs to 'anyjob' workers running rails, but as a resque worker. Has no state; safe to reboot with loss of data; it's fine. +3. rabbitmq - used to relay messages between web nodes and websocket-gateway, for a customer messaging protocol. Also stateless; safe to reboot with loss of data. +4. web - this is the jam-cloud 'web' site. it has controllers, web UI, etc. It's a rails app. this is at jam-cloud/web +5. admin - this is another rails app; used for backend administrator. It's our 'control panel'. this is at jam-cloud/admin +6. websocket-gateway. this is a ruby app, used to broker messages between the browser pages hosted by 'web', and rabbitmq. It does support running > 1 at once. It's CPU bound, rather slow. +7. jam-ui - another web site, all frontend. Deployed to cloudfront/s3. Is, over time, replacing the frontend of 'web'. It's live; and uses the contollers/endpoints of the web app + + +Servers: +* **db** - runs postgresql, redis, and rabbitmq +* The web nodes run on **web1**, **web2**, and **web3**web2, web3, and **web4**. A strange thing we've had to do though; web1/web2/web3 are all the 'develop' branch of jam-cloud, 'web4' is running the 'promised_api_branch_iteration' branch. So they have diverged, but fundamenally then we have two web apps, that are *very similar. Call web1/web2/web3 the classic web site, and web4 the new-client site. jam-ui uses the controllers of the web1/web/web3. web 4 is only directly accessed by our native client; it's configurled to to https://www.jamkazam.com:444/client#, so only people using our desktop app realistically every go to web4. This is why it's called the 'new-client' site. + +Roles: +* Job Workers (resque anyjob workers) run on web1/web2/web3, and pick up any jobs sent to them from the redis/resque queue of jobs. They have a connection to redis and db. +* A Scheduler job, also built as part as a CLI invocationof 'web', runs on **db** as well; it will trigger a 'HourlyJob' every hour, and a few others on a schedule. (there is a hourly_job.rb if curious) + + + +We have a staging environment, and a production enviroment. We this both for the 'linode vms', and an existing k8s environment. The k8s environment is managed mostly in video/video-iac. + +Note, the description above of all the nodes is for production. For 'staging', for the linode-vms in particular, there is just one server that runs everything, called 'int'. It also runs jenkins. + +So here's how we can attack migration and remove all the 'linode': + +Say we want to decomission 'int'; i.e. We need to: + + +1. Implement daily backups on the staging database server, and production server. We'd do this in a rather trivial way; we will create a script , that we put on both staging and production database servers, that we can invoke over ssh , with a cron on a machine in my house; that's good enough. It'd create the database dump and scp back to my machine. So my machine would be on a cron, and it'd run the script over ssh, and then scp the output when done. +2. Get latest redis running on the staging k8s (again, video/video-iac would house this; won't repeat that) + * Figure out how to stop redis on `int`, and have everything use the redis at k8s. + * We may need to find a very old redis to start, if the existing code on`int` can't use newer redis. +3. Get latest rabbitmq running on the staging k8s. + * Figure out how to stop rabbitmq on `int`, and have everything use the rabbitmq at k8s. + * We may need to find a very old rabbitmq to start, if the existing code on`int` can't use newer rabbitmq. + +4. Get the rails' admin' site functionality running in k8s. + * The web site needs to connect to the 9.3 postgresql running on `int`, and redis, and MAYBE redis. It currently gets 'mounted' behind 'web' in `int` (same domain name, behind a path prefix), we may need to temporirly expose it directly, say, admin.staging.jamkazam.com, so we can just vet it works. + * Turn off `admin` on staging. +5. Get the rails 'web' site functionality - (not the job workers yet), running in k8s. + * The web site needs to connect to the 9.3 postgresql running on `int`, and also rabbitmq and redis. + * Point the linode loadbalancer for staging.jamkazam.com:443 and port :80 to the k8s ingestion point. + * Stop the 'web' site on `int`; all normal web traffic should be working out of the k8s. +6. Get the `websocket-gateway` into k8s. + * Needs rabbitmq and db access. + * It has particular portsy, i believe 6767, that i'll have to divert from linode loadbalancer to the inject of the k8s cluster. +6. Get the scheduler out of `int` and into k8s. Now the jobs should continue to start. + * Scheduler needs redis and db access. +7. Get the job works out of `int` and into k8s. + * Jobs need redis and db access + * Consider removing the loadbalance at linode and moving DNS to point to somewhere in k8s etc. Unsure if we should do this right now, or ever. + + +The 1st development in this is going to skip CI/CD entirely, or to the greatest extent possible. + +We are going to focus on a local iteration driven flow that, say, I make a change locally in video/video-iac that's got k8s changes related to any of the above. I should always have a local terminal command I can run to sync that straight to the cluster, and have the change take effect immediately. + +We want to codify all operations done from my machine with a `jkctl` master command. This command will have many useful subcommands, such as tailing logs or other operational helpers. But one such command would be `jkctl sync k8s`, which would sync the k8s configuration related to jam-cloud (but, for now, leaving alone the rest of the video/video-iac cluser ). `jkctl backup db` is another. So that's the actual 1st command we'd build. + +Let's make `jkctl` be ruby based for now. Given the order above. + +jkctl will take a --stg or --prd to help you know which env it is. + +Other acceptable build technology will be dagger and gitea. We will retire use of jenkins, but that's one of the last steps; we'll rebuild any pipelines we need via dagger/gitea, focusing on local use of dagger anyway, when useful. + +When we are done, we'll need to do a similar cut-over for production. It won't be much different, and we can address a plan then, when we've gotten through this staging part successfully. + + + +