infra-migration
This commit is contained in:
parent
89dca5ae03
commit
7eba5fe10d
|
|
@ -0,0 +1,74 @@
|
|||
We need to begin use a k8s managed 'jam-cloud', over the current ansible/linode-vm managed approach.
|
||||
|
||||
Let's first identify our infrastructure as it relates to jam-cloud.
|
||||
|
||||
1. postgres 9.3 - our main database. Users, music sessions, connections, and so on.
|
||||
2. redis - used to distribute jobs to 'anyjob' workers running rails, but as a resque worker. Has no state; safe to reboot with loss of data; it's fine.
|
||||
3. rabbitmq - used to relay messages between web nodes and websocket-gateway, for a customer messaging protocol. Also stateless; safe to reboot with loss of data.
|
||||
4. web - this is the jam-cloud 'web' site. it has controllers, web UI, etc. It's a rails app. this is at jam-cloud/web
|
||||
5. admin - this is another rails app; used for backend administrator. It's our 'control panel'. this is at jam-cloud/admin
|
||||
6. websocket-gateway. this is a ruby app, used to broker messages between the browser pages hosted by 'web', and rabbitmq. It does support running > 1 at once. It's CPU bound, rather slow.
|
||||
7. jam-ui - another web site, all frontend. Deployed to cloudfront/s3. Is, over time, replacing the frontend of 'web'. It's live; and uses the contollers/endpoints of the web app
|
||||
|
||||
|
||||
Servers:
|
||||
* **db** - runs postgresql, redis, and rabbitmq
|
||||
* The web nodes run on **web1**, **web2**, and **web3**web2, web3, and **web4**. A strange thing we've had to do though; web1/web2/web3 are all the 'develop' branch of jam-cloud, 'web4' is running the 'promised_api_branch_iteration' branch. So they have diverged, but fundamenally then we have two web apps, that are *very similar. Call web1/web2/web3 the classic web site, and web4 the new-client site. jam-ui uses the controllers of the web1/web/web3. web 4 is only directly accessed by our native client; it's configurled to to https://www.jamkazam.com:444/client#, so only people using our desktop app realistically every go to web4. This is why it's called the 'new-client' site.
|
||||
|
||||
Roles:
|
||||
* Job Workers (resque anyjob workers) run on web1/web2/web3, and pick up any jobs sent to them from the redis/resque queue of jobs. They have a connection to redis and db.
|
||||
* A Scheduler job, also built as part as a CLI invocationof 'web', runs on **db** as well; it will trigger a 'HourlyJob' every hour, and a few others on a schedule. (there is a hourly_job.rb if curious)
|
||||
|
||||
|
||||
|
||||
We have a staging environment, and a production enviroment. We this both for the 'linode vms', and an existing k8s environment. The k8s environment is managed mostly in video/video-iac.
|
||||
|
||||
Note, the description above of all the nodes is for production. For 'staging', for the linode-vms in particular, there is just one server that runs everything, called 'int'. It also runs jenkins.
|
||||
|
||||
So here's how we can attack migration and remove all the 'linode':
|
||||
|
||||
Say we want to decomission 'int'; i.e. We need to:
|
||||
|
||||
|
||||
1. Implement daily backups on the staging database server, and production server. We'd do this in a rather trivial way; we will create a script , that we put on both staging and production database servers, that we can invoke over ssh , with a cron on a machine in my house; that's good enough. It'd create the database dump and scp back to my machine. So my machine would be on a cron, and it'd run the script over ssh, and then scp the output when done.
|
||||
2. Get latest redis running on the staging k8s (again, video/video-iac would house this; won't repeat that)
|
||||
* Figure out how to stop redis on `int`, and have everything use the redis at k8s.
|
||||
* We may need to find a very old redis to start, if the existing code on`int` can't use newer redis.
|
||||
3. Get latest rabbitmq running on the staging k8s.
|
||||
* Figure out how to stop rabbitmq on `int`, and have everything use the rabbitmq at k8s.
|
||||
* We may need to find a very old rabbitmq to start, if the existing code on`int` can't use newer rabbitmq.
|
||||
|
||||
4. Get the rails' admin' site functionality running in k8s.
|
||||
* The web site needs to connect to the 9.3 postgresql running on `int`, and redis, and MAYBE redis. It currently gets 'mounted' behind 'web' in `int` (same domain name, behind a path prefix), we may need to temporirly expose it directly, say, admin.staging.jamkazam.com, so we can just vet it works.
|
||||
* Turn off `admin` on staging.
|
||||
5. Get the rails 'web' site functionality - (not the job workers yet), running in k8s.
|
||||
* The web site needs to connect to the 9.3 postgresql running on `int`, and also rabbitmq and redis.
|
||||
* Point the linode loadbalancer for staging.jamkazam.com:443 and port :80 to the k8s ingestion point.
|
||||
* Stop the 'web' site on `int`; all normal web traffic should be working out of the k8s.
|
||||
6. Get the `websocket-gateway` into k8s.
|
||||
* Needs rabbitmq and db access.
|
||||
* It has particular portsy, i believe 6767, that i'll have to divert from linode loadbalancer to the inject of the k8s cluster.
|
||||
6. Get the scheduler out of `int` and into k8s. Now the jobs should continue to start.
|
||||
* Scheduler needs redis and db access.
|
||||
7. Get the job works out of `int` and into k8s.
|
||||
* Jobs need redis and db access
|
||||
* Consider removing the loadbalance at linode and moving DNS to point to somewhere in k8s etc. Unsure if we should do this right now, or ever.
|
||||
|
||||
|
||||
The 1st development in this is going to skip CI/CD entirely, or to the greatest extent possible.
|
||||
|
||||
We are going to focus on a local iteration driven flow that, say, I make a change locally in video/video-iac that's got k8s changes related to any of the above. I should always have a local terminal command I can run to sync that straight to the cluster, and have the change take effect immediately.
|
||||
|
||||
We want to codify all operations done from my machine with a `jkctl` master command. This command will have many useful subcommands, such as tailing logs or other operational helpers. But one such command would be `jkctl sync k8s`, which would sync the k8s configuration related to jam-cloud (but, for now, leaving alone the rest of the video/video-iac cluser ). `jkctl backup db` is another. So that's the actual 1st command we'd build.
|
||||
|
||||
Let's make `jkctl` be ruby based for now. Given the order above.
|
||||
|
||||
jkctl will take a --stg or --prd to help you know which env it is.
|
||||
|
||||
Other acceptable build technology will be dagger and gitea. We will retire use of jenkins, but that's one of the last steps; we'll rebuild any pipelines we need via dagger/gitea, focusing on local use of dagger anyway, when useful.
|
||||
|
||||
When we are done, we'll need to do a similar cut-over for production. It won't be much different, and we can address a plan then, when we've gotten through this staging part successfully.
|
||||
|
||||
|
||||
|
||||
|
||||
Loading…
Reference in New Issue