# Manual

## Overview

Due to the many options available to you for installing Kubernetes clusters, this document will not go into the specifics of setting up the cluster. Rather, it will provide you with guidance and requirements for your cluster.

## Nodes

Depending on whether you want to use GPUs or not, you need the following nodes:

Nodes that are always required:

1. "main": This means nodes to run the control plane. The AMESA controller does not interact with these nodes, so they should be provisioned as recommended by the Kubernetes distribution you use.
2. "composabl": This node or nodes will be where the AMESA controller and Historian software are scheduled.
3. "envrunners": These nodes will handle training workloads. If you're not using GPUs, all training will be done on these nodes. If you are, these nodes will manage the communication with the simulators, and can be reduced in size
4. "simscpu": These nodes are where the simulators will be scheduled. Sizing depends on the simulator.

If you want to use GPU training, you need the following nodepool: 5. "learners": These nodes with GPUs will accelerate the learning step of the training process.

If your simulator can be accelerated using GPU, you can add the final node pool: 6. "simsgpu": These will run simulators, assigning a GPU to them.

A note on GPUs: Currently, only Nvidia GPUs are supported. The cluster must have the [nvidia-gpu-operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) installed for training on GPU to be enabled.

### 1. Sizing

Whether or not you use autoscaling using [cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler), each node type must be sized accordingly.

1. `main`: As required by your Kubernetes distribution
2. `composabl`: In total, 16GB of memory and 4 CPU - with 1 node at least 8GB of memory.
3. `envrunners`: If not using GPUs, we recommend 8 CPU and 8 or 16 GB of memory. In any case, the number of simulators that can be managed by each envrunner instance depends on the number of CPU
4. `simscpu`: The sizing of these nodes depends on the resource requirements of your simulator
5. `learners`: These nodes should have 1 Nvidia GPU. Other resources can be limited - 2 CPU and 8GB of memory is sufficient
6. `simsgpu`: As with `simscpu`, depends on the simulator requirements.

### 2. Labels

All groups of nodes must be labeled accordingly. The name, as given in the sizing guide is the name you should set as the agentpool label.

You may be able to define this during your cluster setup, but if not, you can use the following commands:

```bash
kubectl label node <my-amesa-node> agentpool=composabl --overwrite
kubectl label node <my-envrunners-node> agentpool=envrunners --overwrite
kubectl label node <my-simulator-node> agentpool=simscpu --overwrite
kubectl label node <my-learners-node> agentpool=learners --overwrite
kubectl label node <my-simulator-gpu-node> agentpool=simsgpu --overwrite
```

Replace the values in between `<>` with the name of the nodes you'd like to assign to a specific pool.

## Storage

The components also need access to (semi)persistent, shared storage. This section will detail the types and amount of storage needed.

It needs the following `PersistentVolumeClaim`s in the `composabl-train` namespace:

1. `pvc-controller-data` with a size of ±`1Gi` and `ReadWriteOnce` (or better) `accessMode` When using Azure, you will need to set the `nobrl` mountOption for this PVC, as this is required for the AMESA controller to function.
2. `pvc-training-results` with a suitable size - this is where your final agent system data will be stored before it is uploaded to the No-code application. It **needs `accessmode` to be `ReadWriteMany` (RWX)**. A good initial size is to match `historian-tmp`.
3. `historian-tmp` is used as temporary storage for historian data. It needs to have an `accessMode` of `ReadWriteOnce` and the size will depend on the length of your training sessions. We recommend starting with `5Gi`.

The size of `pvc-training-results` and `historian-tmp` is dependent on the amount and size of training jobs you want to run simultaneously on your cluster. If you plan on running long-lived training sessions with many cycles, you may want to increase the capacity for both,

## Private image registry

If you want to use a private registry for simulator images, you will need to set up this private registry yourself, and make sure the cluster is able to pull images from this registry.

## Next steps

Once your cluster is running, and you have verified your setup is working, you can continue to [Installing AMESA](/clusters/creating-a-cluster/manual.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amesa.com/clusters/creating-a-cluster/manual.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.