# Analyze Data in Detail with the Historian

In this tutorial, we will explore how to use the historian to validate the trained AI agent system in AMESA and training logs. The historian stores historical time-series data in an optimized format (parquet) - <https://www.databricks.com/glossary/what-is-parquet>, which helps in evaluating how the agent is performing during training.

### **Step 1: Accessing the Historian Data**

The historian file stores time-series data essential for validating agent system training. There are several ways to access and store the historian data, but the recommended format is as a **delta file** (parquet).

1. **Understanding the Format**:
   * The historian data is typically large, around 500 megabytes for standard operations. It is stored in a **Delta Lake** file format, optimized for time-series data and supporting efficient queries.
2. **Downloading the Historian File**:
   * From the AMESA UI, download the historian file. This file will come in a compressed format (e.g., `.gz`).
   * After extracting it, you should see the delta file containing time-series data.

### **Step 2: Setting Up for Validation**

1. **Unpacking the Historian File**:
   * If the historian file is compressed (e.g., `.gz`), unpack the file using a tool like `gzip`:

     ```bash

     gunzip -k historian_file.gz 

     ```
   * Once unzipped, you’ll see a **10 MB+ delta file** with historical time-series data.
2. **Understanding the Delta File**:
   * The delta file is optimized for fast reads and writes of time-series data.
   * It supports an append-only structure, which ensures that each new piece of data can be added efficiently without modifying the existing data.

### **Step 3: Querying the Historian Data**

1. **Setting Up a Query Environment**:
   * To validate your agent system’s training, you’ll need to set up an environment that allows you to query the delta file. Delta Lake integrates well with systems like **Apache Spark**, but for simple querying, you can use tools like **pandas** in Python.
2. **Querying for Agent Training Logs**:

   * Extract and analyze relevant historical data from the delta file. Here's a simple Python example for querying the delta file using pandas:

   ```python

   import pandas as pd 



   # Load the historian delta file 

   df = pd.read_parquet('historian_delta_file.parquet') 

   df = df.sort_values(by=['timestamp'])

   df_data = df[df['category_sub'].isin(['step', 'skill-training','skill-training-cycle'])]
   #filter df with composabl_obs on "data" col only
   df_data = df_data[(df_data['data'].str.contains('composabl_obs')) | (df_data['category_sub'].str.contains('skill-training')) | (df_data['category_sub'].str.contains('skill-training-cycle'))]

   #df_data['data'] = df_data['data'].apply(lambda x: x if 'composabl_obs' in x else None)
   def convert_to_dict(x):
      try:
         return json.loads(x)
      except:
         try:
               return ast.literal_eval(x)
         except:
               return None

   df_data['data'] = df_data['data'].apply(lambda x: convert_to_dict(x))

   df_data['skill_name'] = df_data['data'].apply(lambda x: x['name'] if 'is_done' in x else None)
   df_data['skill_name'] = df_data['skill_name'].fillna(method='bfill')

   df_data['reward'] = df_data['data'].apply(lambda x: x['teacher_reward'] if 'composabl_obs' in x else None)

   df_data['obs'] = df_data['data'].apply(lambda x: x['composabl_obs'] if 'composabl_obs' in x else None)

   #df_data['done'] = df_data['data'].apply(lambda x: x["teacher_terminated"] if "teacher_terminated" in x else None)
   df_data['cycle'] = df_data['data'].apply(lambda x: x['cycle'] if 'cycle' in x else None)
   df_data['cycle'] = df_data['cycle'].fillna(method='bfill')

   df_data = df_data[df_data['category_sub'] == 'step']

   print(df_data)

   # group by runs
   df_group = df_data.groupby(['run_id','skill_name','cycle'])['reward'].mean()

   # Process observation data
   df_obs = pd.DataFrame(data=[[v[0] for v in list(x.values())] for x in df_data['obs'].values], columns=[list(df_data['obs'][0].keys())])

   df_obs['cycle'] = df_data['cycle']
   df_obs['run_id'] = df_data['run_id']
   df_obs['skill_name'] = df_data['skill_name']
   df_obs.columns = [x[0] for x in list(df_obs.columns)]

   # Episode Reward by Run Id
   for run_id in list(set([x[0] for x in df_group.index])):
      for skill in list(set([x[1] for x in df_group.index])):
         #df_group[run_id].plot(subplots=True, title=run_id)
         plt.plot(df_group[run_id][skill])
         plt.ylabel(f'Mean Episode Reward')
         plt.xlabel(f'Cycle')
         plt.title(f'{run_id} - {skill}')

         plt.show()

   ```

***

### Key Benefits of Using the Historian for Validation:

* **Optimized Data Handling**: The Delta Lake format is designed for fast querying, making it ideal for time-series data.
* **Efficient Storage**: The append-only nature ensures that new data can be added without overwriting or modifying existing data, making it easy to track data over time.
* **Continuous Monitoring**: By continuously adding data to the historian, you can validate your agent system's long-term impact on machine performance, uptime, and safety.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.amesa.com/train-agents/analyze-agent-behavior/analyze-data-in-detail-with-the-historian.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
