TIR/AI Platform
Introduction
The TIR platform is an AI Development Platform built to tackle the friction of training and serving large AI models.
Components of TIR Platform
TIR Dashboard: Notebooks, Datasets, Models, Inference, Token Management
Python SDK: Work with TIR objects from the comfort of a python shell or jupyter notebooks (hosted or local)
CLI: Bring the power of E2E GPU cloud on your local desktop
Why AI Model Development is so hard?
Software Stack Complexity: Taking a model from development to production requires a variety of toolsets and environments. Some of these toolsets need hard version dependencies which further make things harder.
Data Loading and Processing
Training frameworks and libraries (e.g. pytorch, transformers, etc)
GPU drivers with library optimizations (some libraries depend on the GPUs)
Fault Tolerence handling (through usage of pipelines and stateful jobs that can restart)
Deployment Management
Scaling Up, Out and to Zero: Training and serving large models requires platforms with high GPU availability and ability to scale out, up and ability to scale to zero to save idle usage cost.
Collaboration: Work of AI Researchers and Engineers requires high degree of collaboration. Being able to reproduce your team members work is an important aspect of pushing the boundaries of work. The software engineering tools like git do help but are not sufficient to handle large datasets, models to enable reproducibility of work.
Taking Models to Production: Packaging open source or your own models for production use requires a whole different set of skillsets (Containers, API Development, Security and Authenticaiton, etc). A good news is this process is repeatitive in nature, so can be easilty automated.
Key Features of TIR Platform
GPUs Optimized Containers (Nvidia)
Manage End-to-End Lifecycle of Training and Serving large AI models
Pre-Configured Notebooks:
Easily launch notebooks with a variety of environment options (e.g transformers) and desired hardware
Persistent notebook workspaces for reproducibility of work
Datasets: EOS (E2E Object Storage) and PVC backed for easier data sharing and availability
Model and Endpoints: Track models with EOS backed repository and serve them through end point with simple configuration
Pipelines: Define end-to-end training and deployment pipeline
Jobs: Want to quickly run your python code? Just start a job with desired hardware and we take care of the rest
Project and Team Management
User and Access management
Integrations: git, Huggingface, Weights and Biases (Experiement Management), Neptune