Datasets

You can organize, share and easily access your data through notebooks, training code using datasets. At the moment, TIR supports EOS (Object Storage) backed datasets but we will soon introduce PVC (Disk) backed datasets as well.

How does it work?

The datasets allows you to mount and access EOS storage buckets as local file system. This will enable you to read objects in your bucket using standard file system semantics.

With datasets, you can load training data from local machine, other cloud providers and access the data from your notebook as a mounted file system (under /datasets directory).

Note

Define datasets to manage and control your data, even if you don’t plan to use it as mount it for use on notebooks or training jobs.

Benefits

Using datasets offers the following benefits:

  • You can create a shared EOS bucket for training data for your team. By using common data sources across team, you can improve reproducibility of results

  • Less configuration overhead. When you create dataset through TIR, we automatically create a new storage bucket and access credentials. You can straight away copy and run the mc (minio cli) commands (shown in the ui) to your local desktop or hosted notebooks to upload data.

  • You can access your training data without having to setup access information on hosted notebooks or training jobs

  • You can stream training data instead of downloading all of it to the disk. This is also useful in distributed training jobs.

Usage

WebUI

You can define dataset through TIR dashboard. The UI also allows you to browse the objects in the bucket and upload files. We recommend using mc client or any s3 compliant cli to upload large data. In case you have data with other cloud providers, visit the Import Data from Other Cloud Providers section.

SDK

Using TIR SDK, you can quickly setup mc cli and start importing or exporting data from EOS buckets.

Getting Started

Prequisites

Install minio cli on your desktop (local) from here. Ignore this step, if you already have mc installed on your local machine.

Create a new dataset

  • Go to the TIR dashboard

  • Make sure you are on the right project or feel free to create a new project

  • Go to Datasets tab

  • Click CREATE DATASET

  • Choose a bucket type New EOS Bucket. This will create a new EOS bucket tied to your account and also access keys for it.

  • Enter a name for your dataset (for e.g. paws)

  • Click on CREATE

  • You will see a popup with Bucket name, Access Key and Secret Key

  • Locate and go to Setup Minio CLI (mc) tab (in the popup) and copy the mc command to setup host

  • Run the copied mc command in the command line on your local desktop. This will setup mc alias that you transfer data to.

Copy data into Dataset

On your local desktop, open command line and enter the following command:

mc alias ls <dataset-name>

You should see the eos bucket names in response. Copy this bucket name to edit and run the following copy command:

mc cp --recursive <mylocaldirectory> <dataset-name>/<eos-bucket>

Working with Datasets

Import Data from local machine

To import data from your local machine, TIR provides two options:

  • Upload from browser (TIR Dashboard >> Datasets >> Click on Dataset >> Objects Browser tab)

  • Use CLI (s3 compliant like mc):
    1. Install minio cli on your desktop (local) from here.

    2. In TIR Dashboard, locate and click your dataset in Datasets section. At the bottom of the screen, you will find setup tabs. Choose Minio (mc) tab and to copy setup instructions for alias.

      mc alias set <dataset-name> <eos-bucket> <access key> <secret key>
      
    3. After setting up alias, you can upload complete directory using command below:

      mc cp --recursive <source_dir> <dataset-name>/<eos-bucket>/
      

Import Data from Amazon S3

After configuring alias for your dataset and aws s3 in mc (minio cli), run the following command:

mc cp --recursive <s3-alias>/<s3-bucket> <dataset-name>/<eos-bucket>/ --attr Cache-Control=max-age=90000, min-fresh=9000\;key1=value1\;key2=value2

To configure dataset alias for EOS bucket, follow instructions in Import Data from local machine

Import Data from GCS

After configuring alias for your dataset and aws s3 in mc (minio cli), run the following command:

mc cp --recursive <s3-alias>/<s3-bucket> <dataset-name>/<eos-bucket>/ --attr Cache-Control=max-age=90000, min-fresh=9000\;key1=value1\;key2=value2

To configure dataset alias for EOS bucket, follow instructions in Import Data from local machine

Upload data from TIR Jupyter Notebook

You will need to configure dataset alias in mc to upload data to EOS bucket. You can achive this in two ways:

  • From TIR Dashboard, you can find mc alias` and mc cp commands for respective dataset. Use the commands to upload dataset from notebook cells.

  • Using our python SDK, you can configure mc alias without having to go to TIR Dashboard. Run the following steps from notebook cells:

from e2enetworks.cloud.tir

tir.init()
tir.set_dataset_alias("dataset-name")

To upload or download data to the notebook instance:

# from the output copy the eos bucket name and use it in cp command
!mc ls <dataset-name>/
!mc cp --recursive ~/data/ <dataset-name>/<eos-bucket>/

Permenantly Migrate from other Cloud

If you have a data science workflow or an application that uploads data to other cloud storage, you can replace the endpoints to EOS and get best performance on your training jobs.

How to Create Datasets ?

For creating the Dataset you have to click on Dataset button from side nav bar.

../_images/Dataset1.png

Clicking on the Create Dataset button.

../_images/Dataset2.png

After clicking the Create Dataset button the Dataset creation form will be open and you have to select bucket type either you want to create a new bucket or select an existing bucket. Here we are giving the example with the new EOS bucket. So enter the database details and click on the Create button.

../_images/Dataset3.png

Create a Dataset with an existing bucket. For that you have to simply select the Existing EOS bucket from the dropdown and enter all details and click on Create button.

../_images/Dataset4.png

After clicking on the Create button a pop up will appear in which the Dataset Credentials will be shown.

../_images/Dataset5.png

It will show new dialog box Dataset Credentials with dataset name, Bucket Name, Access Key, Secret Key. After creating the dataset it will be shown like this.And you have to click on the Ok button to continue.

../_images/Dataset6.png

After clicking on the OK button it should show Details, Upload & Setup page.

../_images/Dataset7.png

Delete DataSet

Click on the Delete button to delete the dataset.

../_images/Dataset8.png

After clicking on the Delete button it will show one popup to delete the dataset.

../_images/Dataset9.png

Click on the Upload tab then it will show the Bucket path or upload files text fields.

../_images/Dataset10.png

Click on Setup then it will show Setup Minio CLI & Setup s3cmd.

../_images/Dataset11.png