# RDT Robotics Diffusion Transformer (RDT) is a 1B-parameter Diffusion Transformer trained with imitation learning. To finetune RDT with RoboVerse data, you need to follow the steps below. ## Installation First, you need to clone the RoboVerse [version](https://github.com/wbjsamuel/RoboVerse-RDT) of RDT and install the environment as required. We provide an official transform and configuration of RoboVerse fine-tuning setup there. ```bash # Clone this repo git clone https://github.com/wbjsamuel/RoboVerse-RDT cd RoboticsDiffusionTransformer # Create a Conda environment conda create -n rdt python=3.10.0 conda activate rdt # Install pytorch # Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121 # Install packaging pip install packaging==24.0 # Install flash-attn pip install flash-attn --no-build-isolation # Install other prequisites pip install -r requirements.txt ``` Then, download the multi-modal encoders from huggingface and create a symbolic link to the `google` directory. Here is what you may refer to: ```bash huggingface-cli download google/t5-v1_1-xxl --cache-dir YOUR_CACHE_DIR huggingface-cli download google/siglip-so400m-patch14-384 --cache-dir YOUR_CACHE_DIR # Under the root directory of this repo mkdir -p google # Link the downloaded encoders to this repo ln -s /YOUR_CACHE_DIR/t5-v1_1-xxl google/t5-v1_1-xxl ln -s /YOUR_CACHE_DIR/siglip-so400m-patch14-384 google/siglip-so400m-patch14-384 ``` Lastly, add your buffer path to the `configs/base.yaml`. We'll directly use pre-training pipeline for fine-tuning, so the buffer path is necessary. ```bash # ... dataset: # ... # ADD YOUR buf_path: the path to the buffer (at least 400GB) buf_path: PATH_TO_YOUR_BUFFER # ... ``` ## Data Preparataion For RDT, we still employ the RLDS-format data for fine-tuning. Please make sure your data have been successfully converted to RLDS format. If not, please refer to the [OpenVLA](https://roboverse.wiki/roboverse_learn/openvla) documentation for related data conversion part. Then, all you need to do is just create a symbolic link of the converted RLDS data to the required path. ```bash ln -s PATH_TO_RLDS_DATA data/datasets/openx_embod/YOUR_TASK_NAME ``` ## Finetune RDT After the above steps, you can start finetuning RDT with the following command. We have modified the `finetune.sh` script for you and you can directly use it.: ```bash export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=lo export NCCL_DEBUG=INFO export NCCL_NVLS_ENABLE=1 export TEXT_ENCODER_NAME="google/t5-v1_1-xxl" export VISION_ENCODER_NAME="google/siglip-so400m-patch14-384" export OUTPUT_DIR="./checkpoints/rdt-finetune-1b-pickcube" export CFLAGS="-I/usr/include" export LDFLAGS="-L/usr/lib/x86_64-linux-gnu" # export CUTLASS_PATH="/path/to/cutlass" export WANDB_PROJECT="rdt-finetune-1b-pickcube" if [ ! -d "$OUTPUT_DIR" ]; then mkdir "$OUTPUT_DIR" echo "Folder '$OUTPUT_DIR' created" else echo "Folder '$OUTPUT_DIR' already exists" fi # For run in multi nodes/machines # deepspeed --hostfile=hostfile.txt main.py \ # --deepspeed="./configs/zero2.json" \ # ... accelerate launch main.py \ --deepspeed="./configs/zero2.json" \ --pretrained_model_name_or_path="robotics-diffusion-transformer/rdt-1b" \ --pretrained_text_encoder_name_or_path=$TEXT_ENCODER_NAME \ --pretrained_vision_encoder_name_or_path=$VISION_ENCODER_NAME \ --output_dir=$OUTPUT_DIR \ --train_batch_size=16 \ --sample_batch_size=32 \ --max_train_steps=200000 \ --checkpointing_period=1000 \ --sample_period=500 \ --checkpoints_total_limit=40 \ --lr_scheduler="constant" \ --learning_rate=1e-4 \ --mixed_precision="bf16" \ --dataloader_num_workers=8 \ --image_aug \ --dataset_type="pretrain" \ --state_noise_snr=40 \ --report_to=wandb # Use this to resume training from some previous checkpoint # --resume_from_checkpoint="checkpoint-36000" \ # Use this to load from saved lanuage instruction embeddings, # instead of calculating it during training # --precomp_lang_embed \ ```