RDT#
Robotics Diffusion Transformer (RDT) is a 1B-parameter Diffusion Transformer trained with imitation learning. To finetune RDT with RoboVerse data, you need to follow the steps below.
Installation#
First, you need to clone the RoboVerse version of RDT and install the environment as required. We provide an official transform and configuration of RoboVerse fine-tuning setup there.
# Clone this repo
git clone https://github.com/wbjsamuel/RoboVerse-RDT
cd RoboticsDiffusionTransformer
# Create a Conda environment
conda create -n rdt python=3.10.0
conda activate rdt
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
# Install packaging
pip install packaging==24.0
# Install flash-attn
pip install flash-attn --no-build-isolation
# Install other prequisites
pip install -r requirements.txt
Then, download the multi-modal encoders from huggingface and create a symbolic link to the google
directory. Here is what you may refer to:
huggingface-cli download google/t5-v1_1-xxl --cache-dir YOUR_CACHE_DIR
huggingface-cli download google/siglip-so400m-patch14-384 --cache-dir YOUR_CACHE_DIR
# Under the root directory of this repo
mkdir -p google
# Link the downloaded encoders to this repo
ln -s /YOUR_CACHE_DIR/t5-v1_1-xxl google/t5-v1_1-xxl
ln -s /YOUR_CACHE_DIR/siglip-so400m-patch14-384 google/siglip-so400m-patch14-384
Lastly, add your buffer path to the configs/base.yaml
. We’ll directly use pre-training pipeline for fine-tuning, so the buffer path is necessary.
# ...
dataset:
# ...
# ADD YOUR buf_path: the path to the buffer (at least 400GB)
buf_path: PATH_TO_YOUR_BUFFER
# ...
Data Preparataion#
For RDT, we still employ the RLDS-format data for fine-tuning. Please make sure your data have been successfully converted to RLDS format. If not, please refer to the OpenVLA documentation for related data conversion part. Then, all you need to do is just create a symbolic link of the converted RLDS data to the required path.
ln -s PATH_TO_RLDS_DATA data/datasets/openx_embod/YOUR_TASK_NAME
Finetune RDT#
After the above steps, you can start finetuning RDT with the following command. We have modified the finetune.sh
script for you and you can directly use it.:
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=lo
export NCCL_DEBUG=INFO
export NCCL_NVLS_ENABLE=1
export TEXT_ENCODER_NAME="google/t5-v1_1-xxl"
export VISION_ENCODER_NAME="google/siglip-so400m-patch14-384"
export OUTPUT_DIR="./checkpoints/rdt-finetune-1b-pickcube"
export CFLAGS="-I/usr/include"
export LDFLAGS="-L/usr/lib/x86_64-linux-gnu"
# export CUTLASS_PATH="/path/to/cutlass"
export WANDB_PROJECT="rdt-finetune-1b-pickcube"
if [ ! -d "$OUTPUT_DIR" ]; then
mkdir "$OUTPUT_DIR"
echo "Folder '$OUTPUT_DIR' created"
else
echo "Folder '$OUTPUT_DIR' already exists"
fi
# For run in multi nodes/machines
# deepspeed --hostfile=hostfile.txt main.py \
# --deepspeed="./configs/zero2.json" \
# ...
accelerate launch main.py \
--deepspeed="./configs/zero2.json" \
--pretrained_model_name_or_path="robotics-diffusion-transformer/rdt-1b" \
--pretrained_text_encoder_name_or_path=$TEXT_ENCODER_NAME \
--pretrained_vision_encoder_name_or_path=$VISION_ENCODER_NAME \
--output_dir=$OUTPUT_DIR \
--train_batch_size=16 \
--sample_batch_size=32 \
--max_train_steps=200000 \
--checkpointing_period=1000 \
--sample_period=500 \
--checkpoints_total_limit=40 \
--lr_scheduler="constant" \
--learning_rate=1e-4 \
--mixed_precision="bf16" \
--dataloader_num_workers=8 \
--image_aug \
--dataset_type="pretrain" \
--state_noise_snr=40 \
--report_to=wandb
# Use this to resume training from some previous checkpoint
# --resume_from_checkpoint="checkpoint-36000" \
# Use this to load from saved lanuage instruction embeddings,
# instead of calculating it during training
# --precomp_lang_embed \