You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
127 lines
4.0 KiB
127 lines
4.0 KiB
Google's TPU
|
|
--------------------------------------------------------------------
|
|
|
|
We document the Google TPU v2/v3 in order to support it in tinygrad without the XLA compiler.
|
|
|
|
## Creating a Google Cloud TPU VM
|
|
|
|
This costs $4.50/hr for a TPUv2-8 machine, the cheapest VM.
|
|
|
|
```bash
|
|
gcloud alpha compute tpus tpu-vm create test --zone=us-central1-b --accelerator-type=v2-8 --version=v2-alpha
|
|
gcloud alpha compute tpus tpu-vm ssh test --zone us-central1-b
|
|
# and for when you are done
|
|
gcloud alpha compute tpus tpu-vm delete test --zone us-central1-b
|
|
gcloud alpha compute tpus tpu-vm list --zone us-central1-b
|
|
```
|
|
|
|
Aside from the usual VM stuff, there's 4 accelerators on the PCI-E bus. (v2-8 is 4 chips with 2 cores each)
|
|
|
|
```
|
|
# lspci
|
|
00:04.0 Unassigned class [ff00]: Google, Inc. Device 0027
|
|
00:05.0 Unassigned class [ff00]: Google, Inc. Device 0027
|
|
00:06.0 Unassigned class [ff00]: Google, Inc. Device 0027
|
|
00:07.0 Unassigned class [ff00]: Google, Inc. Device 0027
|
|
```
|
|
|
|
They show up in `/sys/class/accel` (tons of files here) and the driver lives in `/lib/libtpu.so`. The devices are in `/dev/accel[0-3]`, and a bunch of stuff is mmaped. They are "ba16c7433" chips.
|
|
|
|
We grab the minimal TPU [example from TensorFlow](https://github.com/tensorflow/tensorflow/blob/695b4c93d5da7277eb845937b79b66f9f363ed94/tensorflow/compiler/xla/python/tpu_driver/client/libtpu_client.c). When the compiler runs, it produces tons of great logs in `/tmp/tpu_logs`
|
|
|
|
```bash
|
|
cd tfexample
|
|
gcc -o libtpu_client libtpu_client.c -ltpu
|
|
TPU_VLOG_LEVEL=99 ./libtpu_client
|
|
```
|
|
|
|
From these logs, we find the "LLO Instructions"
|
|
|
|
## VLIW Instruction (322b VLIW bundle)
|
|
|
|
```
|
|
spare : 0 (0,1)
|
|
vex_mxu : 0 (1,1)
|
|
* 1 misc slot
|
|
msc_targ : 0 (2,3)
|
|
msc_opnd : 0 (5,3)
|
|
msc_op : 0 (8,5)
|
|
msc_pred : 31 (13,5)
|
|
* 2 matrix slots (push, pop)
|
|
vres_dest : 28 (18,2)
|
|
vres_op : 28 (20,2)
|
|
vres_pred : 31 (22,5)
|
|
vex_source : 28 (27,2)
|
|
vex_subop : 24 (29,3)
|
|
vex_op : 24 (32,3)
|
|
vex_pred : 31 (35,5)
|
|
* 4 vector slots (2 for load/store)
|
|
vld_ttu : 30 (40,1)
|
|
vld_stride : 24 (41,3)
|
|
vld_offset : 24 (44,2)
|
|
vld_base : 24 (46,2)
|
|
vld_submsk : 24 (48,3)
|
|
vld_dest : 0 (51,5)
|
|
vld_op : 0 (56,2)
|
|
vld_pred : 31 (58,5)
|
|
vst_ttu : 30 (63,1)
|
|
vst_iar : 30 (64,1)
|
|
vst_value_two : 24 (65,3)
|
|
vst_offset : 24 (68,2)
|
|
vst_base : 24 (70,2)
|
|
vst_value_one : 24 (72,3)
|
|
vst_source : 0 (75,5)
|
|
vst_op : 0 (80,5)
|
|
vst_pred : 31 (85,5)
|
|
* 4 vector slots (2 for ALU)
|
|
v1_dest : 0 (90,5)
|
|
v1_y_vreg : 0 (95,5)
|
|
v1_y_src : 0 (100,5)
|
|
v1_x : 0 (105,5)
|
|
v1_op : 0 (110,6)
|
|
v1_pred : 31 (116,5)
|
|
v0_dest : 0 (121,5)
|
|
v0_y_vreg : 0 (126,5)
|
|
v0_y_src : 0 (131,5)
|
|
v0_x : 0 (136,5)
|
|
v0_op : 0 (141,6)
|
|
v0_pred : 31 (147,5)
|
|
* 3 scalar registers copied in to the vector units?
|
|
vs2 : 0 (152,5)
|
|
vs1 : 0 (157,5)
|
|
vs0 : 0 (162,5)
|
|
* 6 immediates (16-bit each, two can be merged for 32)
|
|
imm_5 : 0 (167,16)
|
|
imm_4 : 0 (183,16)
|
|
imm_3 : 0 (199,16)
|
|
imm_2 : 0 (215,16)
|
|
imm_1 : 0 (231,16)
|
|
imm_0 : 0 (247,16)
|
|
* ttu? what's a ttu?
|
|
ttu_set_btr : 0 (263,1)
|
|
ttu_iterate : 0 (264,1)
|
|
ttu_row : 0 (265,3)
|
|
* 2 scalar slots
|
|
s1_dest : 0 (268,5)
|
|
s1_y : 0 (273,6)
|
|
s1_x : 0 (279,5)
|
|
s1_op : 0 (284,6)
|
|
s1_pred : 31 (290,5)
|
|
s0_dest : 0 (295,5)
|
|
s0_y : 0 (300,6)
|
|
s0_x : 0 (306,5)
|
|
s0_op : 0 (311,6)
|
|
s0_pred : 15 (317,5)
|
|
```
|
|
|
|
## Running a Program (WIP)
|
|
|
|
Our goal is to run a program on TPU without the driver.
|
|
|
|
```
|
|
...
|
|
openat(AT_FDCWD, "/dev/accel3", O_RDWR) = 184
|
|
mmap(NULL, 27799736, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, 184, 0) = 0x7f59a74b3000
|
|
# size is 0x1a830b8, aka 28MB
|
|
```
|
|
|
|
|