Google's TPU
--------------------------------------------------------------------

We document the Google TPU v2/v3 in order to support it in tinygrad without the XLA compiler.

## Creating a Google Cloud TPU VM

This costs $4.50/hr for a TPUv2-8 machine, the cheapest VM.

```bash
gcloud alpha compute tpus tpu-vm create test --zone=us-central1-b --accelerator-type=v2-8 --version=v2-alpha
gcloud alpha compute tpus tpu-vm ssh test --zone us-central1-b
# and for when you are done
gcloud alpha compute tpus tpu-vm delete test --zone us-central1-b
gcloud alpha compute tpus tpu-vm list --zone us-central1-b
```

Aside from the usual VM stuff, there's 4 accelerators on the PCI-E bus. (v2-8 is 4 chips with 2 cores each)

```
# lspci
00:04.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:05.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:06.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:07.0 Unassigned class [ff00]: Google, Inc. Device 0027
```

They show up in `/sys/class/accel` (tons of files here) and the driver lives in `/lib/libtpu.so`. The devices are in `/dev/accel[0-3]`, and a bunch of stuff is mmaped. They are "ba16c7433" chips.

We grab the minimal TPU [example from TensorFlow](https://github.com/tensorflow/tensorflow/blob/695b4c93d5da7277eb845937b79b66f9f363ed94/tensorflow/compiler/xla/python/tpu_driver/client/libtpu_client.c). When the compiler runs, it produces tons of great logs in `/tmp/tpu_logs`

```bash
cd tfexample
gcc -o libtpu_client libtpu_client.c -ltpu
TPU_VLOG_LEVEL=99 ./libtpu_client
```

From these logs, we find the "LLO Instructions"

## VLIW Instruction (322b VLIW bundle)

```
  spare         : 0   (0,1)
  vex_mxu       : 0   (1,1)
* 1 misc slot
  msc_targ      : 0   (2,3)
  msc_opnd      : 0   (5,3)
  msc_op        : 0   (8,5)
  msc_pred      : 31  (13,5)
* 2 matrix slots (push, pop)
  vres_dest     : 28  (18,2)
  vres_op       : 28  (20,2)
  vres_pred     : 31  (22,5)
  vex_source    : 28  (27,2)
  vex_subop     : 24  (29,3)
  vex_op        : 24  (32,3)
  vex_pred      : 31  (35,5)
* 4 vector slots (2 for load/store)
  vld_ttu       : 30  (40,1)
  vld_stride    : 24  (41,3)
  vld_offset    : 24  (44,2)
  vld_base      : 24  (46,2)
  vld_submsk    : 24  (48,3)
  vld_dest      : 0   (51,5)
  vld_op        : 0   (56,2)
  vld_pred      : 31  (58,5)
  vst_ttu       : 30  (63,1)
  vst_iar       : 30  (64,1)
  vst_value_two : 24  (65,3)
  vst_offset    : 24  (68,2)
  vst_base      : 24  (70,2)
  vst_value_one : 24  (72,3)
  vst_source    : 0   (75,5)
  vst_op        : 0   (80,5)
  vst_pred      : 31  (85,5)
* 4 vector slots (2 for ALU)
  v1_dest       : 0   (90,5)
  v1_y_vreg     : 0   (95,5)
  v1_y_src      : 0   (100,5)
  v1_x          : 0   (105,5)
  v1_op         : 0   (110,6)
  v1_pred       : 31  (116,5)
  v0_dest       : 0   (121,5)
  v0_y_vreg     : 0   (126,5)
  v0_y_src      : 0   (131,5)
  v0_x          : 0   (136,5)
  v0_op         : 0   (141,6)
  v0_pred       : 31  (147,5)
* 3 scalar registers copied in to the vector units?
  vs2           : 0   (152,5)
  vs1           : 0   (157,5)
  vs0           : 0   (162,5)
* 6 immediates (16-bit each, two can be merged for 32)
  imm_5         : 0   (167,16)
  imm_4         : 0   (183,16)
  imm_3         : 0   (199,16)
  imm_2         : 0   (215,16)
  imm_1         : 0   (231,16)
  imm_0         : 0   (247,16)
* ttu? what's a ttu?
  ttu_set_btr   : 0   (263,1)
  ttu_iterate   : 0   (264,1)
  ttu_row       : 0   (265,3)
* 2 scalar slots
  s1_dest       : 0   (268,5)
  s1_y          : 0   (273,6)
  s1_x          : 0   (279,5)
  s1_op         : 0   (284,6)
  s1_pred       : 31  (290,5)
  s0_dest       : 0   (295,5)
  s0_y          : 0   (300,6)
  s0_x          : 0   (306,5)
  s0_op         : 0   (311,6)
  s0_pred       : 15  (317,5)
```

## Running a Program (WIP)

Our goal is to run a program on TPU without the driver.

```
...
openat(AT_FDCWD, "/dev/accel3", O_RDWR) = 184
mmap(NULL, 27799736, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, 184, 0) = 0x7f59a74b3000
# size is 0x1a830b8, aka 28MB
```