Commits · 7e9c01f12a1498657682fe3f54662848abff2db8 · educg-net-22022-2210132 / proj43-support-riscv-in-gvisor

01 Apr, 2024 1 commit
- Add Dockerfile and K8S deployments for a basic TPU pytorch workflow. · 7e9c01f1
  Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 620968670
```
  7e9c01f1
31 Mar, 2024 1 commit
- systrap: handle syscall errors · 8c9cbf0d
  Andrei Vagin authored 11 months ago
```
PiperOrigin-RevId: 620610190
```
  8c9cbf0d
30 Mar, 2024 1 commit

Bump `waitForStopped` timeout further. · 7d680566

Etienne Perot authored 11 months ago

I am hitting the 15-second deadline for llama2-70b @ fp16. It's a beast of a
model with ~130GiB of RAM.

PiperOrigin-RevId: 620382594

7d680566

29 Mar, 2024 3 commits
- netstack: more small alloc optimizations · 5728f719
  Kevin Krakauer authored 11 months ago
```
PiperOrigin-RevId: 620360076
```
  5728f719
- Update tpuproxy package for a more consistent gVisor naming convention. · 32afe881
  Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 620321751
```
  32afe881
- Implement pass through ioctl VFIO_IOMMU_UNMAP_DMA. · 88ee65f3
  Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 620107765
```
  88ee65f3
28 Mar, 2024 4 commits

netstack: don't allocate in gro, use faster buffer.NewView in the xdp endpoint · 5a63761a
Kevin Krakauer authored 11 months ago
```
These are just little optimizations found while working on other things.

PiperOrigin-RevId: 620083599
```
5a63761a
Internal change. · cef9e4fc
gVisor bot authored 11 months ago
```
PiperOrigin-RevId: 620018746
```
cef9e4fc

Fix mount promise TOCTOU bug. · 3c206d82

Lucas Manning authored 11 months ago

Bug scenario:

T1: Creates waiter queue, adds waiter, emits mount promise block event,
    waits.
T2: Gets waiter queue from vfs.mountPromises with read lock.
T1: Daemon does mount, notifies original waiter, deletes promise.
T2: Emits another mount promise block event, but mount already happened!
T2: Waits forever for a mount that will never come.
PiperOrigin-RevId: 619974202

3c206d82

`TestTree`: Utility to run Go tests with a hierarchy unknown at compile time. · 5ba35f51

Etienne Perot authored 11 months ago

This utility creates a nested structure out of a flat list of fully-qualified
test names, and can then execute them using nested `t.Run`s that reflect the
hierarchy properly.

This is useful for CUDA sample tests, which are organized in a hierarchy.
This hierarchy isn't known at compile time, so it cannot be reflected using
plain `t.Run`s.

PiperOrigin-RevId: 619730658

5ba35f51

27 Mar, 2024 4 commits

Implement ioctl command VFIO_IOMMU_MAP_DMA. · 79dd2520
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 619691808
```
79dd2520
Fix Copybara's non-reversible transformations in gVisor. · db85b631
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 619663125
```
db85b631

`dockerutil`: Implement `ContainerPool`, a pool of reusable test containers. · 08ed01b2

Etienne Perot authored 11 months ago

Callers may request a container from the pool, and must release it back
when they are done with it.

This is useful for large tests which can `exec` individual test cases
inside the same set of reusable containers, to avoid the cost of creating
and destroying containers for each test.

It also supports reserving the whole pool ("exclusive"), which locks out
all other callers from getting any other container from the pool.
This is useful for tests where running in parallel may induce unexpected
errors which running serially would not cause. This allows the test to
first try to run in parallel, and then re-run failing tests exclusively
to make sure their failure is not due to parallel execution.

I plan to use this for CUDA sample tests.

PiperOrigin-RevId: 619377512

08ed01b2

Add duration metrics to Stable Diffusion XL library. · 883c9e9b
Etienne Perot authored 11 months ago
```
PiperOrigin-RevId: 619375261
```
883c9e9b

26 Mar, 2024 4 commits

Avoid panic opportunity for TCP keep-alive timers · ff7dbbfe

Zeling Feng authored 11 months ago

There is a panic opportunity if the timer fires but the socket is
going through cleanup. In this case `e.route` might go away but the
timer handler for keep-alive continues to execute and causes panics.
The fix is to return early from the handler if we find out the route
is already removed.

PiperOrigin-RevId: 619268432

ff7dbbfe

Make vfio-pci devices mmappable backed by a host FD. · e9020077
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 619265846
```
e9020077

Override operator new and delete in tests · edbc2af9

gVisor bot authored 11 months ago

This is necessary to ensure errno is not updated while allocating.

Allocators are allowed to update errno, even in case of success. As gvisor
uses matchers to check the value of errno, the tests might fail if errno is
overriden by an allocation done while building the matcher. Using a custom
implementation of new and delete ensures this is not the case.

PiperOrigin-RevId: 619238390

edbc2af9

Increase the RLIMIT_MEMLOCK when TPUProxy is enabled. · cc37e536
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 619045692
```
cc37e536

25 Mar, 2024 3 commits

Implement VFIO-PCI TPU device's Pread and Pwrite to enable bus master at host. · f94df6d1
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 618998464
```
f94df6d1

Option to set bazel's `--test_output` flag to another value than `errors`. · 5c168704

Etienne Perot authored 11 months ago

This is useful to debug long-running or stuck tests, as with
`--test_output=errors` the log is only shown when the test fails or times out.

PiperOrigin-RevId: 618908605

5c168704

Fix and re-enable `imagegen_test.go`. (2/2: Re-enable.) · c3ac1773

Etienne Perot authored 11 months ago

PIL requires a string format name, whereas `Format.PNG` is an "enum string"
which is not quite the same type.

I had not noticed this problem because I had manually tested the test on a VM
where the cached Stable Diffusion XL Docker image was from before this change.

Tested on a fresh VM with a freshly-built Docker image now.

This is split in two changes because the image needs to be rebuilt.
The first change updates the image, and the second change re-enables the test.

PiperOrigin-RevId: 618874414

c3ac1773

23 Mar, 2024 2 commits

Restore errno around allocation in test matchers · e1ffb147

gVisor bot authored 11 months ago

Allocation are not guaranteed to preserve errno, even in case of success.
Because the test matchers test against errno, preserve errno when allocating
new matchers.

PiperOrigin-RevId: 618437007

e1ffb147

`Makefile`: `load-%`: Ignore images that do not exist on the current arch. · fe27d980

Etienne Perot authored 11 months ago

In a previous change, the GPU images changed from any-architecture to
x86-only, so they are no longer available on ARM. Therefore, rules like:

```
gpu-smoke-images: load-basic_cuda-vector-add load-gpu_cuda-tests
.PHONY: gpu-smoke-images
```

... Fail if these images don't exist on ARM.
This change makes the image loading ignored instead.

PiperOrigin-RevId: 618349678

fe27d980

22 Mar, 2024 7 commits

Add Modal logo to gvisor.dev users page. · c493b28a
Steve Silva authored 11 months ago
```
PiperOrigin-RevId: 618286015
```
c493b28a
netstack: don't allocate interfaces when copying data in and out · 3f8ecf02
Kevin Krakauer authored 11 months ago
```
In a redis-benchmark PING_INLINE test, this reduces allocations by 32%.

PiperOrigin-RevId: 618248114
```
3f8ecf02
Merge pull request #10188 from kevinGC:xdptcpbench · b050c90d
gVisor bot authored 11 months ago
```
PiperOrigin-RevId: 618246127
```
b050c90d
Implmenet ioctl command VFIO_DEVICE_RESET. · 81e65120
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 618077589
```
81e65120

Fix and re-enable `imagegen_test.go` (1/2: Fix.) · ce673f29

Etienne Perot authored 11 months ago

PIL requires a string format name, whereas `Format.PNG` is an "enum string"
which is not quite the same type.

I had not noticed this problem because I had manually tested the test on a VM
where the cached Stable Diffusion XL Docker image was from before this change.

Tested on a fresh VM with a freshly-built Docker image now.

This is split in two changes because the image needs to be rebuilt.
The first change updates the image, and the second change re-enables the test.

PiperOrigin-RevId: 618048266

ce673f29

Implement the ioctl command VFIO_DEVICE_SET_IRQS. · 628f1bad
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 618047305
```
628f1bad
Implement pass through ioctl command VFIO_DEVICE_GET_IRQ_INFO. · 24251f57
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 618034948
```
24251f57

21 Mar, 2024 6 commits

netstack: fix tcp_benchmark with XDP · 48bb959c
Kevin Krakauer authored 11 months ago
```
This was the classic "os.File loves to close important file descriptors"
problem.
```
48bb959c
`dockerutil.GPURunOpts`: Expose all GPUs to containers, not just the first. · 641a1a56
Etienne Perot authored 11 months ago
```
PiperOrigin-RevId: 617956471
```
641a1a56

Bump timeout while waiting for sandbox process to stop. · ad7931ac

Ayush Ranjan authored 11 months ago

When the sandbox process has a large memory footprint (say >25 GiB), it can
take more than 5 seconds for the sandobx process to dissapear from the
process table after receiving SIGKILL.

This is more applicable for GPU applications which can use very large amounts
of memory.

PiperOrigin-RevId: 617931970

ad7931ac

Fix CPUUsage() method in cgroup v2 version. · 094b83e1

Nayana Bidari authored 11 months ago

CPUUsage() returns the CPU usage used in calculating the pod CPU utilization.
The cgroup v1 version returns this value in nanoseconds,but the v2 version was
returning in microseconds which resulted in incorrect CPU usage values when
cgroupv2 was used as default.

Fix this by changing the return value of CPUUsage() in cgroupv2 to nanoseconds

PiperOrigin-RevId: 617903260

094b83e1

Qualify Nvidia driver 535.161.07. · 53f976de

Ayush Ranjan authored 11 months ago

Tested on a T4 GPU with driver version 535.161.07:
```
$ docker run --runtime=runsc --rm -it --gpus all  nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

PiperOrigin-RevId: 617867263

53f976de

Implement pass through ioctl command VFIO_DEVICE_GET_REGION_INFO. · 8739b58b
Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 617760440
```
8739b58b

20 Mar, 2024 4 commits
- netstack: fix broken tcp_benchmark XDP mode · add4c98c
  Kevin Krakauer authored 11 months ago
```
cl/591090544 introduced a flag that was not added to tcp_benchmark, breaking
it.

PiperOrigin-RevId: 617622867
```
  add4c98c
- Implement pass through ioctl command VFIO_DEVICE_GET_INFO. · 705fb540
  Jing Chen authored 11 months ago
```
PiperOrigin-RevId: 617621125
```
  705fb540
- netstack: don't allocate in XDP hot paths · 1f6bd756
  Kevin Krakauer authored 11 months ago
```
In a redis-benchmark PING_INLINE test, this reduces allocations by 84%.

PiperOrigin-RevId: 617542009
```
  1f6bd756
- Add support for 550.54.14 driver in nvproxy. · 8f1f1339
  Ayush Ranjan authored 11 months ago
```
PiperOrigin-RevId: 617535894
```
  8f1f1339