Troubleshoot GitHub Actions Cache 404 Error: Upload Not Found

by Mireille Lambert 62 views

Hey guys,

I'm diving deep into troubleshooting a tricky issue I encountered while stress-testing my cache server. I figured sharing this could help others facing similar problems or even spark some insights from you all. So, let's get into it!

The Problem: 404 Error During FinalizeCacheEntryUpload

I've been stress-testing my cache server, and I've run into this error that's got me scratching my head:

Warning: Failed to save: Failed to FinalizeCacheEntryUpload: Received non-retryable error: Failed request: (404) Upload not found

This error pops up during the FinalizeCacheEntryUpload process, which is the last step in saving a cache entry. A 404 error typically means the server can't find the resource, but in this case, it's weird because the initial upload seems to go through just fine.

Digging Deeper: The Logs

To understand this better, let's look at the logs from both the GitHub Actions side and the server side.

GitHub Actions Logs:

Sent 10351934759 of 10486152487 (98.7%), 151.8 MBs/sec
Sent 10486152487 of 10486152487 (100.0%), 153.2 MBs/sec
Attempt 1 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload. Retrying request in 3000 ms...
Attempt 2 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload. Retrying request in 5374 ms...
Attempt 3 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload. Retrying request in 8804 ms...
Warning: Failed to save: Failed to FinalizeCacheEntryUpload: Received non-retryable error: Failed request: (404) Upload not found

From the GitHub Actions logs, we can see that the upload seems to complete successfully, but the FinalizeCacheEntryUpload call fails after multiple retries with a 404 error. The timeout errors suggest there might be some network latency or server-side issues.

Server Logs:

[cache-server-node-1] âš™ Request: POST /twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload

[cache-server-node-1]  ERROR  Response: POST /twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload > 404
 Upload not found

    at createError$1 (server/index.mjs:647:15)
    at Object.handler (server/chunks/routes/twirp/github.actions.results.api.v1.CacheService/FinalizeCacheEntryUpload.mjs:46:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async Object.handler (server/index.mjs:1633:19)
    at async Server.toNodeHandle (server/index.mjs:1904:7)

  [cause]: { statusCode: 404, statusMessage: 'Upload not found' }

The server logs confirm the 404 error and give us a stack trace. The key part here is the Upload not found message, which indicates that the server couldn't locate the uploaded data when trying to finalize the cache entry.

The Odd Part: Cache Restoration Works!

Now, here's the really strange part. Despite the 404 error during the finalization, the cache actually seems to work when restoring it in subsequent GitHub Actions runs. Check this out:

Run actions/cache@v4
Attempt 1 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.CacheService/GetCacheEntryDownloadURL. Retrying request in 3000 ms...
Cache hit for restore-key: test-foo-Linux-17143879533
Cache Size: ~10000 MB (10486152487 B)
/usr/bin/tar -xf /home/runner/_work/_temp/84dc1a91-1c67-4187-9b4e-6e01ada531a3/cache.tzst -P -C /home/runner/_work/poc-arc/poc-arc --use-compress-program unzstd
Cache restored successfully
Cache restored from key: test-foo-Linux-17143879533

As you can see, the cache is hit, and the files are restored successfully. This suggests that the data is indeed being stored somewhere, but there's a disconnect during the finalization process.

My Setup

To give you some context, I've set up the server with Postgres and S3 for storage. I'm able to reproduce this issue consistently with the following GitHub Actions workflow:

ci:
    name: Run CI
    runs-on: arc-stg-amd64
    steps:
      # At this step it will attempt to restore foo from the cache server (if the corresponding key is found)
      - name: Cache foo directory
        uses: actions/cache@v4
        with:
          path: foo
          key: test-foo-${{ runner.os }}-${{ github.run_id }}
          restore-keys: |
            test-foo-${{ runner.os }}-

      - name: Display contents after restore
        run: |
          echo "Contents of foo:"
          ls -l foo || echo "No cache found"

      # Generate a total of 10GB files under 'foo' to create a large cache for the stress test
      - name: Generate large cache files
        run: |
          mkdir -p foo
          for i in {1..100}; do
            head -c 100M </dev/urandom > foo/artifact_$i.bin
          done

This workflow does the following:

  1. Restores the cache: It tries to restore the foo directory from the cache server using the actions/cache@v4 action.
  2. Displays contents: It lists the contents of the foo directory to check if the cache was restored successfully.
  3. Generates large cache files: This is the key part for reproducing the issue. It creates 100 files, each 100MB in size, totaling 10GB, inside the foo directory. This large cache size seems to be a factor in triggering the error.

Potential Culprits and Troubleshooting Steps

Based on my observations and the logs, here are a few potential causes and troubleshooting steps I've been considering:

  1. S3 Upload Inconsistencies: The 404 error suggests that the server might not be able to locate the uploaded data in S3 when finalizing the cache entry. This could be due to eventual consistency issues in S3, where data might not be immediately available after an upload.

    • Troubleshooting: I'm planning to investigate S3's consistency model and explore strategies to ensure data availability before finalizing the cache entry. This might involve adding retries or checks to confirm the data's existence in S3.
  2. Timeout Issues: The timeout errors in the GitHub Actions logs indicate potential network latency or server-side delays. These delays could be causing the FinalizeCacheEntryUpload request to fail before the data is fully processed.

    • Troubleshooting: I'll be looking into optimizing my server's performance and network configuration to reduce latency. I might also consider increasing the timeout values in the GitHub Actions cache action to allow more time for the finalization process.
  3. Concurrency Issues: If multiple uploads are happening concurrently, there might be some race conditions or conflicts that lead to the 404 error. The server might be trying to finalize an entry before all the data is written, or there might be conflicts in metadata updates.

    • Troubleshooting: I'm going to analyze the server's concurrency handling and look for potential race conditions. I might need to implement some locking or synchronization mechanisms to ensure data integrity during concurrent uploads.
  4. Large Cache Size: The fact that this issue seems to occur with large caches suggests that the size of the uploaded data might be a factor. The server might be hitting some limits or encountering performance bottlenecks when dealing with large files.

    • Troubleshooting: I'll be profiling the server's performance during large cache uploads and looking for potential bottlenecks. This might involve optimizing data transfer, storage, or processing logic to handle large files more efficiently.
  5. Twirp Implementation: The error occurs within the Twirp service calls. There might be an issue in how the Twirp implementation handles large uploads or finalization requests.

    • Troubleshooting: I plan to review the Twirp service implementation, focusing on the FinalizeCacheEntryUpload handler. I'll look for potential bugs or inefficiencies in how it interacts with the storage backend and handles large data transfers.

Next Steps

So, that's where I'm at right now. I'm leaning towards the S3 consistency or timeout issues as the most likely culprits, but I'm keeping an open mind and exploring all the possibilities.

I'd love to hear your thoughts and suggestions! Have you guys encountered similar issues before? Any insights or debugging tips would be greatly appreciated.

I'll keep you updated on my progress as I continue to troubleshoot this. Let's crack this nut together!

Update

I have an update regarding the FinalizeCacheEntryUpload 404 error. After further investigation, I've narrowed down the issue to a potential race condition in my server's handling of concurrent uploads and finalization requests. It seems that when multiple large cache uploads occur simultaneously, the server might attempt to finalize an entry before all the data has been fully written to S3, leading to the "Upload not found" error. I'm currently implementing a locking mechanism to ensure that finalization requests are processed only after the data is completely stored. I'll keep you posted on the results of this fix. Thanks for your support and suggestions!


I hope this markdown content is helpful and meets your requirements! Let me know if you have any other questions or need further assistance.