Downloading files
Reworkd can automatically handle downloading files on your behalf. They are stored in our own infrastructure and links to the downloaded files are exposed in all export formats.
Setting up downloadsCopied!
-
Create a field in your schema with the
URL
type. -
Select the
Download file from URL
check box. -
Run a job that saves data with the new
URL
download field populated. -
Once the data is saved the file will be automatically downloaded and available within exports. Note that this occurs asynchronously so it may take time for your file downloads to appear.
Note: This download url field in the actual scraped data will
Retrieving filesCopied!
Both export formats will expose the file download information after files have been correctly downloaded. To create an export, refer to the Exporting data
section for more information.
Below is an file data within the API. The important information to consider is:
-
s3_url
: The pre-signed URL to our S3 bucket used to actually retrieve the file. -
source_url
: The URL on the source website where the file is from. Will point to our S3 bucket if there is no canonical source URL.file_metadata.dynamic_download
will also be set totrue.
-
field
: What field in the output data the file is associated with.
{
...
"files": [
{
"id": "70057eca-d05c-4a33-ae84-4af8dce83ce3",
"field": "attachments[0].url",
"url_etag_hash": "92359181252f9b52a4da21599fbf8f8d.pdf",
"s3_key": "test_key.pdf",
"s3_url": "https://files.reworkd.dev/test_url",
"source_url": "https://source-website.com/download/49a42973",
"create_date": "2024-08-26T18:49:31.575000",
"file_url": "s3://deworkd-prod-files/11ee111ee.pdf",
"file_type": "pdf",
"file_checksum": "7eec76e4bd1fed22f5d7d5fa7efbeaf717a77da771bb5c61e09b0d7ae46bbd",
"file_metadata": {
"url": "https://source-website.com/download/49a42973",
"filename": "Test document.pdf",
"dynamic_download": "true"
}
}
],
....
}
How are files downloaded?Copied!
Regular downloads
-
Regular downloads are downloads where the file is available directly at a URL. For example, it may be the link to a PDF or will download the pdf directly.
-
The canonical download URL will be used.
-
We execute these downloads asynchronously via lambda. There is a separate download queue that these files go through so you may experience some delays in your files actually getting downloaded.
Dynamic downloads
-
Sometimes no canonical link for the download exists: this is when the file download is triggered via JavaScript or they use a link that only works for the active session.
-
In such cases, because no canonical link exists, we use the link to our S3 bucket.
-
When this occurs, we download the file in the browser worker itself so we have guarantees that the file is actually downloaded and is correct.
For more info in how downloads are handled in code, read the following:
Handling file downloadingA technical look at how downloads are handled in code
Why am i getting an S3 URL for the download?Copied!
Read the Dynamic downloads
section above!
Deduplicating FilesCopied!
File downloads are de-duplicated to ensure
-
There is an accurate count of unique files downloaded
-
We do not download multiple of the same files
File level de-duplication occurs in two stages.
-
We make a head request to the URL. Occasionally the website will provide e-tag metadata which we can use to understand if a file has changed since we last downloaded it
-
If e-tag metadata is not present, we directly download the file and compare its hash with the hash we calculated when we last downloaded the file. If the hash has changed, we save the new version of the file
File storageCopied!
All files are stored within our internal S3 buckets. The file links we provided on export are pre-signed links to the files within our bucket. These links expire after 30 days. We can only guarantee that your downloaded files remain stored within our S3 buckets for 90 days. If your use case requires longer retention periods, please let us know!