Handling file downloading

A technical look at how downloads are handled in code

Regular downloads are file downloads where the file link is available directly within the HTML (typically the href of an a tag). Visiting these links will directly take you to the file.

To download from these links, we simply save the URL of the file as if it were regular data. Once the data has been saved, we asynchronously visit these links via AWS lambda and use curl-cffi to download and save the file. The following is an example of how this might be done:

# Select the link element itself
# This might look like: <a href="www.test.com/file.pdf" class="download">Download</a
link = await sdk.page.query_selector('a.download')

# Get the href on the page directly
href = await link.get_attribute("href")

# Save the href directly. Our lambda will visit the url and download the file
await sdk.save_data({"download_url": href })

One download edge case is when the download link is not directly provided on the site, but available through clicking a button or link.

This means to get to the link information, we must first click on the button/link to open the page, save the URL once it loads, and then navigate back to the original URL. This can be easily done through our sdk.capture_url method which does this entire process automatically. After using this method, we can just save the url data as we did above. The following is an example of how this might be done:

# Select the element that will open the page
# This might be an element like: <button class="download">Download</a>
element = await sdk.page.query_selector('button.download') 

# Capture the URL that is opened
# This will do click the element and record what page it ultimately traveled to
download_url = await sdk.capture_url(element)

# Save the URL directly. Our lambda will visit the url and download the file as usual
await sdk.save_data({"download_url": download_url })

Handling JavaScript/dynamic downloadsCopied!

The trickier download edge case is when buttons/links on the page trigger file downloads dynamically. Instead of opening a link to the file, a download event is triggered on the browser itself and your browser will ask you if you want to download the file or not.

In this case, there is no directly link to the file that we can save. We’re forced to handle the download directly on the browser itself. To do this, we can use the capture_download method. Because it actually triggers the download, it will

  1. Create a “url” to be used when saving the data

  2. Have information about the title of the file since the file has already been downloaded

It will return an object with a url and title attributes in order to access the above information. The following is an example of how this might be done:

# Select the element that will trigger the download
# This might also be an element like: <button class="download">Download</a>
element = await sdk.page.query_selector('button.download') 

# Capture the download event
# This will click the element and download the file directly
download_metadata = await sdk.capture_download(element)

# The metadata is an object since we've already handled the entirety of the download
await sdk.save_data({
    "attachment": {
        "download_url": download_metadata["url"],
        "title": download_metadata["title"],
    },
})

Handing downloads requiring cookies/session informationCopied!

Some websites strictly enforce that the same browser session that visited the page is the one that views/downloads the file. In these cases, our lambda approach fails since it cannot emulate the original browser session that opened the page.

For these websites, the same approach as Handling JavaScript/Dynamic downloads above can be taken in order to download the files.