Additional functionality

Progress Meter

By default nothing is shown during data downloading, but it can be changed with passing true as a second argument to the function urldownload

using UrlDownload

url = "https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-December-2019-quarter/Download-data/business-price-indexes-december-2019-quarter-csv.csv"

urldownload(url, true)
# Progress: 45%|████████████████████                      | Time: 0:00:01

Custom parsers

If file type is not supported by UrlDownload.jl it is possible to use custom parser to process the data. Such parser should accept one positional argument, of the type Vector{UInt8} and can have optional keyword arguments.

It should be used in parser argument of the urldownload.

using UrlDownload
using DataFrames
using CSV

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
# 2×2 DataFrame
# │ Row │ x     │ y     │
# │     │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1   │ 1     │ 2     │
# │ 2   │ 3     │ 4     │

Alternatively one can use parser = identity and process data outside of the function

using UrlDownload
using DataFrames
using CSV

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = identity) |>
  x -> DataFrame(CSV.File(IOBuffer(x)))

If keywords arguments are used in custom parser they will accept values from keyword arguments of urldownload function

using UrlDownload
using DataFrames
using CSV

wrapper(x; kw...) = DataFrame(CSV.File(IOBuffer(x); kw...))

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url, parser = wrapper, delim = ';')
# 2×2 DataFrame
# │ Row │ x     │ y     │
# │     │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1   │ 1     │ 2     │
# │ 2   │ 3     │ 4     │

Compressed files

UrlDownload.jl can process compressed data using autodetection. Currently following formats are supported: :xz, :gzip, :bzip2, :lz4, :zstd, :zip.

using UrlDownload
using DataFrames

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.gz"
res = urldownload(url) |> DataFrame
# 2×2 DataFrame
# │ Row │ x     │ y     │
# │     │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1   │ 1     │ 2     │
# │ 2   │ 3     │ 4     │

To override compression type one can use either one of formats :xz, :gzip, :bzip2, :lz4, :zstd, :zip in the argument compress or specify :none. In second case if custom parser is used it should decompress data on itself

using UrlDownload
using DataFrames
using CodecXz
using CSV

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.gz"
res = urldownload(url, compress = :xz) |> DataFrame

res = urldownload(url, compress = :none, parser = x -> CSV.read(XzDecompressorStream(IOBuffer(x))))

For all compress types except :zip urldownload automatically applies CSV.File transformation. If any other kind of data is stored in an archive, it should be processed with custom parser.

:zip compressed data is processed one by one with usual rules of the auto-detection applied. If zip archive contains only single file, than it'll be decompressed as a single object, otherwise only first file is unpacked. This behavior can be overridden with multifiles = true, in this case urldownload returns Vector of processed objects.

using UrlDownload
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test2.zip"
res = urldownload(url, multifiles = true)

length(res) # 2

Undetected file types

Sometimes file type can't be detected from the url, in this case one can supply optional format argument, to force necessary behavior

using UrlDownload
using DataFrames

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noextcsv"
df = urldownload(url, format = :CSV) |> DataFrame

# 2×2 DataFrame
# │ Row │ x     │ y     │
# │     │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1   │ 1     │ 2     │
# │ 2   │ 3     │ 4     │

Storing raw data

Sometimes downloaded data can be too big to be downloaded multiple times, so you may want to store original version of the file locally. To do it you can use save_raw argument, which can be either String with the file name or IOStream.

using UrlDownload
using DataFrames

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url; save_raw = "/tmp/data.csv") |> DataFrame
# 2×2 DataFrame
# │ Row │ x     │ y     │
# │     │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1   │ 1     │ 2     │
# │ 2   │ 3     │ 4     │

and you can verify that original data was stored locally

sh> cat /tmp/data.csv
x,y
1,2
3,4

Since urldownload supports local files download, you can read data in the following way

res = urldownload("/tmp/data.csv") |> DataFrame

Resource type autodetection

As it was said in previous section, urldownload can autodetect resource type and download and process data accordingly. In case autodetection fails, you can always fallback to manual resource definition with the help of File and URL structs or @f_str and @u_str macros

using DataFrames
using UrlDownload
using UrlDownload: File, URL, @f_str, @u_str

# All of these are equivalent
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame

url = u"https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame

url = @u_str "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame

url = URL("https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv")
res = urldownload(url) |> DataFrame

and for the files

using DataFrames
using UrlDownload
using UrlDownload: File, URL, @f_str, @u_str

# All of these are equivalent
url = "/tmp/data.csv"
res = urldownload(url) |> DataFrame

url = f"/tmp/data.csv"
res = urldownload(url) |> DataFrame

url = @f_str "/tmp/data.csv"
res = urldownload(url) |> DataFrame

url = File("/tmp/data.csv")
res = urldownload(url) |> DataFrame