Additional functionality
Progress Meter
By default nothing is shown during data downloading, but it can be changed with passing true as a second argument to the function urldownload
using UrlDownload
url = "https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-December-2019-quarter/Download-data/business-price-indexes-december-2019-quarter-csv.csv"
urldownload(url, true)
# Progress: 45%|████████████████████ | Time: 0:00:01Custom parsers
If file type is not supported by UrlDownload.jl it is possible to use custom parser to process the data. Such parser should accept one positional argument, of the type Vector{UInt8} and can have optional keyword arguments.
It should be used in parser argument of the urldownload.
using UrlDownload
using DataFrames
using CSV
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │Alternatively one can use parser = identity and process data outside of the function
using UrlDownload
using DataFrames
using CSV
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = identity) |>
x -> DataFrame(CSV.File(IOBuffer(x)))If keywords arguments are used in custom parser they will accept values from keyword arguments of urldownload function
using UrlDownload
using DataFrames
using CSV
wrapper(x; kw...) = DataFrame(CSV.File(IOBuffer(x); kw...))
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url, parser = wrapper, delim = ';')
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │Compressed files
UrlDownload.jl can process compressed data using autodetection. Currently following formats are supported: :xz, :gzip, :bzip2, :lz4, :zstd, :zip.
using UrlDownload
using DataFrames
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.gz"
res = urldownload(url) |> DataFrame
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │To override compression type one can use either one of formats :xz, :gzip, :bzip2, :lz4, :zstd, :zip in the argument compress or specify :none. In second case if custom parser is used it should decompress data on itself
using UrlDownload
using DataFrames
using CodecXz
using CSV
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.gz"
res = urldownload(url, compress = :xz) |> DataFrame
res = urldownload(url, compress = :none, parser = x -> CSV.read(XzDecompressorStream(IOBuffer(x))))For all compress types except :zip urldownload automatically applies CSV.File transformation. If any other kind of data is stored in an archive, it should be processed with custom parser.
:zip compressed data is processed one by one with usual rules of the auto-detection applied. If zip archive contains only single file, than it'll be decompressed as a single object, otherwise only first file is unpacked. This behavior can be overridden with multifiles = true, in this case urldownload returns Vector of processed objects.
using UrlDownload
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test2.zip"
res = urldownload(url, multifiles = true)
length(res) # 2Undetected file types
Sometimes file type can't be detected from the url, in this case one can supply optional format argument, to force necessary behavior
using UrlDownload
using DataFrames
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noextcsv"
df = urldownload(url, format = :CSV) |> DataFrame
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │Storing raw data
Sometimes downloaded data can be too big to be downloaded multiple times, so you may want to store original version of the file locally. To do it you can use save_raw argument, which can be either String with the file name or IOStream.
using UrlDownload
using DataFrames
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url; save_raw = "/tmp/data.csv") |> DataFrame
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │and you can verify that original data was stored locally
sh> cat /tmp/data.csv
x,y
1,2
3,4Since urldownload supports local files download, you can read data in the following way
res = urldownload("/tmp/data.csv") |> DataFrameResource type autodetection
As it was said in previous section, urldownload can autodetect resource type and download and process data accordingly. In case autodetection fails, you can always fallback to manual resource definition with the help of File and URL structs or @f_str and @u_str macros
using DataFrames
using UrlDownload
using UrlDownload: File, URL, @f_str, @u_str
# All of these are equivalent
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame
url = u"https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame
url = @u_str "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url) |> DataFrame
url = URL("https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv")
res = urldownload(url) |> DataFrameand for the files
using DataFrames
using UrlDownload
using UrlDownload: File, URL, @f_str, @u_str
# All of these are equivalent
url = "/tmp/data.csv"
res = urldownload(url) |> DataFrame
url = f"/tmp/data.csv"
res = urldownload(url) |> DataFrame
url = @f_str "/tmp/data.csv"
res = urldownload(url) |> DataFrame
url = File("/tmp/data.csv")
res = urldownload(url) |> DataFrame