Usage examples
Here you can find different snippets which can help in building fast and simple data pipelines with the usage of PCloud.jl. These snippets are not the best possible ways to solve problems, but they can be used as a starting point. Also they illustrate ways how to apply various Julia techniques such as broadcasting and anonymous functions together with pCloud
to achieve goals without too much efforts.
Uploading and downloading CSV
CSV is rather common format for storing data, and CSV.jl provides convenient function CSV.write
which can store data in IOBuffer
which in turn can be uploaded to pCloud
.
Let's create DataFrame
using CSV
using DataFrames
using Random
df = DataFrame(x = rand(10), y = rand(1:10, 10), z = [randstring(5) for _ in 1:10])
# 10×3 DataFrame
# │ Row │ x │ y │ z │
# │ │ Float64 │ Int64 │ String │
# ├─────┼───────────┼───────┼────────┤
# │ 1 │ 0.0756344 │ 6 │ H3BIk │
# │ 2 │ 0.396882 │ 5 │ Rv2SB │
# │ 3 │ 0.797529 │ 5 │ M61Hw │
# │ 4 │ 0.856915 │ 5 │ jLc7K │
# │ 5 │ 0.0120147 │ 1 │ HgZMA │
# │ 6 │ 0.493593 │ 3 │ ENfu3 │
# │ 7 │ 0.27618 │ 2 │ MIU5B │
# │ 8 │ 0.492329 │ 10 │ QflU7 │
# │ 9 │ 0.398613 │ 10 │ 4XioP │
# │ 10 │ 0.40273 │ 10 │ PQs14 │
To store this dataframe in pCloud
we write it's contents to IOBuffer
and upload resulting buffer to pCloud
with the help of uploadfile
function
using PCloud
using PCloud: uploadfile, getfilelink
token = # HERE SHOULD BE YOUR TOKEN
client = PCloudClient(auth_token = token)
buffer = CSV.write(IOBuffer, df)
res = uploadfile(client, files = "data.csv" => buf)
Returned reponse res
contains necessary information about resulting file. And to get it back we can use getfilelink
using UrlDownload
using Underscores
df2 = @_ getfilelink(client, fileid = first(res.fileids)) |>
urldownload("https://" * first(__.hosts) * __.path) |> DataFrame
# 10×3 DataFrame
# │ Row │ x │ y │ z │
# │ │ Float64 │ Int64 │ String │
# ├─────┼───────────┼───────┼────────┤
# │ 1 │ 0.0756344 │ 6 │ H3BIk │
# │ 2 │ 0.396882 │ 5 │ Rv2SB │
# │ 3 │ 0.797529 │ 5 │ M61Hw │
# │ 4 │ 0.856915 │ 5 │ jLc7K │
# │ 5 │ 0.0120147 │ 1 │ HgZMA │
# │ 6 │ 0.493593 │ 3 │ ENfu3 │
# │ 7 │ 0.27618 │ 2 │ MIU5B │
# │ 8 │ 0.492329 │ 10 │ QflU7 │
# │ 9 │ 0.398613 │ 10 │ 4XioP │
# │ 10 │ 0.40273 │ 10 │ PQs14 │
Working with comressed CSV
Since csv files can be rather large it is a common practice to compress them before uploading. It can be done as follows (assuming the same df
from the previous example)
using CodecZlib
buf = CSV.write(IOBuffer(), df) |> seekstart |> GzipCompressorStream
res = uploadfile(client, files = "data.csv.gz" => buf)
Note that we should use seekstart
here, since after IOBuffer
is written, it's pointer located at the end and subsequent reading of the buffer in uploadfile
return empty array. Also, in this exampe we used GzipCompressorStream
, but any other compressing algorithm can be used, refer TranscodingStreams.jl.
And to verify the result of upload
using UrlDownload
using Underscores
df2 = @_ getfilelink(client, fileid = first(res.fileids)) |>
urldownload("https://" * first(__.hosts) * __.path) |> DataFrame
# 10×3 DataFrame
# │ Row │ x │ y │ z │
# │ │ Float64 │ Int64 │ String │
# ├─────┼───────────┼───────┼────────┤
# │ 1 │ 0.0756344 │ 6 │ H3BIk │
# │ 2 │ 0.396882 │ 5 │ Rv2SB │
# │ 3 │ 0.797529 │ 5 │ M61Hw │
# │ 4 │ 0.856915 │ 5 │ jLc7K │
# │ 5 │ 0.0120147 │ 1 │ HgZMA │
# │ 6 │ 0.493593 │ 3 │ ENfu3 │
# │ 7 │ 0.27618 │ 2 │ MIU5B │
# │ 8 │ 0.492329 │ 10 │ QflU7 │
# │ 9 │ 0.398613 │ 10 │ 4XioP │
# │ 10 │ 0.40273 │ 10 │ PQs14 │
Uploading generated image
In this example we will use Luxor.jl for image generation and also use getfilepublink
to generate public link to the resulting image.
using Luxor
d = Drawing(600, 400, :png)
origin()
background("white")
for θ in range(0, step=π/8, length=16)
gsave()
scale(0.25)
rotate(θ)
translate(250, 0)
randomhue()
julialogo(action=:fill, color=false)
grestore()
end
gsave()
scale(0.3)
juliacircles()
grestore()
translate(200, -150)
scale(0.3)
julialogo()
finish()
Please notice, that we used :png
keyword in Drawing
definition, to force in-memory image processing.
using PCloud
using PCloud: uploadfile, getfilepublink
token = # HERE SHOULD BE YOUR TOKEN
client = PCloudClient(auth_token = token)
res = uploadfile(client, files = "logo.png" => d.buffer)
getfilepublink(client, fileid = first(res.fileids)).link
# "https://u.pcloud.link/publink/show?code=XZh8FEkZ6vBed7DI1Wys8g7BHl8FFVuhUSSX"
If you follow this link, you can see that it is valid png file.
Project Gutenberg and downloadfile
Method PCloud.downloadfile
can download file from urls directly to pCloud. This can be very useful during web crawling, when various information of interest should be saved for further investigation. As an example we download 10 top books from Project Gutenberg
using Underscores
using Gumbo
using Cascadia
using Cascadia: matchFirst
using UrlDownload
using PCloud
using PCloud: createfolder, downloadfile
token = # HERE SHOULD BE YOUR TOKEN
client = PCloudClient(auth_token = token)
# folder where books will be stored
folderid = createfolder(client, folderid = 0, name = "Gutenberg").metadata.folderid
host = "https://www.gutenberg.org"
# helper function for parsing data downloaded by `urldownload` to a more useful format
pageparser(x) = parsehtml(String(x)).root
# helper function which finds download url on each book page
# should be used for parsing each individual book page, for example
# getlink("https://www.gutenberg.org/ebooks/1342") would produce url to
# "Pride and Prejudice" in epub format.
getlink(url) = @_ urldownload(url, parser = pageparser) |>
host*matchFirst(sel"a[type='application/epub+zip']", __).attributes["href"]
# This is central function, which parses top scores page, extract top 10 books,
# extract download url for each book with the help of `getlink` and finally
# download everything to pCloud
@_ urldownload("https://www.gutenberg.org/browse/scores/top", parser = pageparser) |>
matchFirst(sel"ol", __) |> eachmatch(sel"li", __)[1:10] |>
matchFirst.(Ref(sel"a"), __) |> map(host*_.attributes["href"], __) |>
getlink.(__) |> join(__, " ") |>
downloadfile(client, url = __, folderid = folderid)