Upload to Cloud and Then Unzip File
Automatic Unzipping of Google Cloud Storage files
Welcome to the Data Analytics/Warehousing realm on Google Cloud Platform. I am sure you are taking reward of the infrastructure and services that Google Deject Platform provides, in a way that is most convenient and efficient for you.
Data Warehousing pipelines on Google Cloud usually (for about use cases) starts with ingestion of data to the landing/staging zone. Google Cloud Storage is a service that serves us for landing/staging area. GCS (Google Cloud Storage) holds all kinds of information and is preferred for unstructured data like csv files or blob.
Ingestion of data can be achieved in several ways.
- If data is coming to GCP from on-prem and security is requirement then it is ordinarily over VPN.
- We can extract data using programming lawmaking from an API and put it in the bucket.
- As well some vendors sends out data from an automated electronic mail platform to business electronic mail that we apply.
From an on-prem transfer and automated email, data is transferred in compressed grade, either in .zip or tar.gz. And information technology makes sense there, as manual is faster and size reduces. We cannot use data in its compressed/zipped format. We demand to decompress the naught file in gild to use the data that resides within it or to exist able to use it with other service. I had similar claiming of decompressing zipped files that I was getting from an external vendor.
Let's movement on solution and setting up an automated pipeline to get the extracted content. Well, in that location are 2 approaches one can have as there is no native capability that allows the states to excerpt the zipped files within Google Cloud Storage. Both approaches are fully automatic once setup and real-fourth dimension (of course, if files are big, processing time will be more).
File drops into your cloud storage, that triggers cloud office based on object creation outcome. Code(written in whatever supported programming language) in Deject role extracts the file and writes back to cloud storage. by the style, Cloud Functions is server-less compute platform on GCP that runs piece of lawmaking when triggered. This works but you volition have to write a code yourself and computing capacity is express for cloud role. I chose 2nd approach which I am putting below:
From the figure above let me explain a scrap what Cloud Dataflow service is. It is fully managed, server-less Apache Beam ready to work for yous. It auto-scales equally resources are required. You maybe wondering you need to code for Dataflow as well, right! Luckily, Google has few tailored dataflow templates that we tin can utilize. Bulk Decompress Cloud Storage file is the template that nosotros need to use. There is no way to automatically run or schedule dataflow template as file arrives or use CRON job from within dataflow service. Hence, we need to accept help of deject role, which will be triggered when object arrives and in plow trigger our template using dataflow API which will do our work and writes extracted data dorsum to Google Cloud Storage. I am putting beneath sample python code which you can use in deject function to invoke dataflow job.
from googleapiclient.discovery import build
def main(event, context): """Triggered by a change to a Cloud Storage bucket. Args: result (dict): Event payload. context (google.deject.functions.Context): Metadata for the outcome. """ file = result print(f"Processing file: {file['name']}.") projection = "<your-gcp-project-id>" chore = "<unique-dataflow-jobname>" template = "gs://<path-to-bucket-storing-template>/Bulk_Decompress_GCS_Files" parameters = { "inputFilePattern":"gs://<bucketname-with-cipher-file>/"+file['proper noun'], "outputDirectory":"gs://<output-saucepan-extracted-content>", "outputFailureFile": "gs://<temp-bucket-to-store-fault>/failed.csv", } environs = { "tempLocation": "gs://<temp-bucket-location>/", "workerRegion": "united states-central1", "maxWorkers": ii, "subnetwork": "regions/the states-central1/subnetworks/default", } dataflow = build('dataflow', 'v1b3') request = dataflow.projects().templates().launch( projectId=project, gcsPath=template, body={ "jobName": job, "parameters": parameters, "environment": environment } ) response = request.execute() print(response)
Beneath is requirement.txt
google-api-python-client==i.7.11 oauth2client==4.1.3 google-auth-oauthlib==0.iv.1
I hope this is helpful to you for your use case. I tried finding a proficient piece of code to unzip from a sever-less coding platform only was fleck tricky. So needed to share with you lot guys thinking it might help y'all. Experience gratis to drop me bulletin on my email — darpan.3073@yahoo.com for any concern or queries!
middletonmamption.blogspot.com
Source: https://darpan-3073.medium.com/automated-unzipping-of-google-cloud-storage-files-577835c1f45f
0 Response to "Upload to Cloud and Then Unzip File"
Post a Comment