- Python commands in Jupiter:
-
%%writefile spark_analysis.py : create or replace a file. 1.1 %%writefile -a spark_analysis.py : write appendingly to the existing file.
-
import google.cloud.storage as gcs : import google cloud storage
-
!wget https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/sparklab/kddcup.data_10_percent.gz : download a file
-
!hadoop fs -put kddcup* / : copy all folder “kddcup” to the default local Hadoop file system. (fs : file sys)
-
!hadoop fs -ls / : list contents of the default directory of Hadoop file system.
-
!gcloud storage cp spark_analysis.py gs://$BUCKET/sparktodp/spark_analysis.py : save a copy of the Python file to cloud storage bucket.
-
import a spark, initialize spark session, set spark context, read data from a cloud bucket and cache it.
from pyspark.sql import SparkSession, SQLContext, Row spark = SparkSession.builder.appName("kdd").getOrCreate() sc = spark.sparkContext data_file = "gs://{}/kddcup.data_10_percent.gz".format(BUCKET) raw_rdd = sc.textFile(data_file).cache()
-
-
Linux commands in Cloud Shell:
- export PROJECT_ID=$(gcloud config get-value project): create a ENV var.
1.1. BUCKET=”PROJECT ID”: also create an ENV var.
-
echo $PROJECT_ID : show a ENV var
-
cat /tmp/output-* : append & show all the content of all files starting with “output-“ in folder “tmp”
-
ls -al /tmp : despite of standing any-folder, we can show all files inside folder “tmp” at root directory.
4.1 ls -al /tmp/myoutput* : similar to 4 but show only files starting with “myoutput”.
- nano submit_onejob.sh
5.1 edit in the new script file.
```sql
#!/bin/bash
gcloud dataproc jobs submit pyspark
–cluster sparktodp
–region REGION
spark_analysis.py
– –bucket=$1
``` 5.2 chmod +x submit_onejob.sh : grant execute permission. 5.3 ./submit_onejob.sh $PROJECT_ID : execute
-
gsutil cp gs://your-bucket-name/path/to/file.csv.gz . : download a zip. 6.1. zcat file.csv.gz | head -n 10 : read 10 lines of a zip. 6.2. gunzip file.csv.gz : unzip a zip.
-
*cat /proc/cpuinfo grep “model name”*: check CPU power at Cloud shell 7.1. *top -bn1 grep “Cpu(s)”*: check CPU usage currently. - free -h: check memory
-
Cloud Storage commands in Cloud Shell:
-
gcloud auth list : show persons
-
gcloud config list project : show projectId and more 4.1: gcloud config get-value project = GOOGLE_CLOUD_PROJECT : show only projectId (default env var at Cloud Shell) 4.1. gcloud projects list : show all projects IDs and projects Numbers 4.2. gcloud projects describe $(gcloud config get-value project) –format=”value(projectNumber)” : show only projectNumber
-
gcloud storage buckets create gs://$PROJECT_ID : create a new bucket in Cloud Storage. 5.1. gcloud storage ls gs://$BUCKET/sparktodp/** : show all the files inside a Cloud Storage bucket. 5.2. !gcloud storage ls gs://$BUCKET/sparktodp/** : ‘!’ means “added to be able to execute in Jupiter notebook”.
- gcloud services disable dataflow.googleapis.com –force : disable dataflow service forcefully.
-
gcloud services enable dataflow.googleapis.com : enable dataflow service ().
-
gcloud services disable datafusion.googleapis.com : disable datafusion api
-
gcloud pubsub topics create sandiego : create a new pub/sub topic 9.1 gcloud pubsub subscriptions create –topic sandiego mySub1 : create a new pub/sub subscription for topic “sandiego”. 9.2 gcloud pubsub topics publish sandiego –message “hello” : send a message 9.3 gcloud pubsub subscriptions pull –auto-ack mySub1 : pull a message.
-
gcloud dataflow jobs run job2 … –disable-public-ips: net no use public IPs (external IP) => need enable PGA 10.1 python3 -m apache_beam.examples … –no_use_public_ips: also no use public IPs as running a python script.
- Example of a full code to run a dataflow job:
python3 -m apache_beam.examples.wordcount \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://$BUCKET/results/outputs --runner DataflowRunner \ --project $PROJECT --temp_location gs://$BUCKET/tmp/ --region $REGION \ --subnetwork regions/$REGION/subnetworks/$SUBNETWORK \ --num_workers 20 --machine_type n1-standard-4 --no_use_public_ips
-
-
BigQuery commands in Cloud Shell:
-
bq show –schema –format=prettyjson demo_dataset.external_table : show schema.
- Define a connection permission for a project to a file in Cloud Storage, save it into a file name “tabledef”:
export PROJECT_ID=$(gcloud config get-value project) bq mkdef \ --autodetect \ --connection_id=$PROJECT_ID.US.my-connection \ --source_format=CSV \ "gs://$PROJECT_ID/invoice.csv" > /tmp/tabledef.json
- have a look at the file
cat /tmp/tabledef.json
- get the schema, then save it into a file “schema”
bq show --schema --format=prettyjson demo_dataset.external_table > /tmp/schema
- Update the definitions of connection permission and schema into a bigquery table dataset “demo_dataset.external_table”
bq update --external_table_definition=/tmp/tabledef.json --schema=/tmp/schema demo_dataset.external_table
- create a new bucket in Cloud Storage:
gsutil mb -p PROJECT_ID gs://PROJECT_ID
- create a new dataset in BigQuery:
bq mk -d dataset_name
- run a standard query in Computer Engine SSH:
bq query \ --use_legacy_sql=false \ 'SELECT * FROM `loadavro.campaigns`;'
- Enable or Disable Cloud Run Function APIs
gcloud services disable cloudfunctions.googleapis.com gcloud services enable cloudfunctions.googleapis.com
- Download a file into a staging stage and move(copy) it into a bucket.
wget https://.../file_name gcloud storage cp campaigns.avro gs://PROJECT_ID
-
use bts; in a SQL command we do not need to use “FROM db_name.table1”, use “FROM table1” - Create a new Clound SQL instance “taxi”:
gcloud sql instances create taxi \ --tier=db-n1-standard-1 \ --activation-policy=ALWAYS
- Set a root password for the Cloud SQL instance ‘taxi’: (- ‘%’ means users can connect from anyhost (localhost or others))
gcloud sql users set-password root \ --host % \ --instance taxi \ --password 123@KhanH
-
‘wget -qO - http://ipecho.net/plain’ : Take the IP address of the Cloud Shell.
- Get the IP address of a Cloud SQL instance:
MYSQLIP=$(gcloud sql instances describe taxi \ --format="value(ipAddresses.ipAddress)")
-
‘gcloud sql instances patch taxi –authorized-networks $ADDRESS’ : set management access for Cloud Shell to a Cloud SQL instance via IP address.
- Log into a mysql and enable “loading data permission” with ‘–local-infile’:
mysql --host=$MYSQLIP \ --user=root \ --password --local-infile
- create a database & a table with a schema in mySQL:
create database if not exists bts; \ use bts; \ drop table if exists trips; \ create table trips ( vendor_id VARCHAR(16), pickup_datetime DATETIME, passenger_count INT, trip_distance FLOAT, rate_code VARCHAR(16), fare_amount FLOAT, total_amount FLOAT, pickup_location_id VARCHAR(16), ... );
-
Describe a table: describe trips;
- Dowload a file from Cloud Storage into local file ‘trips.csv-1’:
gcloud storage cp \ gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_1.csv \ trips.csv-1
- Load data into a table in mySQL:
LOAD DATA LOCAL INFILE 'trips.csv-1' INTO TABLE trips FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES (vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id);
- Append data from Cloud Storage into a table ‘2018trips’ in a BQ dataset ‘nyctaxi’:
bq load \ --source_format=CSV \ --autodetect \ --noreplace \ nyctaxi.2018trips \ gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv
- Run a dataflow job
gcloud dataflow jobs run job1 \ --gcs-location gs://dataflow-templates-"REGION"/latest/Word_Count \ --region $REGION \ --staging-location gs://$PROJECT/tmp \ --parameters inputFile=gs://dataflow-samples/shakespeare/kinglear.txt,output=gs://$PROJECT/results/outputs --disable-public-ips
- Enable PGA (private Google Access) with a required role: ```sql # add a new role “networkAdmin” gcloud projects add-iam-policy-binding $PROJECT –member=user:$USER_EMAIL –role=roles/compute.networkAdmin
# Enable PGA gcloud compute networks subnets update default
–region=$REGION
–enable-private-ip-google-access ``` -