Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
E
edx-analytics-pipeline
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
edx
edx-analytics-pipeline
Commits
d6fef5b0
Unverified
Commit
d6fef5b0
authored
Apr 23, 2018
by
brianhw
Committed by
GitHub
Apr 23, 2018
Browse files
Options
Browse Files
Download
Plain Diff
Merge pull request #496 from edx/brian/fix-gs-rsync
Exclude unloadable files at rsync time
parents
89139226
819b5fbd
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
11 additions
and
12 deletions
+11
-12
edx/analytics/tasks/common/bigquery_load.py
+11
-12
No files found.
edx/analytics/tasks/common/bigquery_load.py
View file @
d6fef5b0
...
@@ -267,20 +267,19 @@ class BigQueryLoadTask(BigQueryLoadDownstreamMixin, luigi.Task):
...
@@ -267,20 +267,19 @@ class BigQueryLoadTask(BigQueryLoadDownstreamMixin, luigi.Task):
def
_copy_data_to_gs
(
self
,
source_path
,
destination_path
):
def
_copy_data_to_gs
(
self
,
source_path
,
destination_path
):
if
self
.
is_file
(
source_path
):
if
self
.
is_file
(
source_path
):
return_code
=
subprocess
.
call
([
'gsutil'
,
'cp'
,
source_path
,
destination_path
])
command
=
[
'gsutil'
,
'cp'
,
source_path
,
destination_path
]
else
:
else
:
log
.
debug
(
" "
.
join
([
'gsutil'
,
'-m'
,
'rsync'
,
source_path
,
destination_path
]))
# Exclude any files which should not be uploaded to
return_code
=
subprocess
.
call
([
'gsutil'
,
'-m'
,
'rsync'
,
source_path
,
destination_path
])
# BigQuery. It is easier to remove them here than in the
if
return_code
==
0
:
# load steps. The pattern is a Python regular expression.
# Remove any files that were copied whose names have leading underscores, since
exclusion_pattern
=
".*_SUCCESS$|.*_metadata$"
# these files cannot be uploaded to BigQuery. It is easier to remove them here
command
=
[
'gsutil'
,
'-m'
,
'rsync'
,
'-x'
,
exclusion_pattern
,
source_path
,
destination_path
]
# than to exclude them either in the rsync or in the load steps.
underscore_path
=
url_path_join
(
destination_path
,
'_*'
)
log
.
debug
(
" "
.
join
(
command
))
log
.
debug
(
" "
.
join
([
'gsutil'
,
'rm'
,
underscore_path
]))
return_code
=
subprocess
.
call
(
command
)
return_code
=
subprocess
.
call
([
'gsutil'
,
'rm'
,
underscore_path
])
if
return_code
!=
0
:
if
return_code
!=
0
:
raise
RuntimeError
(
'Error while syncing {source} to {destination}'
.
format
(
raise
RuntimeError
(
'Error {code} while syncing {source} to {destination}'
.
format
(
code
=
return_code
,
source
=
source_path
,
source
=
source_path
,
destination
=
destination_path
,
destination
=
destination_path
,
))
))
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment