For partners who are running courses on edx.org and edge.edx.org, edX regularly
makes research data available for download from the Amazon S3 storage service.
The *data package* that data czars download from Amazon S3 consists of a set of
compressed and encrypted files that contain event logs and database snapshots
for all of their organizations' edx.org and edge.edx.org courses.
* :ref:`Data Package Files`
* :ref:`Amazon S3 Buckets and Directories`
* :ref:`Download Data Packages from Amazon S3`
* :ref:`Data Package Contents`
.. _Data Package Files:
**********************
Data Package Files
**********************
A data package consists of different files that contain event data and database
data.
.. note:: In all file names, the date is in {YYYY}-{MM}-{DD} format.
You download these files from different Amazon S3 "buckets". See :ref:`Amazon
S3 Buckets and Directories`.
============
Event Data
============
The ``{org}-{site}-events-{date}.log.gz.gpg`` file contains a daily log of
course events. A separate file is available for courses running on edge.edx.org
(with "edge" for {site} in the file name) and on edx.org (with "edx" for
{site}).
For a partner organization named UniversityX, these daily files are identified
by the organization name, the edX site name, and the date. For example,
``universityx-edge-2014-07-25.log.gz.gpg``.
An alternative option for event data is available. The
``{date}-{org}-tracking.tar`` file is available each week. It contains a
cumulative log of events in all of an organization's courses. Data for courses
running on both the edx.org and edge.edx.org sites is included in this file.
.. remove this paragraph ^ when weekly file is removed.
.. important:: The ``{org}-{site}-events-{date}.log.gz.gpg`` file is designed to replace the ``{date}-{org}-tracking.tar`` file. Both files will be produced for several weeks, and then production of the ``{date}-{org}-tracking.tar`` file will be discontinued.
.. remove this paragraph ^ when weekly file is removed.
For information about the contents of these files, see :ref:`Data Package
Contents`.
==================
Database Data
==================
The ``{org}-{date}.zip`` file contains views on database tables. This file
includes data as of the time of the export, for all of an organization's
courses on both the edx.org and edge.edx.org. sites. A new file is available
every week, representing the database at that point in time.
For a partner organization named UniversityX, each weekly file is identified by
the organization name and its extraction date: for example,
``universityx-2013-10-27.zip``.
For information about the contents of this file, see :ref:`Data Package
Contents`.
.. _Amazon S3 Buckets and Directories:
********************************************
Amazon S3 Buckets and Directories
********************************************
Data package files are located in the following buckets on Amazon S3:
* The **edx-course-data** bucket contains the daily
``{org}-{site}-events-{date}.log.gz.gpg`` files of course event data.
* The **course-data** bucket contains the weekly ``{org}-{date}.zip`` database
snapshot. It also contains the weekly ``{date}-{org}-tracking.tar`` file of
cumulative course event data (until production of this file is discontinued).
.. remove the last sentence ^ when weekly event file is removed.
For information about accessing Amazon S3, see :ref:`Access Amazon S3`.
You download the files in your data package from the Amazon S3 storage service.
==========================
Download Daily Event Files
==========================
#. To download daily event files, use the AWS Command Line Interface or a
third-party tool to connect to the **edx-course-data** bucket on Amazon S3.
For information about providing your credentials to connect to Amazon S3,
see :ref:`Access Amazon S3`.
#. Navigate the directory structure in the **edx-course-data** bucket to locate
the files that you want:
``{org}/{site}/events/{year}``
The event logs in the ``{year}`` directory are in compressed, encrypted
files named ``{org}-{site}-events-{date}.log.gz.gpg``.
3. Download the ``{org}-{site}-events-{date}.log.gz.gpg`` file.
If your organization has courses running on both edx.org and edge.edx.org,
separate log files are available for the "edx" site and the "edge" site.
Repeat this step to download the file for the other site.
============================
Download Weekly Files
============================
.. note:: If you are using a third-party tool to connect to Amazon S3, you may not be able to navigate from one edX bucket to the other in a single session. You may need to disconnect from Amazon S3 and then reconnect to the other bucket.
#. To download a weekly database data file or cumulative event file, connect to
the edX **course-data** bucket on Amazon S3 using the AWS Command Line
Interface or a third-party tool.
.. revise this sentence ^ when weekly event logs are no longer available
For information about providing your credentials to connect to Amazon S3,
see :ref:`Access Amazon S3`.
#. Download the ``{org}-{date}.zip`` database data file from the **course-
data** bucket.
The **course-data** bucket also contains the weekly, cumulative
``{date}-{org}-tracking.tar`` files.
.. remove this step ^ when weekly event logs are no longer available
.. _AWS Command Line Interface: http://aws.amazon.com/cli/
.. _Data Package Contents:
**********************
Data Package Contents
**********************
Each of the files you download contains one or more files of research data.
The ``{date}-{org}-tracking.tar`` file contains cumulative event data for all
of an organization's courses, running on both edx.org and edge.edx.org.
.. note:: Over time, these cumulative files could become large (25GB and larger) and difficult for many data czars to download without encountering session timeouts and other problems. As a result, this file will be superseded by daily ``{org}-{site}-events-{date}.log.gz.gpg`` files in the **edx-course-data** bucket.
After you download the ``{date}-{org}-tracking.tar`` file for your
institution, you:
#. Extract the contents of the downloaded .tar file.
To balance the load of traffic to edX courses, every course is served by
multiple edX servers. A different set of servers handles traffic for the two
edX sites: edx.org ("prod") and edge.edx.org ("edge"). When you extract the
contents of this file, a separate subdirectory is created for events that
took place on each edX server.
For example, subdirectories with these names can be created:
``prod-edx-001/``
``prod-edx-002/``
``prod-edx-003/``
``prod-edge-001/``
``prod-edge-002/``
The subdirectory names identify the site on which events took place.
Each of these subdirectories contains an encrypted log file of event data
for every day that events occurred on that server. These event tracking data
files are named ``{date}-{org}.log.gpg``.
2. Use your private key to decrypt the extracted log files. See :ref:`Decrypt
an Encrypted File`.
.. note:: During analysis, you must combine events from different servers to get a complete picture of the activity in each course.
.. remove this section ^ when weekly file is removed
============================================
Extracted Contents of ``{org}-{date}.zip``
============================================
After you download the ``{org}-{date}.zip`` file for your
institution, you:
#. Extract the contents of the file. When you extract (or unzip) this file, all
of the files that it contains are placed in the same directory. All of the
extracted files end in ``.gpg``, which indicates that they are encrypted.
#. Use your private key to decrypt the extracted files. See
:ref:`Decrypt an Encrypted File`.
The result of extracting and decrypting the ``{org}-{date}.zip`` file is the
@@ -14,6 +14,8 @@ In the data package, wiki data is delivered in two SQL files:
...
@@ -14,6 +14,8 @@ In the data package, wiki data is delivered in two SQL files:
* The wiki_articlerevision file stores data about the articles, including data about changes and deletions. The full name of this file is in this format: edX-*organization*-*course*-wiki_articlerevision-*source*-analytics.sql.
* The wiki_articlerevision file stores data about the articles, including data about changes and deletions. The full name of this file is in this format: edX-*organization*-*course*-wiki_articlerevision-*source*-analytics.sql.
.. _wiki_article:
***********************************
***********************************
Fields in the wiki_article file
Fields in the wiki_article file
***********************************
***********************************
...
@@ -94,6 +96,8 @@ other_write
...
@@ -94,6 +96,8 @@ other_write
----------------------
----------------------
Defines whether others have write access to the article. 1 if so, 0 if not.
Defines whether others have write access to the article. 1 if so, 0 if not.