Commit 65463429 by Will Daly

Merge pull request #331 from edx/will/ai-grading-arch-revisions-3

AI Grading architecture doc revisions
parents f94334ce 6d384fab
......@@ -212,9 +212,9 @@ Recovery from Failure
a. We assume that the task queue is (for the most part) *reliable*: If a task is scheduled, then a worker will pick it up and execute it (although it might not complete the task successfully). Even if tasks occasionally are dropped, however, we can rely on the error recovery procedure below.
b. In general, we avoid retrying tasks, since an error that occurs once is likely to occur again. However, for errors that may be recoverable (such as transient network connectivity issues), the worker should reschedule the task once. If the task fails again, it should not be rescheduled.
b. If an error occurs, first retry the task. This allows the worker to gracefully handle recoverable errors such as temporary network connectivity issues.
c. For non-recoverable errors, the task should be fail without being rescheduled. Failures should be logged and monitored. Once the issue has been fixed, failed tasks should be rescheduled manually (e.g. by a command that queries for incomplete workflows and reschedules tasks).
c. If a task fails repeatedly, the worker should log the failure as a non-recoverable error and stop retrying the task. Once the issue has been fixed, failed tasks should be rescheduled manually (e.g. by a command that queries for incomplete workflows and reschedules tasks).
2. A course author publishes a problem without training classifiers.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment