Introduction
Rust is a systems programming language (think C-like) that makes it easier to perform memory-safe operations than languages like C or C++. It accomplishes this by making it harder to do memory-unsafe operations–and catching these sorts of issues at compile-time instead of runtime.
In order to accomplish this, Rust imposes some constraints on the engineer through the borrow checker and immutable-by-default types. I’m not going to write about those things here as they have been covered in depth by others.
My focus for this post (and other posts in this potential series) is to focus on other language features and idioms that may be unfamiliar to managed-language developers.
In my first post in this series, I talked about the fact that Rust does not have the concept of null.
No Exceptions!
Rust does not have the concept of exceptions or the associated concept of try-catch
blocks. This is because once you get code to compile in Rust you can be sure there are no errors anywhere… just kidding.
Instead, in Rust we use an enum type called std::result::Result<T, E> . The T
in the generic signature is the return result. The E
represents the type of the Error should one occur. The two variants of Result
—Ok(value)
and Err(error)
–are always in scope, similarly to the Some(value)
and None
variants of theOption
type.
A Naive Example
Consider the following made-up function:
fn find_data(i: u32) -> Result<u32, String> {
match i {
1 => Err("1 is not a valid value".to_string()),
_ => Ok(i*2)
}
}
This function accepts an integer and doubles it. For whatever reason, 1
is not considered to be a valid value, so an error message is returned instead. Notice that Ok
and Err
are used to wrap the return and error values.
Now let’s look at how we would use the Result
type in a another function:
let result = find_data(5);
match result {
Ok(value) => {
println!("The result was {}", value);
},
Err(message) => {
println!("{}", message);
}
}
The type of result
is std::result::Result<i32, String>
. We then treat it like any other enum, matching on the variants and doing the correct processing.
Adding Complexity
Things start to get a little complicated if we have a series of potential errors. Consider retrieving some data from a database. We could fail to connect to the database, construct our query correctly, or map the raw data to our intended representation.
fn get_employee_by_id(id: i32) -> Result<Employee, DataRetrivalError> {
let connection = Database::create_connection();
match connection {
Ok(conn) => {
let raw_data = conn.execute("EmployeeByIdQuery", id);
match raw_data {
Ok(data) => {
Employee = Employee::from_raw_data(data)
}
Err(error) => {
Err(DataRetrievalError::QueryFailed)
}
}
},
Err(error) => {
Err(DataRetrivalError::ConnectionFailed)
}
}
}
Yuck! This is pretty ugly. We could improve readability by removing the nesting:
fn get_employee_by_id(id: i32) -> Result<Employee, DataRetrivalError> {
let connection_result = Database::create_connection();
if connection_result.is_err() {
return connection_result;
}
let connection = connection_result.unwrap();
let raw_data = connection.execute("EmployeeByIdQuery", id);
if (raw_data.is_err()) {
return raw_data;
}
let data = raw_data.unwrap();
Employee::from_raw_data(data)
}
This is better, but still pretty ugly. Fortunately, Rust offers some syntactic sugar to clean this up a lot in the form of the ?
operator. The ?
early return the result if it’s an error and unwrap it if it’s not. Here is the function rewritten to use the ?
operator.
fn get_employee_by_id(id: i32) -> Result<Employee, DataRetrivalError> {
let connection = Database::create_connection()?;
let data = connection.execute("EmployeeByIdQuery", id)?;
Employee::from_raw_data(data)
}
Much nicer!
If the error returned from an inner function does not match the error type expected by the outer function, the compiler will look for a From
implementation and do the type-coercion for you.
Comparing to Exception-based Languages
Rust’s error handling strategy does a great job of communicating possible failure modes since the error states of part of the signature of any function you call. This is a clear advantage over exception-based languages in which you (usually) have to read the documentation to know what exceptions can possibly occur.
On the other hand, it’s fairly common in exception-based languages to have some root handler for unhandled exceptions that provides standard processing for most errors.
In Rust, adding error handling can force you to edit much more code than in exception-based languages. Consider the following set of functions:
fn top_levl() -> i32 {
mid_level1() + mid_level2()
}
fn mid_level1() -> i32 {
low_level1 + low_level2()
}
fn mid_level2() -> i32 {
low_level1() * low_level2()
}
fn low_level1() -> i32 {
5
}
fn low_level2() -> i32 {
10
}
The top_level
function depends on the two mid_level
functions which in turn depend on the two low_level
functions. Consider what happens to our program if low_level2
is modified to potentially return an error:
fn top_levl() -> Result<i32, String> { // had to change this signature
mid_level1() + mid_level2()
}
fn mid_level1() -> Result<i32, String> { // had to change this signature
low_level1 + low_level2()
}
fn mid_level2() -> Result<i32, String> {
low_level1() * low_level2()
}
fn low_level1() -> i32 {
5
}
fn low_level2() -> Result<i32, String> {
Ok(10)
}
This sort of signature change will often bubble through the entire call stack, resulting in a much larger code-change than you would find in exception-based languages. This can be a good thing because it clearly communicates the fact that a low level function now returns an error. On the other hand, if there really is no error handling strategy except returning an InternalServerError
at an API endpoint, then requiring that every calling function change its signature to bubble the error is a fairly heavy tax to pay (these signature changes can also have their own similar side-effects in other call-paths).
I’m not making the argument that Rust error handling is therefore bad. I’m just pointing out that this error design has its own challenges.
Error Design Strategies
While mechanism by which errors are generated and handled in Rust is fairly simple to understand, the principles you should use in desigining your errors is not so straightforward.
There are essentially three dominant strategies available for designing your error handling strategy for your library or application:
Strategy | Description | Pros | Cons |
Error Per Crate | Define one error enum per crate. Contains all variants relevant to all functions in the crate. |
|
|
Error Per Module | Define one error per module. Contains all variants relevant to functions in that module. |
|
|
Error Per Function | Define one error per function. Only contains variants relevant to that function. |
|
|
Hybrid Strategy
I don’t think I have the right answer yet, but this hybrid strategy is the one I’ve settled on in my personal development. It basically creates an error hierarchy for the create that gets more specific as you approach a given function.
- Define an error enum per function.
- Define an error per module, the variants of which “source” the errors per function.
- Define an error per crate, the variants of which “source” the errors per module.
pub enum ConfigFileErrors {
FileNotFound { path: String },
}
fn load_config_file(path: String) -> Result<ConfigFile, ConfigFileErrors> {
// snipped
}
pub enum ParsingError {
InvalidFormat
}
fn parse_config(config: ConfigFile) -> Result<ConfigurationItems, ParsingError> {
// snipped
}
pub enum ValidationError {
RequiredDataMissing { message: String }
}
fn validate_config(input: ConfigurationItems) -> Result<ConfigurationItems, ValidationError> {
// snipped
}
pub enum ConfigErrors {
File { source: ConfigFileErrors },
Parsing { source: ParsingError },
Validation { source: ValidationError }
}
fn get_config() -> Result<ConfigurationItems, ConfigErrors> {
let file = load_config_file("path/to/config".to_string())?;
let parsed = parse_config(file)?;
validate_config(parsed)
}
This approach has many of the pros and cons of the other approaches so it’s not a panacea.
Pros:
- Each function clearly communicates how it can fail and is not polluted by the failure modes of other functions.
- No information is lost as you bubble up the call-stack as each low-level error is packaged in a containing error.
- The caller gets to match at the top-level error and decide for themselves if they wish to take finer-grained control of inner errors.
Cons:
- Proliferation of error types.
- New failure modes potentially impact the top-level crate design (e.g., adding a failure mode becomes a breaking change requiring a major revision if you are practicing Semantic Versioning.
- It’s not obvious how to deal with error variants that may be shared across multiple functions (e.g., parsing errors).
“I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”
― Frank Herbert, Dune
That iconic passage from Frank Herbert is something I think about when I encounter code that engineers are afraid to touch. When an engineering team is afraid to touch a piece of code then the software is no longer “soft.” Instead, current and future choices become constrained by decisions and mistakes of the past.
This fear will often lead teams to try to rewrite and replace the software rather than modify it. The results of this effort is almost always bad. Some of the problems you encounter during the Great Rewrite are:
- The legacy software is still running in production.
- It still gets bug fixes.
- It still gets new mission-critical features that have to be replicated in the new software.
- No one wants to cut over to the new software until it’s feature parable with the legacy software.
- Feature Parability is hindered by the inability to modify the old software.
- If the cutover happens before parability is achieved, I’ve seen people express that they’d rather go back to the old and busted system because at least it did what they needed it to.
- The planning and engineering techniques used to produce the first set of software–and the corresponding rigidity that led to the Great Rewrite–have not changed.
- The Great Rewrite will ultimately go through the same lifecycle and have to be rewritten again.
These problems multiply if the authors of The Great Rewrite are a different team than the one that maintains the existing system.
What’s the Alternative?
The alternative is to save your existing software. If you are afraid to change a piece of code, you need to take steps to remove that fear.
- Spend time with the code to understand it.
- Stand the system up in a test environment so that you can experiment with it to learn it’s edge-cases and wrinkles.
- Before making any change, cover the test environment with black-box automated tests that can verify behavior.
- If the tests cannot be fully automated (sadly the case with some old “Smart UI” style applications), then document the test cases and automate what you can.
- Analyze the error logs to make sure you understand the existing failures in the system as well as the rate at which they occur.
- Make the desired change to the system.
- At this point you will have to be careful. You will need to do a post-change analysis of the error logs to look for anomalous errors. The first log analysis is your baseline.
- Once you are confident in the change, cover it with as many automated tests as you can.
- Once you have great test coverage, aggressively refactor the code for clarity.
This process is time-consuming and expensive. For this teams people try to find shortcuts around it. Unfortunately, there are no shortcuts. The path of the Great Rewrite is even longer and more expensive. It just has better marketing.
But I Really Have to Rewrite!
There are times when a rewrite is unavoidable. Possible reasons might be:
- The technology that was used in the original software is no longer supported.
- It may also be old enough that it’s difficult to find people willing or able to support it–which amounts to the same thing.
- The component is small enough that a rewrite is virtually no risk.
- The component cannot be ported to the new infrastructure it is intended to run on (e.g., cloud, mobile, docker, etc).
In these cases, the process is the same as above–stand up a test environment. Wrap the old system in automated black-box acceptance tests. Limit yourself to targeting parity (no new features!) until the replacement is done.
Testing software is critically important to ensuring quality. Automated tests provide a lower Mean Time to Feedback (MTTF) for errors as well as enable developer’s to make changes without fear of breaking things. The earlier in the SDLC that errors can be detected and corrected, the better. (See the Test Pyramid). As engineers on the platform we should practice TDD in order to generate a thorough bed of unit tests. Unit tests alone do not ensure that everything works as expected so we will need gradually more sophisticated forms of testing.
There are different approaches to testing software. This document chooses to articulate types of automated testing by the point in the SDLC at which it is executed and by what it covers. There may be different strategies for testing at each of these lifecycle points (e.g., deterministic, fuzz, property-based, load, perf, etc..)
SDLC Stage | Type | Target | Who Runs Them? | Description |
Design / Build Time | Unit | Single Application | Engineer, CI | In process, no external resources. Mock at the Architectural boundaries but otherwise avoid mocks where possible. |
Integration | Single Application | Engineer, CI | These tests will mostly target the adapters for external systems (e.g., file io, databases, 3rd party API’s, 1st party API’s that are not the component under test.) Integration tests differ from acceptance tests in that they should never fail to an issue with an external service. | |
Post Deployment to Test Environment | Acceptance | Entire System or Platform | CI, CD | Largely black box, end-to-end testing. For bonus points, tie failures into telemetry to see if your monitors are alerting you. |
Manual UX Testing | Entire System or Platform | Engineer, QA, Users | This testing is qualitative and pertains to the “feel” of the platform with respect to the user experience. | |
Post Production Release | Smoke | Entire System or Platform | Engineer, CD | A small suite of manual tests to validate production configuration. |
Synthetic Transactions | Entire System or Platform | System | Black box, end-to-end use-case testing, automated, safe for production. These tests are less about correctness and more about proving the service is running. | |
Other? | This is not an exhaustive list. |
Emphasize Unit Tests
In general, our heaviest investment in testing should be done at the time the code is written. This means that unit tests should far outweigh other testing efforts. Why?
Unit tests are very low-cost to write and have very low Mean Time to Feedback (MTTF). This means they have the greatest ROI of any other kind of test.
The other kinds of testing are important but they get more complex as you move through the SDLC. This makes covering finicky edge-cases challenging from both an implementation and maintenance perspective. Unit Tests don’t have these drawbacks provided you follow good TDD guidance.
TDD
TDD is the strongly preferred manner of writing unit tests as it ensures that all code written is necessary (required by a test) and correct. Engineers who are not used to writing code in a TDD style often struggle with the practice in the early stages. If this describes your experience, be satisfied with writing tests for the code you’ve written in the same commit until it starts to feel natural.
The activity of TDD consists of three steps:
- (RED) Write a failng unit test.
- (GREEN) Write enough productino code to make it pass.
- (REFACTOR) Now make the code pretty.
The unit tests you write should strive to obey the three laws of TDD:
- Don’t write any production code unless it is to make a failing unit test pass.
- Don’t write any more of a unit test than is sufficient to fail; and compilation failures are failures.
- Don’t write any more production code than is sufficient to pass the one failing unit test.
Good unit tests have the following attributes:
- The test must fail reliably for the reason intended.
- The test must never fail for any other reason.
- There must be no other test that fails for this reason.
Further Reading
It’s impossible to fully convey the scope of what you should know about test automation in this document. Below are some resources you may be interested in as you move through your career.
- Test Driven Development: By Example by Kent Beck
- The Art of Unit Testing: 2nd Edition by Roy Osherove
- Working Effectively With Legacy Code by Michael Feathers
- Refactoring: Improving the Design of Existing Code (2nd Edition) by Martin Fowler