Learn some basic Rust by trying to cache with a HashMap

2021-08-08

2023-11-08

Lately I’ve been working on my first big Rust project, a part of which involves fetching remote assets over HTTP and caching them into a std::collections::HashMap.

Previously, I had been threading the HashMap throughout my code, making requests and inserting entries as necessary. Recently, though, I’ve decided to further abstract this into a struct where one can declare that it wants the content of an asset—if it hasn’t been fetched yet, then the request is made and the response is cached. Otherwise, the cached response is returned.

This means that code that deals with the content of assets does not have to care about having to fetch it. Yay, separation of concerns! Abstractions! Hopefully “zero cost” ones! (…joking.)

A simplified rendition of this abstraction looks a bit like this:

use std::collections::HashMap;

struct Cacher {
    cache: HashMap<String, String>,
}

We start off with our Cacher struct. The key is an opaque asset ID that can be any UTF-8 string. The asset content is also a UTF-8 string. It is appropriate for both to be owned by the struct itself, so we use String instead of &str. This also implies that we cannot have any binary asset content.

To make this more clear, we can use type aliases instead:

type AssetId = String;
type AssetContent = String;

struct Cacher {
    cache: HashMap<AssetId, AssetContent>,
}

That’s better.

It would be ✨ even better ✨ to make “newtypes”, like this:

struct AssetId(String);
struct AssetContent(String);

struct Cacher {
    cacher: HashMap<AssetId, AssetContent>,
}

This restricts what user code is able to do, what assumptions it’s able to make, and how the types are able to be made. But enough API design! Let’s just keep things simple, yeah?

Now that fetching happens implicitly, any operation that obtains asset content is now fallible, since the (potential) HTTP request can fail. This means that we’ll have to return a Result<T, E>.

I’ll assume you know about this, and the ? (try) operator too. If you don’t, check out the link(s) above. Rust docs 👍

Since I’m just trying to give an example here, I’ll use the unit type as our error:

// amazing error type :100:
type Error = ();

Implementing `Cacher`

Let’s sketch out the actual implementation of our Cacher. Again, we want the somewhat declarative behavior where user code just says “hey, I want the content of this asset!” and it’ll be fetched and cached (if necessary), returning the content to whoever is being pesky about it.

// it's important that this lives outside of the `Cacher` struct so we don't
// have to borrow `self` again when fetching which could be a problem
//
// PLUS fetching has nothing to do with caching so separation of concerns x2!
fn fetch(id: &str) -> Result<String, Error> {
    // how we exactly do this isn't important yaknow
    Ok(String::from("remember to hug your blahaj"))
}

impl Cacher {
    fn content(&mut self, id: &str) -> Result<&str, Error> {
        match self.cache.get(id) {
            // If we have the cached asset content, just return it. This can't
            // fail.
            Some(content) => Ok(content),

            // We haven't cached the asset yet!
            None => {
                // Fetch the asset.
                //
                // If something goes wrong, propagate that error by returning
                // early.
                let content = fetch(id)?;

                // Cache, so we don't have fetch this asset again.
                //
                // Here, we use `to_owned` to copy the asset ID so the `HashMap`
                // can own it as a `String`.
                self.cache.insert(id.to_owned(), content);

                // Return a reference to the asset content we just fetched.
                //
                // If it's not there for whatever reason, panic with an error
                // message.
                Ok(&self
                    .cache
                    .get(id) // kinda sucks that we have to do this awkward dance,
                             // cuz we end up hashing `id` *twice* overall if
                             // we take this path

                    // aaand this is a little pointless too.
                    .expect("key we just inserted is not there?!")
                )
            }
        }
    }
}

Obligatory playground link!

Dissecting the signature of content a bit, we take an asset ID &str and return another yet another &str. We use string slices because we only want read-only views into strings.

A view we accept and a view we return. We don’t need to own the requested asset ID string, and it’d be awfully inefficient to copy the asset content every time we returned it.

The self.cache.get call borrows the value for us, returning an Option<&String>, an optional reference to the cache’s value. Due to deref coercion though, we can treat it as if it were a Option<&str>.

Hey, wanna know the difference between str, &str, and String? Or maybe you already do and want a review? Yeah you do! 😎

A String value is an owned string buffer, with data that lives on the heap.

Internally, it’s a wrapper over a Vec of bytes (Vec<u8>) with the additional guarantee that the bytes are valid UTF-8.

Meanwhile, a str is a string slice, which represents a region of UTF-8 text. It isn’t a pointer—it directly refers to the bytes that make up a string. This means we can’t make a variable of type str or pass them into functions, since dynamically sized types can’t live on the stack.

str in its borrowed form, &str, is a shared reference to a string slice. Slices are represented as a pointer and length, so we can pass &strs into functions and so on, since we know the size of both at compile time. &strs are useful for encoding the idea of “a read-only reference to some UTF-8 text”.

String implements Deref<Target=str>, so all of str’s methods implicitly become available on String. Likewise, you can treat &String as &str. This makes sense because a reference to a String fits the bill of “a read-only reference to a string somewhere”, which is what &str is.

The big difference is that with String, you own the bytes. &str is just a reference to some UTF-8 bytes somewhere else in memory.

To summarize: str is a UTF-8 text region, &str is a reference to that region, String is a string somewhere on the heap. Since &String is conceptually a reference to a text region, you can treat it as &str.

What about `entry`?

You may notice that we are expressing the following behavior: “if this key isn’t present, insert some value beforehand; then, get key’s value”. This procedure is so common that HashMap has an entry API that lets you do exactly that:

let mut map: HashMap<&str, u32> = HashMap::new();

map.entry("poneyland").or_insert(3);
assert_eq!(map["poneyland"], 3);

However, this particular API isn’t well-suited for our use case, as we’d have to make the request every time in order to pass it as an argument. Luckily, there is an or_insert_with API, which is similar to or_insert, but takes a closure instead:

let mut map: HashMap<&str, String> = HashMap::new();
let s = "hoho".to_string();

map.entry("poneyland").or_insert_with(|| s);

The closure is only called if the entry is vacant, which seems perfect!

Unfortunately, we can’t use this API either, because you can’t propagate any errors that happen during fetching inside of the closure. It’d try to return the error from inside the closure, and the or_insert_with isn’t expecting a Result. Darn.

The error

Okay, now we know we can’t use the or_insert or or_insert_with APIs to get around this issue. Let’s try to compile our Cacher implementation:

error[E0502]: cannot borrow `self.cache` as mutable because it is also borrowed as immutable
  --> src/lib.rs:19:17
   |
14 |     fn content(&mut self, id: &str) -> Result<&str, Error> {
   |                - let's call the lifetime of this reference `'1`
15 |         match self.cache.get(id) {
   |               ---------- immutable borrow occurs here
16 |             Some(content) => Ok(content),
   |                              ----------- returning this value requires that `self.cache` is borrowed for `'1`
...
19 |                 self.cache.insert(id.to_owned(), content);
   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ mutable borrow occurs here

Examining the code again:

fn content(&mut self, id: &str) -> Result<&str, Error> {
    match self.cache.get(id) {
        Some(content) => Ok(content),
        None => {
            let content = self.fetch(id)?;

            self.cache.insert(id.to_owned(), content);

            Ok(&self
                .cache
                .get(id)
                .expect("key we just inserted is not there?!"))
        }
    }
}

To check for a potential entry in the cache, we use self.cache.get, which borrows the HashMap, because it’s syntax sugar for HashMap::get(&self.cache, id). If it’s there, then we return the reference to it, wrapped in Ok.

However, the borrow “lives” into the None arm of the match, preventing us from calling self.cache.insert, because you can’t borrow mutably and immutably at the same time.

By the time we reach the None arm, though, we don’t need the borrow from the get call anymore. So what gives?

The issue

Our issue arises from the fact that the lifetime of the &str we return is tied to the lifetime of the borrow from the self.cache.get call. We can make this more obvious if we expand out the lifetimes ourselves.

// "lifetimes are just boneless generics"
//        vvvvvvvvvvvvvvvvvvvvv
fn content<'ourself, 'asset_id>(

    // `self` contains the all-important hashmap we care about
    &'ourself mut self,

    // the asset id that is being requested
    id: &'asset_id str

    //      the string slice we return has the same lifetime as `self`
    //      since it comes from the hashmap on `self`
    //      vvvvvvvvvvvvv
) -> Result<&'ourself str, Error> {

    //    borrowing 'ourself here...
    //    vvvvvvvvvvvvvvvvvv
    match self.cache.get(id) {
        //                  ...and returning it here...
        //                  vvvvvvv
        Some(content) => Ok(content),

        // [snip]
    }
}

Both &mut self and the &str we return have to have the same lifetime 'ourself because the reference to the hash map data can only live as long as the Cacher struct (and therefore the HashMap inside) is still around. We don’t really care for how long the id string lives, because we’re only using it in the lookup.

Returning the content we get from the cache.get call means that the borrow has to live for the duration of whole function, meaning that it doesn’t even matter that we don’t need the borrow in the None arm of the match!

This is more evident when we examine the lifetimes in the signature. mut self is annotated with 'ourself, and the return value has 'ourself, too. That means if we borrow self (which we do in the self.cache.get call) and return something from that borrow, the borrow would have to live until the end of the function.

//                these things have the same lifetimes; because we borrow
//                `self` and return something from that borrow, the borrow
//                would have to "last" for the whole function
//                vvvvvvvvvvvvv                         vvvvvvv
fn content<'a, 'b>(&'a mut self, id: &'b str) -> Result<&'a str, Error> {
//                                   ------- this lifetime is unimportant

OK, I know I basically just restated what rustc said. But I think it’s good to explain the details behind that error, because I was pretty confused on why it was erroring in the first place.

Why NLL doesn’t help

If you happen to know about Rust’s “non-lexical lifetimes”, you might’ve been confused as to why the code wasn’t accepted by the borrow checker.

As a Compiler That’s Trying Its Best, rustc is inherently conservative in that it’s perfectly willing to reject programs that cannot be adequately proven to be safe under the compiler’s verification model, even if they are actually safe in reality. This can be frustrating, especially in situations where we simply know better. NLL was an effort to loosen some of these restrictions by making the compiler smarter:

fn print(string: &str) {
    println!("{string}");
}

fn main() {
    let mut cool = String::from("hello,");

    // we borrow `cool` mutably…
    let exclusive_borrow = &mut cool;
    exclusive_borrow.push_str(" world!");

    // since we never use `exclusive_borrow` again, we can borrow immutably
    print(&cool);
}

Traditionally, this code would be rejected because the borrow would last until the end of the enclosing block (hence, lexical lifetimes).

OK, so how do we fix it?

We can fix this with HashMap’s entry API, but not with the or_insert nor the or_insert_with methods. Instead of those helpers, we use the Entry enum instead.

fn content(&mut self, id: &str) -> Result<&str, Error> {
    use std::collections::hash_map::Entry;

    match self.cache.entry(id.to_owned()) {
        Entry::Occupied(entry) => Ok(entry.into_mut()),
        Entry::Vacant(entry) => Ok(entry.insert(fetch(id)?))
    }
}

entry returns an Entry. If occupied, we call into_mut to get a mutable reference to the value with the same lifetime as the HashMap itself. (We don’t use get because that’d borrow from the entry itself instead, and that’s going to be dropped by the end of the function.) Mutable references are coercible to immutable references. OK!

If vacant, insert into the entry and return the value simultaneously. Now we don’t need to call get again and deal with the Option, nice!

It’s a bit weird that we need to clone the asset ID when grabbing the entry, but that seems to be a hard problem.

Conclusion

I hope you took something away from this glorified error dissection! Here’s a small review outline you might find useful:

Use type aliases or newtypes to increase readability and establish API boundaries.
str is a UTF-8 text region, &str is an immutable reference to it. String is an owned UTF-8 string buffer. String impls Deref<Target=str>, so String gets str’s methods and you can treat &String as &str.
NLL permits some programs that strict lexical lifetimes would reject, but it has its limitations.
Returning something from a borrow with a lifetime in both the self and return position means that borrow lives for the whole function.
The Entry API allows for fluent and performant HashMap usage, letting succinct patterns emerge. If you run into ownership problems vis-à-vis HashMap, see if Entry can help.

Hopefully the tangents and asides weren’t too disorienting, but I’d like this post to function as a branching point for other useful resources, ideas, and whatnot. Plus some things are just too important to gloss over!

Implementing Cacher#

What about entry?#

The error#

The issue#

Why NLL doesn’t help#

OK, so how do we fix it?#

Conclusion#