Wrapping up work on the RLS for GSoC

Work summary

It’s time to wrap up the work done on the Rust Language Server (RLS) for this year’s edition of Google Summer of Code. Time surely flew fast since the first blog post!

In general, the goal of the project was to improve the user experience and IDE support story for Rust. Apart from the regular bugfixes and QoL changes, the primary goal and biggest in terms of scope feature was to implement supporting multiple package targets and packages in general, including Cargo workspaces. Another one, a stretch goal of sorts, was to implement previewing macro expansion in the editor, however, in the end, there was not enough time to pursue that.

At the time of writing, the workspace support has been implemented (still requires more polishing), I managed to land 30 PRs for the RLS, push 50 commits (+2,837 insertions and -1,301 deletions) in total.

Additionally, my talk regarding the RLS and my work on it has been accepted for RustFest Zürich 2017! It’s titled “Teaching an IDE to understand Rust” and I’m looking forward to present it on September 30th!

Work done for multiple packages and workspaces

For supporting multiple crate targets, one of the most useful implemented features is ability to specify which target ([lib] or which [[bin]]) should be analyzed by the RLS, as well as supporting analyzing bin targets that require the lib from the same package. Additionally, the RLS itself can now detect which target should be analyzed, if the user didn’t explicitly specify it in the configuration.

Furthermore, a new workspace_mode has been added. When under it, the RLS doesn’t panic anymore when analyzing Cargo workspaces and happily provides all the diagnostic data from the compiler for all the packages inside (assuming files are saved on the disk), if a specific package hasn’t been explicitly specified by the user via the analyze_package option.

The work for workspace support was split into 3 stages:

  1. First one was to introduce a new, separate mode, under which the RLS would not crash and provide very basic diagnostics.
  2. Second one was to improve both the performance and support by executing the linked compiler and gathering diagnostics and analysis data, which drives the IDE features such as goto def, from every package in the workspace.
  3. Third one, the biggest, was to create and store a dependency graph of units of work (compiler calls), which would allow to improve the latency of the analysis and provide support for in-memory modified files.

As of now, all three stages are completed. The last one is still being polished - a concrete area of improvement is the accuracy of determination which changed files require a rebuild of which packages, as well as reliability of gathering provided analysis for multiple packages.

GSoC Experiences

Difficulties

One of the things I struggled the most with was prioritisation of the work I was tasked with. Even though the planned work was clearly defined, I tried to find a balance between doing the work on a bigger, planned feature and implementing smaller QoL changes or fixes that did ease/allow some people usage of the RLS. In the end, I found myself frequently working on the side tasks longer than I should’ve, effectively delaying the main job I was tasked with in the first place.

Having more freedom over one’s work is both good and bad. There are no strict dreadlines looming over you, however it’s easy to lose yourself if you don’t organize your own work yourself. When working on some side projects I sometimes used organization tools such as Trello or good ol’ Post-Its and in the end, I think I should’ve used it for this project. I managed to do most of the planned work, but I think this would allow me to focus better on the task at hand and better prioritise the work.

I very much feel like this is the area I need to improve on the most.

Takeaways

Before I started working on an open source project, it seemed really daunting to contribute to one. I’m not sure what’s causing this, maybe it’s stories about traditional open source projects’ core developers posting rants or aggressively rejecting contributions.

However it turns out the open source community is very welcoming and helpful. I think I’ve been bitten by the OSS bug by now and definitely find myself contributing also to other projects thanks to working on one during the Google Summer of Code.

I also have to give a shout out to the Rust community specifically here. When posting issues, asking (sometimes silly) questions on the IRC or getting some required features in Cargo for the RLS, people were always open and helpful every step of the way.

Speaking from a contributor’s perspective, it means a lot when people take their time to slow down and thoroughly explain as to why and how certain things work the way they do, so I’m really grateful for that. It encourages further contribution and makes you feel even better when doing that! After all, the work is not only useful to yourself, but also to other people in general.

I also learned quite a lot on how IDEs may be structured and how they interact with other tools/systems, most notably the compiler and a build system, to provide the desired functionality. Previously I treated it as a black box, where it performed some arcane compiler invocations and everything Just Worked™. Now that I had the occasion to work on one, it helped demistify the way a compiler and an IDE may work, both in general and in tandem. As it turns out, IDEs are not as complex as I thought and working on them was really interesting and accessible at the same time.

Closing words

Google Summer of Code turned out to be a wonderful experience. It changed the way I think about the open source movement and it opened it up for me somehow. Professionally, it also helped me grow as a programmer, both in terms of skill and communication required to do the job.

Furthermore, it helped me connect to and work with a lot of creative, capable people that are smarter than me. Here, I’d like to especially give a shout out to @nrc, Nick Cameron, who was a great mentor during this project, perfectly guided me through it and managed to put up with me! :smile:

If you may be thinking whether to apply for the next year’s edition of Google Summer of Code 2018, do it! I can wholeheartedly recommend it.

The GSoC project may be coming to an end, but there’s so much more that can be done for the RLS and Rust in general! Working on it is a joy and I’m probably here to stay. See you over at the RLS repo and related, stay tuned! :sunglasses:

Most notable PRs

Supporting multiple packages and targets
Smaller features and bugfixes
Refactoring / project organization
rust-lang-nursery/rls-vscode
rust-lang/cargo

Taking a closer look at Cargo metadata

In this blog post I’m going to describe the Cargo metadata format and how it’s used to build a project.

Cargo does a great job managing different kinds of dependencies. It can easily resolve those, taking into account different package features and targets to understand how a project needs to be build. I’ll use the word package to describe a Cargo crate and by project I mean either a single or multiple packages.

You can use cargo metadata command to retrieve the metadata for the project. When executed in a Cargo project, it prints a JSON structure that describes all the relevant data Cargo could resolve for the project. The format is packed, so to get a better look at it, it’s a good idea to run cargo metadata | python -m json.tool (requires Python 2.6+).

Today we’re mostly concerned with packages and resolve objects.

Packages

The packages object is a list of Packages that are in the project scope, i.e. are a dependency (direct or indirect) to the packages in the workspace.

Each Package contains further information about its own direct, declared Package dependencies along with the SemVer requirement and the kind (whether it’s a regular or a build-/dev-dependency). Furthermore, it also contains specified features for the package or other additional metadata, such as the package name, description or a license.

Resolve

Since declared dependencies specify SemVer constraints, these need to be further resolved, in order to create a simpler, acyclic dependency graph with concrete and locked down versions of the packages, that will be used during the build procedure.

A resolve object is such a graph. The data representation is a dependency graph between PackageIds. A PackageId is essentially a triple of a package (name, exact version, source). This is not directly included in the graph representation, but the resolve object is constructed with specified features being taken into account, and also more information about a certain package can be retrieved via the PackageId from the packages object.

This information allows to build the project bottom-up, starting from the dependencies, and guarantees that after doing so, all required dependencies will be built for the packages in the project.

Footnotes

Exact cargo metadata JSON schema is defined over at doc.crates.io.

One environment to rule them all

Environment variables are a set of values that can alter the way a process works. They are a part of the environment in which a process runs and as such, the environment can be globally accessed throughout its execution. Cargo and rustc are not an exception, and make a heavy use of it to drive the compilation process or pass additional configuration to the runtime (e.g. via RUST_BACKTRACE). Since RLS uses both to perform its analysis build, it must pass appropriate environment variables to them. However, since these programs are run in the RLS process rather than in their own, we can cause environment-related race conditions.

One example would be when compiling two packages in parallel. Both compilations would start and an appropriate environment would be set twice. However, since the compilations would share the same process, second compilation would overwrite the environment and introduce invalid data for the first one. Because of that, we must guard the environment and provide a mutually exclusive access to it for programs that normally rely on it and which the RLS uses.

I was recently trying to switch the way the compiler is invoked while working on supporting Cargo workspaces in the RLS. During that time, an implementation of a mutually exclusive access to the environment, using a simple Mutex<()> and an Environment RAII guard, which held the Mutex lock guard, has landed. After that, I encountered a strange regression for the workspace support test - the environment was effectively locked (somewhat) recursively, leading to a deadlock.

The initial implementation of the scope locking the environment with a Mutex used only a single lock. Originally, during the Cargo build routine, rustc compiler instances were ran as separate processes, each one having its own environment appropriately set by Cargo. However, executing the linked compiler lead to an attempt to lock the environment for the duration of the compilation, while it was still locked previously at the outer scope for the entire duration of Cargo execution. Oops.

What’s needed was to provide a consistent and scoped environment at an outer scope, the Cargo execution, but also at the inner one, the compiler scope. Locking two separate, unrelated Mutexes for each scope would not work. Because rustc build routine can be also executed both separately and as a part of Cargo build routine, we need to provide a single, shared outer lock for both situations. If we don’t, Cargo could acquire the outer lock and access the environment, while another rustc build routine would just acquire inner lock, still leading to sharing the environment for both.

The solution for the problem seemed like an easy one - all builds should first acquire a single, shared lock and whilst holding it, possibly acquire an inner one for the duration not longer than the first one is held for. It’d be great to encode that information using RAII guards with specified lifetimes of the lock, all while guaranteeing the order of locking, however there was one problem with this approach.

To intercept planned rustc calls, Cargo exposes an Executor trait, which effectively provides a callback (exec()) with all the arguments and environment rustc would be called, so the compiler execution can be freely customized. What’s unfortunate is that the API expects an Arc<Executor>. This means, that there are no guarantees that the passed pointer will not be further copied out of scope, so we can’t really limit the lifetime of the inner lock here (since it has to live shorter than the outer lock).

At the moment, the current implementation does not strictly enforce lock order. Ideally, the access to the second, inner lock should only be given if and only if the outer lock is held. The outer lock() function returns a (MutexGuard<'a, ()>, InnerLock), where InnerLock is a unit struct that has access to the static inner lock and should not live longer than the returned outer lock guard. While technically it can be copied outside the scope of the initial lock guard, it seems acceptable for the time being.

The final implemented solution isn’t ideal, but it does the job. The requirements were fairly straightforward and the scope limited, so it was reasonable to implement a correct solution, rather than a more generic one. While not foolproof, it still allows to encode some logic and ordering using the type system, such as acquiring InnerLock only after locking the outer Mutex. If needed, it can be improved later, but right now it provides the guarantees we needed to further the work for workspace support in the RLS.

Supporting workspaces in RLS

One of the main goals for my GSoC project is to increase the usability of the RLS and the range of projects it supports. As I mentioned in the previous post, I managed to implement support for a common project layout, which is a single package split into library and a binary that uses it, however there still remains a larger goal in sight: multiple packages and with it, Cargo workspaces.

To briefly explain what it is, using second edition of The Book as a reference, workspace is a set of packages that all share the same Cargo.lock and output directory. Workspaces helps organize multiple related packages in a single repository, mainly by keeping a single and well-defined dependency set for all of them (by having a single Cargo.lock file) and allowing to share and reuse resulting build artifacts for the packages (by having a single, shared output directory).

Designing the solution

When designing the new implementation, creating a design document helped me immensely. Not only did it help me lay down and materialize the design that existed only as an idea but it also serves as a reference as to why certain design decisions were made in the end, along with the context.

As of now, RLS works in a single package mode. Using Cargo, it still resolves dependencies, generates all the intermediate build artifacts along with analysis data for a given package, however after doing so it continues to only run the linked rustc compiler for an active target in the package. What this means is that, depending on which target is configured as active in the RLS, it will collect diagnostic messages and check compilation only for that specific target. That’s what currently build_lib, build_bin and partially cfg_test configuration options are for.

The end goal is to change how the RLS operates and to allow it to support multiple targets by default. This can be often desired even for a single package, like previously described project layout with both library and dependent binary target.

However, since it will fundamentally change how the RLS works, a complete switch to a final design in one move would not only be risky in terms of a possible regression or the implementation being buggy but it would also require substantial amount of work, where it would have to be constantly adapted to frequent code changes and would not provide any features until completion.

The plan

That is why the work on supporting multiple active packages will be done in an incremental manner and initially gated behind a workspace_mode configuration switch. The plan to do so can be briefly laid out out as follows:

  1. First, a prototype implementation is created that moves away from running mostly linked rustc to using Cargo exclusively for every build. After every package compilation, generated output and analysis data file is consumed to build the analysis database.
  2. From there we can use Cargo only for build coordination and run linked rustc compiler instead of the Cargo one. Thanks to this, the RLS will be able to provide more accurate analysis on the fly by feeding in-memory file buffer contents to the compiler, as well as fetching analysis data directly from memory instead of serializing and reading it from file.
  3. Finally, RLS will coordinate builds on its own, at first using the dependency graph from Cargo. By doing so, it stops being directly dependent on Cargo and can be further extended to work with other build systems. Additionally, build queue and management can be more fine-grained, leaving room for more optimizations and reducing analysis latency.

Supporting custom build systems later on could probably be done by consuming some sort of a build plan in a standard format (support for Cargo emitting one is already planned), but that is probably outside the scope of the current project given time constraints.

And now the details

After specifying a high-level plan, it’s time to delve into more details. These are not strictly tied to a certain step in the plan and mostly explain some design motivations and various caveats involved.

The compilation times can be quite high when analyzing every package in the workspace, and so additional analyze_package option will be provided to serve as a convenience when working on and interested in only single workspace package and will act as if cargo check -p <analyze_package> was executed.

Build Queue

There is one tricky bit once RLS starts running the linked compiler in-process instead of executing separate rustc processes by Cargo - environment variables. We want to parallelize the build as much as possible to improve the compilation times, however since it will be only run inside a single RLS process, we only have a single environment for it. An environment is essentially a global mutable state and changing it for every starting parallel build is just asking for trouble, since we cannot guarantee that it stays the same throughout the build (Cargo provides certain guarantees during compilation).

To deal with this, the initial, not-yet-cached build for the project will be run as it would be in the prototype, in a parallel fashion. In general, very often the full project build will include many directly unrelated dependencies. Because of this, it makes sense to compile them all in parallel and then suffer the cost of serializing and reading analysis data file for every package than to compile the packages one by one but with directly read analysis data in-memory.

However, having built necessary dependencies, subsequent changes to the workspace won’t necessarily require a rebuild of all packages. This means that it’ll probably be acceptable to run requested buids in a sequential manner, since we need to ensure consistent environment during package compilation.

Nevertheless, it’s best to profile first to measure the performance for both scenarios and only then decide on the final behaviour.

Files

Since we do not generate the dependency graph ourselves, we need to rebuild it via Cargo when a source file is added or deleted, or when a Cargo.toml file is modified. Moving files around, as well as modifying Cargo.toml, can change the workspace structure or even implicit targets (e.g. src/main.rs or src/lib.rs) and that’s why the graph has to be rebuilt to avoid stale data. Once it’s regenerated, RLS can continue as usual.

Cargo relies on the notion of registries and hash fingerprints to determine whether a package needs rebuilding. However, when files are modified in-editor, there are no changes on disk and so Cargo can’t know that a certain package needs rebuilding. While RLS does provide a virtual file loader facility for the linked compiler, package freshness is checked by Cargo before the compiler has a chance to run. Providing and injecting our own virtual registry only to fake fingerprints, as well as modifying Cargo API only for RLS’ purposes, seems a bit too extreme, however once RLS has a dependency graph itself the problem effectively solves itself. With that, it won’t rely on Cargo to do the compilation and coordination anymore and it can already run the compiler itself with the virtual file system to provide the modified source files.

One more thing to add is that LSP servers aren’t explicitly forbidden to perform disk I/O or watch the files themselves, however the protocol provides a workspace/didChangeWatchedFiles notification for whenever a file watched by the client changes and seems that watching on the client side is preferred, as per this discussion. With this, RLS won’t (nor probably it should) do the file watching itself and will rely on the protocol messages instead.

Analysis

During the build routine, publishing compiler diagnostics and updating current analysis data can be done at two different points in time. It can be done directly after each crate compilation or in one fell swoop only after the build is finished. Processing data after each crate instead of doing it all at once reduces latency to user but requires more work to keep analysis data consistent when mixing old and new analysis.

Preferably we want to stream the data in a per-crate fashion. In case the build will be run in parallel, during the routine there should be a separate thread that processes incoming messages and analysis after each crate compilation and updates it on the fly until the build is finished.

Work to come

Few days ago a PR was merged with the prototype implementation of the workspace support. So far I mostly tested it on the webrender project and was thrilled to see it working!

It still has few drawbacks when compared to current single package mode, most notably longer compilation times and requires file changes saved to disk for a correct analysis. It’s still rough around the edges and may not work perfectly but I’d like to encourage switching it on for people working with Cargo workspaces. It’s currently opt-in via workspace_mode, any additional testing and feedback will be greatly appreciated!

With the prototype done, now what’s left is to improve it and slowly move towards coordinating the build ourselves and being less reliant on Cargo. I’m very excited to finally see some results of my work and I hope it will also prove useful to others. I’ll be back with more updates on how the work is progressing. Stay tuned!

Working on Rust Language Server for GSoC 2017

Introduction

This year’s edition of Google Summer of Code has begun on May 30th and I managed to enroll as a student! This means I’ll spend the summer hacking away at Rust Language Server. This post will mark a series of posts about my experiences with GSoC and specifically working on the RLS, which I hope will prove interesting and provide some insight on how such work may look like.

The project

My GSoC project is about working on and extending Rust Language Server, which is a tool providing analysis for Rust projects. It is not another IDE, but rather a standalone tool, which uses Language Server Protocol for communicating with different frontends (clients). This approach to creating analysis tools is fairly new, since the protocol used has been recently standarised and since then managed to gain a widespread adoption. This means that there’s a variety of things to work on and there’s more freedom to experiment on the final design of the project.

Language Server Protocol

The protocol has been created and standarized by Microsoft and, as explained in its repository,

The Language Server protocol is used between a tool (the client) and a language smartness provider (the server) to integrate features like auto complete, goto definition, find all references and alike into the tool.

The core idea is to extract and standarize the glue used to integrate all the language features into any text editor (which supports the protocol, that is). It uses slightly extended JSON-RPC protocol to define request/response flow and specifies supported set of requests that can be used during communication.

What’s really cool is that for each language there is need for only one server implementation, significantly reducing the amount of work required to support every editor. By separating the core language service from the editor, the analysis tool can focus on providing quality and well-defined set of tools for working with a language, and the text editors on something they were primarly designed for - editing text.

More information, along with a more detailed list of available client and server implementations (including RLS!), can be found over at langserver.org.

Why not just develop an IDE or a plugin?

Most importantly, because it’s a costly and time-consuming endeavour. Writing a fully-fledged, mature IDE from scratch takes a lot of focused effort. Doing so also often entails writing an internal compiler to provide or facilitate detailed language analysis. However since Rust is a complex language, which encodes a lot of information at compile-time (e.g. complex generics, trait system), writing such a compiler would require a huge amount of work or would lead to inaccuracies in analysis.

What if we instead choose to write an analysis plugin for one of the editors we prefer? The amount of work required will be somewhat less than to develop a complete IDE, however we still need to create an entire analysis tool from scratch, just like in the previous case. We also might have to tailor our solution to a specific editor, which could be limiting, depending on our choice. Furthermore, since such a tool would be, more often than not, bound to the editor, by doing so we leave out users of editors, e.g. Vim or Sublime.

I’d like to also note that the end goal isn’t to shrink IDEs in terms of capabilities. Rather, it can be thought of as splitting an IDE into components, where LSP is just a custom inter-process protocol, used to communicate between the components.

Architecture

Thankfully, Rust Language Server does not try to reimplement all the functionality on its own. Most importantly, it uses core Rust tools: rustc, the official compiler, and cargo, the Rust package manager. Cargo is used to detect project workspace and its configuration, while the analysis data is emitted by the compiler during a special check compilation, where the final building stage is omitted.
RLS mostly coordinates the build process with the help of Cargo, and manages analysis data for diagnostic purposes and request handling.
Finally, LSP is used for communication between the RLS and a language server client, and the latter is responsible for forwarding requests made by the user, as well as getting the information back, in the editor.

Planned work and beyond

As I mentioned earlier, there’s a variety of stuff that can be done for the RLS. My main focus will be on extending the integration with Cargo, specifically supporting project workspaces, which are used for managing bigger projects consisting of multiple packages. Bigger projects are where IDEs shine brightest, so bringing support for that would mean a considerable leap in terms of usability for the RLS. One useful feature that is already implemented is the ability to specify for which target the analysis data should be provided (using --bin <target>, specifically). This should prove helpful for people working on packages that have different possible executable targets.

Having done that, I’d like to improve the ergonomics and user-facing part in general. An exciting feature to add would be macro expansion previewable in the editor.

All in all, I’m very excited to be working on this project. It’s a great opportunity to learn more about inner workings of Rust and IDEs in general, and to do something usable by others. I always felt that Rust lacks much needed IDE support, so now’s my chance to change that. More to come, stay tuned.