Data-Driven, Descriptor Based Generative C++ • memdump

Today I’d like to share with you an idiom I’ve been enjoying more and more. I doubt it is novel, but I haven’t heard about it before. It all started while authoring a new domain-specific scripting language and its compiler/data-generator. I was fighting with google flatbuffers limitations when a virus (aka idea) entered my mind. Why use a separate schema notation and schema compiler to generate C++ serialization code, when we can do it all in C++ directly?

What I’ll be discussing today is a bit more philosophical than usual. Ultimately, it is just a different way to write templated C++ and type traits. However, I’ve found data-driven descriptors extremely useful as they model my thought process very well. Some of my colleagues, who dislike templates, have commented on the readability of this technique. To me this is a huge win. I don’t particularly enjoy writing undecipherable code.

The main goal here is to remove copy-paste, catch errors at compile-time and make our architecture configurable (at no extra runtime cost). By gathering our schemas into a central location, it becomes a goto header to read documentation. And since schemas are used to generate our code, the documentation is actually up-to-date (unheard of before)! Of course, pushing more things at compile-time also enables nifty performance optimizations.

What Are Descriptors

What we’ll look into is using plain old structs, with compile-time static variables, to generate C++. This system will decouple instantiation parameters of your templated classes from their declaration. We will have all our template arguments gathered in a humanly readable header which acts as our schema. Again, this schema doubles as always up-to-date documentation and a single source of truth. And since we don’t use anything too fancy there, anyone can read our schema and understand our system characteristics. In a nut-shell, we’ll be exploring data-driven generative C++.

If you prefer, you can think of our descriptors as a json file, or an xml file, used to generate C++. But instead of going through the hassle of actually doing so (à-la protobufs, flatbuffers), we just write the schema in C++ directly. Because we can.

Under-the-hood, it isn’t particularly fancy. But from a comprehension standpoint, it works wonders.

Benefits And Draw-Backs

To list a few, here are some pros and cons I’ve gathered since I started organizing my code this way.

Pros

Descriptors are very easy to read & understand.
They act as up-to-date comments, which don’t lie.
Data-driven code-generation eradicates a-lot of copy-paste.
Clear mental model.
We get the benefits of compile-time code, namely catching mistakes early and performance.
They are, quite simply, fun to use.

Cons

To keep things “safe” or “correct”, you must write a-lot of parsing and static_assert code.
You’ll often end up in circular include hell and must be thoughtful of forward declarations to prevent this.
High up-front programming cost.
Descriptors still require some potentially voodoo template code underneath the hood.

Dipping Our Toes

Lets jump right in with an (overly simplified) illustration of what we’re talking about, so we’re all on the same page. Traditionally, you’d write a heavily templated class like so.

template <class Container, bool StackOptim,
		size_t NumElements, class... MaybeSomeVarArgs>
struct potato {
	// Something magical things, unicorns, etc.

	potato(const std::string& name);
};

There’s nothing particularly wrong here, but at the callee site, it is quite cryptic and dare I say, rude.

// Ehhh, yeah no?
// #amifiredyet
potato<std::vector<float>, true, 10, int, int, float> p{ "russet_potato" };

The problem with the above is multi-fold, but lets focus on the case where we have multiple potato types that are used throughout our codebase. First, it is quite unclear what parameters do what. If it was a thing™, designated template initialization may help, but it is not a thing™. Another problem is enforcement, since you can pass in whatever variadic arguments you desire. If a developer other than yourself must create new potatoes, this is very risky and uncomfortable. In 5 years, you likely won’t remember the details either. Finally, even though you can typedef this instantiation, someone trying to understand what is happening will eventually end up at this instantiation.

To clarify the above instantiation for our colleagues and future selves, we will use a data-driven descriptor.

// In a header far-far-away.

// Our descriptor keys. If you only have 1 you shouldn't be using this idiom ;)
enum class potato_variety {
	russet,
	count,
};

// A descriptor. Basically a C++ schema.
// The great thing here is not only do these variables have names, we can add comments!
struct RussetPotatoDescriptor {
	static constexpr potato_variety key = potato_variety::russet;
	static constexpr std::string_view name = "russet_potato";

	using container = std::vector<float>;

	// A comment explaining this amazing stack optimization we are very proud of.
	// We would appreciate a trophy or a medal for this optimization.
	// CANT YOU SEE I POURED MY SWEAT BLOOD AND TEARS INTO THIS OPTIMIZATION!?!?11!
	// ...
	// ahem
	static constexpr bool stack_optim = true;

	static constexpr size_t num_elements = 10;
	using a_descriptive_name = std::tuple<int, int, float>;

	// etc.
};


// In another header, not so far away...
// Our new templated class.
template <class Descriptor>
struct potato {
	/* A-lot of static_asserts for the descriptor. */
};


// Later on...
potato<RussetPotatoDescriptor> p;

This is a contrived example, but should illustrate the general idea. We move our parameters into a reusable struct, and use that to instantiate our potato. If we want, we can declare an alias to the pre-configured type. Of course, with just one potato type, it doesn’t make much sense. But if we choose to offer many predefined potatoes, this becomes quite useful. Furthermore, with if constexpr and the is_detected idiom, we can make schema members optional. You may think that is already the case for traditional template arguments, but it is not so simple (or achievable) when using variadic arguments.

Usage Patterns

So far, two main patterns have emerged while integrating descriptors into code. One is a centralized, queryable map of descriptors. The other is through inheritance of a templated base class. A CRTP pattern of sorts, where you pass the descriptor instead of yourself. Both are useful, so lets explore.

Compile-time Descriptor Databases

By gathering your descriptors in a central database-like object, you can grow your architecture easily over time. In the database map, I also like to gather the descriptor members into arrays, which make them queryable at runtime. As I often do, I use an enum to uniquely identify descriptors, because enums are great and you should use them everywhere.

There is a ton of static_asserting that needs to take place when using this in production, but I won’t go over that here. Just know that you’d probably want to assert the order of descriptors matches their key index (not required, but greatly simplifies the implementation). You’d also assert various particulars related to your problem space. By accessing the expected members, you guarantee this won’t compile if a descriptor is missing a member, etc.

// A schema defined language and compiler beginning.
// OK OK, prelude.
enum class lang_directive {
	include,
	foreach,
	var,
	count,
};

enum class argument_requirement {
	required,
	optional,
	prohibited,
	count,
};

enum class argument_type {
	expression, // (auto v : vec)
	string, // "a string with quotes"
	name, // a single word without quotes
	count,
};

enum class compile_phase {
	preprocessor, // Parser will search for '#' + 'token'.
	compile,
	count,
};

// etc.

// I personally like to hide the descriptors in 'detail', if at all possible.
namespace detail {

struct include_descriptor {
	static constexpr auto key = lang_directive::include;
	static constexpr std::string_view token = "include";
	static constexpr auto arg_requirement = argument_requirement::required;
	static constexpr auto arg_type = argument_type::string;
	static constexpr auto phase = compile_phase::preprocessor;

	// etc.
};

struct foreach_descriptor {
	static constexpr auto key = lang_directive::foreach;
	static constexpr std::string_view token = "foreach";
	static constexpr auto arg_requirement = argument_requirement::required;
	static constexpr auto arg_type = argument_type::expression;
	static constexpr auto phase = compile_phase::compile;
};

struct var_descriptor {
	static constexpr auto key = lang_directive::var;
	static constexpr std::string_view token = "var";
	static constexpr auto arg_requirement = argument_requirement::required;
	static constexpr auto arg_type = argument_type::name;
	static constexpr auto phase = compile_phase::compile;
};


template <class Key, class... Descriptors>
struct lang_db {
	static constexpr size_t size = sizeof...(Descriptors);

	// Create compiletime or runtime accessible arrays, indexable with descriptors' key.
	static constexpr std::array<std::string_view, size> tokens{ Descriptors::token... };
	static constexpr std::array<argument_requirement, size> arg_requirements{ Descriptors::arg_requirement... };
	static constexpr std::array<argument_type, size> arg_types{ Descriptors::arg_type... };

	// Add useful helpers.
	static constexpr auto get_preprocessor_directives() {
		// Find all preprocessor directives and return std::array of their lang_directive keys.
	}

	// A LOT of static_asserts

	// etc.
};
} // namespace detail

// This is the "global" database which we'll be interrogating and interacting with.
using lang_directive_db = detail::lang_db<
	lang_directive,
	detail::include_descriptor,
	detail::foreach_descriptor,
	detail::var_descriptor
>;

Hopefully this sheds some light on the sort of abstraction you can accomplish with this idiom. What you see here is far from complete to implement a language programmatically, but with the descriptors it isn’t an issue. Our schemas will evolve over time and get more (or less) complex as required. When things evolve, we’ll catch problems at compile time and simply update our code, no biggie. Of course, we’ll probably have a multitude of databases which use this pattern for various things.

To facilitate this specific pattern, I’ve written a simple descriptor_map type, which asserts a multitude of things I always want. Then I simply inherit it to customize my descriptor databases. If you use this technique, I’d highly recommend you do something similar. static_assert code isn’t really interesting to write over and over again.

Descriptor Kindof CRTP But Not Really

The other pattern I described was CRTP like inheritance, which I will dub today as “Curiously Recurring Descriptor Pattern” (CRDP). With this technique, we still declare our descriptors as before, but instead of gathering them in a database of sorts, we use them to add conditional functionality to our current class. This removes copy-paste, and unlike traditional inheritance, can participate in constexpr fun with all the friends. Once again, you can use enums for various cool things, but uniquely identifying descriptors isn’t as necessary with this pattern.

I’ve personally used this to great effect for a reflection framework. Where you describe your struct variables, their names, etc. and inherit a reflectable<my_descriptor> in your final class.

For this blog post however, I’ll demo what would be the start of a node-based system, executable at compile time of course. Note this is as easily accomplished with a descriptor database and is simply presented as an alternative style. The beauty with the inheritance pattern is user expandability. Consumers of a library can create their own types, whereas databases are hard-coded somewhere. It is also useful to optimize the size of objects, which may opt into removing certain member variables if they aren’t required by the descriptor config.

// First, lets declare our descriptor and the node that will use it.

namespace detail {
struct add_node_descriptor {
	static constexpr std::string_view name = "add";

	enum class inputs {
		a,
		b,
		count,
	};

	enum class outputs {
		out,
		count,
	};

	// If only C++ had enum to strings.
	static constexpr std::array<std::string_view, 2> input_names{"a", "b"};
	static constexpr std::array<std::string_view, 1> output_names{"out"};

	// Constexpr babay!
	static constexpr auto compute = [](auto a, auto b) { return a + b; };

};
} // namespace detail

// Somewhere, a node.
template <class Descriptor>
struct node {
	// Setup and bind node UI.
	// Create a list of compatible input nodes, output nodes, etc.
	// Abstract string_view apis to the actual input and outputs.

	// static_asserts for enums' count, the names, required schema members, etc.
};


// Final, the add node.
struct add_node : node<detail::add_node_descriptor> {
	// More "traditional" customization, when compared to the data-base approach.
};

Once again, a lot of static_assert boiler-plate is required in your node definition. Hopefully your imagination can fill in the blanks. Like I mentioned before, you may opt to remove variables from your inherited class according to the input descriptor. To do this, simply create an inheritance hierarchy which is specialized on specific descriptor variables. In this example, we’d likely inherit an input and output base class and specialize for count == 0.

A Simple Example Is Impossible

To conclude, I did try finding a good, simple and concrete example to demonstrate this idiom succinctly. I have failed. Over the past year, I’ve come up with 3 such prototypes. All of which are much too large and complex for an introductory post on the subject. This failure of mine does illustrate something fundamental.

Compile-time, descriptor based architecture generators are meant for very large and complex systems. Applying this idiom to small problems is overkill and silly.

I may revisit this topic to dive in deeper with a sizeable prototype, but for now, I wish you well and I hope this was useful to you.

Special thanks to Alexandre Arsenault for testing these out and pushing the idiom further.