Using Feature Flags to Build a Better AI

Mark Allen

Apr 25, 2025 • 3 min read

In a world where AI is becoming increasingly prevalent in applications, there is a need to ensure that the AI is accurate. We hear a lot about failures with chatbots going rogue, image generators showing physically incorrect images, and generally wrong answers to simple questions. So, how can you ensure that your model works, more importantly, that new models you create work in the real world before making them available to the general public? How do you go about testing out new AI models? With feature flags, that's how.

We encountered these challenges while developing Helix, an advanced player tracking and behavior analysis solution that seamlessly integrates in-game actions with mobile interactions. Through machine learning, Helix monitors player activities such as logging in, viewing, or clicking on virtual items, as well as scanning virtual items using the mobile app. By utilizing image recognition technology, Helix accurately associates in-game behavior with mobile app usage, offering deeper insights into user engagement and optimizing the gaming experience.

Part of the Helix technology stack is an image recognition system. This is based upon an internally managed image classification system. When new image classifications are added, how do we know if they will work in the real world with real users and return the correct classification?

To handle this, we built the ability to select which model to use into our image recognition microservice using a feature flag. This way, we could test to see how well it worked with a specific set of internal and external beta users before making it the default model for all users.

When we created a new model, our product manager would add it as a new variation in DevCycle.

The Test Model would then only be shown to a specific set of users.

Our image recognition microservice is an internal service that takes an image as input and returns a list of matching classifications, along with their corresponding confidence scores. It is accessed through external APIs, which require either user or machine authentication. For authenticated users, we passed in the user's email address from the external API as a custom header.

  async scanImageV4(
    @Request() req,
    @UploadedFile() file: Express.Multer.File,
    @Headers('X-GATHERER-USER-EMAIL') email?: string,
  ) {

NestJS controller for the image scanner showing customer email header

Then we could use DevCycle to return the correct model to use for the specific image classification call.

    const user: DevCycleUser = {
      user_id: origin.uuid,
      email: email,
    };
    await this.devcycleClient.onClientInitialized();
    const activeRekogntionModel = this.devcycleClient.variable(
      user,
      'helix-active-rekognition-model',
      process.env.AMAZONREKOG_MODEL,
    );

Call to DevCycle to get the correct version of the model to use for this user.

This became a bit trickier for our mobile application, which didn't have an authentication system. To handle this, we used the device identifier provided by Android and iOS. Since the mobile application already had a scanner built into it, we created a QR code that the user could scan and placed it on our internal wiki. Within the UI, we displayed the ID along with a copy button, so they could add it to DevCycle and test the new model.

Conclusion

Incorporating feature flags into the Helix image recognition system allowed us to test new AI models safely and effectively before full deployment. By targeting specific users, whether through authenticated email addresses or device identifiers, we could gather real-world feedback and performance data without risking the reliability of the overall application. This approach not only improved the accuracy and robustness of our AI models but also accelerated iteration and innovation. As AI continues to shape user experiences, embedding tools like feature flags into your development pipeline is essential for controlled experimentation, rapid validation, and delivering trustworthy results at scale.