Luma Dream Machine vs Kling - A brief comparsion using APIs

A logo of PiAPI
PiAPI

Introduction

As OpenAI's text-to-video model Sora rapidly gained popularity after the release of its several high-definition previews on Feb 15th 2024, the AI ecosystem was eagerly waiting for months for its launch to experiment with this newest text-to-video technology but they were of no avail. However, on June 12th, Luma's Dream Machine was launched for public use. And two days prior, on June 10th, the Chinese short video platform Kuaishou released their own text-to-video model Kling. Both product launches were able to make significant dents in the text-to-video generation space, with developers, tech enthusiast and content creator quickly utilizing the tools as part of their workflow.

Thus, PiAPI as the API provider for both Luma's Dream Machine and Kling, would like to briefly compare the two models and share as a reference for our users.

Evaluation Framework

Regarding the evaluation framework used for this comparison, we'd like to adopt the same framework as we had used in our previous blog comparing Luma Dream Machines 1.0 vs 1.5, namely:

  • Prompt Adherence
  • Text Adherence
  • Video Realism
  • Artifacts

For what each of these aspects mean and how they could be rated, feel free to read our previous blog for more detail.

Same Prompt Comparison

To effectively compare the outputs of these two models, we have decided to use the same prompt for both models. The prompts are submitted to Dream Machine and Kling using our Dream Machine API (or Luma API), and our Kling API and the returned results are provided along with their analysis, as shown below.

Two GIFs converted from videos of a little playing with sand castle on the beach, generated by Lum's Dream Machine and Kling
Prompt: On a warm summer afternoon with sunlight shining on the beach, the sea meets the sky in the far distance, gentle waves lap against the shore, and a beautiful little girl, about five or six years old, is playing on the beach. Her golden curls glisten, and her face beams with smile. She wears a pink dress, its hem gently fluttering in the breeze, and small beach sandals, her toes occasionally burying into the soft sand. With a tiny plastic shovel in her hand, she is trying to build a sandcastle. There are seagulls circle in the sky, and a few white sailboats in the distance.


Prompt Adherence
Both models shows important elements described in the prompt: an afternoon, sunlight, sandy beach, ocean, the litter girl with blonde hair and pink dress, and the boats in the far distance.
The Luma model was able to get the seagull and sandal details well; whereas the Kling model was able to get the "toe buried in the sand" part well.
The Luma model wasn't able to get shovel and the sand castle details where Kling depicted them well.

Video Realism
The Luma Model offered a slightly different visual effect, perhaps a bit too ideal and too unblemished. Kling's output on the other hand resembles the real-world more closely, with natural lighting and proportions.

Video Resolution
The Luma Model's resolution is high, with details like the texture of the hair, clothing, and the sparkle of the sea clearly visible, but the overall effect appears somewhat smooth due to the artistic processing. Kling's output also has high resolution, with details such as the sand, waves, and the girl’s hair strands appearing clear and natural, consistent with real-world resolution.


Artifacts
The depiction of the little girl's knees and legs are quite off in Luma's output, perhaps it is due to there is very little training data on that particular w-sitting position. And since Kling's output has the little girl standing up, one could make the assumption that Kling would have a similarly difficult time portraying the sitting posture accurately and realistically.

Two GIFs converted from videos of a night city covered by shining lights, generated by Lum's Dream Machine and Kling
Prompts: On a night in a modern metropolis, the camera overlooks from high above, outlining the city skyline with blinking lights. Then, the camera slowly descends, focusing on bustling streets. The neon lights shine brilliantly in colors of red, blue, green, and purple, illuminating the entire block. Pedestrians hustle along the sidewalks, some pausing to gaze at shop windows or check their phones. Vehicles passes by on the streets, their headlights flickering in the night.

Prompt Adherence
Both models captured most of the prompt descriptions: the night city, the neon lights, the overviewing camera angle, busy streets with people and cars. However, both models missed the slowly descending camera view part, perhaps it was due to the usage of the word "slowly".

Video Realism
Both videos showed a decent level of realism with the night lighting. The city skyline portrayed in the Luma video seems a bit whimiscal.

Video Resolution
Both videos had a relatively good quality of resolution.


Artifacts
The video from Luma shows some unnatural transitions between the lights in the background and those on the streets, particularly around the edges of buildings, where there is slight distortion and blurring. For Kling's video, the way that the vehicles shrinks in size and rapidly disappears as they approach the end of the street look very unnatural.

Conclusion

So above are the two examples of both text-to-video generation model processing complex, dynamic scenes with multiple details. It is up to readers like you to decide which model fared better, our team at PiAPI simply proposes a framework to evaluate the outputs and provide real test outputs for you :)

We hope you found this comparison blog useful! If you are interested, please also check out PiAPI for our other generative AI APIs!


More Stories