Mistral Medium
Table 2. Quality of DBRX Instruct compared to leading closed-source models. Except for Inflection Corrected MTBench (data we measured ourselves at the model endpoints), all other numbers are reported by the creators of these models in their respective white papers. For details, see the footnotes.
DBRX Instruct used a context window of up to 32K tokens during training. Table 3 compares its performance on a set of long context benchmark tests (KV-Pairs and HotpotQAXL from the Lost in the Middle paper, which modifies HotPotQA to extend tasks to longer sequence lengths) with Mixtral Instruct and the latest versions of GPT-3.5 Turbo and GPT-4 Turbo APIs. GPT-4 Turbo is generally the best model in these tasks. However, with one exception, DBRX Instruct outperforms GPT-3.5 Turbo across all context lengths and parts of the sequences. Overall, the performance of DBRX Instruct is similar to that of Mixtral Instruct.
Model | DBRX Instruct | Mixtral Instruct | GPT-3.5 Turbo (API) | GPT-4 Turbo (API) |
---|---|---|---|---|
Answer in the first third of the context | 45.1% | 41.3% | 37.3%* | 49.3% |
Answer in the middle third of the context | 45.3% | 42.7% | 37.3%* | 49.0% |
Answer in the last third of the context | 48.0% | 44.4% | 37.0%* | 50.9% |
2K context | 59.1% | 64.6% | 36.3% | 69.3% |
4K context | 65.1% | 59.9% | 35.9% | 63.5% |
8K context | 59.5% | 55.3% | 45.0% | 61.5% |
16K context | 27.0% | 20.1% | 31.7% | 26.0% |
32K context | 19.9% | 14.0% | — | 28.5% |
Table 3. Average performance of models on KV-Pairs and HotpotQAXL benchmark tests. Bold indicates the highest score. Underline indicates the highest score excluding GPT-4 Turbo. GPT-3.5 Turbo supports a maximum context length of 16K, so we could not evaluate it on 32K. The averages for the beginning, middle, and end of GPT-3.5 Turbo are based only on contexts not exceeding 16K.
Using RAG (retrieval-augmented generation) is one of the most popular methods for leveraging model context. In RAG, content relevant to the prompt is retrieved from a database and provided to the model along with the prompt to give it more information than it would have on its own. Table 4 shows the quality of DBRX in two RAG benchmark tests (Natural Questions and HotPotQA) when the model also provided the top 10 passages retrieved using the embedding model bge-large-en-v1.5 from a Wikipedia article corpus. DBRX Instruct competes with open models like Mixtral Instruct and LLaMA2-70B Chat, as well as the current version of GPT-3.5 Turbo.
Model | DBRX Instruct | Mixtral Instruct | LLaMa2-70B Chat | GPT 3.5 Turbo (API) | GPT 4 Turbo (API) |
---|---|---|---|---|---|
Natural Questions | 60.0% | 59.1% | 56.5% | 57.7% | 63.9% |
HotPotQA | 55.0% | 54.2% | 54.7% | 53.0% | 62.9% |
Table 4. Performance of models when provided with the top 10 passages retrieved from the Wikipedia corpus using bge-large-en-v1.5. Accuracy is measured by matching the model's answers. Bold indicates the highest score. Underline indicates the highest score excluding GPT-4 Turbo.
Model quality must be viewed in the context of training and usage efficiency. This is especially important at Databricks, as we build models like DBRX to establish processes for customers to train their own foundational models.
We found that training mixture of experts models offers significant improvements in training efficiency (Table 5). For example, training a smaller member of the DBRX family, called DBRX MoE-B (23.5B total parameters, 6.6B active parameters), required 1.7 times fewer FLOPs to achieve a score of 45.5% on the Databricks LLM Gauntlet than the FLOPs required for LLaMA2-13B to achieve a score of 43.8%. The number of active parameters in DBRX MoE-B is also only half that of LLaMA2-13B.
Overall, our end-to-end LLM pre-training process has become nearly more efficient over the past ten months. On May 5, 2023, we released MPT-7B, a 7B parameter model trained on 1 trillion tokens that achieved a score of 30.9% on the Databricks LLM Gauntlet. A member of the DBRX family, called DBRX MoE-A (7.7B total parameters, 2.2B active parameters), achieved a score of 30.5% on the Databricks Gauntlet, requiring 3.7 times fewer FLOPs than MPT-7B to achieve a score of 30.9%. This efficiency improvement is the result of many enhancements, including the use of MoE architectures, other architectural changes to the network, better optimization strategies, better tokenization, and, importantly, better pre-training data.
Independently, better pre-training data has a significant impact on model quality. We trained a 7B model (called DBRX Dense-A) using DBRX pre-training data on 1 trillion tokens. It achieved a score of 39.0% on the Databricks Gauntlet, while MPT-7B scored 30.9%. We estimate that our new pre-training data is at least twice as good per token compared to the data used to train MPT-7B. In other words, we estimate that only half the number of tokens is needed to achieve the same model quality. We confirmed this by training DBRX Dense-A with 500 billion tokens; it outperformed MPT-7B on the Databricks Gauntlet, achieving a score of 32.1%. Besides better data quality, another significant contributor to token efficiency may be the tokenizer from GPT-4, which has a large vocabulary and is considered particularly efficient in token efficiency. These insights about improving data quality directly translate into practices and tools for our customers to train foundational models based on their own data.
Model | Total Parameters | Active Parameters | Gauntlet Score | Relative FLOP |
---|---|---|---|---|
DBRX MoE-A | 7.7B | 2.2B | 30.5% | 1x |
MPT-7B (1T tokens) | — | 6.7B | 30.9% | 3.7x |
DBRX Dense-A (1T tokens) | — | 6.7B | 39.0% | 3.7x |
DBRX Dense-A (500B tokens) | — | 6.7B | 32.1% | 1.85x |
DBRX MoE-B | 23.5B | 6.6B | 45.5% | 1x |
LLaMA2-13B | — | 13.0B | 43.8% | 1.7x |
Table 5. Details of several test articles we used to validate the DBRX MoE architecture and end-to-end training process.
Figure 2 shows the end-to-end inference efficiency provided for DBRX and similar models using NVIDIA TensorRT-LLM on our optimized service infrastructure and at 16-bit precision. We aim for this benchmark to be as close to actual usage scenarios as possible, including multiple users simultaneously accessing the same inference server. We generate a new user every second, with each user request containing approximately 2000 tokens of prompts and each response containing 256 tokens.
In general, MoE models are faster in inference than their total parameter count would suggest. This is because they use relatively fewer parameters for each input. We found that DBRX is no exception in this regard. DBRX's inference throughput is 2 to 3 times higher than that of a non-MoE model with 132B parameters.
Inference efficiency and model quality are often trade-offs: larger models typically achieve higher quality, but smaller models are more efficient in inference. Using MoE architecture can achieve better model quality and inference efficiency than dense models usually provide. For example, DBRX outperforms LLaMA2-70B in quality, and due to having approximately half the number of active parameters, DBRX's inference throughput is twice that of LLaMA2-70B (Figure 2). Mixtral is another point on the improved Pareto frontier achieved by MoE models: it is smaller than DBRX, so it scores lower in quality but has higher inference throughput. Users of the Databricks base model API can see DBRX achieving 150 tokens per second on our optimized model service platform, using 8-bit quantization.
Figure 2. Inference throughput for various model configurations using NVIDIA TensorRT-LLM at 16-bit precision on our optimized service infrastructure. Models run in tensor parallelism across the nodes. Input prompts contain approximately 2000 prompt tokens, and we generate 256 output tokens. A new user is generated every second.
DBRX was trained on a 3.2Tbps Infiniband connected by 3072 NVIDIA H100s. The main processes for building DBRX—including pre-training, post-training processing, evaluation, red teaming, and improvements—were conducted over three months. This was based on several months of scientific and dataset research and scaling experiments at Databricks, not to mention Databricks' years of experience in LLM development, including the MPT and Dolly projects, as well as the thousands of models we have built and deployed in production with our customers.
To build DBRX, we utilized the same Databricks toolkit available to our customers. We used Unity Catalog to manage and govern our training data. We explored this data using newly acquired Lilac AI. We processed and cleaned the data using Apache Spark™ and Databricks notebooks. We trained DBRX using an optimized version of our open-source training library: MegaBlocks, LLM Foundry, Composer, and Streaming. We managed large-scale model training and fine-tuning across thousands of GPUs using Mosaic AI Training services. We recorded our results using MLflow. We collected human feedback through Mosaic AI Model Serving and Inference Tables to improve quality and safety. We manually experimented with models using the Databricks Playground. We found that Databricks tools excel in their respective uses and that we benefit from them being part of a unified product experience.
If you want to start using DBRX immediately, you can easily access it through Databricks Mosaic AI Foundation Model APIs. You can get started quickly with our pay-as-you-go pricing and query the model through our AI Playground chat interface. For production applications, we offer a provisioned throughput option to provide performance guarantees, support fine-tuned models, and ensure additional safety and compliance. To privately host DBRX, you can download the model from the Databricks Marketplace and deploy it on Model Serving.
At Databricks, we believe that every enterprise should be able to take control of its data and destiny in the emerging GenAI world. DBRX is a core pillar of our next-generation GenAI products, and we look forward to the exciting journey our customers will take as they leverage the capabilities of DBRX and the tools we used to build it. Over the past year, we have trained thousands of LLMs with our customers. DBRX is just one example of the powerful and efficient models that Databricks builds, suitable for a variety of applications, from internal functionalities to our customers' ambitious use cases.
For any new model, the journey of DBRX is just the beginning; the best work will be done by those who build on it: enterprises and the open community. This is just the beginning of our work on DBRX, and you should expect more results to come.
The development of DBRX is led by the Mosaic team, which previously built the MPT model series and collaborated with dozens of engineers, lawyers, procurement and finance experts, project managers, marketers, designers, and other contributors across various departments at Databricks. We thank our colleagues, friends, families, and communities for their patience and support over the past months.
In creating DBRX, we stand on the shoulders of giants in the open and academic communities. By making DBRX publicly available, we hope to give back to the community and look forward to building greater technologies together in the future. In this context, we are especially grateful for the work and collaboration of Trevor Gale and his MegaBlocks project (Trevor's PhD advisor is Databricks CTO Matei Zaharia), the PyTorch team and the FSDP project, NVIDIA and the TensorRT-LLM project, the vLLM team and project, EleutherAI and their LLM evaluation project, Daniel Smilkov and Nikhil Thorat from Lilac AI, and our friends at the Allen Institute for Artificial Intelligence (AI2) for their work and collaboration.